NL2034409A - A biological invasion identification method based on multi-source data fusion analysis - Google Patents

A biological invasion identification method based on multi-source data fusion analysis Download PDF

Info

Publication number
NL2034409A
NL2034409A NL2034409A NL2034409A NL2034409A NL 2034409 A NL2034409 A NL 2034409A NL 2034409 A NL2034409 A NL 2034409A NL 2034409 A NL2034409 A NL 2034409A NL 2034409 A NL2034409 A NL 2034409A
Authority
NL
Netherlands
Prior art keywords
data
text
invasive
matrix
probability matrix
Prior art date
Application number
NL2034409A
Other languages
Dutch (nl)
Inventor
Chen Biyun
Original Assignee
Univ Yancheng Teachers
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Univ Yancheng Teachers filed Critical Univ Yancheng Teachers
Publication of NL2034409A publication Critical patent/NL2034409A/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The present invention discloses a method, for identifying biological invasions based on multi—source data fusion analysis, comprising the steps of: acquiring a multi—source data set containing invasive biological data, and tagging the invasive biological data;data sets include: text data, image data, temporal data, geolocation data; classification of text data and output of text probability matrix with markers;ldentifying the location of invasive organisms in images, determining boundaries and sizes, and training a probability matrix of images with markers on the image data.solo thermal encoding of temporal data and construction of a time—space feature matrix from the encoded data with said geolocation data;construction of a multi—feature vector based on a text probability matrix, a picture probability matrix, and a time— space feature matrix;Assigning weights to multiple feature vectors and training binary classifiers using machine learning algorithms;the predicted data is fed into a binary classifier to obtain invasive organism data.

Description

A BIOLOGICAL INVASION IDENTIFICATION METHOD BASED ON MULTI-SOURCE
DATA FUSION ANALYSIS
Technical field
The present invention relates to the field of big data arti- ficial intelligence technology, and in particular to a biological intrusion identification method based on multi-source data fusion analysis.
Background technology
As the process of globalisation accelerates and land use pat- terns change.Biological invasions have become a worldwide ecologi- cal security problem, and the means to combat them at this stage include: the establishment of appropriate monitoring systems to identify the species, number, distribution and role of alien spe- cies;increase education on the dangers of biological invasions and raise awareness of the community to prevent them;active search for identification and control techniques for invasive alien species in order to effectively curb the current trend of biological inva- sion.
At present, artificial intelligence technology is becoming a new engine in the field of ecological resources, but it is worth noting that there is a problem of matching data structure and ac- curacy in the process of fusion of data from multiple sources;the monitoring information obtained contains many different types of data, and it is a question of how to use the multi-characteristic data for rapid identification and intelligent diagnosis of alien species, and based on this for risk analysis and prognosis.
Summary of the invention
The present invention addresses problems in existing solu- tions and provides a method for identifying biological invasions based on fusion analysis of data from multiple sources, comprising the steps of:
Obtaining a multi-source data set containing invasive organ- ism data and tagging the invasive organism data: said data set in- cludes: text data, image data, time data, geographic location da-
ta.
Classification of said text data and output of the text prob- ability matrix with markers.
The location of the invasive organisms in the images is iden- tified for said image data, the boundaries and sizes are deter- mined and a probability matrix of images with markers is trained.
Solo thermal encoding of said temporal data and construction of a time-space feature matrix from the encoded data with said ge- olocation data.
Constructing a multi-feature vector based on said text proba- bility matrix, said picture probability matrix, and said time- space feature matrix; assigning weights to said multi-feature vec- tor and training a binary classifier using a machine learning al- gorithm.
The predicted data is fed into a binary classifier to obtain invasive organism data.
Further, classifying said text data, specifically comprising: deactivating said text data, constructing N-grame features using
Fast-Text, subjecting the text content to a sliding window opera- tion of size N in byte order, finally forming a sequence of byte fragments of length N. The resulting sequence is used as a text feature candidate set, filtering out important features, and out- putting a text probability matrix with markers using Soft-Max.
Further, identifying the location of the invasive organism in said picture data, specifically comprising: determining the loca- tion of the invasive organism to be identified by means of a pic- ture recognition algorithm such as a CNN for said picture data, zooming in, determining the boundaries as well as the picture size, and training a picture probability matrix with markers using the CNN.
Further, assigning weights to said multiple feature vectors and training a binary classifier using a machine learning algo- rithm, specifically comprising: normalizing said multiple feature vectors, assigning weights using an entropy weighting method, and training into a binary classifier using a machine learning algo- rithm such as SVM.
Further, the data to be predicted is fed into a binary clas-
sifier to obtain invasive organism data, specifically comprising: inputting the data to be predicted, using the SVM to make the fi- nal marker, and when the output marker is 1, representing that the data uploaded by the user at that location is true for that time period, representing that invasive species have been present here and should be dealt with promptly.
Further, said a method for identifying biological invasions based on fusion analysis of multiple sources of data, further com- prising: using a GM model for said time-space feature matrix to predict the migration or reproduction patterns of future invasive organisms.
Further, the classifier in said binary classifier can also be replaced by methods such as random forest, logistic regression, neural network, etc.
Compared to the prior art, the present invention provides a biological invasion identification method based on multi-source data fusion analysis, which has the following beneficial effects:
This multi-source data fusion analysis-based biological inva- sion identification method incorporates textual data, image data, temporal data and geographic location data, and is applied to the rapid identification and monitoring of species, while also pre- dicting the development trend of species over time, providing a basis for the formulation of corresponding rational and efficient conservation and management measures.
Description of attached drawings
Fig 1 shows the Fast-Text flowchart;
Fig 2 shows a flowchart specific to the output layer from the fully connected layer;
Fig 3 shows the probability distribution of the 11-100 train- ing sets;
Fig 4 shows the test set recall statistics;
Fig 5 shows a graph of test set accuracy statistics;
Fig 6 shows a plot of the stochastic predictions of the training model;
Fig 7 shows a plot of the rate of change metrics for the training model;
Fig 8 shows a schematic diagram of the actual geographical location of the training model;
Fig 9 shows a map of thermal predictions;
Fig 10 shows the overall flowchart of the biological invasion identification method based on multi-source data fusion analysis.
Specific implementation method
Specific embodiments of the invention are described further below in conjunction with the accompanying drawings.The following embodiments are only intended to illustrate the technical solu- tions of the invention more clearly and cannot be used to limit the scope of protection of the invention.
Example 1
The present invention provides a biological invasion identi- fication method based on fusion analysis of multiple sources of data, as shown in the overall flowchart in Fig 10, comprising the steps of:
Obtain a multi-source dataset containing invasive organism data and tag the invasive organism data; the dataset includes: text data, image data, temporal data, and geographic location da- ta.
The dataset consists of text, images, time and geographical location, expanding the range of data in the dataset. By dividing the dataset into a training set and a test set, controlled experi- ments can be carried out to facilitate verification of the tech- nical effectiveness of this algorithm.
The dataset consists of text, images, time and geographical location, expanding the range of data in the dataset. By dividing the dataset into a training set and a test set, controlled experi- ments can be carried out to facilitate verification of the tech- nical effectiveness of this algorithm.
Classification of the text data and output of the text proba- bility matrix with markers:Classifying text data, specifically comprising: deactivating said text data, constructing N-grame fea- tures using Fast-Text, subjecting the text content to a sliding window operation of size N in byte order, and finally forming a sequence of byte fragments of length N. The resulting sequence is used as a candidate set to filter out important features, and out- putting a text probability matrix with markers using Soft-Max.
The data provided by people commenting on invasive organisms can be useful for laboratory identification, so the textual data can have a significant impact on the identification of invasive organisms.To ensure the morphological features within each of its 5 words, feature extraction is performed on the word vector of each record.This is a sliding window operation of size N in byte order, resulting in a sequence of byte fragments of length N.Where '<! denotes a prefix and '>' denotes a suffix. A trigram made up of '<>' can then be used to represent a word, and furthermore 5 vec- tors can be used superimposed to better represent word vectors.The discrete variables are converted into continuous vectors by Embed- ding to form a word vector for this record, Wj = [wy wy ... Wij ..., Wy, where W‚ = [wij waj Wij coe, Wy] denotes the word vector in the jth rec- ord.As Fig 1 shows the flowchart of Fast-Text, the Embedding- processed word vectors are used as input features, and the implic- it layer is averaged over multiple word vectors. and using the negative log-likelihood as the loss func- tion: LSA ynlog(f(BAx,)), where N is the text and xn is the word features in the text.yn is the label, A and B are the weight ma- trices, A is used to convert to a textual representation and B is used to compute the category by linear transformation, fis a Soft-
Max function used to calculate the probability of the final clas- sification.Soft-Max uses a Hoffman tree-based hierarchical struc- ture to classify the text data and output the probabilities for each category.
The expression for layered Soft-Max is shown below, p(oclwt) = estote) es Where p(wclw ) is the final probability of the text. w de- notes the word vector.
The final probability matrix for the n records is obtained.
As shown in Table 1, where T, denotes n records and Lk denotes that there are k categories, where pij denotes the probability of cate- gory j in the i" record.
Table 1: Text data probability matrix
L, en L; en Ly
Ti P11 bij Pik
Ti Pit so Pij oo Pik
Ty Pni Bh Pnj ie Pnk
Identifying the location of invasive organisms in images, de- termining boundaries and sizes, and training a probability matrix of images with markers on the image data;the location of the inva- sive creature is identified in the image data by means of an image recognition algorithm such as CNN, zooming in, determining the boundaries and the size of the image, and training the image prob- ability matrix with markers using CNN.
It is important to extract features from the image data, as the information uploaded by the public has a great influence on the laboratory judgement.First of all, the image data is now pre- processed: data that is not an image is deleted and the suffix name is changed.CNN convolutional neural networks are mainly di- vided into a convolutional layer (CONV), a pooling layer (POOL) and a fully connected layer (FC) .CNNs are mainly used to extract features, one by one, from local features to overall features, to perform image recognition and other functions.The processing of images using CNN neural networks is divided into four steps: the input layer, the convolutional and downsampling layers, the fully connected layer and the output layer.The input layer uses RGB col- our images, and the output of the RGB components, convolved with the convolution layer weights W to obtain the individual C layers, which are then downsampled to obtain the individual S layers.Using the activation function, the output of these layers is called a
Feature-Map, and the fully connected layer expands each element of the Feature-Map in turn, arranging them in a column, and using
Soft-Max for classification in the output layer.
Fig 2 is a specific flow diagram from the fully connected layer to the output layer, where the invasive Asian creatures are used as the input layer data, and since the dataset is RGB colour images, three separated 2D kernels are used to scale and greyscale the images, and the 3-channel RGB colour images are quickly con- verted to 1-channel greyscale.After multiple convolution, pooling and activation, the features are extracted and passed through a fully connected layer to output the probability of being an inva- sive Asian organism using the Soft-Max function.
Where X (height x width x channel) is the input pixel matrix,
Y is the output matrix, which is convolutionally pooled to flatten the multidimensional data, connected to a fully connected layer, and the output is the class probability for classification using traditional Soft-Max.A vector of T x L is obtained, where each value represents the probability value of the input corresponding to all samples. This gives the classification probability for the image file, denoted using C.
The final probability matrix for the picture data is obtained as shown in Table 2, with qij denoting the probability of category j in the i" record.
Table 2: Picture data probability matrix
Li ne L; en Ly
Ti dit vw dij u dik
T; di ne dij . dik
Ta 4n1 oe nj ot Ink
Solo thermal coding of temporal data and construction of a time-space feature matrix from the coded data with geolocation da- ta;construction of a multi-feature vector based on a text proba- bility matrix, a picture probability matrix, and a time-space fea- ture matrix;assigning weights to multiple feature vectors and training binary classifiers using machine learning algorithms; The multi-feature vector is assigned weights and the binary classifier is trained using machine learning algorithms, specifically: the multi-feature vector is normalised, the weight coefficients are assigned using the entropy weight method and trained into a binary classifier using machine learning algorithms such as SVM; the classifiers in the binary classifier can also be replaced by meth- ods such as random forest, logistic regression and neural net- works.
Inputting predicted data into a binary classifier to obtain invasive organism data;the data to be predicted is fed into a bi- nary classifier to obtain invasive organism data, specifically: the data to be predicted is input and the final marker is made us- ing SVM, when the output marker is 1, it means that the data up- loaded by the user is true for that location at that time, which means that invasive species have been present here and should be dealt with promptly.
A method for identifying biological invasions based on multi- source data fusion analysis, further comprising: using a GM model for a time-space feature matrix to predict the migration or repro- duction patterns of future invasive organisms.
Dataset consisting of: sighting report data aggregated by the
Washington State Department of Agriculture for December 2020, the dataset contains a spreadsheet of 4440 sighting reports and 3305 user uploaded image data;sighting reports that have been laborato- ry-identified are marked with a 1 for invasive organisms and a 0 for the opposite.Randomly, 70% of the total data in the dataset is divided into the training set and the remaining data in the da- taset is divided into the test set.
For the sighting reports provided by the population, each re- port is independent of the other, and the information is not con- tinuous between them, nor between the individual characteristic values, but discrete and disordered. Features can be digitised us- ing one-hot coding, also known as one-bit valid coding.This is achieved by encoding N states using N-bit status registers, each of which has its own register bit and only one of which is valid at any given time.
As invasive organisms are exotic species, there may be few occurrences, and the GM model is well suited to small amounts and incomplete information, calculating the range of variation for the time of occurrence of invasive organisms as well as their geo- graphical location.The GM{1,1) model can be expressed as Y=Bu, with the predicted sample range indicating the regional divi- sion.In order to ensure the viability of the GM(1, 1) modelling approach, the necessary tests need to be done on the known da- ta.Let the original data column be x = (x°(1),x°(2), + x%(n)), where x{0}) is the original time data column, and calculate the rank ra- tio of the series.The series can be modelled as GM(1l, 1) and grey predictions can be made if all the ratios fall within the allowa- ble coverage interval (Deli) otherwise, make appropriate transformations to the data, e.g. translational mations YOK) =x(0)(K) +c‚,k= 1,2, …,n.
The model used to derive the range of areas where events oc- cur in the different records and the planning forecasts collate the range of trends.For the range predicted by the GM, a judgement is first made as to whether it is within the range, and if it is, 1 is 0 if it is not.Use L{(Location)=(0,1) to denote the geographic feature value of the record. For the temporal feature T, which has uncertainty in event occurrence, one-hot coding is used to effec- tively differentiate between different time periods for discrete values.
The first step is data feature extraction, which is normal- ised according to four different features (text, image, time, lo- cation) .For the people to provide text data and image data with a high degree of authenticity, and for the time and geographical range is relatively broad, does not represent the specific mean- ing, so the overall values need to be weighted, the sum of the as- signed weights should be equal to 1.
For the extracted probabilistic features F for text data and
C for image data, the default value of 1/k is used.Since the events are independent of each other, the probability of k events occurring at the same moment is only 1/k, where the sum of the k time probabilities should be 1.For missing time and geographical locations the average of the top and bottom two records can be used to fill in the gaps.
For multi-characteristic events, which are inherently random in nature, entropy values are calculated to determine the degree of randomness and disorder of the event and to judge the degree of dispersion of the indicator; the greater the dispersion of the in- dicator, the greater the influence of that indicator on the over- all evaluation.
The eigenvector X = {Xi Xn} first requires a Z-Score normal- ization of the values between the individual matrices, xj ‚where ¥ and ¢ denote all the means and standard deviations in the feature vector X.
Using information entropy to calculate the entropy weight of each indicator, Hj= —k}il,yjlny;(G = 12, ..,n)where Hj denctes the infor- mation entropy of the jth indicator.
In order to ensure that 0 £€ Hj S 1, it is common to take k = 1/ (lnm), before calculating the deviation degree dj for each indi- cator.
The normalised eigenvector matrix is multiplied with each in- dicator wj to obtain the weighted multi-feature evaluation matrix
V.
As a larger number of feature vectors are provided, there is no need to use overly complex algorithms in the final multi- feature fusion classification, as the probability statistics for the first round of predictive classification have already been performed on the data. The use of a traditional SVM classification model is sufficient.
In the SVM classification problem given the input data X and the learning target Y.
Nik Ag} ¥={¥. Ty}
The input data here are X=F, ‚LT.
A simple explanation of FCLT is given by the fact that each sample of the input data contains multiple features, thus forming the feature space X:X=[X|, .., Xn] EX,
The learning objectives are binary variables, representing negative class and positive class.If the feature space X in which the input data is located exists, then it will be used as the hy-
perplane of the Decision Boundary:decision boundary:w'X +b =0, the learning targets are then separated by positive and negative clas- ses and the point-to-plane distance of any sample is made equal to or greater than 1: point to plane distance: y; * (wTX; +b) = 1, where the param- eters w,b are the normal vector and intercept of the hyperplane, respectively.
The decision boundary satisfying this condition in effect constructs 2 parallel hyperplanes as interval boundaries to dis- criminate the classification of the sample: wiXi+ bz +l= 3 =+lpositive sample wi ths —l= y= 1 Negative sample
All samples above the upper interval boundary belong to the positive class and those below the lower interval boundary belong to the negative class. The distance d between the two interval boundaries is defined as the margin (margin) d= 2 iw]
The positive and negative class samples on the interval boundary are the Support Vector.
The Fast-Text tool was used to classify the data into text.
The training set is first subjected to a data egualisation opera- tion to obtain a probability distribution of 100 training sets, which is eventually peaked in the course of continuous training.
As shown in Fig 3, the probabilities converge to 0.9 or 0.1 from roughly the 10th, and with the uneven sample, the Fast-Text helper does a good job of balancing the sample and judging the probabilistic events.
The sample data were scrambled for training and the training results for the Asian invasive organism problem, shown in Figs 4 and 5, showed a recall and accuracy of 94.6%.
In practice, the PyTorch framework is used to build CNN mod- els. Use 70% of the dataset for training and 30% of the data for testing.Due to the uneven data sample, a simple oversampling was performed and the final training model performed well on the test set. Several images were randomly selected and then predicted us- ing the trained model as shown in Figure 6 (True values are the actual categories, Predictions are the predictions of the model,
Negative values indicate that the image is an invasive Asian or- ganism, Positive indicates that the image is not an invasive Asian organism) .
Finally, the metrics for evaluating the training model are shown in Table 3 and Fig 7.
Table 3: Metrics for training models
Indicator Score
Precision 93.11%
Recall 100.00%
ACC 96.32%
Fl-score 96.43%
AUC 99.99%
For the small number of invasive organism occurrences, the GM model is well suited to small amounts and incomplete information, calculating the range of variation for the time of invasive organ- ism occurrence as well as the geographical location.
Take c so that the data columns all fall within the allowable coverage, after calculating the grade ratio values it was found that the grade ratio test values for both data were within the standard range interval [0.857,1.166], implying that this data is suitable for GM(1l,1) model construction.After testing the data, the development coefficient a, the grey effect b and the posterior variance ratio C are calculated for the GM model, as shown in Ta- ble 4:
Table 4: Results of the model construction “Index Longitude Latitude
Development coefficient A 0.0001 0.0001
Grey action B -122.7586 48.9994
C value of posterior error 0.0468 0.5956 ratio
The posterior difference ratio C values for both models are less than 0.65, with only 0.0468 less than 0.35 for the longitude model indicating that the longitude model is particularly good.So forecasts are made for latitude and longitude, and the residuals are tested after they have been predicted, including relative er-
rors, and step deviations.For both sets of data, longitude and latitude, the maximum values of relative errors are less than 0.1, and for the grade deviation values, both less than 0.1 means that the higher requirements are met, implying that the model fit is at a high level, and the relevant ranges are plotted according to their geographical location, as shown in Fig 8.
After predicting the latitude and longitude using the differ- ence between the latitude and longitude of the two points to cal- culate the distance between the two points, the goodness of fit was calculated for the above two models: Latitude R-squared 71.45%
Longitude R-squared 95.31% It is easy to see from Figure 8 that these samples verified as true invasive Asian organisms Latitude range: [48.7775,49.1494], and Longitude range: [-123.9431,- 122.4186].
For the extracted probabilistic features F for text data and
C for image data, the default value of 1/k is used. Since there is mutual independence between the events, the probability of k events occurring at the same moment is only 1/k.For the range pre- dicted by the GM, a judgement is first made as to whether it is within the range, and if it is, 1 is 0 if it is not.The use of
L(Location) = (0,1) to denote the geographic feature value of the rec- ord, and for the temporal feature T due to the uncertainty of the event occurrence, the use of one-hot coding allows for effective differentiation of discrete values between different time periods.
Correlation weights need to be determined between the fea- tures, and the weight assignments for each feature are calculated using the entropy weighting method, as shown in Table 5:
Table 5: Table for the distribution of weights for different features
For several eigenvalues, being linearly inseparable, i.e. hy- perbolic in the eigenspace using a non-linear function allows the non-linearly separable problem to be mapped from the original ei- genspace to a higher dimensional Hilbert space, thus transforming it into a linearly separable problem, using linear regression to calculate Table 6.
Table 6: Linear regression calculation table
R2<0 was found and the data may not have any linear relation- ship.For such hyperplane multi-feature problems, the kernel func- tion has good convergence using a radial basis function kernel for classification.SVM multi-feature fusion analysis was performed and the results were highly accurate compared to Fast-Text and CNN neural network models alone. The results are highly accurate com- pared to Fast-Text alone and CNN neural network models, and have good generalisability compared to just predicting individual data.
Training on 300 sets of data and evaluation of 521 sets of data for classification resulted in Table 7:
Table 7: Classification assessment table =S vm es ee mr [wm we
The diagonal line is the number of correct predictions, and for 521 records the number of errors in identifying invasive or- ganisms was only one, the number of records that were correctly identified as invasive, and the extent of their occur- rence. Predictions were made for population data that had not yet been experimentally judged, and heat maps were plotted by time based on the predicted data, as shown in Fig 9: It was found that traces of invasive Asian organisms may still be present in parts of Washington in the second half of the year, and it may not be possible to eliminate such hazards in the short term.
A comprehensive evaluation of the multi-feature data fusion analysis algorithm shows from 8 that the use of the algorithm al- lows for a good combination of multiple feature data sources and a reasonable judgement of different events.
Table 8: Comprehensive evaluation of multi-feature data fu- sion analysis algorithms
The classifiers in the binary classifier can also be replaced by methods such as random forests, logistic regression and neural networks.
The technical features of the above embodiments can be com- bined in any number of ways. For the sake of brevity, not all pos- sible combinations of the technical features of the above embodi- ments have been described, however, as long as these combinations of technical features are not contradictory, they should be con- sidered to be within the scope of the present specification.

Claims (7)

CONCLUSIESCONCLUSIONS 1. Werkwijze voor het identificeren van biologische invasies op basis van de analyse van gegevensfusie uit verschillende bronnen, met het kenmerk, dat deze de stappen omvat van: verwerving van multi-source datasets met gegevens over invasieve organismen en tagging van gegevens over invasieve organismen; ge- noemde datasets omvatten: tekstgegevens, beeldgegevens, tijdsgege- vens, geolocatiegegevens; classificatie van genoemde tekstgegevens en output van een matrix van tekstwaarschijnlijkheden met markeringen; identificatie van de locatie van het invasieve organisme in het beeld, bepaling van de grens en grootte, en training van een waar- schijnlijkheidsmatrix van beelden met markers, voor genoemde beeldgegevens; solo thermische codering van genoemde temporele gegevens en con- structie van een tijd-ruimte kenmerkenmatrix uit de gecodeerde ge- gevens met genoemde geolocatiegegevens; constructie van een multi-kenmerkvector uit de genoemde tekstwaar- schijnlijkheidsmatrix, de genoemde beeldwaarschijnlijkheidsmatrix en de gencemde tijd-ruimtekenmerkmatrix; constructie van een bi- naire classificator door gewichten toe te kennen aan de genoemde meervoudige kenmerkvectoren en training van de binaire classifica- tor met behulp van algoritmen voor machinaal leren; de voorspelde gegevens worden ingevoerd in een binaire classifica- tor om gegevens over invasieve organismen te verkrijgen.A method for identifying biological invasions based on the analysis of data fusion from different sources, characterized in that it comprises the steps of: acquiring multi-source data sets containing invasive organism data and tagging invasive organism data; said data sets include: text data, image data, time data, geolocation data; classification of said text data and output of a matrix of text probabilities with markers; identifying the location of the invasive organism in the image, determining its boundary and size, and training a probability matrix of images with markers, for said image data; solo thermal encoding of said temporal data and construction of a time-space feature matrix from the encoded data with said geolocation data; constructing a multi-feature vector from said text probability matrix, said image probability matrix and said time-space characteristic matrix; constructing a binary classifier by assigning weights to said multiple feature vectors and training the binary classifier using machine learning algorithms; the predicted data is fed into a binary classifier to obtain data on invasive organisms. 2. Werkwijze voor het identificeren van biologische invasies op basis van fusieanalyse van meerdere gegevensbronnen zoals geclaimd in conclusie 1, met het kenmerk, dat: het classificeren van genoemde tekstgegevens, specifiek bestaande uit: de genoemde tekstgegevens worden gedeactiveerd, Fast-Text wordt gebruikt om kenmerken met een N-gram te construeren, en de tekst- inhoud wordt onderworpen aan een schuifvensteroperatie van grootte N in bytevolgorde, wat resulteert in een opeenvolging van byte-A method for identifying biological invasions based on fusion analysis of multiple data sources as claimed in claim 1, characterized in that : classifying said text data, specifically comprising: deactivating said text data, using Fast-Text to construct features with an N-gram, and the text content is subjected to a sliding window operation of size N in byte order, resulting in a sequence of byte- fragmenten van lengte N; de resulterende opeenvolging wordt ge- bruikt als een kandidaatset voor tekstkenmerken, belangrijke ken- merken worden uitgefilterd, en een tekstwaarschijnlijkheidsmatrix met markeringen wordt uitgevoerd met behulp van Soft-Max.fragments of length N; the resulting sequence is used as a candidate set for text features, major features are filtered out, and a text probability matrix with markers is output using Soft-Max. 3. Werkwijze voor het identificeren van biologische invasies op basis van fusieanalyse van meerdere gegevensbronnen zoals gesteld in conclusie 1, met het kenmerk, dat: het classificeren van genoemde tekstgegevens en het uitvoeren van een tagged text waarschijnlijkheidsmatrix, specifiek bestaande uit: de locatie van het te identificeren invasieve wezen wordt bepaald voor genoemde beeldgegevens door het beeldherkenningsalgoritme CNN, er wordt ingezoomd op de locatie, de grenzen worden bepaald evenals de beeldgrootte, en een waarschijnlijkheidsmatrix van beelden met markeringen wordt getraind met behulp van de CNN.A method for identifying biological invasions based on fusion analysis of multiple data sources as claimed in claim 1, characterized in that : classifying said text data and performing a tagged text probability matrix, specifically comprising: the location of the invasive creature to be identified is determined for said image data by the image recognition algorithm CNN, the location is zoomed in, the boundaries are determined as well as the image size, and a probability matrix of images with markers is trained using the CNN. 4. Werkwijze voor het identificeren van biologische invasies op basis van fusieanalyse van meerdere gegevensbronnen zoals geclaimd in conclusie 1, met het kenmerk, dat: het toekennen van gewichten aan genoemde multi-feature vectoren en het trainen van een binaire classifier met behulp van een machine learning algoritme, specifiek bestaande uit: de beschreven multi-feature vectoren worden genormaliseerd, ge- wichten worden toegewezen met behulp van de entropie weging werk- wijze en getraind in een binaire classifier met behulp van het ma- chine learning algoritme SVM.A method for identifying biological invasions based on fusion analysis of multiple data sources as claimed in claim 1, characterized in that : assigning weights to said multi-feature vectors and training a binary classifier using a machine learning algorithm, specifically consisting of: the described multi-feature vectors are normalized, weights are assigned using the entropy weighting method and trained in a binary classifier using the machine learning algorithm SVM. 5. Werkwijze voor het identificeren van biologische invasies op basis van fusieanalyse van meerdere gegevensbronnen als bedoeld in conclusie 1, met het kenmerk, dat: de te voorspellen gegevens worden ingevoerd in een binaire classi- ficator om gegevens over invasieve organismen te verkrijgen, meer in het bijzonder: de te voorspellen gegevens worden ingevoerd en de uiteindelijke marker wordt gemaakt met behulp van de SVM; wanneer de uitgangs- markering 1 is, betekent dit dat de door de gebruiker geüploade gegevens waar zijn voor die locatie op dat moment, wat betekent dat hier invasieve soorten aanwezig zijn geweest en onmiddellijk moeten worden aangepakt.A method for identifying biological invasions based on fusion analysis of multiple data sources as claimed in claim 1, characterized in that : the data to be predicted is entered into a binary classifier to obtain data on invasive organisms, more in in particular: the data to be predicted is entered and the final marker is created using the SVM; when the output flag is 1, it means that the user uploaded data is true for that location at that time, meaning that invasive species have been present here and need to be addressed immediately. 6. Werkwijze voor het identificeren van biologische invasies op basis van fusieanalyse van meerdere gegevensbronnen als bedoeld in conclusie 1, met het kenmerk, dat de werkwijze verder omvat: het GM-model wordt gebruikt om toekomstige migratie- of voortplan- tingspatronen van invasieve organismen te voorspellen voor de be- schreven tijd-ruimte kenmerkenmatrix.The method of identifying biological invasions based on fusion analysis of multiple data sources as recited in claim 1, characterized in that the method further comprises: the GM model is used to predict future migration or reproductive patterns of invasive organisms predict for the described time-space feature matrix. 7. Werkwijze voor het identificeren van biologische invasies op basis van fusieanalyse van meerdere gegevensbronnen zoals geclaimd in conclusie 1, met het kenmerk, dat: de classificeerders in genoemde binaire classificeerder omvatten: random forest, logistische regressie, en neuraal netwerk.A method for identifying biological invasions based on fusion analysis of multiple data sources as claimed in claim 1, characterized in that : the classifiers in said binary classifier include: random forest, logistic regression, and neural network.
NL2034409A 2022-05-25 2023-03-23 A biological invasion identification method based on multi-source data fusion analysis NL2034409A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210575412.2A CN114943290B (en) 2022-05-25 2022-05-25 Biological intrusion recognition method based on multi-source data fusion analysis

Publications (1)

Publication Number Publication Date
NL2034409A true NL2034409A (en) 2023-05-19

Family

ID=82908603

Family Applications (2)

Application Number Title Priority Date Filing Date
NL2034214A NL2034214A (en) 2022-05-25 2023-02-23 A biological invasion identification method based on multi-source data fusion analysis
NL2034409A NL2034409A (en) 2022-05-25 2023-03-23 A biological invasion identification method based on multi-source data fusion analysis

Family Applications Before (1)

Application Number Title Priority Date Filing Date
NL2034214A NL2034214A (en) 2022-05-25 2023-02-23 A biological invasion identification method based on multi-source data fusion analysis

Country Status (2)

Country Link
CN (1) CN114943290B (en)
NL (2) NL2034214A (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117109664B (en) * 2023-10-20 2023-12-22 生态环境部华南环境科学研究所(生态环境部生态环境应急研究所) Wetland ecological environment monitoring device and system

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20110110895A (en) * 2010-04-02 2011-10-10 제주대학교 산학협력단 System for fusing realtime image and context data by using position and time information
US9652915B2 (en) * 2014-02-28 2017-05-16 Honeywell International Inc. System and method having biometric identification intrusion and access control
US10846308B2 (en) * 2016-07-27 2020-11-24 Anomalee Inc. Prioritized detection and classification of clusters of anomalous samples on high-dimensional continuous and mixed discrete/continuous feature spaces
CN107832718B (en) * 2017-11-13 2020-06-05 重庆工商大学 Finger vein anti-counterfeiting identification method and system based on self-encoder
CA2992333C (en) * 2018-01-19 2020-06-02 Nymi Inc. User access authorization system and method, and physiological user sensor and authentication device therefor
CN109165387A (en) * 2018-09-20 2019-01-08 南京信息工程大学 A kind of Chinese comment sentiment analysis method based on GRU neural network
CN109347863B (en) * 2018-11-21 2021-04-06 成都城电电力工程设计有限公司 Improved immune network abnormal behavior detection method
CN109934354A (en) * 2019-03-12 2019-06-25 北京信息科技大学 Abnormal deviation data examination method based on Active Learning
CN111046946B (en) * 2019-12-10 2021-03-02 昆明理工大学 Burma language image text recognition method based on CRNN
CN112990262B (en) * 2021-02-08 2022-11-22 内蒙古大学 Integrated solution system for monitoring and intelligent decision of grassland ecological data
CN113343770B (en) * 2021-05-12 2022-04-29 武汉大学 Face anti-counterfeiting method based on feature screening
CN113537355A (en) * 2021-07-19 2021-10-22 金鹏电子信息机器有限公司 Multi-element heterogeneous data semantic fusion method and system for security monitoring
CN113793405A (en) * 2021-09-15 2021-12-14 杭州睿胜软件有限公司 Method, computer system and storage medium for presenting distribution of plants
CN113822233B (en) * 2021-11-22 2022-03-22 青岛杰瑞工控技术有限公司 Method and system for tracking abnormal fishes cultured in deep sea

Also Published As

Publication number Publication date
NL2034214A (en) 2023-05-19
CN114943290B (en) 2023-08-08
CN114943290A (en) 2022-08-26

Similar Documents

Publication Publication Date Title
CN108009643A (en) A kind of machine learning algorithm automatic selecting method and system
Nepovinnykh et al. Identification of Saimaa ringed seal individuals using transfer learning
NL2034409A (en) A biological invasion identification method based on multi-source data fusion analysis
Han et al. Vision-based crack detection of asphalt pavement using deep convolutional neural network
Mehta et al. Employee attrition system using tree based ensemble method
Hantak et al. Computer vision for assessing species color pattern variation from web-based community science images
Apeagyei et al. Evaluation of deep learning models for classification of asphalt pavement distresses
Ozyirmidokuz et al. A data mining based approach to a firm's marketing channel
CN112270671A (en) Image detection method, image detection device, electronic equipment and storage medium
Stepchenko Land-Use Classification Using Convolutional Neural Networks
CN116304941A (en) Ocean data quality control method and device based on multi-model combination
CN116029760A (en) Message pushing method, device, computer equipment and storage medium
Bousarhane et al. Convolutional Neural Networks for traffic signs recognition
Selvarathi et al. A visualisation technique of extracting hidden patterns for maintaining road safety
Madariya et al. A comparative study on supervised and unsupervised techniques of land use and land cover classification
Shakya et al. Noise clustering-based hypertangent kernel classifier for satellite imaging analysis
Kiersztyn et al. Classification of complex ecological objects with the use of information granules
Sangeetha et al. Crime Rate Prediction and Prevention: Unleashing the Power of Deep Learning
Mahalle et al. Data Acquisition and Preparation
Kulothungan Loan Forecast by Using Machine Learning
Ngestrini Predicting poverty of a region from satellite imagery using CNNs
Kang et al. Urban management image classification approach based on deep learning
Sahu et al. A Computational Intelligence Approach Using SMOTE and Deep Neural Network (DNN)
Ray et al. Prediction of atmospheric pressure at ground level using artificial neural network
CN116578613B (en) Data mining system for big data analysis