CN114943290A - Biological invasion identification method based on multi-source data fusion analysis - Google Patents

Biological invasion identification method based on multi-source data fusion analysis Download PDF

Info

Publication number
CN114943290A
CN114943290A CN202210575412.2A CN202210575412A CN114943290A CN 114943290 A CN114943290 A CN 114943290A CN 202210575412 A CN202210575412 A CN 202210575412A CN 114943290 A CN114943290 A CN 114943290A
Authority
CN
China
Prior art keywords
data
picture
text
biological
invasive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210575412.2A
Other languages
Chinese (zh)
Other versions
CN114943290B (en
Inventor
陈碧云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yancheng Teachers University
Original Assignee
Yancheng Teachers University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yancheng Teachers University filed Critical Yancheng Teachers University
Priority to CN202210575412.2A priority Critical patent/CN114943290B/en
Publication of CN114943290A publication Critical patent/CN114943290A/en
Priority to NL2034214A priority patent/NL2034214A/en
Priority to NL2034409A priority patent/NL2034409A/en
Application granted granted Critical
Publication of CN114943290B publication Critical patent/CN114943290B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The invention discloses a biological intrusion identification method based on multi-source data fusion analysis, which comprises the following steps: acquiring a multi-source data set containing invasive biological data, and marking the invasive biological data; the data set includes: text data, picture data, time data, geographical location data; classifying the text data and outputting a text probability matrix with a mark; identifying the position of an invasive organism in the picture from the picture data, determining the boundary and the size, and training a picture probability matrix with a mark; carrying out one-hot encoding on the time data, and constructing a time-space characteristic matrix through the encoded data and the geographical position data; constructing a multi-feature vector according to the text probability matrix, the picture probability matrix and the time-space feature matrix; carrying out weight distribution on the multi-feature vector, and training a binary classifier by using a machine learning algorithm; and inputting the data to be predicted into a binary classifier to obtain the invasive biological data.

Description

Biological intrusion identification method based on multi-source data fusion analysis
Technical Field
The invention relates to the technical field of big data artificial intelligence, in particular to a biological invasion identification method based on multi-source data fusion analysis.
Background
With the acceleration of global development and the change of land use patterns, biological invasion has become a worldwide ecological safety problem. Studies have shown that the total cost of global intrusion control reaches at least $ 1.288 trillion from 1970 to 2017, the annual average cost is $ 268 trillion, and there is no slow-down trace of its growth rate. The research aiming at the biological invasion field is still in the preliminary stage, and in recent years, under the large background of global ecological change, the research is developed into a new field combining the global change and ecological sustainable management. At present, measures for preventing and controlling biological invasion mainly comprise: establishing a corresponding monitoring system, and finding out the type, quantity, distribution and action of the foreign species; propaganda and education on the harmfulness of biological invasion are enhanced, and social prevention awareness is improved; the identification and prevention technology aiming at the foreign invasive species is actively searched to effectively restrain the spreading trend of the current biological invasion. In conclusion, accurate identification of foreign species is of great importance.
At present, an artificial intelligence technology is becoming a new engine in the field of ecological resources, research and initiation in the aspect of species identification by utilizing the artificial intelligence technology are earlier, the effect of the artificial intelligence technology is better than that of a traditional classifier in plant, animal and specimen identification, and deep learning in the artificial intelligence technology is widely applied to species image identification. In plant classification (Lee et al, 2015, 2016), Mohanty et al (2016) achieved image-based classification of 38 plant diseases using a deep learning method. Carranza-Rojas et al (2017) implemented classification of thousands of species based on specimen pictures using convolutional and migratory neural networks. In addition to classifying individual images using CNN, Taghavi et al (2018) use LSTM to classify the phenotype and genotype of features of CNN-extracted time series images. Norouzzadeh and the like automatically identify the categories of animals and count the number of the animals based on image data acquired by a camera trap by using a deep learning method, so that the monitoring of animal populations is realized, but the identification accuracy rate is lower under the background of a complex environment. In order to solve the problem of low accuracy of monitoring image identification caused by the background of a complex environment in the field, the sound emitted by an animal is also used as an important data source.
With the progress of observation technology, the species monitoring system is continuously perfected, and the acquisition capacity of mass heterogeneous multi-source data with long time and scale span is remarkably improved. The research published in Science by Gregory P.Asner and the like of academy of sciences in 2017 indicates that the plant function type of the whole Peru forest is divided by integrating massive and high-precision hyperspectral and laser radar data, and then corresponding forest management and protection strategies are proposed for each area. The limitation that the plant group with complex structure and high biodiversity cannot be accurately monitored is broken through. It is worth noting that there is a problem of whether data structure, precision, etc. are matched during the fusion process of multi-source data. The acquired monitoring information comprises various different types of data, and how to utilize multi-feature data to carry out rapid identification and intelligent diagnosis on foreign species and carry out risk analysis and prejudgment based on the multi-feature data is a very worthy research problem.
Disclosure of Invention
The research in this aspect is only reported recently, and on the basis of this background, a biological intrusion identification method based on multi-source data fusion analysis is proposed herein. Firstly, a deep learning method is used for carrying out probability prejudgment on data, then data weight is distributed based on an entropy weight method, and finally an SVM method is used for carrying out comprehensive judgment on multi-feature data. The invention takes the event that the bumblebee invades Washington as an example, and analyzes and verifies the practicability of the algorithm. The result shows that the algorithm can be applied to the rapid identification and monitoring of species, and can also predict the time-varying development trend of the species. Provides basis for formulating corresponding reasonable and efficient protection and management measures.
The invention provides a biological intrusion identification method based on multi-source data fusion analysis, which comprises the following steps:
acquiring a multi-source data set containing invasive biological data, and marking the invasive biological data; the data set includes: text data, picture data, time data, geographical location data.
And classifying the text data and outputting a text probability matrix with marks.
And identifying the position of an invasive organism in the picture from the picture data, determining the boundary and the size, and training a picture probability matrix with a mark.
And carrying out one-hot coding on the time data, and constructing a time-space characteristic matrix through the coded data and the geographical position data.
Constructing a multi-feature vector according to the text probability matrix, the picture probability matrix and the time-space feature matrix; and carrying out weight distribution on the multi-feature vectors, and training a binary classifier by using a machine learning algorithm.
And inputting the data to be predicted into a binary classifier to obtain the invasive biological data.
Further, classifying the text data specifically includes: removing stop words from the Text data, constructing N-gram characteristics by using Fast-Text, performing N-size sliding window operation on the Text content according to the byte sequence, finally forming a byte fragment sequence with the length of N, taking the generated sequence as a Text characteristic candidate set, screening out important characteristics, and outputting a Text probability matrix with marks by using Soft-Max.
Further, the training of the labeled picture probability matrix specifically includes: and determining the position of the invasive creature to be identified by the picture data through a picture identification algorithm such as CNN, amplifying the position, determining the boundary and the picture size, and training a picture probability matrix with a mark by using the CNN.
Further, weight distribution is performed on the multi-feature vectors, and a machine learning algorithm is used for training a binary classifier, which specifically comprises: and standardizing the multi-feature vectors, carrying out weight distribution by using an entropy weight method, and training into a binary classifier by using a machine learning algorithm SVM.
Further, inputting data to be predicted into a binary classifier to obtain invasive biological data, which specifically comprises: inputting data needing prediction, using SVM for final marking, and when the output mark is 1, representing the time period, the data uploaded by the user at the place is true, representing that the invasive species appear at the place and should be processed in time.
Further, the biological intrusion identification method based on multi-source data fusion analysis further includes: and predicting the migration or reproduction rules of the future invasive organisms by using a GM model for the time-space characteristic matrix.
Further, the classifier in the binary classifier comprises: random forest, logistic regression, neural network.
Compared with the prior art, the biological invasion identification method based on multi-source data fusion analysis provided by the invention has the following beneficial effects:
the biological intrusion identification method based on multi-source data fusion analysis fuses text data, picture data, time data and geographical position data, is applied to rapid identification and monitoring of species, can also predict the change development trend of the species along with time, and provides a basis for formulating corresponding reasonable and efficient protection and management measures.
Drawings
FIG. 1 is a Fast-Text flow chart;
FIG. 2 is a detailed flow chart from the fully connected layer to the output layer;
FIG. 3 is a probability distribution graph of 11-100 training sets;
FIG. 4 is a test set recall statistical chart;
FIG. 5 is a test set accuracy statistical chart;
FIG. 6 is a graph of stochastic prediction of a training model;
FIG. 7 is a graph of the variability index for the training model;
FIG. 8 is a schematic diagram of the actual geographic location of the training model;
FIG. 9 is a thermal prediction diagram;
fig. 10 is an overall flow chart of the biological intrusion identification method based on multi-source data fusion analysis.
Detailed Description
The following further describes embodiments of the present invention with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.
Example 1
The invention provides a biological intrusion identification method based on multi-source data fusion analysis, which comprises the following steps as shown in an overall flow chart shown in figure 10:
acquiring a multi-source data set containing invasive biological data, and marking the invasive biological data; the data set includes: text data, picture data, time data, geographical location data.
The data set is composed of texts, pictures, time and geographical positions, the data range of the data set is expanded, the data set is divided into a training set and a testing set, a contrast experiment can be carried out, and the technical effect of the algorithm is convenient to verify.
Classifying the text data and outputting a text probability matrix with a mark; classifying the text data specifically comprises the following steps: and removing stop words from the Text data, constructing N-gram characteristics by using Fast-Text, performing N-size sliding window operation on the Text content according to the byte sequence, finally forming a byte fragment sequence with the length of N, using the generated sequence as a candidate set, screening out important characteristics, and outputting a Text probability matrix with a mark by using Soft-Max.
The comment data of the invading creatures provided by people can help the laboratory to judge, so that the text data has great influence on judging whether the invading creatures exist. And in order to ensure the morphological characteristics of the interior of each word, performing characteristic extraction on the word vector of each record. Namely, the text content is subjected to sliding window operation with the size of N according to the byte sequence, and finally a byte fragment sequence with the length of N is formed. Wherein'<'represents a prefix'>' denotes a suffix. From'<>' the composed trigram can be used to represent a word, and further 5 vector stacks can be used to better represent word vectors. Converting the discrete variable into a continuous vector through Embedding to form a word vector, W, of the record j =[w 1j ,w 2j, …w ij, …,w nj ]Wherein W is j =w 1j ,w 2j ,...w ij ,...,w nj ]Representing the word vector in the j-th record. As shown in the figure1 is a flow chart of Fast-Text, word vectors processed by Embedding are used as input characteristics, and a hidden layer carries out superposition averaging on a plurality of word vectors. And using negative log-likelihood as a loss function:
Figure BDA0003661940000000051
where N is text, x n Are word features in text. y is n Are labels, a and B are weight matrices, a is used to convert to text representation, B is used to linearly transform compute class, and f is a Soft-Max function used to compute the probability of the final classification. And Soft-Max adopts a hierarchical structure based on a Huffman tree formula to classify the text data and output the probability of each class.
The expression for the hierarchical Soft-Max is shown below,
Figure BDA0003661940000000052
wherein p (ω) ct ) Is the final probability of the text. W denotes a word vector.
And finally obtaining the probability matrix of n records. As shown in Table 1, wherein T n Denotes n records, L k Represents k classes, where p ij The probability of the category j in the ith record is shown.
Table 1: text data probability matrix
Figure BDA0003661940000000053
Figure BDA0003661940000000061
Identifying the position of an invasive organism in the picture from the picture data, determining the boundary and the size, and training a picture probability matrix with a mark; training a marked picture probability matrix, specifically comprising: determining the position of an invasive organism to be identified by the image data through an image identification algorithm CNN, amplifying the position, determining the boundary and the size of an image, and training an image probability matrix with a mark by utilizing the CNN.
For the image data information uploaded by the people, the image data information has great influence on laboratory judgment, so the image data information is very important for feature extraction of the image data. The picture data is now pre-processed first: delete data that is not a picture, and modify the suffix name. The CNN convolutional neural network is mainly divided into a convolutional layer (CONV), a pooling layer (POOL), and a fully connected layer (FC). CNN mainly performs functions such as image recognition by continuously extracting features from local features to global features through individual filters. The processing of the picture by using the CNN neural network is divided into four steps: an input layer, convolutional and downsample layers, a full-connect layer, and an output layer. The input layer uses RGB color images, the outputs of RGB components are convolved with convolution layer weights W to obtain C layers, and then downsampled to obtain S layers. The output of these layers is called Feature-Map using an activation function. The full connection layer expands each element of all Feature-maps in sequence and arranges the elements in a row. The output layer is classified by Soft-Max.
Fig. 2 is a detailed flow chart from the fully connected layer to the output layer, wherein asian intruder is used as input layer data, and since the data set is an RGB color image, three separate 2D kernels are used to scale and ash the picture, and the 3-channel RGB color image is rapidly converted into 1-channel gray scale. After multiple convolution, pooling and activation, the features are extracted, and the probability of whether the Asian invading organisms exist is output by using a Soft-Max function through a full connection layer.
Wherein X (height × width × channel) is an input pixel matrix, Y is an output matrix, convolution pooling is performed to flatten the multidimensional data, the multidimensional data is connected with a full connection layer, and the output is the class probability classified by using the traditional Soft-Max. A T x L vector is obtained where each value represents a probability value for all samples of the input. This results in a classification probability for the image file, denoted by C.
The resulting probability matrix for the picture data, q is shown in Table 2 ij Indicating the probability of the category j in the ith record.
Table 2: picture data probability matrix
Figure BDA0003661940000000071
Carrying out one-hot encoding on the time data, and constructing a time-space characteristic matrix through the encoded data and the geographical position data; constructing a multi-feature vector according to the text probability matrix, the picture probability matrix and the time-space feature matrix; carrying out weight distribution on the multi-feature vector, and training a binary classifier by using a machine learning algorithm; the method comprises the following steps of carrying out weight distribution on the multi-feature vectors, and training a binary classifier by using a machine learning algorithm, wherein the method specifically comprises the following steps: standardizing the multi-feature vectors, distributing weight coefficients by adopting an entropy weight method, and training the weight coefficients into a binary classifier by utilizing a machine learning algorithm SVM (support vector machine); the classifier in the binary classifier includes: random forest, logistic regression, neural network.
Inputting data to be predicted into a binary classifier to obtain invasive biological data; inputting data to be predicted into a binary classifier to obtain invasive biological data, which specifically comprises the following steps: inputting data needing prediction, using an SVM (support vector machine) as a final mark, and when an output mark is 1, representing the time period, the data uploaded by a user at the place is true, representing that the invasive species appear at the place and should be processed in time.
A biological intrusion identification method based on multi-source data fusion analysis further comprises the following steps: and predicting the migration or propagation rules of the future invading organisms by using a GM model for the time-space characteristic matrix.
The data set includes: sighting report data summarized by the washington state department of agriculture 2020 in 12 months, the dataset comprising a spreadsheet of 4440 sighting reports and 3305 image data uploaded by the user; and marking witness report data which are judged by a laboratory, wherein the mark for judging the invading organism is 1, and otherwise, the mark is 0. Randomly dividing 70% of all data in the data set into a training set, and dividing the rest of the data in the data set into a test set.
For witness reports provided by the people, each report is independent, and the information and the characteristic values are not continuous but discrete and unordered. One-hot encoding, also known as one-bit efficient encoding, can be used to digitize features. The method is to use an N-bit status register to encode N states, each state having its own independent register bit and only one of which is active at any one time.
Because the invading organisms belong to foreign species, the occurrence of events is less, the GM model has good applicability to a small amount of incomplete information, and the variation range of the invading organisms is calculated according to the occurrence time and the geographic position of the invading organisms. The GM (1,1) model can be expressed as Y ═ Bu, and the predicted sample range represents the region partition. In order to guarantee the feasibility of the GM (1,1) modeling method, the necessary verification process needs to be performed on the known data. Let the original data column be x (0) =(x 0 (1),x 0 (2),…x 0 (n)), wherein x (0) For the original time data column, the rank ratio of the number columns is calculated. If all the step ratios fall within the allowable coverage area
Figure BDA0003661940000000081
In this case, the series may be modeled as GM (1,1) and grey predictions may be made. Otherwise, performing appropriate transformation processing on the data, such as translation transformation: y is (0) (k)=x(0)(k)+c,k=1,2,…,n。
And (4) obtaining the area range of the event occurrence in different records by utilizing the model, and planning, predicting and sorting out the trend range. For the range predicted by the GM, it is first determined whether the GM is within the range, and if the GM is within the range, the GM is not within the range and is 0. The recorded geographic characteristic value is represented by l (location) ═ 0,1, and the time characteristic T has uncertainty due to the occurrence of an event, so that the discrete values can be effectively distinguished from different time periods by using one-hot coding.
Firstly, data feature extraction is carried out, and standardization is carried out according to four different features (text, picture, time and position). The text data and the picture data provided for the people have high reality, are relatively wide in time and geographic range and cannot represent specific meanings, so that the overall numerical value needs to be subjected to weight distribution. The sum of the weights for the assignment completion should be equal to 1.
For the extracted text data probability feature F and picture data probability feature C, 1/k is used for the default value. Due to the mutual independence between events, the probability of k events occurring at the same time is only 1/k, where the sum of k time probabilities should be 1. The absence of time and geographic location can be completed using the average of the upper and lower two records.
The multi-feature event is essentially random, the randomness and the disorder degree of the event are judged by using the calculation entropy, the dispersion degree of the index is judged, and the larger the dispersion degree of the index is, the larger the influence of the index on the comprehensive evaluation is.
First, the feature vector X is required to be set to X 11 ,...,X Nj The values between the various matrices need to be Z-Score normalized,
Figure BDA0003661940000000091
where μ and σ represent all the means and standard deviations in the feature vector X.
The entropy weight of each index is calculated by utilizing the information entropy,
Figure BDA0003661940000000092
wherein H j The information entropy of the j-th index is represented.
To ensure that 0 ≦ H j When k is 1/(lnm), the degree of deviation d of each index is calculated j
The normalized eigenvector matrix and each index w j And multiplying to obtain a weighted multi-feature evaluation matrix V.
Since more feature vectors are provided, in the final multi-feature fusion classification, no overly complex algorithm needs to be used because the probability statistics of the first round of predictive classification have already been performed on the data. It is sufficient to use a conventional SVM classification model.
Input data X and a learning objective Y are given in the SVM classification problem.
X={X 1 ,...,X N },
Y={Y 1 ,...,Y N },
Here, X is F, C, L, and T.
A simple explanation is given to FCLT, since each sample of input data contains multiple features, thereby forming a feature space x (feature space): x ═ X i ,…,X N ]∈x,
While the learning objective is a binary variable, representing a negative class and a positive class. If the feature space X in which the input data is located exists, it will be used as a hyperplane of Decision Boundary (Decision Boundary): determination boundary of w T And X + b is 0, then separating the learning target into a positive class and a negative class, and enabling the point-to-plane distance of any sample to be greater than or equal to 1: point to plane distance y i *(w T X i + b) is more than or equal to 1, wherein the parameters w and b are the normal vector and the intercept of the hyperplane respectively.
Decision boundaries that satisfy this condition actually construct 2 parallel hyperplanes as separation boundaries to discriminate the classification of samples:
Figure BDA0003661940000000101
positive sample
Figure BDA0003661940000000102
Negative sample
All samples above the upper granularity boundary belong to the positive class and all samples below the lower granularity boundary belong to the negative class. The distance d between two spaced boundaries is defined as the margin (margin)
Figure BDA0003661940000000103
The positive and negative class samples located on the interval boundary are Support vectors (Support vectors).
The data is subjected to Text classification processing using the Fast-Text tool. Firstly, the training set is subjected to data equalization operation to obtain 100 training set probability distributions, and the peak value is finally reached in the continuous training process.
As shown in FIG. 3, the probability approaches 0.9 or 0.1 from the 10 th, and in the case of uneven samples, Fast-Text helps to perform sample equalization processing and make the judgment of probability events good.
The sample data is subjected to out-of-order processing for training, and as for the training result of the Asian invasive biological problem, as shown in FIG. 4 and FIG. 5, the recall rate and the accuracy rate are both 94.6%.
In practice, the CNN model is constructed using the PyTorch framework. 70% of the data set was used for training and 30% for testing. Simple oversampling is performed due to the non-uniformity of the data samples. The final training model performed well on the test set, several pictures were randomly selected, and then the trained model was used for prediction, as shown in fig. 6 (true value is the actual category, prediction is the prediction result of the model, Negative value indicates that the picture is an asian invading organism, Positive indicates that the picture is not an asian invading organism)
Finally, the training model was evaluated for metrics as shown in table 3 and fig. 7.
Table 3: index of training model
Figure BDA0003661940000000104
Figure BDA0003661940000000111
And for few invasive creatures, the GM model has good applicability to a small amount of incomplete information, and the variation range of the GM model is calculated according to the time and the geographic position of the invasive creatures.
Taking c so that the level ratios of the data columns all fall within the acceptable coverage, finding that the check values of the level ratios of both data are within the standard range interval [0.857,1.166] after calculating the level ratios means that the data is suitable for GM (1,1) model construction. After the data are examined, the development coefficient a, the gray effect amount b and the posterior difference ratio C are calculated for the GM model, as shown in Table 4:
table 4: results of model construction
Figure BDA0003661940000000112
The posterior difference ratio C values of both models are less than 0.65, with the longitude model having only 0.0468 less than 0.35 indicating that the longitude model is particularly good. Therefore, the latitude and longitude are predicted, and after the prediction is finished, the residual errors, including relative errors and grade ratio deviation, are tested; the maximum values of relative error values of the longitude and the latitude of the two groups of data are both less than 0.1, and for the level ratio deviation value, both less than 0.1 indicate that the higher requirement is met, which means that the model fitting effect meets the higher requirement. And plots the relevant range according to its geographic location, as shown in fig. 8.
After the longitude and latitude are predicted, the distance between the two points is calculated by utilizing the difference value between the longitude and latitude of the two points, and the goodness of fit of the two models is calculated: latitude R side 71.45% longitude R side 95.31%, as can be readily seen from fig. 8, these sample Latitude ranges for an asian invading organism that are validated as true: [48.7775,49.1494], Longituude range: [ -123.9431, -122.4186].
For the extracted text data probability feature F and picture data probability feature C, 1/k is used for the default value. Since there is independence between events, the probability of k events occurring at the same time is only 1/k. For the range predicted by the GM, it is first determined whether the GM is within the range, and if the GM is within the range, the GM is not within the range and is 0. The recorded geographic characteristic value is represented by l (location) ═ 0,1, and the time characteristic T has uncertainty due to the occurrence of an event, so that the discrete values can be effectively distinguished from different time periods by using one-hot coding.
The relevant weight needs to be determined among the characteristics, and the weight distribution of each characteristic is calculated by using an entropy weight method, as shown in table 5:
table 5: different characteristic weight distribution table
Longitude 0.048440
Latitude 0.026739
Text 0.290734
Image 0.258996
LabText 0.157505
Year 0.115133
Month 0.102453
For several characteristic values, the characteristic space is linear inseparable, that is, the characteristic space has a supercurve and the nonlinear function is used to map the nonlinear separable problem from the original characteristic space to a higher-dimensional Hilbert space, so as to convert the nonlinear separable problem into a linear separable problem, and a linear regression calculation is used to obtain a table 6:
table 6: linear regression calculation table
MAE 0.08905792097395834
MSE 0.4136931156194732
R 2 -0.3932832269742397
Discovery of R 2 <0, the data may not have any linear relationship. For the hyperplane multi-feature problem, the kernel function has good convergence when the kernel function uses the radial basis function kernel for classification. And performing SVM multi-feature fusion analysis, wherein the analysis result has high accuracy when the Fast-Text and CNN neural network models are singly used. It has good universality with respect to predicting only a single datum in general.
Training 300 groups of data and performing classification evaluation on 521 groups of data to obtain table 7:
table 7: classification evaluation table
Figure BDA0003661940000000121
Figure BDA0003661940000000131
The diagonal line is the predicted correct number, and the number of errors for discriminating the invading creature in 521 records is only one, so that the record of the invading creature can be correctly discriminated, and the occurrence range of the record can be correctly discriminated. The method comprises the steps of predicting the civil data which are not judged by experiments, drawing a thermodynamic diagram according to the predicted data and time, and showing in a figure 9: it was found that there may be traces of asian invading organisms in the washington sector during the next half year, and in a short time, such a risk may not be eliminated.
The multi-feature data fusion analysis algorithm is comprehensively evaluated, and as can be seen from 8, the algorithm can be well combined with a multi-feature data source and can be used for reasonably judging different events.
Table 8: comprehensive evaluation of multi-feature data fusion analysis algorithm
MSE 0.0007262164124909223
MAE 0.0007262164124909223
R 2 0.9970754396397927
ACC 0.9992737835875091
Recall 0.9992088607594937
F2 0.9992737396242461
ROC 0.9992088607594937
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

Claims (7)

1. A biological intrusion identification method based on multi-source data fusion analysis is characterized by comprising the following steps:
acquiring a multi-source data set containing invasive biological data, and marking the invasive biological data; the data set includes: text data, picture data, time data, geographical location data;
classifying the text data and outputting a text probability matrix with a mark;
identifying the position of an invasive organism in the picture from the picture data, determining the boundary and the size, and training a picture probability matrix with a mark;
carrying out one-hot encoding on the time data, and constructing a time-space characteristic matrix through the encoded data and the geographical position data;
constructing a multi-feature vector according to the text probability matrix, the picture probability matrix and the time-space feature matrix; carrying out weight distribution on the multi-feature vector, and training a binary classifier by using a machine learning algorithm;
and inputting the data to be predicted into a binary classifier to obtain the invasive biological data.
2. The biological intrusion identification method based on multi-source data fusion analysis according to claim 1, characterized in that:
classifying the text data, specifically including:
removing stop words from the Text data, constructing N-gram characteristics by using Fast-Text, performing N-size sliding window operation on the Text content according to the byte sequence, finally forming a byte fragment sequence with the length of N, taking the generated sequence as a Text characteristic candidate set, screening out important characteristics, and outputting a Text probability matrix with marks by using Soft-Max.
3. The biological intrusion identification method based on multi-source data fusion analysis according to claim 1, characterized in that:
the training marked picture probability matrix specifically comprises:
and determining the position of the invasive creature to be identified according to the picture data by using a picture identification algorithm (CNN), amplifying the position, determining the boundary and the picture size, and training a picture probability matrix with a mark by using the CNN.
4. The biological intrusion identification method based on multi-source data fusion analysis according to claim 1, characterized in that:
carrying out weight distribution on the multi-feature vector, and training a binary classifier by using a machine learning algorithm, wherein the method specifically comprises the following steps:
and standardizing the multi-feature vectors, carrying out weight distribution by using an entropy weight method, and training into a binary classifier by using a machine learning algorithm SVM.
5. The biological intrusion identification method based on multi-source data fusion analysis according to claim 1, characterized in that:
inputting data to be predicted into a binary classifier to obtain invasive biological data, which specifically comprises the following steps:
inputting data needing prediction, using SVM for final marking, and when the output mark is 1, representing the time period, the data uploaded by the user at the place is true, representing that the invasive species appear at the place and should be processed in time.
6. The biological intrusion identification method based on multi-source data fusion analysis according to claim 1, further comprising:
and predicting the migration or reproduction rules of the future invasive organisms by using a GM model for the time-space characteristic matrix.
7. The biological intrusion identification method based on multi-source data fusion analysis according to claim 1, wherein:
the classifier in the binary classifier includes: random forest, logistic regression, neural networks.
CN202210575412.2A 2022-05-25 2022-05-25 Biological intrusion recognition method based on multi-source data fusion analysis Active CN114943290B (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN202210575412.2A CN114943290B (en) 2022-05-25 2022-05-25 Biological intrusion recognition method based on multi-source data fusion analysis
NL2034214A NL2034214A (en) 2022-05-25 2023-02-23 A biological invasion identification method based on multi-source data fusion analysis
NL2034409A NL2034409A (en) 2022-05-25 2023-03-23 A biological invasion identification method based on multi-source data fusion analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210575412.2A CN114943290B (en) 2022-05-25 2022-05-25 Biological intrusion recognition method based on multi-source data fusion analysis

Publications (2)

Publication Number Publication Date
CN114943290A true CN114943290A (en) 2022-08-26
CN114943290B CN114943290B (en) 2023-08-08

Family

ID=82908603

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210575412.2A Active CN114943290B (en) 2022-05-25 2022-05-25 Biological intrusion recognition method based on multi-source data fusion analysis

Country Status (2)

Country Link
CN (1) CN114943290B (en)
NL (2) NL2034214A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117109664A (en) * 2023-10-20 2023-11-24 生态环境部华南环境科学研究所(生态环境部生态环境应急研究所) Wetland ecological environment monitoring device and system

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20110110895A (en) * 2010-04-02 2011-10-10 제주대학교 산학협력단 System for fusing realtime image and context data by using position and time information
US20150248798A1 (en) * 2014-02-28 2015-09-03 Honeywell International Inc. System and method having biometric identification intrusion and access control
CN107832718A (en) * 2017-11-13 2018-03-23 重庆工商大学 Finger vena anti false authentication method and system based on self-encoding encoder
CN109165387A (en) * 2018-09-20 2019-01-08 南京信息工程大学 A kind of Chinese comment sentiment analysis method based on GRU neural network
CN109347863A (en) * 2018-11-21 2019-02-15 成都城电电力工程设计有限公司 A kind of improved immune Network anomalous behaviors detection method
US20190188212A1 (en) * 2016-07-27 2019-06-20 Anomalee Inc. Prioritized detection and classification of clusters of anomalous samples on high-dimensional continuous and mixed discrete/continuous feature spaces
CN109934354A (en) * 2019-03-12 2019-06-25 北京信息科技大学 Abnormal deviation data examination method based on Active Learning
CN111046946A (en) * 2019-12-10 2020-04-21 昆明理工大学 Burma language image text recognition method based on CRNN
US20200342086A1 (en) * 2018-01-19 2020-10-29 Nymi Inc. Live user authentication device, system and method
CN112990262A (en) * 2021-02-08 2021-06-18 内蒙古大学 Integrated solution system for monitoring and intelligent decision of grassland ecological data
CN113343770A (en) * 2021-05-12 2021-09-03 武汉大学 Face anti-counterfeiting method based on feature screening
CN113537355A (en) * 2021-07-19 2021-10-22 金鹏电子信息机器有限公司 Multi-element heterogeneous data semantic fusion method and system for security monitoring
CN113793405A (en) * 2021-09-15 2021-12-14 杭州睿胜软件有限公司 Method, computer system and storage medium for presenting distribution of plants
CN113822233A (en) * 2021-11-22 2021-12-21 青岛杰瑞工控技术有限公司 Method and system for tracking abnormal fishes cultured in deep sea

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20110110895A (en) * 2010-04-02 2011-10-10 제주대학교 산학협력단 System for fusing realtime image and context data by using position and time information
US20150248798A1 (en) * 2014-02-28 2015-09-03 Honeywell International Inc. System and method having biometric identification intrusion and access control
US20190188212A1 (en) * 2016-07-27 2019-06-20 Anomalee Inc. Prioritized detection and classification of clusters of anomalous samples on high-dimensional continuous and mixed discrete/continuous feature spaces
CN107832718A (en) * 2017-11-13 2018-03-23 重庆工商大学 Finger vena anti false authentication method and system based on self-encoding encoder
US20200342086A1 (en) * 2018-01-19 2020-10-29 Nymi Inc. Live user authentication device, system and method
CN109165387A (en) * 2018-09-20 2019-01-08 南京信息工程大学 A kind of Chinese comment sentiment analysis method based on GRU neural network
CN109347863A (en) * 2018-11-21 2019-02-15 成都城电电力工程设计有限公司 A kind of improved immune Network anomalous behaviors detection method
CN109934354A (en) * 2019-03-12 2019-06-25 北京信息科技大学 Abnormal deviation data examination method based on Active Learning
CN111046946A (en) * 2019-12-10 2020-04-21 昆明理工大学 Burma language image text recognition method based on CRNN
CN112990262A (en) * 2021-02-08 2021-06-18 内蒙古大学 Integrated solution system for monitoring and intelligent decision of grassland ecological data
CN113343770A (en) * 2021-05-12 2021-09-03 武汉大学 Face anti-counterfeiting method based on feature screening
CN113537355A (en) * 2021-07-19 2021-10-22 金鹏电子信息机器有限公司 Multi-element heterogeneous data semantic fusion method and system for security monitoring
CN113793405A (en) * 2021-09-15 2021-12-14 杭州睿胜软件有限公司 Method, computer system and storage medium for presenting distribution of plants
CN113822233A (en) * 2021-11-22 2021-12-21 青岛杰瑞工控技术有限公司 Method and system for tracking abnormal fishes cultured in deep sea

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
KONRAD RIECK等: "Sally: A Tool for Embedding Strings in Vector Spaces", 《JOURNAL OF MACHINE LEARNING RESEARCH》, vol. 13, no. 104, pages 3247 - 3251 *
SATHEESH NARAYANASAMI等: "Biological feature selection and classification techniques for intrusion detection on BAT", 《WIRELESS PERSONAL COMMUNICATIONS》, vol. 127, pages 1763 - 1785 *
孟威等: "核电厂海洋生物探测预警多源信息融合技术研究", 《大连海洋大学学报》, vol. 34, no. 06, pages 840 - 845 *
施光莹: "基于Active SVM算法的恶意网页检测技术研究", 《中国优秀硕士学位论文全文数据库 (信息科技辑)》, no. 2014, pages 139 - 59 *
李梅等: "外来植物入侵机制研究进展", 《广东农业科学》, no. 02, pages 93 - 96 *
陆倩: "基于仿生算法的网络入侵检测系统研究", 《中国优秀硕士学位论文全文数据库 (信息科技辑)》, no. 2019, pages 139 - 139 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117109664A (en) * 2023-10-20 2023-11-24 生态环境部华南环境科学研究所(生态环境部生态环境应急研究所) Wetland ecological environment monitoring device and system
CN117109664B (en) * 2023-10-20 2023-12-22 生态环境部华南环境科学研究所(生态环境部生态环境应急研究所) Wetland ecological environment monitoring device and system

Also Published As

Publication number Publication date
NL2034409A (en) 2023-05-19
NL2034214A (en) 2023-05-19
CN114943290B (en) 2023-08-08

Similar Documents

Publication Publication Date Title
Eastwood et al. A framework for the quantitative evaluation of disentangled representations
CN109034264B (en) CSP-CNN model for predicting severity of traffic accident and modeling method thereof
US11263528B2 (en) Neural network, computer readable medium, and methods including a method for training a neural network
CN109302410B (en) Method and system for detecting abnormal behavior of internal user and computer storage medium
CN110929029A (en) Text classification method and system based on graph convolution neural network
Widiyanto et al. Implementation of convolutional neural network method for classification of diseases in tomato leaves
Shi et al. Amur tiger stripes: Individual identification based on deep convolutional neural network
CN113761259A (en) Image processing method and device and computer equipment
US20200364549A1 (en) Predicting optical fiber manufacturing performance using neural network
CN115471739A (en) Cross-domain remote sensing scene classification and retrieval method based on self-supervision contrast learning
CN111477328B (en) Non-contact psychological state prediction method
CN114943290B (en) Biological intrusion recognition method based on multi-source data fusion analysis
CN114547365A (en) Image retrieval method and device
CN114943859A (en) Task correlation metric learning method and device for small sample image classification
Hantak et al. Computer vision for assessing species color pattern variation from web-based community science images
WO2022134104A1 (en) Systems and methods for image-to-video re-identification
CN116304941A (en) Ocean data quality control method and device based on multi-model combination
CN114049966B (en) Food-borne disease outbreak identification method and system based on link prediction
CN116028803A (en) Unbalancing method based on sensitive attribute rebalancing
Ade Students performance prediction using hybrid classifier technique in incremental learning
CN110265151B (en) Learning method based on heterogeneous temporal data in EHR
CN114202671A (en) Image prediction optimization processing method and device
CN109308565B (en) Crowd performance grade identification method and device, storage medium and computer equipment
Taubert et al. Species Prediction based on Environmental Variables using Machine Learning Techniques.
CN112364193A (en) Image retrieval-oriented method for fusing multilayer characteristic deep neural network model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20220826

Assignee: Yancheng Guzhuo Technology Co.,Ltd.

Assignor: YANCHENG TEACHERS University

Contract record no.: X2024980003605

Denomination of invention: A biological intrusion recognition method based on multi-source data fusion analysis

Granted publication date: 20230808

License type: Common License

Record date: 20240328