CN114943290B - Biological intrusion recognition method based on multi-source data fusion analysis - Google Patents

Biological intrusion recognition method based on multi-source data fusion analysis Download PDF

Info

Publication number
CN114943290B
CN114943290B CN202210575412.2A CN202210575412A CN114943290B CN 114943290 B CN114943290 B CN 114943290B CN 202210575412 A CN202210575412 A CN 202210575412A CN 114943290 B CN114943290 B CN 114943290B
Authority
CN
China
Prior art keywords
data
picture
text
biological
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210575412.2A
Other languages
Chinese (zh)
Other versions
CN114943290A (en
Inventor
陈碧云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yancheng Teachers University
Original Assignee
Yancheng Teachers University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yancheng Teachers University filed Critical Yancheng Teachers University
Priority to CN202210575412.2A priority Critical patent/CN114943290B/en
Publication of CN114943290A publication Critical patent/CN114943290A/en
Priority to NL2034214A priority patent/NL2034214A/en
Priority to NL2034409A priority patent/NL2034409A/en
Application granted granted Critical
Publication of CN114943290B publication Critical patent/CN114943290B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Databases & Information Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Multimedia (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Remote Sensing (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Circuit Arrangement For Electric Light Sources In General (AREA)
  • Image Analysis (AREA)
  • Catching Or Destruction (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Traffic Control Systems (AREA)

Abstract

The invention discloses a biological intrusion recognition method based on multi-source data fusion analysis, which comprises the following steps: acquiring a multi-source data set containing the invasive biological data and marking the invasive biological data; the data set includes: text data, picture data, time data, geographic location data; classifying the text data and outputting a text probability matrix with marks; identifying the position of an invading organism in the picture according to the picture data, determining the boundary and the size, and training a picture probability matrix with marks; performing single-heat coding on the time data, and constructing a time-space feature matrix through the coded data and the geographic position data; constructing a multi-feature vector according to the text probability matrix, the picture probability matrix and the time-space feature matrix; weight distribution is carried out on the multi-feature vectors, and a binary classifier is trained by using a machine learning algorithm; inputting the data to be predicted into a binary classifier to obtain the invasive biological data.

Description

Biological intrusion recognition method based on multi-source data fusion analysis
Technical Field
The invention relates to the technical field of big data artificial intelligence, in particular to a biological intrusion recognition method based on multi-source data fusion analysis.
Background
With the acceleration of the progress of globalization and the change of land use patterns, biological invasion has become a worldwide ecological safety problem. Studies have shown that global intrusion prevention total costs from 1970 to 2017 reach at least 1.288 trillion dollars, with an average annual cost of 268 billions dollars, and no trace of slowing down the rate of growth. The research on the field of biological invasion is still in the primary stage, and in recent years, the research has been developed into a new field combining global change and ecological sustainable management under the large background of global ecological change. The means for preventing and controlling biological invasion at the present stage mainly comprises: establishing a corresponding monitoring system to find out the types, the numbers, the distribution and the roles of the foreign species; reinforcing propaganda and education on the harm of biological invasion and improving the precaution awareness of society; technology for identifying and preventing foreign invasive species is actively sought to effectively suppress the spreading trend of the current biological invasion. In summary, accurate identification of foreign species is critical.
At present, the artificial intelligence technology is becoming a new engine in the field of ecological resources, research on species identification by utilizing the artificial intelligence technology is started earlier, effects exceeding the traditional classifier are achieved in plant, animal and specimen identification, and deep learning in the artificial intelligence technology is widely applied to species image identification. In plant classification (Lee et al, 2015, 2016), mohanty et al (2016) have achieved classification of 38 plant diseases based on images using a deep learning approach. Carranza-Rojas et al (2017) have implemented classification of thousands of species based on specimen pictures with convolutional and migrating neural networks. In addition to classifying individual images with CNN, taghavi et al (2018) uses LSTM to classify the phenotype and genotype of features of CNN extracted time series images. Norouzzadeh et al automatically identify the category of animals based on image data acquired by a camera trap and count the number of the animals, so that the monitoring of animal population is realized, but the identification accuracy is lower in a complex environment background. In order to solve the problem of low accuracy of monitoring image recognition caused by a complex field environment background, the sound emitted by animals is also used as an important data source.
With the progress of observation technology, the species monitoring system is continuously perfected, and the acquisition capability of long-time, cross-scale and massive heterogeneous multi-source data is remarkably improved. Research published in Science by the American academy of sciences of the United states of America, P.Asner et al shows that plant functional division is performed on the whole Peruvian forest by integrating massive and high-precision hyperspectral and laser radar data, so that corresponding forest management and protection countermeasures are provided for each region. Breaks through the limitation that the plant group with complex structure and high biological diversity can not be accurately monitored. It is noted that there is a problem of whether the data structure, the precision, etc. match in the fusion process of the multi-source data. The monitoring information obtained by the method comprises various different types of data, and how to use multi-feature data to rapidly identify and intelligently diagnose the foreign species and to perform risk analysis and prognosis based on the multi-feature data is a very valuable research problem.
Disclosure of Invention
The research on the aspect is recently reported, and based on the background, a biological intrusion recognition method based on multi-source data fusion analysis is provided. Firstly, probability pre-judging is carried out on data by using a deep learning method, then data weight is distributed based on an entropy weight method, and finally, multi-feature data is comprehensively judged by adopting an SVM method. The invention takes the washington event invaded by the hornet as an example, and analyzes and verifies the practicability of the algorithm. The result shows that the algorithm can be applied to rapid identification and monitoring of species, and meanwhile, the change development trend of the species along with time can be predicted. Provides basis for formulating corresponding reasonable and efficient protection and management measures.
The invention provides a biological intrusion recognition method based on multi-source data fusion analysis, which comprises the following steps:
acquiring a multi-source data set containing the invasive biological data and marking the invasive biological data; the dataset comprises: text data, picture data, time data, geographic location data.
And classifying the text data and outputting a text probability matrix with marks.
And identifying the position of the invading organism in the picture according to the picture data, determining the boundary and the size, and training a picture probability matrix with marks.
And performing single-heat coding on the time data, and constructing a time-space feature matrix through the coded data and the geographic position data.
Constructing a multi-feature vector according to the text probability matrix, the picture probability matrix and the time-space feature matrix; and carrying out weight distribution on the multi-feature vectors, and training a binary classifier by using a machine learning algorithm.
Inputting the data to be predicted into a binary classifier to obtain the invasive biological data.
Further, classifying the text data specifically includes: removing stop words from the Text data, constructing N-gram features by using Fast-Text, performing sliding window operation with the size of N on the Text content according to byte sequence, finally forming a byte fragment sequence with the length of N, taking the generated sequence as a Text feature candidate set, screening out important features, and outputting a Text probability matrix with marks by using Soft-Max.
Further, the training of the tagged picture probability matrix specifically includes: and determining the position of the invasive organism to be identified according to the picture data through a picture identification algorithm such as CNN, amplifying the position, determining the boundary and the picture size, and training a picture probability matrix with marks by using CNN.
Further, the weight distribution is carried out on the multi-feature vectors, and a binary classifier is trained by utilizing a machine learning algorithm, which specifically comprises the following steps: and normalizing the multi-feature vectors, performing weight distribution by using an entropy weight method, and training into a binary classifier by using a machine learning algorithm SVM.
Further, inputting the data to be predicted into a binary classifier to obtain the invasive biological data, which specifically comprises: the data to be predicted is input, the SVM is used for making a final mark, when the output mark is 1, the data uploaded by the user at the place is true, the data representing that invasive species appear at the place, and the data should be processed in time.
Further, the biological intrusion recognition method based on multi-source data fusion analysis further comprises the following steps: and predicting migration or reproduction rules of future invading organisms by using a GM model for the time-space characteristic matrix.
Further, the classifier in the binary classifier includes: random forest, logistic regression, neural network.
Compared with the prior art, the biological intrusion recognition method based on multi-source data fusion analysis has the following beneficial effects:
the biological intrusion recognition method based on multi-source data fusion analysis fuses text data, picture data, time data and geographic position data, is applied to rapid recognition and monitoring of species, can also pre-judge the change development trend of the species along with time, and provides a basis for formulating corresponding reasonable and efficient protection and management measures.
Drawings
FIG. 1 is a Fast-Text flow diagram;
FIG. 2 is a detailed flow chart from the full connection layer to the output layer;
FIG. 3 is a graph of probability distribution of 11-100 training sets;
FIG. 4 is a statistical plot of test set recall;
FIG. 5 is a chart of test set accuracy statistics;
FIG. 6 is a random predictive graph of a training model;
FIG. 7 is a plot of variability index for a training model;
FIG. 8 is a schematic representation of the actual geographic location of a training model;
FIG. 9 is a thermal prediction map;
fig. 10 is a flowchart of a biological intrusion recognition method based on multi-source data fusion analysis.
Detailed Description
The following describes the embodiments of the present invention further with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.
Example 1
The invention provides a biological intrusion recognition method based on multi-source data fusion analysis, which is shown in an overall flow chart in fig. 10, and comprises the following steps:
acquiring a multi-source data set containing the invasive biological data and marking the invasive biological data; the data set includes: text data, picture data, time data, geographic location data.
The data set consists of texts, pictures, time and geographic positions, the data range of the data set is enlarged, and a comparison experiment can be carried out by dividing the data set into a training set and a testing set, so that the technical effect of the algorithm is conveniently verified.
Classifying the text data and outputting a text probability matrix with marks; classifying the text data specifically comprises: removing stop words from the Text data, constructing N-gram features by using Fast-Text, performing sliding window operation with the size of N on the Text content according to byte sequence, finally forming a byte fragment sequence with the length of N, taking the generated sequence as a candidate set, screening out important features, and outputting a Text probability matrix with marks by using Soft-Max.
The comment data of the invading organisms provided by the people can be helpful for the laboratory to judge, so that the comment data of the invading organisms has great influence on judging whether the invading organisms are the text data. To ensure morphological features within each word, feature extraction is performed on each recorded word vector. The text content is subjected to sliding window operation with the size of N according to the byte sequence, and finally a byte fragment sequence with the length of N is formed. Wherein'<'represents a prefix'>' represents a suffix. From'<>The 'composed trigram' can be used to represent a word, and further 5 vector stacks can be used to better representA word vector. Converting discrete variable into continuous vector by Embedding to form word vector of the record, W j =[w 1j ,w 2j, …w ij, …,w nj ]Wherein W is j =w 1j ,w 2j ,...w ij ,...,w nj ]Representing the word vector in the j-th record. As shown in FIG. 1, which is a flow chart of Fast-Text, the hidden layer performs superposition averaging on a plurality of word vectors by using the word vectors processed by the Embedding as input features. And uses the negative log likelihood as a loss function:where N is text, x n Is a word feature in text. y is n Is a label, a and B are weight matrices, a is used to transform to text representation, B is used to linearly transform calculate class, and f is a Soft-Max function used to calculate the probability of the final class. And classifying the text data by adopting a hierarchical structure based on Huffman tree, and outputting the probability of each category.
The expression of hierarchical Soft-Max is as follows,wherein p (omega) ct ) Is the final probability of the text. W represents a word vector.
Finally, a probability matrix of n records is obtained. As shown in Table 1, wherein T n Represents n records, L k Is represented by k categories, p ij The probability of category j in the ith record is shown.
Table 1: text data probability matrix
Identifying the position of an invading organism in the picture according to the picture data, determining the boundary and the size, and training a picture probability matrix with marks; training a labeled picture probability matrix, comprising: and determining the position of the invasive organism to be identified according to the picture data through a picture identification algorithm CNN, amplifying the position, determining the boundary and the picture size, and training a picture probability matrix with marks by utilizing the CNN.
The image data information uploaded by people has great influence on laboratory judgment, so that the image data feature extraction is very important. Firstly, preprocessing picture data: deleting data that is not a picture, and modifying the suffix name. CNN convolutional neural networks are largely divided into convolutional layers (CONV), pooling layers (POOL), and fully-connected layers (FC). CNNs perform functions such as image recognition by continuously extracting features from local features to global features through individual filters. The picture processing by using the CNN neural network is divided into four steps: an input layer, a convolution layer, a downsampling layer, a full connection layer, and an output layer. The input layer uses RGB color image, convolves the output of RGB component with convolution layer weight W to get each C layer, then downsamples to get each S layer. Using the activation function, the output of these layers is called Feature-Map. The full connection layer expands each element of all Feature-maps in sequence and arranges the elements in a row. The classification is performed at the output layer using Soft-Max.
Fig. 2 is a specific flow diagram from the full connection layer to the output layer, wherein asian invading organisms are used as input layer data, and three separate 2D kernels are used to scale and ash the pictures, and the 3-channel RGB color images are quickly converted to 1-channel gray scale, since the data set is an RGB color image. After multiple convolution, pooling and activation, features are extracted, and whether the probability of Asian invasion organism is outputted by using a Soft-Max function through a full connection layer.
Wherein X (height×width×channel) is an input pixel matrix, Y is an output matrix, convolution pooling is performed to flatten the multi-dimension data, the multi-dimension data is connected with a full connection layer, and the output is a class probability classified by using traditional Soft-Max. A T x L vector is obtained in which each value represents the probability value of the input for all samples. This results in a classification probability for the image file, denoted by C.
Finally, a probability matrix for the picture data is obtained, as shown in Table 2, q ij The probability of category j in the ith record is represented.
Table 2: picture data probability matrix
Performing single-heat coding on the time data, and constructing a time-space feature matrix through the coded data and the geographic position data; constructing a multi-feature vector according to the text probability matrix, the picture probability matrix and the time-space feature matrix; weight distribution is carried out on the multi-feature vectors, and a binary classifier is trained by using a machine learning algorithm; the method comprises the steps of carrying out weight distribution on the multi-feature vectors, and training a binary classifier by utilizing a machine learning algorithm, and specifically comprises the following steps: the multi-feature vector is standardized, the weight coefficient is distributed by adopting an entropy weight method, and a machine learning algorithm SVM is utilized to train into a binary classifier; the classifier in the binary classifier comprises: random forest, logistic regression, neural network.
Inputting data to be predicted into a binary classifier to obtain invasive biological data; inputting data to be predicted into a binary classifier to obtain invasive biological data, wherein the method specifically comprises the following steps of: the data to be predicted is input, the SVM is used for making a final mark, when the output mark is 1, the data uploaded by the user at the place is true, the data representing that invasive species appear at the place, and the data should be processed in time.
The biological intrusion recognition method based on the multi-source data fusion analysis further comprises the following steps: and (3) predicting migration or multiplication rules of future invading organisms by using a GM model for the time-space characteristic matrix.
The data set includes: witness report data aggregated by the department of agriculture of washington, year 2020, month 12, the dataset containing a spreadsheet of 4440 witness reports and 3305 image data uploaded by the user; the witness report data which has been subjected to laboratory discrimination is marked, and the discrimination is that the invasive organism is marked as 1, otherwise, as 0. 70% of all data in the data set is randomly divided into training sets and the remaining data in the data set is divided into test sets.
For witness reports provided by the people, each report is independent of the other, and the information and the characteristic values are not continuous, but discrete and unordered. The features can be digitized using one-hot encoding, also known as one-bit efficient encoding. The method is to encode N states using N-bit state registers, each with its own register bit, and at any time only one of the bits is valid.
Since the invading organism belongs to the foreign species, there may be fewer occurrence events, the GM model has good applicability to a small amount of incomplete information, and the variation range is calculated for the time and geographic location of the invading organism. GM (1, 1) model can be expressed as y=bu, and the prediction sample range represents the region division. In order to guarantee the feasibility of the GM (1, 1) modeling method, the necessary verification process is required for the known data. Let the original data column be x (0) =(x 0 (1),x 0 (2),…x 0 (n)), where x (0) For the raw time data columns, the rank ratio of the columns is calculated. If all the level ratios fall between the acceptable coverage areasIn, the array can build the GM (1, 1) model and can make grey predictions. Otherwise, the data is subjected to proper transformation processing, such as translation transformation: y is (0) (k)=x(0)(k)+c,k=1,2,…,n。
And obtaining the region range of the event occurrence in different records by using the model, and planning, predicting and sorting out the trend range. For the range predicted by GM, it is first judged whether it is within the range, and if it is 1, it is not 0. Using L (Location) = (0, 1) to represent the recorded geographical feature value, there is uncertainty about the time characteristic T due to the occurrence of an event, so using one-hot encoding can effectively distinguish between different time periods for discrete values.
Firstly, data feature extraction is carried out, and standardization is carried out according to four different features (text, picture, time and position). Providing text data and picture data for the masses has high realism, but is relatively wide in terms of time and geographical range and cannot represent a specific meaning, so that the overall numerical value needs to be weighted. The sum of the weights for the assignment completion is equal to 1.
For the extracted text data probability feature F and picture data probability feature C, 1/k is used for default values to represent. Since there is mutual independence between events, the probability of occurrence of k events at the same time is only 1/k, where the sum of k time probabilities should be 1. The absence of time and geographic location can be complemented by using the average of the upper and lower records.
For multi-feature events, the event randomness and disorder degree are judged by calculating entropy values, the degree of scattering of an index is judged, and the larger the degree of scattering of the index is, the larger the influence of the index on comprehensive evaluation is.
First, the feature vector x= { X is required 11 ,...,X Nj Z-Score normalization of the values between the matrices is required,where μ and σ represent all the mean and standard deviation in feature vector X.
The entropy weight of each index is calculated by utilizing the information entropy,wherein H is j Information entropy indicating the j-th index.
To ensure 0.ltoreq.H j Less than or equal to 1, and usually taking k=1/(lnm), calculating the deviation degree d of each index j
The normalized eigenvector matrix is combined with each index w j And multiplying to obtain a weighted multi-feature evaluation matrix V.
Since more feature vectors are provided, in the final multi-feature fusion classification, no overly complex algorithm is required because the probability statistics of the first round of predictive classification have been performed on the data. It is sufficient to use a conventional SVM classification model.
The input data X and the learning objective Y are given in the SVM classification problem.
X={X 1 ,...,X N },
Y={Y 1 ,...,Y N },
The input data here are x=f, C, L, T.
A simple explanation is given for FCLT, since each sample of input data contains multiple features, thus constituting feature space X (feature space): x= [ X ] i ,…,X N ]∈x,
While learning targets are binary variables, representing negative classes (negative classes) and positive classes (positive classes). If the feature space X in which the input data resides exists, it will be a hyperplane of decision boundaries (Decision Boundary): decision boundary w T X+b=0, then separating learning targets by positive and negative classes, and making the point-to-plane distance of any sample equal to or greater than 1: point to plane distance y i *(w T X i +b) is not less than 1, wherein the parameters w, b are normal vector and intercept of the hyperplane respectively.
Decision boundaries meeting this condition actually construct 2 parallel hyperplanes as interval boundaries to discriminate the classification of the sample:
positive sample
Negative sample
All samples above the upper interval boundary belong to the positive class and samples below the lower interval boundary belong to the negative class. The distance d between two spaced boundaries is defined as the margin (margin)
The positive and negative class samples located on interval boundaries are Support vectors (Support vectors).
The data is Text-classified using Fast-Text tools. The training set is subjected to data equalization operation to obtain 100 training set probability distributions, and the peak value is finally reached in the continuous training process.
As shown in FIG. 3, the probability tends to be 0.9 or 0.1 from about 10 th, and in the case of uneven samples, fast-Text helps to perform sample equalization processing and evaluate probability events well.
Training is carried out on sample data by out-of-order processing, and the recall rate and the accuracy rate of the training result of the Asian invasive biological problem are 94.6% as shown in fig. 4 and 5.
In practice, CNN models are built using the pyrerch framework. A 70% dataset was used for training and 30% data was used for testing. Due to the non-uniformity of the data samples, a simple oversampling is performed. The final training model performed well on the test set, several pictures were randomly selected, and then the trained model was used to predict as shown in FIG. 6 (true value is the actual class, prediction is the prediction result of the model, negative value indicates that the picture is Asian invading organism, positive indicates that the picture is not Asian invading organism
Finally, the metrics of the training model were evaluated as shown in table 3 and fig. 7.
Table 3: training metrics for models
For fewer invasion biological occurrence events, the GM model has good applicability to a small amount of incomplete information, and the change range of the invasion biological occurrence time and geographic position is calculated.
Taking c to make the level ratio of the data column fall in the acceptable coverage, and finding that the level ratio test value of the two data is in the standard range interval [0.857,1.166] after calculating the level ratio means that the data is suitable for GM (1, 1) model construction. After the data are checked, the development coefficient a, the gray dose b, and the posterior ratio C are calculated for the GM model as shown in table 4:
table 4: results of model construction
The posterior to C value of both models is less than 0.65, with a model of longitude of only 0.0468 of less than 0.35 indicating that the longitude model is particularly good. Predicting longitude and latitude, and checking residual errors after the longitude and latitude are predicted, wherein the residual errors comprise relative errors and level ratio deviations; for the longitude and latitude, the maximum value of the relative error values of the two groups of data is smaller than 0.1, and for the level ratio deviation value, if the maximum value of the relative error values is smaller than 0.1, the model fitting effect reaches the higher requirement. And draws the relevant scope from its geographic location as shown in fig. 8.
Calculating the distance between two points by using the difference value between the longitude and the latitude of the two points after predicting the longitude and the latitude, and calculating the goodness of fit of the two models: latitude R71.45% longitude R95.31%, as can be readily seen from figure 8, these are verified as true asian invading organisms for the sample Latitude range: [48.7775,49.1494], longitude range: [ -123.9431, -122.4186].
For the extracted text data probability feature F and picture data probability feature C, 1/k is used for default values to represent. Since mutual independence exists among events, the probability of occurrence of k events at the same moment is only 1/k. For the range predicted by GM, it is first judged whether it is within the range, and if it is 1, it is not 0. Using L (Location) = (0, 1) to represent the recorded geographical feature value, there is uncertainty about the time characteristic T due to the occurrence of an event, so using one-hot encoding can effectively distinguish between different time periods for discrete values.
The related weights are needed to be determined among the features, and the weight distribution of each feature is calculated by using an entropy weight method, as shown in table 5:
table 5: different characteristic weight distribution table
Longitude 0.048440
Latitude 0.026739
Text 0.290734
Image 0.258996
LabText 0.157505
Year 0.115133
Month 0.102453
For several eigenvalues, the linear inseparable problem is converted by mapping the nonlinear inseparable problem from the original eigenvalue space to the higher-dimensional hilbert space by using a nonlinear function due to the fact that the hyperspectral exists in the eigenvalue space, and the linear regression calculation is used to obtain a table 6:
table 6: linear regression calculation table
MAE 0.08905792097395834
MSE 0.4136931156194732
R 2 -0.3932832269742397
Discovery of R 2 <0, the data may not have any linear relationship. For such hyperplane multi-feature problems, classification of kernel functions using radial basis function kernels has good convergence. And SVM multi-feature fusion analysis is carried out, and compared with a single Fast-Text and CNN neural network model, the analysis result has high accuracy. Has good universality compared with the common method of only predicting single data.
Training was performed on 300 sets of data, and classification evaluation was performed on 521 sets of data, resulting in table 7:
table 7: classification evaluation table
The diagonal line is the correct number of predictions, and the number of faults of the invasive organism is only once determined in 521 records, so that the records of the invasive organism and the occurrence range of the records can be accurately determined. The audience data which is not yet subjected to experimental judgment is predicted, and a thermodynamic diagram is drawn according to the predicted data and time, as shown in fig. 9: it was found that the washington portion of the year in the next half may also have traces of asian invading organisms, and such hidden danger may not be eliminated in a short time.
The multi-feature data fusion analysis algorithm is comprehensively evaluated, and as can be seen from the step 8, the multi-feature data source can be well combined by using the algorithm, and different events can be reasonably judged.
Table 8: comprehensive evaluation of multi-feature data fusion analysis algorithm
MSE 0.0007262164124909223
MAE 0.0007262164124909223
R 2 0.9970754396397927
ACC 0.9992737835875091
Recall 0.9992088607594937
F2 0.9992737396242461
ROC 0.9992088607594937
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

Claims (6)

1. The biological intrusion recognition method based on multi-source data fusion analysis is characterized by comprising the following steps of:
acquiring a multi-source data set containing the invasive biological data and marking the invasive biological data; the dataset comprises: text data, picture data, time data, geographic location data;
removing stop words from the Text data, constructing N-gram features by using Fast-Text, performing sliding window operation with the size of N on the Text content according to byte sequence, finally forming a byte fragment sequence with the length of N, taking the generated sequence as a Text feature candidate set, screening out important features, and outputting a Text probability matrix with marks by using Soft-Max;
identifying the position of an invading organism in the picture according to the picture data, determining the boundary and the size, and training a picture probability matrix with marks;
performing single-heat coding on the time data, and constructing a time-space feature matrix through the coded data and the geographic position data;
constructing a multi-feature vector according to the text probability matrix, the picture probability matrix and the time-space feature matrix; performing weight distribution on the multi-feature vectors, and training a binary classifier by using a machine learning algorithm;
inputting the data to be predicted into a binary classifier to obtain the invasive biological data.
2. The method for identifying biological intrusion based on multi-source data fusion analysis according to claim 1, wherein the method comprises the steps of:
the training of the marked picture probability matrix specifically comprises the following steps:
and determining the position of the invasive organism to be identified according to the picture data through a picture identification algorithm CNN, amplifying the position, determining the boundary and the picture size, and training a picture probability matrix with marks by utilizing the CNN.
3. The method for identifying biological intrusion based on multi-source data fusion analysis according to claim 1, wherein the method comprises the steps of:
the multi-feature vector is subjected to weight distribution, and a binary classifier is trained by using a machine learning algorithm, and the method specifically comprises the following steps:
and normalizing the multi-feature vectors, performing weight distribution by using an entropy weight method, and training into a binary classifier by using a machine learning algorithm SVM.
4. The method for identifying biological intrusion based on multi-source data fusion analysis according to claim 1, wherein the method comprises the steps of:
inputting data to be predicted into a binary classifier to obtain invasive biological data, wherein the method specifically comprises the following steps of:
and inputting data to be predicted, using the SVM as a final mark, and when the output mark is 1, representing that the data uploaded by a user at the time and place represented by the output mark is true, representing that invasive species appear at the time, and timely processing the data.
5. The method for biological intrusion identification based on multi-source data fusion analysis of claim 1, further comprising:
and predicting migration or reproduction rules of future invading organisms by using a GM model for the time-space characteristic matrix.
6. The method for identifying biological intrusion based on multi-source data fusion analysis according to claim 1, wherein the method comprises the steps of:
the classifier in the binary classifier comprises: random forest, logistic regression, neural network.
CN202210575412.2A 2022-05-25 2022-05-25 Biological intrusion recognition method based on multi-source data fusion analysis Active CN114943290B (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN202210575412.2A CN114943290B (en) 2022-05-25 2022-05-25 Biological intrusion recognition method based on multi-source data fusion analysis
NL2034214A NL2034214A (en) 2022-05-25 2023-02-23 A biological invasion identification method based on multi-source data fusion analysis
NL2034409A NL2034409A (en) 2022-05-25 2023-03-23 A biological invasion identification method based on multi-source data fusion analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210575412.2A CN114943290B (en) 2022-05-25 2022-05-25 Biological intrusion recognition method based on multi-source data fusion analysis

Publications (2)

Publication Number Publication Date
CN114943290A CN114943290A (en) 2022-08-26
CN114943290B true CN114943290B (en) 2023-08-08

Family

ID=82908603

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210575412.2A Active CN114943290B (en) 2022-05-25 2022-05-25 Biological intrusion recognition method based on multi-source data fusion analysis

Country Status (2)

Country Link
CN (1) CN114943290B (en)
NL (2) NL2034214A (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117109664B (en) * 2023-10-20 2023-12-22 生态环境部华南环境科学研究所(生态环境部生态环境应急研究所) Wetland ecological environment monitoring device and system

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20110110895A (en) * 2010-04-02 2011-10-10 제주대학교 산학협력단 System for fusing realtime image and context data by using position and time information
CN107832718A (en) * 2017-11-13 2018-03-23 重庆工商大学 Finger vena anti false authentication method and system based on self-encoding encoder
CN109165387A (en) * 2018-09-20 2019-01-08 南京信息工程大学 A kind of Chinese comment sentiment analysis method based on GRU neural network
CN109347863A (en) * 2018-11-21 2019-02-15 成都城电电力工程设计有限公司 A kind of improved immune Network anomalous behaviors detection method
CN109934354A (en) * 2019-03-12 2019-06-25 北京信息科技大学 Abnormal deviation data examination method based on Active Learning
CN111046946A (en) * 2019-12-10 2020-04-21 昆明理工大学 Burma language image text recognition method based on CRNN
CN112990262A (en) * 2021-02-08 2021-06-18 内蒙古大学 Integrated solution system for monitoring and intelligent decision of grassland ecological data
CN113343770A (en) * 2021-05-12 2021-09-03 武汉大学 Face anti-counterfeiting method based on feature screening
CN113537355A (en) * 2021-07-19 2021-10-22 金鹏电子信息机器有限公司 Multi-element heterogeneous data semantic fusion method and system for security monitoring
CN113793405A (en) * 2021-09-15 2021-12-14 杭州睿胜软件有限公司 Method, computer system and storage medium for presenting distribution of plants
CN113822233A (en) * 2021-11-22 2021-12-21 青岛杰瑞工控技术有限公司 Method and system for tracking abnormal fishes cultured in deep sea

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9652915B2 (en) * 2014-02-28 2017-05-16 Honeywell International Inc. System and method having biometric identification intrusion and access control
US10846308B2 (en) * 2016-07-27 2020-11-24 Anomalee Inc. Prioritized detection and classification of clusters of anomalous samples on high-dimensional continuous and mixed discrete/continuous feature spaces
CA2992333C (en) * 2018-01-19 2020-06-02 Nymi Inc. User access authorization system and method, and physiological user sensor and authentication device therefor

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20110110895A (en) * 2010-04-02 2011-10-10 제주대학교 산학협력단 System for fusing realtime image and context data by using position and time information
CN107832718A (en) * 2017-11-13 2018-03-23 重庆工商大学 Finger vena anti false authentication method and system based on self-encoding encoder
CN109165387A (en) * 2018-09-20 2019-01-08 南京信息工程大学 A kind of Chinese comment sentiment analysis method based on GRU neural network
CN109347863A (en) * 2018-11-21 2019-02-15 成都城电电力工程设计有限公司 A kind of improved immune Network anomalous behaviors detection method
CN109934354A (en) * 2019-03-12 2019-06-25 北京信息科技大学 Abnormal deviation data examination method based on Active Learning
CN111046946A (en) * 2019-12-10 2020-04-21 昆明理工大学 Burma language image text recognition method based on CRNN
CN112990262A (en) * 2021-02-08 2021-06-18 内蒙古大学 Integrated solution system for monitoring and intelligent decision of grassland ecological data
CN113343770A (en) * 2021-05-12 2021-09-03 武汉大学 Face anti-counterfeiting method based on feature screening
CN113537355A (en) * 2021-07-19 2021-10-22 金鹏电子信息机器有限公司 Multi-element heterogeneous data semantic fusion method and system for security monitoring
CN113793405A (en) * 2021-09-15 2021-12-14 杭州睿胜软件有限公司 Method, computer system and storage medium for presenting distribution of plants
CN113822233A (en) * 2021-11-22 2021-12-21 青岛杰瑞工控技术有限公司 Method and system for tracking abnormal fishes cultured in deep sea

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
外来植物入侵机制研究进展;李梅等;《广东农业科学》(第02期);93-96 *

Also Published As

Publication number Publication date
CN114943290A (en) 2022-08-26
NL2034409A (en) 2023-05-19
NL2034214A (en) 2023-05-19

Similar Documents

Publication Publication Date Title
Torres et al. Learning to see: Convolutional neural networks for the analysis of social science data
CN109635010B (en) User characteristic and characteristic factor extraction and query method and system
CN111353373A (en) Correlation alignment domain adaptive fault diagnosis method
CN111695597A (en) Credit fraud group recognition method and system based on improved isolated forest algorithm
CN115471739A (en) Cross-domain remote sensing scene classification and retrieval method based on self-supervision contrast learning
Olschofsky et al. Rapid field identification of cites timber species by deep learning
CN116132104A (en) Intrusion detection method, system, equipment and medium based on improved CNN-LSTM
US20210357729A1 (en) System and method for explaining the behavior of neural networks
Ahmed et al. Recognizing a spatial extreme dependence structure: A deep learning approach
CN114943290B (en) Biological intrusion recognition method based on multi-source data fusion analysis
CN111477328B (en) Non-contact psychological state prediction method
CN111242028A (en) Remote sensing image ground object segmentation method based on U-Net
CN110675382A (en) Aluminum electrolysis superheat degree identification method based on CNN-LapseLM
Dawod et al. Assessing mangrove deforestation using pixel-based image: a machine learning approach
CN115527193A (en) Chinese medicinal material type identification method
CN116304941A (en) Ocean data quality control method and device based on multi-model combination
CN109993071A (en) The method and system of discoloration forest automatic identification and investigation based on remote sensing image
CN115293641A (en) Enterprise risk intelligent identification method based on financial big data
CN114998719A (en) Forest fire prediction method based on deep learning and multi-source remote sensing data
CN114547606A (en) Third-party application risk analysis method and system for mobile internet operating system
Ade Students performance prediction using hybrid classifier technique in incremental learning
Riyaz et al. A Novel Prediction Analysing the False Acceptance Rate and False Rejection Rate using CNN Model to Improve the Accuracy for Iris Recognition System for Biometric Security in Clouds Comparing with Traditional Inception Model
Munasinghe et al. Machine Learning based criminal short listing using Modus Operandi features
Doria et al. A Machine Learning Approach on the Problem of Corruption
Venkataramanan et al. Self-supervised gaussian regularization of deep classifiers for mahalanobis-distance-based uncertainty estimation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20220826

Assignee: Yancheng Guzhuo Technology Co.,Ltd.

Assignor: YANCHENG TEACHERS University

Contract record no.: X2024980003605

Denomination of invention: A biological intrusion recognition method based on multi-source data fusion analysis

Granted publication date: 20230808

License type: Common License

Record date: 20240328