CN114943290A

CN114943290A - Biological invasion identification method based on multi-source data fusion analysis

Info

Publication number: CN114943290A
Application number: CN202210575412.2A
Authority: CN
Inventors: 陈碧云
Original assignee: Yancheng Teachers University
Current assignee: Yancheng Teachers University
Priority date: 2022-05-25
Filing date: 2022-05-25
Publication date: 2022-08-26
Anticipated expiration: 2042-05-25
Also published as: NL2034409A; NL2034214A; CN114943290B

Abstract

The invention discloses a biological intrusion identification method based on multi-source data fusion analysis, which comprises the following steps: acquiring a multi-source data set containing invasive biological data, and marking the invasive biological data; the data set includes: text data, picture data, time data, geographical location data; classifying the text data and outputting a text probability matrix with a mark; identifying the position of an invasive organism in the picture from the picture data, determining the boundary and the size, and training a picture probability matrix with a mark; carrying out one-hot encoding on the time data, and constructing a time-space characteristic matrix through the encoded data and the geographical position data; constructing a multi-feature vector according to the text probability matrix, the picture probability matrix and the time-space feature matrix; carrying out weight distribution on the multi-feature vector, and training a binary classifier by using a machine learning algorithm; and inputting the data to be predicted into a binary classifier to obtain the invasive biological data.

Description

Biological intrusion identification method based on multi-source data fusion analysis

Technical Field

The invention relates to the technical field of big data artificial intelligence, in particular to a biological invasion identification method based on multi-source data fusion analysis.

Background

With the acceleration of global development and the change of land use patterns, biological invasion has become a worldwide ecological safety problem. Studies have shown that the total cost of global intrusion control reaches at least $ 1.288 trillion from 1970 to 2017, the annual average cost is $ 268 trillion, and there is no slow-down trace of its growth rate. The research aiming at the biological invasion field is still in the preliminary stage, and in recent years, under the large background of global ecological change, the research is developed into a new field combining the global change and ecological sustainable management. At present, measures for preventing and controlling biological invasion mainly comprise: establishing a corresponding monitoring system, and finding out the type, quantity, distribution and action of the foreign species; propaganda and education on the harmfulness of biological invasion are enhanced, and social prevention awareness is improved; the identification and prevention technology aiming at the foreign invasive species is actively searched to effectively restrain the spreading trend of the current biological invasion. In conclusion, accurate identification of foreign species is of great importance.

At present, an artificial intelligence technology is becoming a new engine in the field of ecological resources, research and initiation in the aspect of species identification by utilizing the artificial intelligence technology are earlier, the effect of the artificial intelligence technology is better than that of a traditional classifier in plant, animal and specimen identification, and deep learning in the artificial intelligence technology is widely applied to species image identification. In plant classification (Lee et al, 2015, 2016), Mohanty et al (2016) achieved image-based classification of 38 plant diseases using a deep learning method. Carranza-Rojas et al (2017) implemented classification of thousands of species based on specimen pictures using convolutional and migratory neural networks. In addition to classifying individual images using CNN, Taghavi et al (2018) use LSTM to classify the phenotype and genotype of features of CNN-extracted time series images. Norouzzadeh and the like automatically identify the categories of animals and count the number of the animals based on image data acquired by a camera trap by using a deep learning method, so that the monitoring of animal populations is realized, but the identification accuracy rate is lower under the background of a complex environment. In order to solve the problem of low accuracy of monitoring image identification caused by the background of a complex environment in the field, the sound emitted by an animal is also used as an important data source.

With the progress of observation technology, the species monitoring system is continuously perfected, and the acquisition capacity of mass heterogeneous multi-source data with long time and scale span is remarkably improved. The research published in Science by Gregory P.Asner and the like of academy of sciences in 2017 indicates that the plant function type of the whole Peru forest is divided by integrating massive and high-precision hyperspectral and laser radar data, and then corresponding forest management and protection strategies are proposed for each area. The limitation that the plant group with complex structure and high biodiversity cannot be accurately monitored is broken through. It is worth noting that there is a problem of whether data structure, precision, etc. are matched during the fusion process of multi-source data. The acquired monitoring information comprises various different types of data, and how to utilize multi-feature data to carry out rapid identification and intelligent diagnosis on foreign species and carry out risk analysis and prejudgment based on the multi-feature data is a very worthy research problem.

Disclosure of Invention

The research in this aspect is only reported recently, and on the basis of this background, a biological intrusion identification method based on multi-source data fusion analysis is proposed herein. Firstly, a deep learning method is used for carrying out probability prejudgment on data, then data weight is distributed based on an entropy weight method, and finally an SVM method is used for carrying out comprehensive judgment on multi-feature data. The invention takes the event that the bumblebee invades Washington as an example, and analyzes and verifies the practicability of the algorithm. The result shows that the algorithm can be applied to the rapid identification and monitoring of species, and can also predict the time-varying development trend of the species. Provides basis for formulating corresponding reasonable and efficient protection and management measures.

The invention provides a biological intrusion identification method based on multi-source data fusion analysis, which comprises the following steps:

acquiring a multi-source data set containing invasive biological data, and marking the invasive biological data; the data set includes: text data, picture data, time data, geographical location data.

And classifying the text data and outputting a text probability matrix with marks.

And identifying the position of an invasive organism in the picture from the picture data, determining the boundary and the size, and training a picture probability matrix with a mark.

And carrying out one-hot coding on the time data, and constructing a time-space characteristic matrix through the coded data and the geographical position data.

Constructing a multi-feature vector according to the text probability matrix, the picture probability matrix and the time-space feature matrix; and carrying out weight distribution on the multi-feature vectors, and training a binary classifier by using a machine learning algorithm.

And inputting the data to be predicted into a binary classifier to obtain the invasive biological data.

Further, classifying the text data specifically includes: removing stop words from the Text data, constructing N-gram characteristics by using Fast-Text, performing N-size sliding window operation on the Text content according to the byte sequence, finally forming a byte fragment sequence with the length of N, taking the generated sequence as a Text characteristic candidate set, screening out important characteristics, and outputting a Text probability matrix with marks by using Soft-Max.

Further, the training of the labeled picture probability matrix specifically includes: and determining the position of the invasive creature to be identified by the picture data through a picture identification algorithm such as CNN, amplifying the position, determining the boundary and the picture size, and training a picture probability matrix with a mark by using the CNN.

Further, weight distribution is performed on the multi-feature vectors, and a machine learning algorithm is used for training a binary classifier, which specifically comprises: and standardizing the multi-feature vectors, carrying out weight distribution by using an entropy weight method, and training into a binary classifier by using a machine learning algorithm SVM.

Further, inputting data to be predicted into a binary classifier to obtain invasive biological data, which specifically comprises: inputting data needing prediction, using SVM for final marking, and when the output mark is 1, representing the time period, the data uploaded by the user at the place is true, representing that the invasive species appear at the place and should be processed in time.

Further, the biological intrusion identification method based on multi-source data fusion analysis further includes: and predicting the migration or reproduction rules of the future invasive organisms by using a GM model for the time-space characteristic matrix.

Further, the classifier in the binary classifier comprises: random forest, logistic regression, neural network.

Compared with the prior art, the biological invasion identification method based on multi-source data fusion analysis provided by the invention has the following beneficial effects:

the biological intrusion identification method based on multi-source data fusion analysis fuses text data, picture data, time data and geographical position data, is applied to rapid identification and monitoring of species, can also predict the change development trend of the species along with time, and provides a basis for formulating corresponding reasonable and efficient protection and management measures.

Drawings

FIG. 1 is a Fast-Text flow chart;

FIG. 2 is a detailed flow chart from the fully connected layer to the output layer;

FIG. 3 is a probability distribution graph of 11-100 training sets;

FIG. 4 is a test set recall statistical chart;

FIG. 5 is a test set accuracy statistical chart;

FIG. 6 is a graph of stochastic prediction of a training model;

FIG. 7 is a graph of the variability index for the training model;

FIG. 8 is a schematic diagram of the actual geographic location of the training model;

FIG. 9 is a thermal prediction diagram;

fig. 10 is an overall flow chart of the biological intrusion identification method based on multi-source data fusion analysis.

Detailed Description

The following further describes embodiments of the present invention with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

Example 1

The invention provides a biological intrusion identification method based on multi-source data fusion analysis, which comprises the following steps as shown in an overall flow chart shown in figure 10:

The data set is composed of texts, pictures, time and geographical positions, the data range of the data set is expanded, the data set is divided into a training set and a testing set, a contrast experiment can be carried out, and the technical effect of the algorithm is convenient to verify.

Classifying the text data and outputting a text probability matrix with a mark; classifying the text data specifically comprises the following steps: and removing stop words from the Text data, constructing N-gram characteristics by using Fast-Text, performing N-size sliding window operation on the Text content according to the byte sequence, finally forming a byte fragment sequence with the length of N, using the generated sequence as a candidate set, screening out important characteristics, and outputting a Text probability matrix with a mark by using Soft-Max.

The comment data of the invading creatures provided by people can help the laboratory to judge, so that the text data has great influence on judging whether the invading creatures exist. And in order to ensure the morphological characteristics of the interior of each word, performing characteristic extraction on the word vector of each record. Namely, the text content is subjected to sliding window operation with the size of N according to the byte sequence, and finally a byte fragment sequence with the length of N is formed. Wherein'<'represents a prefix'>' denotes a suffix. From'<>' the composed trigram can be used to represent a word, and further 5 vector stacks can be used to better represent word vectors. Converting the discrete variable into a continuous vector through Embedding to form a word vector, W, of the record _j ＝[w _1j ,w _2j, …w _ij, …,w _nj ]Wherein W is _j ＝w _1j ,w _2j ,...w _ij ,...,w _nj ]Representing the word vector in the j-th record. As shown in the figure1 is a flow chart of Fast-Text, word vectors processed by Embedding are used as input characteristics, and a hidden layer carries out superposition averaging on a plurality of word vectors. And using negative log-likelihood as a loss function:

where N is text, x _n Are word features in text. y is _n Are labels, a and B are weight matrices, a is used to convert to text representation, B is used to linearly transform compute class, and f is a Soft-Max function used to compute the probability of the final classification. And Soft-Max adopts a hierarchical structure based on a Huffman tree formula to classify the text data and output the probability of each class.

The expression for the hierarchical Soft-Max is shown below,

wherein p (ω) _c |ω _t ) Is the final probability of the text. W denotes a word vector.

And finally obtaining the probability matrix of n records. As shown in Table 1, wherein T _n Denotes n records, L _k Represents k classes, where p _ij The probability of the category j in the ith record is shown.

Table 1: text data probability matrix

Identifying the position of an invasive organism in the picture from the picture data, determining the boundary and the size, and training a picture probability matrix with a mark; training a marked picture probability matrix, specifically comprising: determining the position of an invasive organism to be identified by the image data through an image identification algorithm CNN, amplifying the position, determining the boundary and the size of an image, and training an image probability matrix with a mark by utilizing the CNN.

For the image data information uploaded by the people, the image data information has great influence on laboratory judgment, so the image data information is very important for feature extraction of the image data. The picture data is now pre-processed first: delete data that is not a picture, and modify the suffix name. The CNN convolutional neural network is mainly divided into a convolutional layer (CONV), a pooling layer (POOL), and a fully connected layer (FC). CNN mainly performs functions such as image recognition by continuously extracting features from local features to global features through individual filters. The processing of the picture by using the CNN neural network is divided into four steps: an input layer, convolutional and downsample layers, a full-connect layer, and an output layer. The input layer uses RGB color images, the outputs of RGB components are convolved with convolution layer weights W to obtain C layers, and then downsampled to obtain S layers. The output of these layers is called Feature-Map using an activation function. The full connection layer expands each element of all Feature-maps in sequence and arranges the elements in a row. The output layer is classified by Soft-Max.

Fig. 2 is a detailed flow chart from the fully connected layer to the output layer, wherein asian intruder is used as input layer data, and since the data set is an RGB color image, three separate 2D kernels are used to scale and ash the picture, and the 3-channel RGB color image is rapidly converted into 1-channel gray scale. After multiple convolution, pooling and activation, the features are extracted, and the probability of whether the Asian invading organisms exist is output by using a Soft-Max function through a full connection layer.

Wherein X (height × width × channel) is an input pixel matrix, Y is an output matrix, convolution pooling is performed to flatten the multidimensional data, the multidimensional data is connected with a full connection layer, and the output is the class probability classified by using the traditional Soft-Max. A T x L vector is obtained where each value represents a probability value for all samples of the input. This results in a classification probability for the image file, denoted by C.

The resulting probability matrix for the picture data, q is shown in Table 2 _ij Indicating the probability of the category j in the ith record.

Table 2: picture data probability matrix

Carrying out one-hot encoding on the time data, and constructing a time-space characteristic matrix through the encoded data and the geographical position data; constructing a multi-feature vector according to the text probability matrix, the picture probability matrix and the time-space feature matrix; carrying out weight distribution on the multi-feature vector, and training a binary classifier by using a machine learning algorithm; the method comprises the following steps of carrying out weight distribution on the multi-feature vectors, and training a binary classifier by using a machine learning algorithm, wherein the method specifically comprises the following steps: standardizing the multi-feature vectors, distributing weight coefficients by adopting an entropy weight method, and training the weight coefficients into a binary classifier by utilizing a machine learning algorithm SVM (support vector machine); the classifier in the binary classifier includes: random forest, logistic regression, neural network.

Inputting data to be predicted into a binary classifier to obtain invasive biological data; inputting data to be predicted into a binary classifier to obtain invasive biological data, which specifically comprises the following steps: inputting data needing prediction, using an SVM (support vector machine) as a final mark, and when an output mark is 1, representing the time period, the data uploaded by a user at the place is true, representing that the invasive species appear at the place and should be processed in time.

A biological intrusion identification method based on multi-source data fusion analysis further comprises the following steps: and predicting the migration or propagation rules of the future invading organisms by using a GM model for the time-space characteristic matrix.

The data set includes: sighting report data summarized by the washington state department of agriculture 2020 in 12 months, the dataset comprising a spreadsheet of 4440 sighting reports and 3305 image data uploaded by the user; and marking witness report data which are judged by a laboratory, wherein the mark for judging the invading organism is 1, and otherwise, the mark is 0. Randomly dividing 70% of all data in the data set into a training set, and dividing the rest of the data in the data set into a test set.

For witness reports provided by the people, each report is independent, and the information and the characteristic values are not continuous but discrete and unordered. One-hot encoding, also known as one-bit efficient encoding, can be used to digitize features. The method is to use an N-bit status register to encode N states, each state having its own independent register bit and only one of which is active at any one time.

Because the invading organisms belong to foreign species, the occurrence of events is less, the GM model has good applicability to a small amount of incomplete information, and the variation range of the invading organisms is calculated according to the occurrence time and the geographic position of the invading organisms. The GM (1,1) model can be expressed as Y ═ Bu, and the predicted sample range represents the region partition. In order to guarantee the feasibility of the GM (1,1) modeling method, the necessary verification process needs to be performed on the known data. Let the original data column be x ⁽⁰⁾ ＝(x ⁰ (1),x ⁰ (2),…x ⁰ (n)), wherein x ⁽⁰⁾ For the original time data column, the rank ratio of the number columns is calculated. If all the step ratios fall within the allowable coverage area

In this case, the series may be modeled as GM (1,1) and grey predictions may be made. Otherwise, performing appropriate transformation processing on the data, such as translation transformation: y is ⁽⁰⁾ (k)＝x(0)(k)+c,k＝1,2,…,n。

And (4) obtaining the area range of the event occurrence in different records by utilizing the model, and planning, predicting and sorting out the trend range. For the range predicted by the GM, it is first determined whether the GM is within the range, and if the GM is within the range, the GM is not within the range and is 0. The recorded geographic characteristic value is represented by l (location) ═ 0,1, and the time characteristic T has uncertainty due to the occurrence of an event, so that the discrete values can be effectively distinguished from different time periods by using one-hot coding.

Firstly, data feature extraction is carried out, and standardization is carried out according to four different features (text, picture, time and position). The text data and the picture data provided for the people have high reality, are relatively wide in time and geographic range and cannot represent specific meanings, so that the overall numerical value needs to be subjected to weight distribution. The sum of the weights for the assignment completion should be equal to 1.

For the extracted text data probability feature F and picture data probability feature C, 1/k is used for the default value. Due to the mutual independence between events, the probability of k events occurring at the same time is only 1/k, where the sum of k time probabilities should be 1. The absence of time and geographic location can be completed using the average of the upper and lower two records.

The multi-feature event is essentially random, the randomness and the disorder degree of the event are judged by using the calculation entropy, the dispersion degree of the index is judged, and the larger the dispersion degree of the index is, the larger the influence of the index on the comprehensive evaluation is.

First, the feature vector X is required to be set to X ₁₁ ,...,X _Nj The values between the various matrices need to be Z-Score normalized,

where μ and σ represent all the means and standard deviations in the feature vector X.

The entropy weight of each index is calculated by utilizing the information entropy,

wherein H _j The information entropy of the j-th index is represented.

To ensure that 0 ≦ H _j When k is 1/(lnm), the degree of deviation d of each index is calculated _j 。

The normalized eigenvector matrix and each index w _j And multiplying to obtain a weighted multi-feature evaluation matrix V.

Since more feature vectors are provided, in the final multi-feature fusion classification, no overly complex algorithm needs to be used because the probability statistics of the first round of predictive classification have already been performed on the data. It is sufficient to use a conventional SVM classification model.

Input data X and a learning objective Y are given in the SVM classification problem.

X＝{X ₁ ，...，X _N }，

Y＝{Y ₁ ，...，Y _N }，

Here, X is F, C, L, and T.

A simple explanation is given to FCLT, since each sample of input data contains multiple features, thereby forming a feature space x (feature space): x ═ X _i ,…,X _N ]∈x，

While the learning objective is a binary variable, representing a negative class and a positive class. If the feature space X in which the input data is located exists, it will be used as a hyperplane of Decision Boundary (Decision Boundary): determination boundary of w ^T And X + b is 0, then separating the learning target into a positive class and a negative class, and enabling the point-to-plane distance of any sample to be greater than or equal to 1: point to plane distance y _i *(w ^T X _i + b) is more than or equal to 1, wherein the parameters w and b are the normal vector and the intercept of the hyperplane respectively.

Decision boundaries that satisfy this condition actually construct 2 parallel hyperplanes as separation boundaries to discriminate the classification of samples:

positive sample

Negative sample

All samples above the upper granularity boundary belong to the positive class and all samples below the lower granularity boundary belong to the negative class. The distance d between two spaced boundaries is defined as the margin (margin)

The positive and negative class samples located on the interval boundary are Support vectors (Support vectors).

The data is subjected to Text classification processing using the Fast-Text tool. Firstly, the training set is subjected to data equalization operation to obtain 100 training set probability distributions, and the peak value is finally reached in the continuous training process.

As shown in FIG. 3, the probability approaches 0.9 or 0.1 from the 10 th, and in the case of uneven samples, Fast-Text helps to perform sample equalization processing and make the judgment of probability events good.

The sample data is subjected to out-of-order processing for training, and as for the training result of the Asian invasive biological problem, as shown in FIG. 4 and FIG. 5, the recall rate and the accuracy rate are both 94.6%.

In practice, the CNN model is constructed using the PyTorch framework. 70% of the data set was used for training and 30% for testing. Simple oversampling is performed due to the non-uniformity of the data samples. The final training model performed well on the test set, several pictures were randomly selected, and then the trained model was used for prediction, as shown in fig. 6 (true value is the actual category, prediction is the prediction result of the model, Negative value indicates that the picture is an asian invading organism, Positive indicates that the picture is not an asian invading organism)

Finally, the training model was evaluated for metrics as shown in table 3 and fig. 7.

Table 3: index of training model

And for few invasive creatures, the GM model has good applicability to a small amount of incomplete information, and the variation range of the GM model is calculated according to the time and the geographic position of the invasive creatures.

Taking c so that the level ratios of the data columns all fall within the acceptable coverage, finding that the check values of the level ratios of both data are within the standard range interval [0.857,1.166] after calculating the level ratios means that the data is suitable for GM (1,1) model construction. After the data are examined, the development coefficient a, the gray effect amount b and the posterior difference ratio C are calculated for the GM model, as shown in Table 4:

table 4: results of model construction

The posterior difference ratio C values of both models are less than 0.65, with the longitude model having only 0.0468 less than 0.35 indicating that the longitude model is particularly good. Therefore, the latitude and longitude are predicted, and after the prediction is finished, the residual errors, including relative errors and grade ratio deviation, are tested; the maximum values of relative error values of the longitude and the latitude of the two groups of data are both less than 0.1, and for the level ratio deviation value, both less than 0.1 indicate that the higher requirement is met, which means that the model fitting effect meets the higher requirement. And plots the relevant range according to its geographic location, as shown in fig. 8.

After the longitude and latitude are predicted, the distance between the two points is calculated by utilizing the difference value between the longitude and latitude of the two points, and the goodness of fit of the two models is calculated: latitude R side 71.45% longitude R side 95.31%, as can be readily seen from fig. 8, these sample Latitude ranges for an asian invading organism that are validated as true: [48.7775,49.1494], Longituude range: [ -123.9431, -122.4186].

For the extracted text data probability feature F and picture data probability feature C, 1/k is used for the default value. Since there is independence between events, the probability of k events occurring at the same time is only 1/k. For the range predicted by the GM, it is first determined whether the GM is within the range, and if the GM is within the range, the GM is not within the range and is 0. The recorded geographic characteristic value is represented by l (location) ═ 0,1, and the time characteristic T has uncertainty due to the occurrence of an event, so that the discrete values can be effectively distinguished from different time periods by using one-hot coding.

The relevant weight needs to be determined among the characteristics, and the weight distribution of each characteristic is calculated by using an entropy weight method, as shown in table 5:

table 5: different characteristic weight distribution table

Longitude	0.048440
		Latitude	0.026739
Text	0.290734
		Image	0.258996
LabText	0.157505
		Year	0.115133
Month	0.102453

For several characteristic values, the characteristic space is linear inseparable, that is, the characteristic space has a supercurve and the nonlinear function is used to map the nonlinear separable problem from the original characteristic space to a higher-dimensional Hilbert space, so as to convert the nonlinear separable problem into a linear separable problem, and a linear regression calculation is used to obtain a table 6:

table 6: linear regression calculation table

MAE	0.08905792097395834
		MSE	0.4136931156194732
R ²	-0.3932832269742397

Discovery of R ² <0, the data may not have any linear relationship. For the hyperplane multi-feature problem, the kernel function has good convergence when the kernel function uses the radial basis function kernel for classification. And performing SVM multi-feature fusion analysis, wherein the analysis result has high accuracy when the Fast-Text and CNN neural network models are singly used. It has good universality with respect to predicting only a single datum in general.

Training 300 groups of data and performing classification evaluation on 521 groups of data to obtain table 7:

table 7: classification evaluation table

The diagonal line is the predicted correct number, and the number of errors for discriminating the invading creature in 521 records is only one, so that the record of the invading creature can be correctly discriminated, and the occurrence range of the record can be correctly discriminated. The method comprises the steps of predicting the civil data which are not judged by experiments, drawing a thermodynamic diagram according to the predicted data and time, and showing in a figure 9: it was found that there may be traces of asian invading organisms in the washington sector during the next half year, and in a short time, such a risk may not be eliminated.

The multi-feature data fusion analysis algorithm is comprehensively evaluated, and as can be seen from 8, the algorithm can be well combined with a multi-feature data source and can be used for reasonably judging different events.

Table 8: comprehensive evaluation of multi-feature data fusion analysis algorithm

MSE	0.0007262164124909223
		MAE	0.0007262164124909223
R ²	0.9970754396397927
		ACC	0.9992737835875091
Recall	0.9992088607594937
		F2	0.9992737396242461
ROC	0.9992088607594937

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

Claims

1. A biological intrusion identification method based on multi-source data fusion analysis is characterized by comprising the following steps:

acquiring a multi-source data set containing invasive biological data, and marking the invasive biological data; the data set includes: text data, picture data, time data, geographical location data;

classifying the text data and outputting a text probability matrix with a mark;

identifying the position of an invasive organism in the picture from the picture data, determining the boundary and the size, and training a picture probability matrix with a mark;

carrying out one-hot encoding on the time data, and constructing a time-space characteristic matrix through the encoded data and the geographical position data;

constructing a multi-feature vector according to the text probability matrix, the picture probability matrix and the time-space feature matrix; carrying out weight distribution on the multi-feature vector, and training a binary classifier by using a machine learning algorithm;

2. The biological intrusion identification method based on multi-source data fusion analysis according to claim 1, characterized in that:

classifying the text data, specifically including:

removing stop words from the Text data, constructing N-gram characteristics by using Fast-Text, performing N-size sliding window operation on the Text content according to the byte sequence, finally forming a byte fragment sequence with the length of N, taking the generated sequence as a Text characteristic candidate set, screening out important characteristics, and outputting a Text probability matrix with marks by using Soft-Max.

3. The biological intrusion identification method based on multi-source data fusion analysis according to claim 1, characterized in that:

the training marked picture probability matrix specifically comprises:

and determining the position of the invasive creature to be identified according to the picture data by using a picture identification algorithm (CNN), amplifying the position, determining the boundary and the picture size, and training a picture probability matrix with a mark by using the CNN.

4. The biological intrusion identification method based on multi-source data fusion analysis according to claim 1, characterized in that:

carrying out weight distribution on the multi-feature vector, and training a binary classifier by using a machine learning algorithm, wherein the method specifically comprises the following steps:

and standardizing the multi-feature vectors, carrying out weight distribution by using an entropy weight method, and training into a binary classifier by using a machine learning algorithm SVM.

5. The biological intrusion identification method based on multi-source data fusion analysis according to claim 1, characterized in that:

inputting data to be predicted into a binary classifier to obtain invasive biological data, which specifically comprises the following steps:

inputting data needing prediction, using SVM for final marking, and when the output mark is 1, representing the time period, the data uploaded by the user at the place is true, representing that the invasive species appear at the place and should be processed in time.

6. The biological intrusion identification method based on multi-source data fusion analysis according to claim 1, further comprising:

and predicting the migration or reproduction rules of the future invasive organisms by using a GM model for the time-space characteristic matrix.

7. The biological intrusion identification method based on multi-source data fusion analysis according to claim 1, wherein:

the classifier in the binary classifier includes: random forest, logistic regression, neural networks.