CN111680843B - Chinese medicinal material survival area prediction method and system based on depth SVDD model - Google Patents

Chinese medicinal material survival area prediction method and system based on depth SVDD model Download PDF

Info

Publication number
CN111680843B
CN111680843B CN202010537578.6A CN202010537578A CN111680843B CN 111680843 B CN111680843 B CN 111680843B CN 202010537578 A CN202010537578 A CN 202010537578A CN 111680843 B CN111680843 B CN 111680843B
Authority
CN
China
Prior art keywords
data
chinese medicinal
traditional chinese
model
medicinal materials
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010537578.6A
Other languages
Chinese (zh)
Other versions
CN111680843A (en
Inventor
李巧勤
蔡茁
刘勇国
杨尚明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202010537578.6A priority Critical patent/CN111680843B/en
Publication of CN111680843A publication Critical patent/CN111680843A/en
Application granted granted Critical
Publication of CN111680843B publication Critical patent/CN111680843B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • G06F18/21355Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis nonlinear criteria, e.g. embedding a manifold in a Euclidean space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/02Agriculture; Fishing; Forestry; Mining

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Strategic Management (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Business, Economics & Management (AREA)
  • Tourism & Hospitality (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Marketing (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Marine Sciences & Fisheries (AREA)
  • Development Economics (AREA)
  • Mining & Mineral Resources (AREA)
  • Evolutionary Computation (AREA)
  • Animal Husbandry (AREA)
  • Primary Health Care (AREA)
  • Agronomy & Crop Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Game Theory and Decision Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Machine Translation (AREA)
  • Medicines Containing Plant Substances (AREA)

Abstract

The invention discloses a Chinese medicinal material survival area prediction method and system based on a deep SVDD model, which comprises the following steps: collecting ecological factor data of the traditional Chinese medicinal materials, and generating pseudo-nonexistent sample data of the traditional Chinese medicinal materials by adopting a MaxEnt model; preprocessing the collected ecological factor data of the traditional Chinese medicinal materials to obtain ecological factor preprocessing data; constructing a prediction model of the Chinese medicinal material survival area according to the ecological factor preprocessing data, the pseudo-nonexistent sample data and the SVDD model; and putting the test points of the traditional Chinese medicinal materials to be predicted into the prediction model of the survival area of the traditional Chinese medicinal materials for judgment to obtain the survival area of the traditional Chinese medicinal materials to be predicted. The SGD and the SGD variant optimize parameters of a deep SVDD model, and the calculation complexity of the model is linear expansion on the training quantity, so that a large data set can be well expanded; the survival area is obtained by judging the distance between different test points and the optimal hypersphere, and the accuracy of the prediction result of the survival area of the traditional Chinese medicinal materials is improved.

Description

Chinese medicinal material survival suitability area prediction method and system based on depth SVDD model
Technical Field
The invention relates to the field of development and utilization of traditional Chinese medicine resources, in particular to a traditional Chinese medicine survival suitability area prediction method and system based on a deep SVDD model.
Background
Development and utilization of traditional Chinese medicine resources and protection of sustainable development of the traditional Chinese medicine resources are very important for research of the traditional Chinese medicine resources in China, and the traditional Chinese medicine resources in China face many problems, for example, the quality and the production area of traditional Chinese medicines are seriously influenced by blindly expanding the cultivation area, the effective components of introduced medicinal materials are obviously different from the standard of Chinese pharmacopoeia, and the sustainable development of the traditional Chinese medicines is seriously restricted. In order to more scientifically expand the Chinese medicinal material introduction area, the research on the ecological suitability of the medicinal materials needs to be enhanced, ecological factors formed by the medicinal materials, such as light, temperature, moisture, terrain, soil and the like, are found, the introduction and cultivation and the division management of the Chinese medicinal materials are increased, and the aims of fully and reasonably utilizing environmental resources, protecting the Chinese medicinal materials and realizing the sustainable development of the Chinese medicinal materials are fulfilled.
At present, most researches on the prediction of the distribution of the medicinal materials in the habitability area adopt a maximum entropy model and existing distribution data and ecological environment to predict the distribution pattern, transition and the like of the habitability area. In the prior art, a method of combining a MaxEnt ecological niche model and a GIS technology is adopted, according to 214 platycodon grandiflorum sample point distribution data, the contribution rate of ecological factors is analyzed based on a cutting method, main ecological factors and habitat characteristics influencing the growth of the platycodon grandiflorum are explored, therefore, the division research is carried out on the growth suitability area of the platycodon grandiflorum in the national range, and the prediction precision evaluation index AUC (area Under cut) value reaches 0.922.
However, the Maxent model is a complex machine learning algorithm, is sensitive to sampling deviation and is easy to generate the over-fitting condition, and the transfer capacity of the Maxent model is better only under the condition of a low threshold value; and the Maxent model influences the accuracy of a prediction result based on default parameters.
Disclosure of Invention
The invention aims to solve the technical problems that overfitting is easy to generate due to sampling deviation caused by adopting a Maxent model in the traditional Chinese medicinal material habitats prediction method and the accuracy of a prediction result is influenced by the Maxent model based on default parameters, and provides a Chinese medicinal material habitats prediction method and system based on a deep SVDD model to solve the problems.
The invention is realized by the following technical scheme:
a traditional Chinese medicine survival area prediction method based on a deep SVDD model comprises the following steps:
s1: collecting ecological factor data of the traditional Chinese medicinal materials, and generating pseudo-nonexistent sample data of the traditional Chinese medicinal materials by adopting a MaxEnt model;
the data of the pseudo-nonexistent sample are Chinese medicinal material non-adaptive regions obtained through a MaxEnt model;
s2: preprocessing the collected ecological factor data of the traditional Chinese medicinal materials to obtain ecological factor preprocessing data;
S3: constructing a prediction model of the Chinese medicinal material survival area according to the ecological factor preprocessing data, the pseudo-nonexistent sample data and the SVDD model;
s4: and putting the test points of the traditional Chinese medicinal materials to be predicted into the prediction model of the survival area of the traditional Chinese medicinal materials for judgment to obtain the survival area of the traditional Chinese medicinal materials to be predicted.
The invention provides a Chinese medicinal material survival area prediction model based on depth support vector data description, namely a depth SVDD model. Due to different data formats of ecological environment factors, data needs to be preprocessed, namely unified conversion of the data is realized based on a t-SNE algorithm, a deep support vector data description model is used for mapping the converted data to a high-dimensional feature space based on nonlinearity, an optimal hypersphere is searched in the feature space, and parameters of a deep SVDD model are optimized by using SGD and variants thereof.
In the existing long-term traditional Chinese medicinal material resource research, the countries of ecological factor data of traditional Chinese medicinal materials provide big data storage and management, and have authority and authenticity, so the method acquires sample distribution data of the Chinese medicinal materials through a Chinese plant specimen museum and a national specimen platform, and effective data are screened by methods such as data cleaning and the like to obtain the longitude and latitude of a sample point due to the fact that the sample is widely distributed in the point and the repeated data exist; acquiring the environmental factor data of the traditional Chinese medicinal materials by looking up the environmental factor literature of the relevant traditional Chinese medicinal materials; the environmental factor data comprises a climate factor, a terrain factor and a soil factor; and carrying out longitude and latitude mapping through a national basic geographic information system network so as to acquire the ecological environment distribution condition of each region of China.
In order to increase the reliability of the model, the model needs to be trained together with sample data which is really present and absent, namely, a certain medicinal material does not grow in a certain place, and because the model does not have real absent data, the invention utilizes the optimized MaxEnt model after parameter adjustment to construct pseudo absent data.
Because the input data is heterogeneous from multiple sources, for example, the soil texture is a text type, and the temperature, the precipitation and the like are numerical value types, firstly, words in the text are converted into the expression of Word vectors by adopting a Word vector model Word2vec, and the characteristic expression of the text data is obtained; because the high-dimensional space of the word vector has low processing efficiency, the high-dimensional word vector space is mapped into a two-dimensional space by utilizing a t-SNE algorithm, so that two words with similar word senses still keep close after mapping, and words with far word senses keep a far mapping distance.
Further, the main ecological factor data for judging the habitability area of the traditional Chinese medicine comprise sample distribution data, environmental factor data and map data.
Further, the generating of the sample data of the pseudo-absence sample in S1 includes:
generating a survival area value result of the traditional Chinese medicinal material by adopting a MaxEnt model;
the output result of the MaxEnt model is 0-1, the result represents that each grid can be regarded as a survival suitability index of a pixel point in a map, the higher the numerical value of the survival suitability index is, the more survival of the traditional Chinese medicinal materials is represented, the grids with the survival suitability indexes above a certain threshold value are regarded as survival suitability areas, the longitude and latitude of the survival suitability areas are removed from the map after the survival suitability areas are selected, and only non-survival suitability areas are left.
Rejecting the numerical value which is greater than or equal to a threshold value in the numerical value result of the survival area of the traditional Chinese medicinal materials to obtain a non-survival area;
and the simulation effect of the model is best under the condition that the number of the existing points and the number of the false nonexistent points are the same, so that the false nonexistent points with the same number as that of the ecological factor data of the traditional Chinese medicinal material are selected from the non-adaptive region to obtain the data of the false nonexistent sample of the traditional Chinese medicinal material.
Further, the preprocessing process of S2 includes:
converting the ecological factor data into high-dimensional space Word vectors by adopting a Word vector model Word2 vec;
and mapping the high-dimensional space word vector into a two-dimensional space word vector by using a t-SNE algorithm.
Further, the t-SNE algorithm:
in order to make similar objects have higher probability to be selected and non-similar objects have lower probability to be selected, the similarity between the objects is expressed by converting the Euclidean distance into a conditional probability, namely, a probability distribution between high-dimensional objects is constructed, and the similarity between different data represents:
Figure BDA0002537544200000031
wherein p isj|iRepresenting the similarity, x, between different data in a high-dimensional spaceiAnd xjFor N-dimensional data x1,x2,…,xNAny two of the data being different, parameter σiIs represented by xiA variance of gaussian distribution centered, | | | | | | represents a two-norm operation; the present invention only concerns the similarity between two different points, and therefore sets p i|i=0;
Because the vectors in the high-dimensional space need to be mapped to the low-dimensional space, in order to make the probability distribution of the same object in the low-dimensional space as the high-dimensional space as similar as possible to the probability distribution in the high-dimensional space, the probability distribution of the high-dimensional object needs to be constructed in the low-dimensional space, and the similarity between different data represents:
Figure BDA0002537544200000032
wherein q isj|iRepresenting the similarity between different data in a low-dimensional space, yiAnd yjRepresenting two-dimensional data y in a low-dimensional space1,y2(ii) a The Gaussian distribution is assumed to have a variance of
Figure BDA0002537544200000033
Same reason qi|i=0。
The joint probability distributions P and Q for the high and low dimensional spaces are constructed separately so that for any i and j, there is Qi|j=pj|i,qi|j=qj|i
Figure BDA0002537544200000034
Figure BDA0002537544200000035
Wherein p isi,jRepresenting the joint probability, q, between any two data in a high-dimensional spacei,jRepresenting the joint probability between any two data in a low dimensional space.
And measuring the similarity of the joint probability distribution of the high-dimensional space and the low-dimensional space by using the KL divergence to obtain the following results:
Figure BDA0002537544200000041
where C represents the similarity of the joint probability distributions of the high-dimensional space and the low-dimensional space, P represents the joint probability of the high-dimensional space, and Q represents the joint probability of the low-dimensional space.
Further, the building process of the prediction model of the Chinese herbal medicine survival area in S3 is as follows:
the SVDD model adopts a fully-connected network
Figure BDA0002537544200000042
Pretreating the ecological factorMapping the physical data to a high-dimensional feature space;
and finding out an optimal hypersphere in the high-order feature space, wherein the pseudo-non-existence sample data is positioned outside the hypersphere of the optimal hypersphere, and the sample distribution data in the ecological factor data is positioned inside the optimal hypersphere.
Further, suppose that
Figure BDA0002537544200000043
For arbitrary input
Figure BDA00025375442000000416
First, the
Figure BDA0002537544200000044
The output of the layer is:
Figure BDA0002537544200000045
where a represents a linear operation (e.g., matrix multiplication),
Figure BDA0002537544200000046
is the first
Figure BDA0002537544200000047
The activation function of the layer(s),
Figure BDA0002537544200000048
is the first
Figure BDA0002537544200000049
The weight of the layer.
The objective function is as follows:
Figure BDA00025375442000000410
Figure BDA00025375442000000415
the first term in equation (10) is the mean value after the sum of the squares of the radii, and satisfies each net representation
Figure BDA00025375442000000414
To the center
Figure BDA00025375442000000411
Is less than the sum of the radius squared and the relaxation variable, where n represents the sample number size, the second term is a weight decay normalized with L2, where λ is the weight decay coefficient, and λ>0,ξiIs a relaxation variable and satisfies
Figure BDA00025375442000000412
||·||FIs the F-norm. Thus, it can be seen that a minimal volume hypersphere centered at c is found, shrinking the sphere radius by minimizing the average deviation of all data representations to the center.
By minimizing equation (10), lagrange multiplier α is introduced iAnd betaiThe lagrange function is constructed as follows:
Figure BDA00025375442000000413
s.t.αi≥0,βi≥0
derivation of R, c, ξ, yields:
Figure BDA0002537544200000051
combining equation (11) and equation (12), we can obtain:
Figure BDA0002537544200000052
the radius and center formula of the optimal hypersphere is:
Figure BDA0002537544200000053
Figure BDA0002537544200000054
wherein c represents a center point, and
Figure BDA0002537544200000055
n represents the size of the number of samples,
Figure BDA00025375442000000511
representing each connection network, αiAnd alphajThe lagrange multiplier is represented by a number of lagrange multipliers,
Figure BDA0002537544200000058
the inner product is represented by the sum of the two,
Figure BDA0002537544200000059
represents a support vector, an
Figure BDA00025375442000000510
Further, in S4, a test point of a Chinese herbal medicine to be predicted in any of the ecological factor data is selected;
calculating the distance between the test point of the traditional Chinese medicinal material to be predicted and the optimal hypersphere center point:
Figure BDA0002537544200000057
wherein x 'represents the test point, s (x') represents the distance between the test point and the optimal center point of the hypersphere,
Figure BDA0002537544200000056
representing hyper-parameters of the SVDD model;
judging whether s (x ') is larger than the radius of the hypersphere with the minimum volume, and when s (x') is larger than the radius of the optimal hypersphere, the test point is a non-adaptive area;
when s (x') is less than or equal to the radius of the optimal hypersphere, the test point is a survival area;
and (4) carrying out the operation on the test points of all the traditional Chinese medicinal materials to be predicted to obtain all the survival regions of the traditional Chinese medicinal materials to be predicted.
A Chinese medicinal material survival area prediction system based on a depth SVDD model comprises:
The acquisition module is used for acquiring ecological factor data of the traditional Chinese medicinal materials and generating sample data of the traditional Chinese medicinal materials, which are not in existence;
the data of the sample with the false nonexistence is a Chinese medicinal material non-adaptive region obtained through a MaxEnt model;
the pretreatment module is used for pretreating the collected ecological factor data of the traditional Chinese medicinal materials to obtain ecological factor pretreatment data;
the prediction model generation module is used for constructing a Chinese medicinal material survival area prediction model according to the ecological factor pretreatment data, the pseudo-nonexistence sample data and the SVDD model;
and the prediction module is used for predicting and obtaining the survival area of the Chinese medicinal material to be predicted.
Further, the prediction process of the prediction module: selecting a test point of the traditional Chinese medicinal material to be predicted in any ecological factor data;
calculating the distance between the test point of the traditional Chinese medicinal material to be predicted and the optimal hypersphere center point in the traditional Chinese medicinal material survival area prediction model:
Figure BDA0002537544200000062
wherein x 'represents a test point, s (x') represents the distance between the test point and the optimal hypersphere central point of the Chinese medicinal material survival area prediction model,
Figure BDA0002537544200000061
representing hyper-parameters of the SVDD model;
judging whether s (x ') is larger than the radius of the hypersphere with the minimum volume, and when s (x') is larger than the radius of the optimal hypersphere, the test point is a non-adaptive area;
When s (x') is less than or equal to the radius of the optimal hyper-sphere, the test point is a survival area; and (4) carrying out the operation on the test points of all the traditional Chinese medicinal materials to be predicted to obtain all the survival regions of the traditional Chinese medicinal materials to be predicted.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. according to the Chinese medicinal material survival area prediction method and system based on the deep SVDD model, the SVDD model is adopted to train the Chinese medicinal material survival area, and the parameters of the deep SVDD model are optimized by using the SGD and the variants thereof, so that the calculation complexity is linearly expanded in the number of training batches, and a large data set is well expanded;
2. according to the Chinese medicinal material survival area prediction method and system based on the depth SVDD model, sampling deviation and overfitting cannot be generated due to the fact that the SVDD model is used for vector description of all data, the survival area is obtained through judgment of the distance between different test points and the optimal hypersphere, and accuracy of Chinese medicinal material survival area prediction results is improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:
FIG. 1 is a flow diagram of the overall process of the present invention;
FIG. 2 is a schematic diagram of the system of the present invention;
FIG. 3 is a diagram illustrating the SVDD model operation according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and accompanying drawings, and the exemplary embodiments and descriptions thereof are only used for explaining the present invention and are not meant to limit the present invention.
Example 1
As shown in fig. 1, a method for predicting Chinese medicinal material survival area based on depth SVDD model includes:
s1: collecting ecological factor data of the traditional Chinese medicinal materials, and generating pseudo-nonexistent sample data of the traditional Chinese medicinal materials by adopting a MaxEnt model;
the data of the sample with the false nonexistence is a Chinese medicinal material non-adaptive region obtained through a MaxEnt model;
s2: preprocessing the collected ecological factor data of the traditional Chinese medicinal materials to obtain ecological factor preprocessing data;
s3: constructing a prediction model of the Chinese medicinal material survival area according to the ecological factor preprocessing data, the pseudo-nonexistent sample data and the SVDD model;
s4: and putting the test points of the traditional Chinese medicinal materials to be predicted into the prediction model of the survival area of the traditional Chinese medicinal materials for judgment to obtain the survival area of the traditional Chinese medicinal materials to be predicted.
The ecological factor data includes sample distribution data, environmental factor data, and map data.
The step of generating the pseudo-nonexistence sample data in S1 includes:
generating a survival area value result of the traditional Chinese medicinal material by adopting a MaxEnt model;
eliminating the numerical value which is greater than or equal to the threshold value in the numerical value result of the healthy region of the traditional Chinese medicinal materials to obtain a non-healthy region;
and selecting the same number of false nonexistent points as the ecological factor data of the traditional Chinese medicinal materials from the non-suitable areas to obtain the false nonexistent sample data of the traditional Chinese medicinal materials.
The preprocessing process of S2 includes:
converting the ecological factor data into high-dimensional space Word vectors by adopting a Word vector model Word2 vec;
and mapping the high-dimensional space word vector into a two-dimensional space word vector by using a t-SNE algorithm.
The t-SNE algorithm:
constructing probability distribution among high-dimensional objects, wherein the similarity among different data represents that:
Figure BDA0002537544200000071
wherein p isj|iRepresenting the similarity, x, between different data in a high-dimensional spaceiAnd xjFor N-dimensional data x1,x2,…,xNAny two of the data being different, parameter σiIs represented by xiA variance of gaussian distribution centered, | | | | | | represents a two-norm operation;
and constructing probability distribution on the high-dimensional object in a low-dimensional space, wherein the similarity between different data represents that:
Figure BDA0002537544200000072
Wherein q isj|iRepresenting the similarity between different data in a low-dimensional space, yiAnd yjRepresenting two-dimensional data y in a low-dimensional space1,y2
Joint probability distributions P and Q for high-dimensional and low-dimensional spaces are constructed, respectively, such that for any i and j, there is Qi|j=pj|i,qi|j=qj|i
Figure BDA0002537544200000081
Figure BDA0002537544200000082
Wherein p isi,jRepresenting the joint probability, q, between any two data in a high-dimensional spacei,jRepresenting the joint probability between any two data in a low dimensional space.
And measuring the similarity of the joint probability distribution of the high-dimensional space and the low-dimensional space by using the KL divergence to obtain the following results:
Figure BDA0002537544200000083
where C represents the similarity of the joint probability distributions of the high-dimensional space and the low-dimensional space, P represents the joint probability of the high-dimensional space, and Q represents the joint probability of the low-dimensional space.
The construction process of the Chinese medicinal material survival area prediction model in the S3 is as follows:
the SVDD model adopts a fully-connected network
Figure BDA0002537544200000084
Mapping the ecological factor pre-processing data to a high-dimensional feature space;
and finding out an optimal hypersphere in the high-order feature space, wherein the pseudo-non-existence sample data is positioned outside the hypersphere of the optimal hypersphere, and the sample distribution data in the ecological factor data is positioned inside the optimal hypersphere.
The radius and center formula of the optimal hypersphere:
Figure BDA0002537544200000085
Figure BDA0002537544200000086
Wherein c represents a center point, and
Figure BDA0002537544200000087
n represents the size of the number of samples,
Figure BDA00025375442000000811
representing each connection network, αiAnd alphajThe lagrange multiplier is represented by a number of lagrange multipliers,
Figure BDA0002537544200000088
the inner product is represented by the sum of the two,
Figure BDA0002537544200000089
represents a support vector, an
Figure BDA00025375442000000810
Selecting a test point of a traditional Chinese medicine to be predicted in any ecological factor data in the S4;
calculating the distance between the test point of the traditional Chinese medicinal material to be predicted and the optimal hypersphere center point:
Figure BDA0002537544200000093
wherein x 'represents the test point, s (x') represents the distance between the test point and the optimal center point of the hypersphere,
Figure BDA0002537544200000094
representing hyper-parameters of the SVDD model;
judging whether s (x ') is larger than the radius of the hypersphere with the minimum volume, and when s (x') is larger than the radius of the optimal hypersphere, the test point is a non-adaptive area;
when s (x') is less than or equal to the radius of the optimal hypersphere, the test point is a survival area;
and (4) carrying out the operation on the test points of all the traditional Chinese medicinal materials to be predicted to obtain all the survival regions of the traditional Chinese medicinal materials to be predicted.
As shown in fig. 2, a system for predicting Chinese medicinal material survival area based on depth SVDD model includes:
the acquisition module is used for acquiring ecological factor data of the traditional Chinese medicinal materials and generating sample data of the traditional Chinese medicinal materials, which are not in existence;
The data of the pseudo-nonexistent sample are Chinese medicinal material non-adaptive regions obtained through a MaxEnt model;
the pretreatment module is used for pretreating the collected ecological factor data of the traditional Chinese medicinal materials to obtain ecological factor pretreatment data;
the prediction model generation module is used for constructing a Chinese medicinal material habitat prediction model according to the ecological factor pretreatment data, the pseudo non-existent sample data and the SVDD model;
and the prediction module is used for predicting and obtaining the survival area of the Chinese medicinal material to be predicted.
Further, the prediction process of the prediction module: selecting a test point of a traditional Chinese medicinal material to be predicted in any ecological factor data;
calculating the distance between the test point of the traditional Chinese medicinal material to be predicted and the optimal hypersphere center point in the traditional Chinese medicinal material survival area prediction model:
Figure BDA0002537544200000092
wherein x 'represents a test point, s (x') represents the distance between the test point and the optimal hypersphere central point of the Chinese medicinal material survival area prediction model,
Figure BDA0002537544200000091
representing hyper-parameters of the SVDD model;
judging whether s (x ') is larger than the radius of the hypersphere with the minimum volume, and when s (x') is larger than the radius of the optimal hypersphere, the test point is a non-adaptive area;
when s (x') is less than or equal to the radius of the optimal hypersphere, the test point is a survival area; and (4) carrying out the operation on the test points of all the traditional Chinese medicinal materials to be predicted to obtain all the survival regions of the traditional Chinese medicinal materials to be predicted.
Example 2
As shown in fig. 3, on the basis of embodiment 1, with the increasing demand of the salvia miltiorrhiza, the present invention takes the salvia miltiorrhiza as a research object, and obtains a total of 120 sample distribution data of the salvia miltiorrhiza existing points; the total 26 selected environmental factors are shown in table 1, which includes the following steps: 19 climate factors, 3 terrain factors and 4 soil factors; the sample data of the pseudo-non-existence sample is 120 pieces.
Using 240 salvia sample data to verify the effectiveness of the model, wherein the training set and the test set respectively account for 80% and 20%; the learning rate was set to 0.0001; the number of training rounds is set to 150, and in one round, the data are adopted to operate on the basis of the embodiment 1, so that all training sets are completely trained in the whole network once; the batch sample size was set to 20 and the weight attenuation factor was set to 5 e-07.
Using the AUC value as an evaluation index, the AUC value of this example was found to be 0.997, and the AUC value of the MaxEnt model was found to be 0.899.
TABLE 1 ecological environmental factors and distribution list of Chinese medicinal materials
Figure BDA0002537544200000101
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (7)

1. A Chinese medicinal material survival suitability area prediction method based on a depth SVDD model is characterized by comprising the following steps:
s1: collecting ecological factor data of the traditional Chinese medicinal materials, and generating pseudo-nonexistent sample data of the traditional Chinese medicinal materials by adopting a MaxEnt model;
the data of the sample with the false nonexistence is a Chinese medicinal material non-adaptive region obtained through a MaxEnt model;
s2: preprocessing the collected ecological factor data of the traditional Chinese medicinal materials to obtain ecological factor preprocessing data;
s3: constructing a prediction model of the Chinese medicinal material survival area according to the ecological factor preprocessing data, the pseudo-nonexistent sample data and the SVDD model;
the construction process of the Chinese medicinal material survival area prediction model in the S3 is as follows:
the SVDD model adoptsFully connected network
Figure FDA0003558342190000011
Mapping the ecological factor pre-processing data to a high-dimensional feature space;
finding out an optimal hypersphere in the high-dimensional feature space, wherein the sample data of the pseudo-nonexistence sample is positioned outside the hypersphere of the optimal hypersphere, and the sample distribution data in the ecological factor data is positioned inside the optimal hypersphere;
the radius and center formula of the optimal hypersphere:
Figure FDA0003558342190000012
Figure FDA0003558342190000013
wherein c represents a center point, and
Figure FDA0003558342190000014
n represents the size of the number of samples,
Figure FDA0003558342190000015
Representing each connection network, αiAnd alphajThe number of lagrange multipliers is such that,
Figure FDA0003558342190000016
the inner product is represented by the sum of the two,
Figure FDA0003558342190000017
and
Figure FDA0003558342190000018
both represent the inner product of the signal,
Figure FDA0003558342190000019
represents a support vector, an
Figure FDA00035583421900000110
S4: and putting the test points of the traditional Chinese medicinal materials to be predicted into the prediction model of the survival area of the traditional Chinese medicinal materials for judgment to obtain the survival area of the traditional Chinese medicinal materials to be predicted.
2. The method as claimed in claim 1, wherein the ecological factor data includes sample distribution data, environmental factor data and map data.
3. The method for predicting Chinese medicinal material survival suitability region based on deep SVDD model as claimed in claim 1, wherein the step of generating the data of sample with no false existence sample in S1 comprises:
generating a survival area value result of the traditional Chinese medicinal material by adopting a MaxEnt model;
eliminating the numerical value which is greater than or equal to the threshold value in the numerical value result of the healthy region of the traditional Chinese medicinal materials to obtain a non-healthy region;
and selecting the same number of false nonexistent points as the ecological factor data of the traditional Chinese medicinal materials from the non-suitable areas to obtain the false nonexistent sample data of the traditional Chinese medicinal materials.
4. The method for predicting Chinese herbal medicine survival suitability based on the deep SVDD model as claimed in claim 1, wherein the preprocessing process of S2 comprises:
Converting the ecological factor data into high-dimensional space Word vectors by adopting a Word vector model Word2 vec;
and mapping the high-dimensional space word vector into a two-dimensional space word vector by using a t-SNE algorithm.
5. The method for predicting Chinese medicinal material survival suitability region based on deep SVDD model according to claim 4, wherein the t-SNE algorithm:
constructing probability distribution among high-dimensional objects, wherein the similarity among different data represents that:
Figure FDA0003558342190000021
wherein p isj|iRepresenting the similarity between different data in a high-dimensional space, xiAnd xjFor N-dimensional data x1,x2,...,xNAny two of the data being different, parameter σiIs represented by xiA central gaussian distribution of variance, | | | | | represents a two-norm operation, xkRepresenting N-dimensional data x1,x2,...,xNData with a middle subscript of k;
and constructing probability distribution on the high-dimensional object in a low-dimensional space, wherein the similarity between different data represents that:
Figure FDA0003558342190000022
wherein q isj|iRepresenting the similarity between different data in a low-dimensional space, yiAnd yjRepresenting two-dimensional data y in a low-dimensional space1,y2,ykRepresenting two-dimensional data with subscript k in low-dimensional space;
the joint probability distributions P and Q for the high and low dimensional spaces are constructed separately so that for any i and j, there is Qi|j=pj|i,qi|j=qj|i
Figure FDA0003558342190000023
Figure FDA0003558342190000024
Wherein p isi,jRepresenting arbitrary two of a high dimensional space Joint probability between data, qi,jRepresenting the joint probability, y, between any two data in a low-dimensional spacelRepresenting two-dimensional data with subscript l in a low-dimensional space;
and measuring the similarity of the joint probability distribution of the high-dimensional space and the low-dimensional space by using the KL divergence to obtain the following results:
Figure FDA0003558342190000025
wherein C represents the similarity of the joint probability distribution of the high-dimensional space and the low-dimensional space, P represents the joint probability of the high-dimensional space, and Q represents the joint probability of the low-dimensional space.
6. The traditional Chinese medicine survival area prediction method based on the deep SVDD model as claimed in claim 1, wherein the test points of the traditional Chinese medicine to be predicted in any of the ecological factor data are selected in S4;
calculating the distance between the test point of the traditional Chinese medicinal material to be predicted and the optimal hypersphere center point:
Figure FDA0003558342190000031
wherein x 'represents the test point, s (x') represents the distance between the test point and the optimal center point of the hypersphere,
Figure FDA0003558342190000032
representing hyper-parameters of the SVDD model;
judging whether s (x ') is larger than the radius of the hypersphere with the minimum volume, and when s (x') is larger than the radius of the optimal hypersphere, the test point is a non-adaptive area;
when s (x') is less than or equal to the radius of the optimal hypersphere, the test point is a survival area;
And performing the operation on all the test points of the traditional Chinese medicinal materials to be predicted to obtain all the survival areas of the traditional Chinese medicinal materials to be predicted.
7. A Chinese medicinal material survival area prediction system based on a depth SVDD model is characterized by comprising:
the acquisition module is used for acquiring ecological factor data of the traditional Chinese medicinal materials and generating fake nonexistence sample data of the traditional Chinese medicinal materials;
the data of the pseudo-nonexistent sample are Chinese medicinal material non-adaptive regions obtained through a MaxEnt model;
the pretreatment module is used for pretreating the collected ecological factor data of the traditional Chinese medicinal materials to obtain ecological factor pretreatment data;
the prediction model generation module is used for constructing a Chinese medicinal material habitat prediction model according to the ecological factor pretreatment data, the pseudo non-existent sample data and the SVDD model;
the prediction module is used for predicting and obtaining a survival area of the traditional Chinese medicinal material to be predicted;
the prediction process of the prediction module: selecting a test point of a traditional Chinese medicinal material to be predicted in any ecological factor data;
calculating the distance between the test point of the traditional Chinese medicine to be predicted and the optimal hypersphere center point in the traditional Chinese medicine survival area prediction model:
Figure FDA0003558342190000033
wherein x 'represents a test point, s (x') represents the distance between the test point and the optimal hyper-sphere central point of the Chinese medicinal material survival area prediction model,
Figure FDA0003558342190000034
Representing hyper-parameters of the SVDD model;
judging whether s (x ') is larger than the radius of the hypersphere with the minimum volume or not, and when s (x') is larger than the radius of the optimal hypersphere, the test point is a non-adaptive area;
when s (x') is less than or equal to the radius of the optimal hyper-sphere, the test point is a survival area; and performing the operation on all the test points of the traditional Chinese medicinal materials to be predicted to obtain all the survival areas of the traditional Chinese medicinal materials to be predicted.
CN202010537578.6A 2020-06-12 2020-06-12 Chinese medicinal material survival area prediction method and system based on depth SVDD model Active CN111680843B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010537578.6A CN111680843B (en) 2020-06-12 2020-06-12 Chinese medicinal material survival area prediction method and system based on depth SVDD model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010537578.6A CN111680843B (en) 2020-06-12 2020-06-12 Chinese medicinal material survival area prediction method and system based on depth SVDD model

Publications (2)

Publication Number Publication Date
CN111680843A CN111680843A (en) 2020-09-18
CN111680843B true CN111680843B (en) 2022-06-28

Family

ID=72435523

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010537578.6A Active CN111680843B (en) 2020-06-12 2020-06-12 Chinese medicinal material survival area prediction method and system based on depth SVDD model

Country Status (1)

Country Link
CN (1) CN111680843B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113095674A (en) * 2021-04-12 2021-07-09 云南省林业调查规划院 Analysis method for potential habitat of Yunnan key protection wild plant based on MaxEnt and GIS

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101398417A (en) * 2008-10-29 2009-04-01 中国药科大学 Universal method for rapid detection and structural identification for Chinese traditional medicine complex component
CN102521480A (en) * 2011-11-15 2012-06-27 中国医学科学院药用植物研究所 Method for selecting new producing area of Chinese medical herb
CN103345588A (en) * 2013-07-18 2013-10-09 成都中医药大学 Method for calculating number of wild traditional Chinese medicine potential resources
CN106372460A (en) * 2016-08-24 2017-02-01 成都旅美科技有限公司 Environment analysis-based biological distribution determination apparatus
CN106845699A (en) * 2017-01-05 2017-06-13 南昌大学 A kind of method for predicting oil tea normal region
CN106961973A (en) * 2017-03-30 2017-07-21 杨友仁 The method that pulse family Chinese medicine is sowed on a large scale is realized using intelligent bulb technology
CN107403057A (en) * 2016-05-20 2017-11-28 中国中医科学院中药研究所 A kind of Chinese medicine Quality Regionalization model based on maximum informational entropy and improved independence weight coefficient
CN110222343A (en) * 2019-06-13 2019-09-10 电子科技大学 A kind of Chinese medicine plant resource name entity recognition method
CN110348060A (en) * 2019-06-13 2019-10-18 中国测绘科学研究院 A kind of snow leopard Habitat suitability evaluation method and device
CN111178631A (en) * 2019-12-30 2020-05-19 广州地理研究所 Method and system for predicting water lettuce invasion distribution area

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101398417A (en) * 2008-10-29 2009-04-01 中国药科大学 Universal method for rapid detection and structural identification for Chinese traditional medicine complex component
CN102521480A (en) * 2011-11-15 2012-06-27 中国医学科学院药用植物研究所 Method for selecting new producing area of Chinese medical herb
CN103345588A (en) * 2013-07-18 2013-10-09 成都中医药大学 Method for calculating number of wild traditional Chinese medicine potential resources
CN107403057A (en) * 2016-05-20 2017-11-28 中国中医科学院中药研究所 A kind of Chinese medicine Quality Regionalization model based on maximum informational entropy and improved independence weight coefficient
CN106372460A (en) * 2016-08-24 2017-02-01 成都旅美科技有限公司 Environment analysis-based biological distribution determination apparatus
CN106845699A (en) * 2017-01-05 2017-06-13 南昌大学 A kind of method for predicting oil tea normal region
CN106961973A (en) * 2017-03-30 2017-07-21 杨友仁 The method that pulse family Chinese medicine is sowed on a large scale is realized using intelligent bulb technology
CN110222343A (en) * 2019-06-13 2019-09-10 电子科技大学 A kind of Chinese medicine plant resource name entity recognition method
CN110348060A (en) * 2019-06-13 2019-10-18 中国测绘科学研究院 A kind of snow leopard Habitat suitability evaluation method and device
CN111178631A (en) * 2019-12-30 2020-05-19 广州地理研究所 Method and system for predicting water lettuce invasion distribution area

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Predicting the Potential Distribution Patterns of the Rare Plant Gymnocarpos Przewalskii Under Present and Future Climate Change;Ma Songmei等;《2011 International Conference on Consumer Electronics, Communications and Networks (CECNet)》;20110516;1513-1515 *
基于GIS的中药材产地适宜性分析系统的设计与实现;孙成忠 等;《世界科学技术-中医药现代化》;20060331;第8卷(第3期);112-117 *
基于MaxEnt和GIS技术的桔梗适宜性分布区划研究;董光 等;《中药材》;20190131;第42卷(第1期);66-70 *
基于Maxent模型对党参害虫烟草甲在中国的适生区预测分析;侯沁文等;《长治学院学报》;20200415(第02期);176-183 *
基于生态因子的山东太子参生态适宜区划研究;边丽华等;《山东农业科学》;20180228(第02期);68-75 *

Also Published As

Publication number Publication date
CN111680843A (en) 2020-09-18

Similar Documents

Publication Publication Date Title
Wang et al. Optimizing the predictive ability of machine learning methods for landslide susceptibility mapping using SMOTE for Lishui City in Zhejiang Province, China
Xu et al. Downscaling and projection of multi‐CMIP5 precipitation using machine learning methods in the upper Han River basin
CN109063355A (en) Near-optimal method based on particle group optimizing Yu Kriging model
Joseph et al. Rainfall prediction using data mining techniques
Xu et al. Simulation of land-use changes using the partitioned ANN-CA model and considering the influence of land-use change frequency
CN110473592A (en) The multi-angle of view mankind for having supervision based on figure convolutional network cooperate with lethal gene prediction technique
CN113434736A (en) Multi-dimensional mixed indexing method and system for remote sensing big data
CN109902714A (en) A kind of multi-modality medical image search method based on more figure regularization depth Hash
CN111680843B (en) Chinese medicinal material survival area prediction method and system based on depth SVDD model
CN115934990A (en) Remote sensing image recommendation method based on content understanding
CN115688982A (en) Building photovoltaic data completion method based on WGAN and whale optimization algorithm
CN116775661A (en) Big space data storage and management method based on Beidou grid technology
Gao et al. The intellectual structure of digital humanities: An author co-citation analysis
CN117314266A (en) Novel intelligent scientific and technological talent evaluation method based on hypergraph attention mechanism
Pabreja et al. A data warehousing and data mining approach for analysis and forecast of cloudburst events using OLAP-based data hypercube
Sun et al. Multisource data reconstruction-based deep unsupervised hashing for unisource remote sensing image retrieval
Nie et al. Semantic category balance-aware involved anti-interference network for remote sensing semantic segmentation
Xiang et al. Spatiotemporal model based on transformer for bias correction and temporal downscaling of forecasts
CN109241070B (en) Time dimension unification method for meteorological data inconsistency based on big data
CN114818681A (en) Entity identification method and system, computer readable storage medium and terminal
Zeng et al. A co-occurrence region based Bayesian network stepwise remote sensing image retrieval algorithm
Sun et al. Consistency Center-Based Deep Cross-Modal Hashing for Multi-Source Remote Sensing Image Retrieval
Ma et al. The susceptibility of wetland areas in the Yangtze River Basin to temperature and vegetation changes
Zhu et al. Using Eco-geographical zoning data and crowdsourcing to improve the detection of spurious land cover changes
CN112508170A (en) Multi-correlation time sequence prediction system and method based on generation countermeasure network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant