CN111680843B - Chinese medicinal material survival area prediction method and system based on depth SVDD model - Google Patents
Chinese medicinal material survival area prediction method and system based on depth SVDD model Download PDFInfo
- Publication number
- CN111680843B CN111680843B CN202010537578.6A CN202010537578A CN111680843B CN 111680843 B CN111680843 B CN 111680843B CN 202010537578 A CN202010537578 A CN 202010537578A CN 111680843 B CN111680843 B CN 111680843B
- Authority
- CN
- China
- Prior art keywords
- data
- chinese medicinal
- traditional chinese
- model
- medicinal materials
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 239000000463 material Substances 0.000 title claims abstract description 122
- 230000004083 survival effect Effects 0.000 title claims abstract description 72
- 238000000034 method Methods 0.000 title claims abstract description 30
- 238000012360 testing method Methods 0.000 claims abstract description 49
- 238000007781 pre-processing Methods 0.000 claims abstract description 17
- 238000009826 distribution Methods 0.000 claims description 35
- 239000013598 vector Substances 0.000 claims description 23
- 230000003044 adaptive effect Effects 0.000 claims description 13
- 239000003814 drug Substances 0.000 claims description 13
- 230000008569 process Effects 0.000 claims description 10
- 238000004422 calculation algorithm Methods 0.000 claims description 9
- 230000007613 environmental effect Effects 0.000 claims description 9
- 238000013507 mapping Methods 0.000 claims description 9
- 241000411851 herbal medicine Species 0.000 claims description 3
- 238000010276 construction Methods 0.000 claims description 2
- 238000012549 training Methods 0.000 abstract description 5
- 238000004364 calculation method Methods 0.000 abstract description 2
- 238000011160 research Methods 0.000 description 6
- 238000011161 development Methods 0.000 description 5
- 239000002689 soil Substances 0.000 description 4
- 244000274050 Platycodon grandiflorum Species 0.000 description 3
- 235000006753 Platycodon grandiflorum Nutrition 0.000 description 3
- 241000304195 Salvia miltiorrhiza Species 0.000 description 3
- 235000011135 Salvia miltiorrhiza Nutrition 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000005070 sampling Methods 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 2
- 229940126680 traditional chinese medicines Drugs 0.000 description 2
- 241000196324 Embryophyta Species 0.000 description 1
- 241001072909 Salvia Species 0.000 description 1
- 235000017276 Salvia Nutrition 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000005520 cutting process Methods 0.000 description 1
- 238000013523 data management Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000001556 precipitation Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/04—Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
- G06F18/2135—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
- G06F18/21355—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis nonlinear criteria, e.g. embedding a manifold in a Euclidean space
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/02—Agriculture; Fishing; Forestry; Mining
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Business, Economics & Management (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Strategic Management (AREA)
- Human Resources & Organizations (AREA)
- Economics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Business, Economics & Management (AREA)
- Tourism & Hospitality (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Marketing (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Marine Sciences & Fisheries (AREA)
- Development Economics (AREA)
- Mining & Mineral Resources (AREA)
- Evolutionary Computation (AREA)
- Animal Husbandry (AREA)
- Primary Health Care (AREA)
- Agronomy & Crop Science (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Game Theory and Decision Science (AREA)
- Entrepreneurship & Innovation (AREA)
- Operations Research (AREA)
- Quality & Reliability (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Machine Translation (AREA)
- Medicines Containing Plant Substances (AREA)
Abstract
The invention discloses a Chinese medicinal material survival area prediction method and system based on a deep SVDD model, which comprises the following steps: collecting ecological factor data of the traditional Chinese medicinal materials, and generating pseudo-nonexistent sample data of the traditional Chinese medicinal materials by adopting a MaxEnt model; preprocessing the collected ecological factor data of the traditional Chinese medicinal materials to obtain ecological factor preprocessing data; constructing a prediction model of the Chinese medicinal material survival area according to the ecological factor preprocessing data, the pseudo-nonexistent sample data and the SVDD model; and putting the test points of the traditional Chinese medicinal materials to be predicted into the prediction model of the survival area of the traditional Chinese medicinal materials for judgment to obtain the survival area of the traditional Chinese medicinal materials to be predicted. The SGD and the SGD variant optimize parameters of a deep SVDD model, and the calculation complexity of the model is linear expansion on the training quantity, so that a large data set can be well expanded; the survival area is obtained by judging the distance between different test points and the optimal hypersphere, and the accuracy of the prediction result of the survival area of the traditional Chinese medicinal materials is improved.
Description
Technical Field
The invention relates to the field of development and utilization of traditional Chinese medicine resources, in particular to a traditional Chinese medicine survival suitability area prediction method and system based on a deep SVDD model.
Background
Development and utilization of traditional Chinese medicine resources and protection of sustainable development of the traditional Chinese medicine resources are very important for research of the traditional Chinese medicine resources in China, and the traditional Chinese medicine resources in China face many problems, for example, the quality and the production area of traditional Chinese medicines are seriously influenced by blindly expanding the cultivation area, the effective components of introduced medicinal materials are obviously different from the standard of Chinese pharmacopoeia, and the sustainable development of the traditional Chinese medicines is seriously restricted. In order to more scientifically expand the Chinese medicinal material introduction area, the research on the ecological suitability of the medicinal materials needs to be enhanced, ecological factors formed by the medicinal materials, such as light, temperature, moisture, terrain, soil and the like, are found, the introduction and cultivation and the division management of the Chinese medicinal materials are increased, and the aims of fully and reasonably utilizing environmental resources, protecting the Chinese medicinal materials and realizing the sustainable development of the Chinese medicinal materials are fulfilled.
At present, most researches on the prediction of the distribution of the medicinal materials in the habitability area adopt a maximum entropy model and existing distribution data and ecological environment to predict the distribution pattern, transition and the like of the habitability area. In the prior art, a method of combining a MaxEnt ecological niche model and a GIS technology is adopted, according to 214 platycodon grandiflorum sample point distribution data, the contribution rate of ecological factors is analyzed based on a cutting method, main ecological factors and habitat characteristics influencing the growth of the platycodon grandiflorum are explored, therefore, the division research is carried out on the growth suitability area of the platycodon grandiflorum in the national range, and the prediction precision evaluation index AUC (area Under cut) value reaches 0.922.
However, the Maxent model is a complex machine learning algorithm, is sensitive to sampling deviation and is easy to generate the over-fitting condition, and the transfer capacity of the Maxent model is better only under the condition of a low threshold value; and the Maxent model influences the accuracy of a prediction result based on default parameters.
Disclosure of Invention
The invention aims to solve the technical problems that overfitting is easy to generate due to sampling deviation caused by adopting a Maxent model in the traditional Chinese medicinal material habitats prediction method and the accuracy of a prediction result is influenced by the Maxent model based on default parameters, and provides a Chinese medicinal material habitats prediction method and system based on a deep SVDD model to solve the problems.
The invention is realized by the following technical scheme:
a traditional Chinese medicine survival area prediction method based on a deep SVDD model comprises the following steps:
s1: collecting ecological factor data of the traditional Chinese medicinal materials, and generating pseudo-nonexistent sample data of the traditional Chinese medicinal materials by adopting a MaxEnt model;
the data of the pseudo-nonexistent sample are Chinese medicinal material non-adaptive regions obtained through a MaxEnt model;
s2: preprocessing the collected ecological factor data of the traditional Chinese medicinal materials to obtain ecological factor preprocessing data;
S3: constructing a prediction model of the Chinese medicinal material survival area according to the ecological factor preprocessing data, the pseudo-nonexistent sample data and the SVDD model;
s4: and putting the test points of the traditional Chinese medicinal materials to be predicted into the prediction model of the survival area of the traditional Chinese medicinal materials for judgment to obtain the survival area of the traditional Chinese medicinal materials to be predicted.
The invention provides a Chinese medicinal material survival area prediction model based on depth support vector data description, namely a depth SVDD model. Due to different data formats of ecological environment factors, data needs to be preprocessed, namely unified conversion of the data is realized based on a t-SNE algorithm, a deep support vector data description model is used for mapping the converted data to a high-dimensional feature space based on nonlinearity, an optimal hypersphere is searched in the feature space, and parameters of a deep SVDD model are optimized by using SGD and variants thereof.
In the existing long-term traditional Chinese medicinal material resource research, the countries of ecological factor data of traditional Chinese medicinal materials provide big data storage and management, and have authority and authenticity, so the method acquires sample distribution data of the Chinese medicinal materials through a Chinese plant specimen museum and a national specimen platform, and effective data are screened by methods such as data cleaning and the like to obtain the longitude and latitude of a sample point due to the fact that the sample is widely distributed in the point and the repeated data exist; acquiring the environmental factor data of the traditional Chinese medicinal materials by looking up the environmental factor literature of the relevant traditional Chinese medicinal materials; the environmental factor data comprises a climate factor, a terrain factor and a soil factor; and carrying out longitude and latitude mapping through a national basic geographic information system network so as to acquire the ecological environment distribution condition of each region of China.
In order to increase the reliability of the model, the model needs to be trained together with sample data which is really present and absent, namely, a certain medicinal material does not grow in a certain place, and because the model does not have real absent data, the invention utilizes the optimized MaxEnt model after parameter adjustment to construct pseudo absent data.
Because the input data is heterogeneous from multiple sources, for example, the soil texture is a text type, and the temperature, the precipitation and the like are numerical value types, firstly, words in the text are converted into the expression of Word vectors by adopting a Word vector model Word2vec, and the characteristic expression of the text data is obtained; because the high-dimensional space of the word vector has low processing efficiency, the high-dimensional word vector space is mapped into a two-dimensional space by utilizing a t-SNE algorithm, so that two words with similar word senses still keep close after mapping, and words with far word senses keep a far mapping distance.
Further, the main ecological factor data for judging the habitability area of the traditional Chinese medicine comprise sample distribution data, environmental factor data and map data.
Further, the generating of the sample data of the pseudo-absence sample in S1 includes:
generating a survival area value result of the traditional Chinese medicinal material by adopting a MaxEnt model;
the output result of the MaxEnt model is 0-1, the result represents that each grid can be regarded as a survival suitability index of a pixel point in a map, the higher the numerical value of the survival suitability index is, the more survival of the traditional Chinese medicinal materials is represented, the grids with the survival suitability indexes above a certain threshold value are regarded as survival suitability areas, the longitude and latitude of the survival suitability areas are removed from the map after the survival suitability areas are selected, and only non-survival suitability areas are left.
Rejecting the numerical value which is greater than or equal to a threshold value in the numerical value result of the survival area of the traditional Chinese medicinal materials to obtain a non-survival area;
and the simulation effect of the model is best under the condition that the number of the existing points and the number of the false nonexistent points are the same, so that the false nonexistent points with the same number as that of the ecological factor data of the traditional Chinese medicinal material are selected from the non-adaptive region to obtain the data of the false nonexistent sample of the traditional Chinese medicinal material.
Further, the preprocessing process of S2 includes:
converting the ecological factor data into high-dimensional space Word vectors by adopting a Word vector model Word2 vec;
and mapping the high-dimensional space word vector into a two-dimensional space word vector by using a t-SNE algorithm.
Further, the t-SNE algorithm:
in order to make similar objects have higher probability to be selected and non-similar objects have lower probability to be selected, the similarity between the objects is expressed by converting the Euclidean distance into a conditional probability, namely, a probability distribution between high-dimensional objects is constructed, and the similarity between different data represents:
wherein p isj|iRepresenting the similarity, x, between different data in a high-dimensional spaceiAnd xjFor N-dimensional data x1,x2,…,xNAny two of the data being different, parameter σiIs represented by xiA variance of gaussian distribution centered, | | | | | | represents a two-norm operation; the present invention only concerns the similarity between two different points, and therefore sets p i|i=0;
Because the vectors in the high-dimensional space need to be mapped to the low-dimensional space, in order to make the probability distribution of the same object in the low-dimensional space as the high-dimensional space as similar as possible to the probability distribution in the high-dimensional space, the probability distribution of the high-dimensional object needs to be constructed in the low-dimensional space, and the similarity between different data represents:
wherein q isj|iRepresenting the similarity between different data in a low-dimensional space, yiAnd yjRepresenting two-dimensional data y in a low-dimensional space1,y2(ii) a The Gaussian distribution is assumed to have a variance ofSame reason qi|i=0。
The joint probability distributions P and Q for the high and low dimensional spaces are constructed separately so that for any i and j, there is Qi|j=pj|i,qi|j=qj|i。
Wherein p isi,jRepresenting the joint probability, q, between any two data in a high-dimensional spacei,jRepresenting the joint probability between any two data in a low dimensional space.
And measuring the similarity of the joint probability distribution of the high-dimensional space and the low-dimensional space by using the KL divergence to obtain the following results:
where C represents the similarity of the joint probability distributions of the high-dimensional space and the low-dimensional space, P represents the joint probability of the high-dimensional space, and Q represents the joint probability of the low-dimensional space.
Further, the building process of the prediction model of the Chinese herbal medicine survival area in S3 is as follows:
the SVDD model adopts a fully-connected network Pretreating the ecological factorMapping the physical data to a high-dimensional feature space;
and finding out an optimal hypersphere in the high-order feature space, wherein the pseudo-non-existence sample data is positioned outside the hypersphere of the optimal hypersphere, and the sample distribution data in the ecological factor data is positioned inside the optimal hypersphere.
where a represents a linear operation (e.g., matrix multiplication),is the firstThe activation function of the layer(s),is the firstThe weight of the layer.
The objective function is as follows:
the first term in equation (10) is the mean value after the sum of the squares of the radii, and satisfies each net representationTo the centerIs less than the sum of the radius squared and the relaxation variable, where n represents the sample number size, the second term is a weight decay normalized with L2, where λ is the weight decay coefficient, and λ>0,ξiIs a relaxation variable and satisfies||·||FIs the F-norm. Thus, it can be seen that a minimal volume hypersphere centered at c is found, shrinking the sphere radius by minimizing the average deviation of all data representations to the center.
By minimizing equation (10), lagrange multiplier α is introduced iAnd betaiThe lagrange function is constructed as follows:
s.t.αi≥0,βi≥0
derivation of R, c, ξ, yields:
combining equation (11) and equation (12), we can obtain:
the radius and center formula of the optimal hypersphere is:
wherein c represents a center point, andn represents the size of the number of samples,representing each connection network, αiAnd alphajThe lagrange multiplier is represented by a number of lagrange multipliers,the inner product is represented by the sum of the two,represents a support vector, an
Further, in S4, a test point of a Chinese herbal medicine to be predicted in any of the ecological factor data is selected;
calculating the distance between the test point of the traditional Chinese medicinal material to be predicted and the optimal hypersphere center point:
wherein x 'represents the test point, s (x') represents the distance between the test point and the optimal center point of the hypersphere,representing hyper-parameters of the SVDD model;
judging whether s (x ') is larger than the radius of the hypersphere with the minimum volume, and when s (x') is larger than the radius of the optimal hypersphere, the test point is a non-adaptive area;
when s (x') is less than or equal to the radius of the optimal hypersphere, the test point is a survival area;
and (4) carrying out the operation on the test points of all the traditional Chinese medicinal materials to be predicted to obtain all the survival regions of the traditional Chinese medicinal materials to be predicted.
A Chinese medicinal material survival area prediction system based on a depth SVDD model comprises:
The acquisition module is used for acquiring ecological factor data of the traditional Chinese medicinal materials and generating sample data of the traditional Chinese medicinal materials, which are not in existence;
the data of the sample with the false nonexistence is a Chinese medicinal material non-adaptive region obtained through a MaxEnt model;
the pretreatment module is used for pretreating the collected ecological factor data of the traditional Chinese medicinal materials to obtain ecological factor pretreatment data;
the prediction model generation module is used for constructing a Chinese medicinal material survival area prediction model according to the ecological factor pretreatment data, the pseudo-nonexistence sample data and the SVDD model;
and the prediction module is used for predicting and obtaining the survival area of the Chinese medicinal material to be predicted.
Further, the prediction process of the prediction module: selecting a test point of the traditional Chinese medicinal material to be predicted in any ecological factor data;
calculating the distance between the test point of the traditional Chinese medicinal material to be predicted and the optimal hypersphere center point in the traditional Chinese medicinal material survival area prediction model:
wherein x 'represents a test point, s (x') represents the distance between the test point and the optimal hypersphere central point of the Chinese medicinal material survival area prediction model,representing hyper-parameters of the SVDD model;
judging whether s (x ') is larger than the radius of the hypersphere with the minimum volume, and when s (x') is larger than the radius of the optimal hypersphere, the test point is a non-adaptive area;
When s (x') is less than or equal to the radius of the optimal hyper-sphere, the test point is a survival area; and (4) carrying out the operation on the test points of all the traditional Chinese medicinal materials to be predicted to obtain all the survival regions of the traditional Chinese medicinal materials to be predicted.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. according to the Chinese medicinal material survival area prediction method and system based on the deep SVDD model, the SVDD model is adopted to train the Chinese medicinal material survival area, and the parameters of the deep SVDD model are optimized by using the SGD and the variants thereof, so that the calculation complexity is linearly expanded in the number of training batches, and a large data set is well expanded;
2. according to the Chinese medicinal material survival area prediction method and system based on the depth SVDD model, sampling deviation and overfitting cannot be generated due to the fact that the SVDD model is used for vector description of all data, the survival area is obtained through judgment of the distance between different test points and the optimal hypersphere, and accuracy of Chinese medicinal material survival area prediction results is improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:
FIG. 1 is a flow diagram of the overall process of the present invention;
FIG. 2 is a schematic diagram of the system of the present invention;
FIG. 3 is a diagram illustrating the SVDD model operation according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and accompanying drawings, and the exemplary embodiments and descriptions thereof are only used for explaining the present invention and are not meant to limit the present invention.
Example 1
As shown in fig. 1, a method for predicting Chinese medicinal material survival area based on depth SVDD model includes:
s1: collecting ecological factor data of the traditional Chinese medicinal materials, and generating pseudo-nonexistent sample data of the traditional Chinese medicinal materials by adopting a MaxEnt model;
the data of the sample with the false nonexistence is a Chinese medicinal material non-adaptive region obtained through a MaxEnt model;
s2: preprocessing the collected ecological factor data of the traditional Chinese medicinal materials to obtain ecological factor preprocessing data;
s3: constructing a prediction model of the Chinese medicinal material survival area according to the ecological factor preprocessing data, the pseudo-nonexistent sample data and the SVDD model;
s4: and putting the test points of the traditional Chinese medicinal materials to be predicted into the prediction model of the survival area of the traditional Chinese medicinal materials for judgment to obtain the survival area of the traditional Chinese medicinal materials to be predicted.
The ecological factor data includes sample distribution data, environmental factor data, and map data.
The step of generating the pseudo-nonexistence sample data in S1 includes:
generating a survival area value result of the traditional Chinese medicinal material by adopting a MaxEnt model;
eliminating the numerical value which is greater than or equal to the threshold value in the numerical value result of the healthy region of the traditional Chinese medicinal materials to obtain a non-healthy region;
and selecting the same number of false nonexistent points as the ecological factor data of the traditional Chinese medicinal materials from the non-suitable areas to obtain the false nonexistent sample data of the traditional Chinese medicinal materials.
The preprocessing process of S2 includes:
converting the ecological factor data into high-dimensional space Word vectors by adopting a Word vector model Word2 vec;
and mapping the high-dimensional space word vector into a two-dimensional space word vector by using a t-SNE algorithm.
The t-SNE algorithm:
constructing probability distribution among high-dimensional objects, wherein the similarity among different data represents that:
wherein p isj|iRepresenting the similarity, x, between different data in a high-dimensional spaceiAnd xjFor N-dimensional data x1,x2,…,xNAny two of the data being different, parameter σiIs represented by xiA variance of gaussian distribution centered, | | | | | | represents a two-norm operation;
and constructing probability distribution on the high-dimensional object in a low-dimensional space, wherein the similarity between different data represents that:
Wherein q isj|iRepresenting the similarity between different data in a low-dimensional space, yiAnd yjRepresenting two-dimensional data y in a low-dimensional space1,y2;
Joint probability distributions P and Q for high-dimensional and low-dimensional spaces are constructed, respectively, such that for any i and j, there is Qi|j=pj|i,qi|j=qj|i。
Wherein p isi,jRepresenting the joint probability, q, between any two data in a high-dimensional spacei,jRepresenting the joint probability between any two data in a low dimensional space.
And measuring the similarity of the joint probability distribution of the high-dimensional space and the low-dimensional space by using the KL divergence to obtain the following results:
where C represents the similarity of the joint probability distributions of the high-dimensional space and the low-dimensional space, P represents the joint probability of the high-dimensional space, and Q represents the joint probability of the low-dimensional space.
The construction process of the Chinese medicinal material survival area prediction model in the S3 is as follows:
the SVDD model adopts a fully-connected networkMapping the ecological factor pre-processing data to a high-dimensional feature space;
and finding out an optimal hypersphere in the high-order feature space, wherein the pseudo-non-existence sample data is positioned outside the hypersphere of the optimal hypersphere, and the sample distribution data in the ecological factor data is positioned inside the optimal hypersphere.
The radius and center formula of the optimal hypersphere:
Wherein c represents a center point, andn represents the size of the number of samples,representing each connection network, αiAnd alphajThe lagrange multiplier is represented by a number of lagrange multipliers,the inner product is represented by the sum of the two,represents a support vector, an
Selecting a test point of a traditional Chinese medicine to be predicted in any ecological factor data in the S4;
calculating the distance between the test point of the traditional Chinese medicinal material to be predicted and the optimal hypersphere center point:
wherein x 'represents the test point, s (x') represents the distance between the test point and the optimal center point of the hypersphere,representing hyper-parameters of the SVDD model;
judging whether s (x ') is larger than the radius of the hypersphere with the minimum volume, and when s (x') is larger than the radius of the optimal hypersphere, the test point is a non-adaptive area;
when s (x') is less than or equal to the radius of the optimal hypersphere, the test point is a survival area;
and (4) carrying out the operation on the test points of all the traditional Chinese medicinal materials to be predicted to obtain all the survival regions of the traditional Chinese medicinal materials to be predicted.
As shown in fig. 2, a system for predicting Chinese medicinal material survival area based on depth SVDD model includes:
the acquisition module is used for acquiring ecological factor data of the traditional Chinese medicinal materials and generating sample data of the traditional Chinese medicinal materials, which are not in existence;
The data of the pseudo-nonexistent sample are Chinese medicinal material non-adaptive regions obtained through a MaxEnt model;
the pretreatment module is used for pretreating the collected ecological factor data of the traditional Chinese medicinal materials to obtain ecological factor pretreatment data;
the prediction model generation module is used for constructing a Chinese medicinal material habitat prediction model according to the ecological factor pretreatment data, the pseudo non-existent sample data and the SVDD model;
and the prediction module is used for predicting and obtaining the survival area of the Chinese medicinal material to be predicted.
Further, the prediction process of the prediction module: selecting a test point of a traditional Chinese medicinal material to be predicted in any ecological factor data;
calculating the distance between the test point of the traditional Chinese medicinal material to be predicted and the optimal hypersphere center point in the traditional Chinese medicinal material survival area prediction model:
wherein x 'represents a test point, s (x') represents the distance between the test point and the optimal hypersphere central point of the Chinese medicinal material survival area prediction model,representing hyper-parameters of the SVDD model;
judging whether s (x ') is larger than the radius of the hypersphere with the minimum volume, and when s (x') is larger than the radius of the optimal hypersphere, the test point is a non-adaptive area;
when s (x') is less than or equal to the radius of the optimal hypersphere, the test point is a survival area; and (4) carrying out the operation on the test points of all the traditional Chinese medicinal materials to be predicted to obtain all the survival regions of the traditional Chinese medicinal materials to be predicted.
Example 2
As shown in fig. 3, on the basis of embodiment 1, with the increasing demand of the salvia miltiorrhiza, the present invention takes the salvia miltiorrhiza as a research object, and obtains a total of 120 sample distribution data of the salvia miltiorrhiza existing points; the total 26 selected environmental factors are shown in table 1, which includes the following steps: 19 climate factors, 3 terrain factors and 4 soil factors; the sample data of the pseudo-non-existence sample is 120 pieces.
Using 240 salvia sample data to verify the effectiveness of the model, wherein the training set and the test set respectively account for 80% and 20%; the learning rate was set to 0.0001; the number of training rounds is set to 150, and in one round, the data are adopted to operate on the basis of the embodiment 1, so that all training sets are completely trained in the whole network once; the batch sample size was set to 20 and the weight attenuation factor was set to 5 e-07.
Using the AUC value as an evaluation index, the AUC value of this example was found to be 0.997, and the AUC value of the MaxEnt model was found to be 0.899.
TABLE 1 ecological environmental factors and distribution list of Chinese medicinal materials
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (7)
1. A Chinese medicinal material survival suitability area prediction method based on a depth SVDD model is characterized by comprising the following steps:
s1: collecting ecological factor data of the traditional Chinese medicinal materials, and generating pseudo-nonexistent sample data of the traditional Chinese medicinal materials by adopting a MaxEnt model;
the data of the sample with the false nonexistence is a Chinese medicinal material non-adaptive region obtained through a MaxEnt model;
s2: preprocessing the collected ecological factor data of the traditional Chinese medicinal materials to obtain ecological factor preprocessing data;
s3: constructing a prediction model of the Chinese medicinal material survival area according to the ecological factor preprocessing data, the pseudo-nonexistent sample data and the SVDD model;
the construction process of the Chinese medicinal material survival area prediction model in the S3 is as follows:
the SVDD model adoptsFully connected networkMapping the ecological factor pre-processing data to a high-dimensional feature space;
finding out an optimal hypersphere in the high-dimensional feature space, wherein the sample data of the pseudo-nonexistence sample is positioned outside the hypersphere of the optimal hypersphere, and the sample distribution data in the ecological factor data is positioned inside the optimal hypersphere;
the radius and center formula of the optimal hypersphere:
wherein c represents a center point, andn represents the size of the number of samples, Representing each connection network, αiAnd alphajThe number of lagrange multipliers is such that,the inner product is represented by the sum of the two,andboth represent the inner product of the signal,represents a support vector, an
S4: and putting the test points of the traditional Chinese medicinal materials to be predicted into the prediction model of the survival area of the traditional Chinese medicinal materials for judgment to obtain the survival area of the traditional Chinese medicinal materials to be predicted.
2. The method as claimed in claim 1, wherein the ecological factor data includes sample distribution data, environmental factor data and map data.
3. The method for predicting Chinese medicinal material survival suitability region based on deep SVDD model as claimed in claim 1, wherein the step of generating the data of sample with no false existence sample in S1 comprises:
generating a survival area value result of the traditional Chinese medicinal material by adopting a MaxEnt model;
eliminating the numerical value which is greater than or equal to the threshold value in the numerical value result of the healthy region of the traditional Chinese medicinal materials to obtain a non-healthy region;
and selecting the same number of false nonexistent points as the ecological factor data of the traditional Chinese medicinal materials from the non-suitable areas to obtain the false nonexistent sample data of the traditional Chinese medicinal materials.
4. The method for predicting Chinese herbal medicine survival suitability based on the deep SVDD model as claimed in claim 1, wherein the preprocessing process of S2 comprises:
Converting the ecological factor data into high-dimensional space Word vectors by adopting a Word vector model Word2 vec;
and mapping the high-dimensional space word vector into a two-dimensional space word vector by using a t-SNE algorithm.
5. The method for predicting Chinese medicinal material survival suitability region based on deep SVDD model according to claim 4, wherein the t-SNE algorithm:
constructing probability distribution among high-dimensional objects, wherein the similarity among different data represents that:
wherein p isj|iRepresenting the similarity between different data in a high-dimensional space, xiAnd xjFor N-dimensional data x1,x2,...,xNAny two of the data being different, parameter σiIs represented by xiA central gaussian distribution of variance, | | | | | represents a two-norm operation, xkRepresenting N-dimensional data x1,x2,...,xNData with a middle subscript of k;
and constructing probability distribution on the high-dimensional object in a low-dimensional space, wherein the similarity between different data represents that:
wherein q isj|iRepresenting the similarity between different data in a low-dimensional space, yiAnd yjRepresenting two-dimensional data y in a low-dimensional space1,y2,ykRepresenting two-dimensional data with subscript k in low-dimensional space;
the joint probability distributions P and Q for the high and low dimensional spaces are constructed separately so that for any i and j, there is Qi|j=pj|i,qi|j=qj|i;
Wherein p isi,jRepresenting arbitrary two of a high dimensional space Joint probability between data, qi,jRepresenting the joint probability, y, between any two data in a low-dimensional spacelRepresenting two-dimensional data with subscript l in a low-dimensional space;
and measuring the similarity of the joint probability distribution of the high-dimensional space and the low-dimensional space by using the KL divergence to obtain the following results:
wherein C represents the similarity of the joint probability distribution of the high-dimensional space and the low-dimensional space, P represents the joint probability of the high-dimensional space, and Q represents the joint probability of the low-dimensional space.
6. The traditional Chinese medicine survival area prediction method based on the deep SVDD model as claimed in claim 1, wherein the test points of the traditional Chinese medicine to be predicted in any of the ecological factor data are selected in S4;
calculating the distance between the test point of the traditional Chinese medicinal material to be predicted and the optimal hypersphere center point:
wherein x 'represents the test point, s (x') represents the distance between the test point and the optimal center point of the hypersphere,representing hyper-parameters of the SVDD model;
judging whether s (x ') is larger than the radius of the hypersphere with the minimum volume, and when s (x') is larger than the radius of the optimal hypersphere, the test point is a non-adaptive area;
when s (x') is less than or equal to the radius of the optimal hypersphere, the test point is a survival area;
And performing the operation on all the test points of the traditional Chinese medicinal materials to be predicted to obtain all the survival areas of the traditional Chinese medicinal materials to be predicted.
7. A Chinese medicinal material survival area prediction system based on a depth SVDD model is characterized by comprising:
the acquisition module is used for acquiring ecological factor data of the traditional Chinese medicinal materials and generating fake nonexistence sample data of the traditional Chinese medicinal materials;
the data of the pseudo-nonexistent sample are Chinese medicinal material non-adaptive regions obtained through a MaxEnt model;
the pretreatment module is used for pretreating the collected ecological factor data of the traditional Chinese medicinal materials to obtain ecological factor pretreatment data;
the prediction model generation module is used for constructing a Chinese medicinal material habitat prediction model according to the ecological factor pretreatment data, the pseudo non-existent sample data and the SVDD model;
the prediction module is used for predicting and obtaining a survival area of the traditional Chinese medicinal material to be predicted;
the prediction process of the prediction module: selecting a test point of a traditional Chinese medicinal material to be predicted in any ecological factor data;
calculating the distance between the test point of the traditional Chinese medicine to be predicted and the optimal hypersphere center point in the traditional Chinese medicine survival area prediction model:
wherein x 'represents a test point, s (x') represents the distance between the test point and the optimal hyper-sphere central point of the Chinese medicinal material survival area prediction model, Representing hyper-parameters of the SVDD model;
judging whether s (x ') is larger than the radius of the hypersphere with the minimum volume or not, and when s (x') is larger than the radius of the optimal hypersphere, the test point is a non-adaptive area;
when s (x') is less than or equal to the radius of the optimal hyper-sphere, the test point is a survival area; and performing the operation on all the test points of the traditional Chinese medicinal materials to be predicted to obtain all the survival areas of the traditional Chinese medicinal materials to be predicted.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010537578.6A CN111680843B (en) | 2020-06-12 | 2020-06-12 | Chinese medicinal material survival area prediction method and system based on depth SVDD model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010537578.6A CN111680843B (en) | 2020-06-12 | 2020-06-12 | Chinese medicinal material survival area prediction method and system based on depth SVDD model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111680843A CN111680843A (en) | 2020-09-18 |
CN111680843B true CN111680843B (en) | 2022-06-28 |
Family
ID=72435523
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010537578.6A Active CN111680843B (en) | 2020-06-12 | 2020-06-12 | Chinese medicinal material survival area prediction method and system based on depth SVDD model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111680843B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113095674A (en) * | 2021-04-12 | 2021-07-09 | 云南省林业调查规划院 | Analysis method for potential habitat of Yunnan key protection wild plant based on MaxEnt and GIS |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101398417A (en) * | 2008-10-29 | 2009-04-01 | 中国药科大学 | Universal method for rapid detection and structural identification for Chinese traditional medicine complex component |
CN102521480A (en) * | 2011-11-15 | 2012-06-27 | 中国医学科学院药用植物研究所 | Method for selecting new producing area of Chinese medical herb |
CN103345588A (en) * | 2013-07-18 | 2013-10-09 | 成都中医药大学 | Method for calculating number of wild traditional Chinese medicine potential resources |
CN106372460A (en) * | 2016-08-24 | 2017-02-01 | 成都旅美科技有限公司 | Environment analysis-based biological distribution determination apparatus |
CN106845699A (en) * | 2017-01-05 | 2017-06-13 | 南昌大学 | A kind of method for predicting oil tea normal region |
CN106961973A (en) * | 2017-03-30 | 2017-07-21 | 杨友仁 | The method that pulse family Chinese medicine is sowed on a large scale is realized using intelligent bulb technology |
CN107403057A (en) * | 2016-05-20 | 2017-11-28 | 中国中医科学院中药研究所 | A kind of Chinese medicine Quality Regionalization model based on maximum informational entropy and improved independence weight coefficient |
CN110222343A (en) * | 2019-06-13 | 2019-09-10 | 电子科技大学 | A kind of Chinese medicine plant resource name entity recognition method |
CN110348060A (en) * | 2019-06-13 | 2019-10-18 | 中国测绘科学研究院 | A kind of snow leopard Habitat suitability evaluation method and device |
CN111178631A (en) * | 2019-12-30 | 2020-05-19 | 广州地理研究所 | Method and system for predicting water lettuce invasion distribution area |
-
2020
- 2020-06-12 CN CN202010537578.6A patent/CN111680843B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101398417A (en) * | 2008-10-29 | 2009-04-01 | 中国药科大学 | Universal method for rapid detection and structural identification for Chinese traditional medicine complex component |
CN102521480A (en) * | 2011-11-15 | 2012-06-27 | 中国医学科学院药用植物研究所 | Method for selecting new producing area of Chinese medical herb |
CN103345588A (en) * | 2013-07-18 | 2013-10-09 | 成都中医药大学 | Method for calculating number of wild traditional Chinese medicine potential resources |
CN107403057A (en) * | 2016-05-20 | 2017-11-28 | 中国中医科学院中药研究所 | A kind of Chinese medicine Quality Regionalization model based on maximum informational entropy and improved independence weight coefficient |
CN106372460A (en) * | 2016-08-24 | 2017-02-01 | 成都旅美科技有限公司 | Environment analysis-based biological distribution determination apparatus |
CN106845699A (en) * | 2017-01-05 | 2017-06-13 | 南昌大学 | A kind of method for predicting oil tea normal region |
CN106961973A (en) * | 2017-03-30 | 2017-07-21 | 杨友仁 | The method that pulse family Chinese medicine is sowed on a large scale is realized using intelligent bulb technology |
CN110222343A (en) * | 2019-06-13 | 2019-09-10 | 电子科技大学 | A kind of Chinese medicine plant resource name entity recognition method |
CN110348060A (en) * | 2019-06-13 | 2019-10-18 | 中国测绘科学研究院 | A kind of snow leopard Habitat suitability evaluation method and device |
CN111178631A (en) * | 2019-12-30 | 2020-05-19 | 广州地理研究所 | Method and system for predicting water lettuce invasion distribution area |
Non-Patent Citations (5)
Title |
---|
Predicting the Potential Distribution Patterns of the Rare Plant Gymnocarpos Przewalskii Under Present and Future Climate Change;Ma Songmei等;《2011 International Conference on Consumer Electronics, Communications and Networks (CECNet)》;20110516;1513-1515 * |
基于GIS的中药材产地适宜性分析系统的设计与实现;孙成忠 等;《世界科学技术-中医药现代化》;20060331;第8卷(第3期);112-117 * |
基于MaxEnt和GIS技术的桔梗适宜性分布区划研究;董光 等;《中药材》;20190131;第42卷(第1期);66-70 * |
基于Maxent模型对党参害虫烟草甲在中国的适生区预测分析;侯沁文等;《长治学院学报》;20200415(第02期);176-183 * |
基于生态因子的山东太子参生态适宜区划研究;边丽华等;《山东农业科学》;20180228(第02期);68-75 * |
Also Published As
Publication number | Publication date |
---|---|
CN111680843A (en) | 2020-09-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109063355A (en) | Near-optimal method based on particle group optimizing Yu Kriging model | |
CN113434736A (en) | Multi-dimensional mixed indexing method and system for remote sensing big data | |
CN117314266B (en) | Novel intelligent scientific and technological talent evaluation method based on hypergraph attention mechanism | |
CN111680843B (en) | Chinese medicinal material survival area prediction method and system based on depth SVDD model | |
CN115688982A (en) | Building photovoltaic data completion method based on WGAN and whale optimization algorithm | |
CN116775661A (en) | Big space data storage and management method based on Beidou grid technology | |
Gao et al. | The intellectual structure of digital humanities: An author co-citation analysis | |
Li et al. | Feature Extraction and Image Retrieval of Landscape Images Based on Image Processing. | |
KR101467707B1 (en) | Method for instance-matching in knowledge base and device therefor | |
Tian et al. | A crown contour envelope model of Chinese fir based on random forest and mathematical modeling | |
Nie et al. | Semantic category balance-aware involved anti-interference network for remote sensing semantic segmentation | |
Zhu et al. | Integrating land-cover products based on ontologies and local accuracy | |
Pabreja et al. | A data warehousing and data mining approach for analysis and forecast of cloudburst events using OLAP-based data hypercube | |
Jiang et al. | A massive multi-modal perception data classification method using deep learning based on internet of things | |
CN112508170B (en) | Multi-correlation time sequence prediction system and method based on generation of countermeasure network | |
CN114818681A (en) | Entity identification method and system, computer readable storage medium and terminal | |
Zhu et al. | Using Eco-geographical zoning data and crowdsourcing to improve the detection of spurious land cover changes | |
Zeng et al. | A co-occurrence region based Bayesian network stepwise remote sensing image retrieval algorithm | |
Sun et al. | Consistency Center-Based Deep Cross-Modal Hashing for Multi-Source Remote Sensing Image Retrieval | |
Zhu et al. | A Spatio-Temporal Local Association Query Algorithm for Multi-Source Remote Sensing Big Data | |
CN109241070A (en) | A kind of time dimension unified approach of the meteorological data inconsistency based on big data | |
CN116933146B (en) | Classification system creation method and device for digital twin space entity | |
Rungyaem et al. | Comparison of 3D Rice Organs Point Cloud Classification Techniques | |
CN112506959B (en) | Data scheduling method and device for intelligent ship database retrieval and retrieval system | |
CN118690178B (en) | Multi-wind-power-plant wind speed prediction method based on mRMR-RF and GS-LSTM |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |