CN111680843B - Prediction method and system of Chinese herbal medicine suitable area based on deep SVDD model - Google Patents

Prediction method and system of Chinese herbal medicine suitable area based on deep SVDD model Download PDF

Info

Publication number
CN111680843B
CN111680843B CN202010537578.6A CN202010537578A CN111680843B CN 111680843 B CN111680843 B CN 111680843B CN 202010537578 A CN202010537578 A CN 202010537578A CN 111680843 B CN111680843 B CN 111680843B
Authority
CN
China
Prior art keywords
data
chinese medicinal
medicinal materials
model
dimensional space
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010537578.6A
Other languages
Chinese (zh)
Other versions
CN111680843A (en
Inventor
李巧勤
蔡茁
刘勇国
杨尚明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202010537578.6A priority Critical patent/CN111680843B/en
Publication of CN111680843A publication Critical patent/CN111680843A/en
Application granted granted Critical
Publication of CN111680843B publication Critical patent/CN111680843B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • G06F18/21355Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis nonlinear criteria, e.g. embedding a manifold in a Euclidean space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/02Agriculture; Fishing; Forestry; Mining

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Strategic Management (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Business, Economics & Management (AREA)
  • Tourism & Hospitality (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Marketing (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Marine Sciences & Fisheries (AREA)
  • Development Economics (AREA)
  • Mining & Mineral Resources (AREA)
  • Evolutionary Computation (AREA)
  • Animal Husbandry (AREA)
  • Primary Health Care (AREA)
  • Agronomy & Crop Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Game Theory and Decision Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Machine Translation (AREA)
  • Medicines Containing Plant Substances (AREA)

Abstract

The invention discloses a Chinese medicinal material survival area prediction method and system based on a deep SVDD model, which comprises the following steps: collecting ecological factor data of the traditional Chinese medicinal materials, and generating pseudo-nonexistent sample data of the traditional Chinese medicinal materials by adopting a MaxEnt model; preprocessing the collected ecological factor data of the traditional Chinese medicinal materials to obtain ecological factor preprocessing data; constructing a prediction model of the Chinese medicinal material survival area according to the ecological factor preprocessing data, the pseudo-nonexistent sample data and the SVDD model; and putting the test points of the traditional Chinese medicinal materials to be predicted into the prediction model of the survival area of the traditional Chinese medicinal materials for judgment to obtain the survival area of the traditional Chinese medicinal materials to be predicted. The SGD and the SGD variant optimize parameters of a deep SVDD model, and the calculation complexity of the model is linear expansion on the training quantity, so that a large data set can be well expanded; the survival area is obtained by judging the distance between different test points and the optimal hypersphere, and the accuracy of the prediction result of the survival area of the traditional Chinese medicinal materials is improved.

Description

基于深度SVDD模型的中药材适生区预测方法及系统Prediction method and system of suitable growth area of Chinese medicinal materials based on deep SVDD model

技术领域technical field

本发明涉及中药资源开发利用领域,具体涉及基于深度SVDD模型的中药材适生区预测方法及系统。The invention relates to the field of development and utilization of traditional Chinese medicine resources, in particular to a method and a system for predicting suitable growth areas of traditional Chinese medicinal materials based on a deep SVDD model.

背景技术Background technique

中药资源的开发利用和保护其持续发展,对于我国的中药资源研究来说非常重要,因为我国的中药资源面临诸多问题,比如盲目的扩大栽培地区不仅严重影响中药材质量及产区,还使引种药材的有效成分与《中国药典》标准相差明显,严重制约了中药材的可持续发展。为了更加科学地扩大中药材引种地区,需要加强药材生态适宜性研究,寻找药材形成的生态因素如光、温度、水分、地形、土壤等,并加大中药材的引种栽培和区划管理,达到充分合理利用环境资源,保护中药资源、实现其可持续发展的目标。The development and utilization of traditional Chinese medicine resources and the protection of their sustainable development are very important for the research of traditional Chinese medicine resources in my country, because my country's traditional Chinese medicine resources face many problems. The effective components of medicinal materials are significantly different from the standard of "Chinese Pharmacopoeia", which seriously restricts the sustainable development of traditional Chinese medicinal materials. In order to more scientifically expand the areas where Chinese medicinal materials are introduced, it is necessary to strengthen the research on the ecological suitability of medicinal materials, find the ecological factors such as light, temperature, moisture, terrain, soil, etc. Rational use of environmental resources, protection of traditional Chinese medicine resources, to achieve the goal of sustainable development.

目前,针对药材适生区分布预测的研究大多是采用最大熵模型及已有分布资料和生态环境,预测适生区分布格局及变迁等。现有技术中采用MaxEnt生态位模型与GIS技术相结合的方法,根据214个桔梗样本点分布数据,基于刀切法分析生态因子的贡献率,探索影响桔梗生长的主要生态因子与适生地特征,从而对桔梗在全国范围内的生长适宜性区域进行区划研究,预测精度评价指标AUC(Area Under Curve)值达0.922。At present, most of the researches on the distribution prediction of the suitable areas for medicinal materials use the maximum entropy model and the existing distribution data and ecological environment to predict the distribution pattern and changes of the suitable areas. In the prior art, the method of combining the MaxEnt niche model and GIS technology was adopted. According to the distribution data of 214 Platycodon grandiflorum sample points, the contribution rate of ecological factors was analyzed based on the knife-cut method, and the main ecological factors and suitable habitat characteristics affecting the growth of Platycodon grandiflorum were explored. Therefore, the regionalization research of the suitable growth area of Platycodon grandiflorum in the whole country was carried out, and the prediction accuracy evaluation index AUC (Area Under Curve) value reached 0.922.

但是Maxent模型是一种复杂的机器学习算法,对采样偏差敏感,容易产生过度拟合的情况,所述Maxent模型的转移能力仅在低阈值情况下较好;且所述Maxent模型基于默认参数会对预测结果的准确率产生影响。However, the Maxent model is a complex machine learning algorithm, which is sensitive to sampling bias and is prone to overfitting. The transfer ability of the Maxent model is only good in the case of a low threshold; and the Maxent model will be based on default parameters. have an impact on the accuracy of the prediction results.

发明内容SUMMARY OF THE INVENTION

本发明所要解决的技术问题是现有的中药材适生区预测方法中采用Maxent模型导致采样偏差,容易产生过度拟合以及Maxent模型基于默认参数会对预测结果的准确率产生影响的问题,目的在于提供基于深度SVDD模型的中药材适生区预测方法及系统,解决上述问题。The technical problem to be solved by the present invention is that the Maxent model is used in the existing traditional Chinese medicinal material suitable area prediction method, which leads to sampling deviation, which is prone to overfitting, and the Maxent model will affect the accuracy of the prediction result based on the default parameters. The purpose is to provide a method and system for predicting the suitable area of Chinese medicinal materials based on the deep SVDD model to solve the above problems.

本发明通过下述技术方案实现:The present invention is achieved through the following technical solutions:

一种基于深度SVDD模型的中药材适生区预测方法,包括:A method for predicting suitable areas for Chinese medicinal materials based on a deep SVDD model, comprising:

S1:采集中药材的生态因子数据,采用MaxEnt模型生成中药材的伪不存在点样本数据;S1: Collect ecological factor data of Chinese herbal medicines, and use MaxEnt model to generate pseudo-absence point sample data of Chinese herbal medicines;

所述伪不存在点样本数据为通过MaxEnt模型得到的中药材非适生区;The pseudo-absence point sample data is the unsuitable area of Chinese medicinal materials obtained by the MaxEnt model;

S2:对采集到的中药材的生态因子数据进行预处理,得到生态因子预处理数据;S2: preprocessing the collected ecological factor data of Chinese medicinal materials to obtain ecological factor preprocessing data;

S3:根据所述生态因子预处理数据、伪不存在点样本数据和SVDD模型构建中药材适生区预测模型;S3: constructing a prediction model for the suitable area for Chinese medicinal materials according to the ecological factor preprocessing data, the pseudo-absence point sample data and the SVDD model;

S4:将待预测中药材的测试点放入所述中药材适生区预测模型进行判断,得到待预测中药材的适生区。S4: Put the test point of the Chinese herbal medicine to be predicted into the prediction model of the Chinese herbal medicine suitable area for judgment, and obtain the suitable area of the Chinese herbal medicine to be predicted.

本发明提出基于深度支持向量数据描述即基于深度SVDD模型的中药材适生区预测模型。由于生态环境因子的数据格式不同,因此需要对数据进行预处理操作,即基于t-SNE算法实现数据的统一转换,使用深度支持向量数据描述模型将转换后的数据基于非线性映射到高维特征空间,并在特征空间寻找最优超球体,使用SGD及其变体优化深度SVDD模型的参数,因为其计算复杂性在训练批的数量上是线性扩展的,因此对大型数据集进行很好的扩展。The present invention proposes a prediction model for the suitable area of Chinese medicinal materials based on the deep support vector data description, that is, based on the deep SVDD model. Due to the different data formats of ecological environment factors, it is necessary to preprocess the data, that is, to realize the unified transformation of the data based on the t-SNE algorithm, and use the deep support vector data description model to map the transformed data to high-dimensional features based on nonlinearity. space, and find the optimal hypersphere in the feature space, using SGD and its variants to optimize the parameters of the deep SVDD model, since its computational complexity scales linearly in the number of training batches, it performs well on large datasets extension.

在现有长期的中药材资源研究中,对于中药材的生态因子数据国家都是有提供大数据的存储与管理的,并且具有权威性和真实性,因此本发明中通过中国植物标本馆与国家标本平台进行采集获得中草药的样本分布数据,由于样本存在点分布地点较广,且存在重复数据,需通过数据清洗等方法筛选有效数据,获得样本点的经纬度;通过查阅相关中药材的环境因子文献,采集中药材的环境因子数据;所述环境因子数据包括气候因子、地形因子和土壤因子;通过国家基础地理信息系统网进行经纬度映射从而采集得到中国各地区的生态环境分布情况。In the existing long-term research on Chinese medicinal materials resources, the country provides the storage and management of big data for the ecological factor data of Chinese medicinal materials, and is authoritative and authentic. The sample platform collects and obtains the sample distribution data of Chinese herbal medicines. Due to the wide distribution of sample points and the existence of duplicate data, it is necessary to screen the valid data by data cleaning and other methods to obtain the longitude and latitude of the sample points; by consulting the environmental factor literature of relevant Chinese herbal medicines , collect environmental factor data of traditional Chinese medicinal materials; the environmental factor data includes climate factor, terrain factor and soil factor; the longitude and latitude mapping is performed through the national basic geographic information system network to collect the ecological environment distribution in various regions of China.

为了增加模型可靠性,同时需要真实存在和不存在即某一药材一定不在某地生长的样本数据一起训练模型,由于没有真实不存在数据,本发明利用参数调整之后的优化MaxEnt模型构造伪不存在数据。In order to increase the reliability of the model, it is necessary to train the model together with sample data that does exist and does not exist, that is, a certain medicinal material must not grow in a certain place. Since there is no real nonexistent data, the present invention uses the optimized MaxEnt model after parameter adjustment to construct a false nonexistence data.

由于输入数据是多源异构的,比如土壤质地为文本类型,温度、降水量等为数值类型,本发明首先采用词向量模型Word2vec将文本中的词转化为词向量的表示,获取文本数据的特征表示;由于词向量高维空间处理效率较低,利用t-SNE算法将高维词向量空间映射为二维空间,使得词义相近的两个单词,在映射之后依然保持相近,词义较远的单词保持很远的映射距离。Since the input data is multi-source and heterogeneous, such as soil texture is text type, temperature, precipitation, etc. are numerical types, the present invention first uses the word vector model Word2vec to convert the words in the text into the representation of word vectors, and obtains the text data. Feature representation; due to the low processing efficiency of the word vector high-dimensional space, the t-SNE algorithm is used to map the high-dimensional word vector space into a two-dimensional space, so that two words with similar word meanings remain similar after mapping, and those with farther word meanings Words maintain a large mapping distance.

进一步的,对中药材的适生区进行判断的主要生态因子数据包括样本分布数据、环境因子数据和地图数据。Further, the main ecological factor data for judging the suitable habitat of Chinese medicinal materials include sample distribution data, environmental factor data and map data.

进一步的,所述S1中伪不存在点样本数据的生成步骤包括:Further, the step of generating the pseudo-absent point sample data in the S1 includes:

采用MaxEnt模型生成中药材的适生区数值结果;Use MaxEnt model to generate the numerical results of the suitable area of Chinese medicinal materials;

MaxEnt模型输出的结果为0~1,代表每个栅格可认为是地图中的像素点的适生指数,其数值越高代表中药材越适生,本发明将适生指数在一定阈值以上的栅格视为适生区,选出适生区后将其经纬度从图上剔除,仅留下非适生区。The output result of the MaxEnt model is 0 to 1, which means that each grid can be considered as the fitness index of the pixel in the map. The higher the value, the more suitable the Chinese medicinal material is. The grid is regarded as a suitable area. After the suitable area is selected, its latitude and longitude are removed from the map, leaving only the non-suitable area.

剔除所述中药材的适生区数值结果中大于或等于阈值的数值,得到非适生区;Eliminate the values that are greater than or equal to the threshold value in the result of the suitable area of the Chinese medicinal material to obtain the non-suitable area;

且存在点和伪不存在点数量相同情况下模型的模拟效果最好,因此本发明从所述非适生区中选择与中药材生态因子数据相同数量的伪不存在点,得到中药材的伪不存在点样本数据。And the simulation effect of the model is the best when the number of existence points and pseudo-absence points is the same, so the present invention selects the same number of pseudo-absence points as the ecological factor data of Chinese medicinal materials from the non-suitable area to obtain the pseudo-absence of traditional Chinese medicinal materials. No point sample data exists.

进一步的,所述S2的预处理过程包括:Further, the preprocessing process of S2 includes:

采用词向量模型Word2vec将所述生态因子数据转化为高维空间词向量;The word vector model Word2vec is used to convert the ecological factor data into a high-dimensional space word vector;

使用t-SNE算法将高维空间词向量映射为二维空间词向量。Use the t-SNE algorithm to map high-dimensional spatial word vectors to two-dimensional spatial word vectors.

进一步的,所述t-SNE算法:Further, the t-SNE algorithm:

为了使相似的对象有更高的概率被选择,而非相似的对象有较低的概率被选择,对象之间的相似度通过将欧式距离转换为条件概率来表达,即构建一个高维对象之间的概率分布,不同数据之间的相似度表示:In order to make similar objects have a higher probability to be selected, and dissimilar objects have a lower probability to be selected, the similarity between objects is expressed by converting Euclidean distance into conditional probability, that is, constructing a high-dimensional object between The probability distribution between different data, the similarity between different data is expressed:

Figure BDA0002537544200000031
Figure BDA0002537544200000031

其中,pj|i表示高维空间中不同数据之间的相似度,xi和xj为N维数据x1,x2,…,xN中的任意两个不相同的数据,参数σi表示以xi为中心的高斯分布的方差,||||表示二范数运算;本发明只关心不同两点之间的相似度,因此设定pi|i=0;Among them, p j|i represents the similarity between different data in high-dimensional space, x i and x j are any two different data in the N-dimensional data x 1 , x 2 ,...,x N , and the parameter σ i represents the variance of the Gaussian distribution centered on x i , and |||| represents the two-norm operation; the present invention only cares about the similarity between two different points, so set p i|i =0;

因为需要将高维空间的向量映射到低维空间,为了低维空间中与高维空间相同的对象的概率分布尽可能的与高维度空间的概率分布相似,需要在低维空间对所述高维对象进行概率分布的构建,不同数据之间的相似度表示:Because it is necessary to map the vectors of the high-dimensional space to the low-dimensional space, in order to make the probability distribution of the same objects in the low-dimensional space as the high-dimensional space as similar as possible to the probability distribution of the high-dimensional space, it is necessary to map the high-dimensional space in the low-dimensional space to the probability distribution of the high-dimensional space. The dimensional object is used to construct the probability distribution, and the similarity between different data is expressed:

Figure BDA0002537544200000032
Figure BDA0002537544200000032

其中,qj|i表示低维空间中不同数据之间的相似度,yi和yj表示低维空间下的二维数据y1,y2;假定高斯分布为方差为

Figure BDA0002537544200000033
同理qi|i=0。Among them, q j|i represents the similarity between different data in the low-dimensional space, and y i and y j represent the two-dimensional data y 1 , y 2 in the low-dimensional space; it is assumed that the Gaussian distribution has a variance of
Figure BDA0002537544200000033
Similarly q i|i =0.

分别构造高维空间和低维空间的联合概率分布P和Q,使得对任意i和j,均有qi|j=pj|i,qi|j=qj|iConstruct the joint probability distributions P and Q of the high-dimensional space and the low-dimensional space, respectively, so that for any i and j, there are q i|j =p j|i , q i|j =q j|i .

Figure BDA0002537544200000034
Figure BDA0002537544200000034

Figure BDA0002537544200000035
Figure BDA0002537544200000035

其中,pi,j表示高维空间任意两个数据之间的联合概率,qi,j表示低维空间任意两个数据之间的联合概率。Among them, p i,j represents the joint probability between any two data in the high-dimensional space, and q i,j represents the joint probability between any two data in the low-dimensional space.

使用KL散度对高维空间和低维空间的联合概率分布的相似性进行衡量,得到:Using KL divergence to measure the similarity of the joint probability distribution of high-dimensional space and low-dimensional space, we get:

Figure BDA0002537544200000041
Figure BDA0002537544200000041

其中,C表示高维空间和低维空间的联合概率分布的相似性,P表示高维空间的联合概率,Q表示低维空间的联合概率。Among them, C represents the similarity of the joint probability distribution of high-dimensional space and low-dimensional space, P represents the joint probability of high-dimensional space, and Q represents the joint probability of low-dimensional space.

进一步的,所述S3中中药材适生区预测模型的构建过程:Further, the construction process of the prediction model of the suitable area for Chinese medicinal materials in the S3:

所述SVDD模型采用全连接网络

Figure BDA0002537544200000042
将所述生态因子预处理数据映射到高维特征空间;The SVDD model uses a fully connected network
Figure BDA0002537544200000042
mapping the ecological factor preprocessing data to a high-dimensional feature space;

在所述高位特征空间中找出最优的超球体,且所述伪不存在点样本数据位于所述最优的超球体的超球面外,所述生态因子数据中的样本分布数据位于所述最优的超球体内部。Find the optimal hypersphere in the high-level feature space, and the pseudo-absence point sample data is located outside the hypersphere of the optimal hypersphere, and the sample distribution data in the ecological factor data is located in the Inside the optimal hypersphere.

进一步的,假设

Figure BDA0002537544200000043
对于任意输入
Figure BDA00025375442000000416
Figure BDA0002537544200000044
层的输出为:Further, suppose
Figure BDA0002537544200000043
for any input
Figure BDA00025375442000000416
the first
Figure BDA0002537544200000044
The output of the layer is:

Figure BDA0002537544200000045
Figure BDA0002537544200000045

其中·表示一个线性操作(比如,矩阵乘法),

Figure BDA0002537544200000046
是第
Figure BDA0002537544200000047
层的激活函数,
Figure BDA0002537544200000048
是第
Figure BDA0002537544200000049
层的权重。where · represents a linear operation (e.g., matrix multiplication),
Figure BDA0002537544200000046
is the first
Figure BDA0002537544200000047
layer activation function,
Figure BDA0002537544200000048
is the first
Figure BDA0002537544200000049
layer weights.

目标函数如下:The objective function is as follows:

Figure BDA00025375442000000410
Figure BDA00025375442000000410

Figure BDA00025375442000000415
Figure BDA00025375442000000415

公式(10)中第一项是半径的平方求和后取均值,且满足每个网络表示

Figure BDA00025375442000000414
到中心
Figure BDA00025375442000000411
的距离的平方小于半径二次方与松弛变量之和,其中n表示样本数大小,第二项是一个权重衰减,其采用L2正则化,其中λ为权重衰减系数,且λ>0,ξi是一个松弛变量,且满足
Figure BDA00025375442000000412
||·||F为F-范数。因此,可以看作是发现一个以c为中心的最小体积超球面,通过最小化所有数据表示到中心的平均偏差来收缩球体半径。The first item in formula (10) is the sum of the squares of the radii and then the average value, and it satisfies each network representation
Figure BDA00025375442000000414
to the center
Figure BDA00025375442000000411
The square of the distance is less than the sum of the square of the radius and the slack variable, where n represents the number of samples, and the second term is a weight decay, which uses L2 regularization, where λ is the weight decay coefficient, and λ>0, ξ i is a slack variable that satisfies
Figure BDA00025375442000000412
||·|| F is the F-norm. Thus, it can be seen as finding a minimal volume hypersphere centered at c, shrinking the sphere radius by minimizing the mean deviation of all data representations from the center.

通过最小化公式(10),引入拉格朗日乘子αi和βi,构建拉格朗日函数如下:By minimizing formula (10), the Lagrangian multipliers α i and β i are introduced, and the Lagrangian function is constructed as follows:

Figure BDA00025375442000000413
Figure BDA00025375442000000413

s.t.αi≥0,βi≥0stα i ≥0,β i ≥0

对R,c,ξ求导,可得:Taking the derivative of R, c, ξ, we can get:

Figure BDA0002537544200000051
Figure BDA0002537544200000051

结合公式(11)和公式(12),可得:Combining formula (11) and formula (12), we can get:

Figure BDA0002537544200000052
Figure BDA0002537544200000052

则所述最优的超球体的半径和中心公式:Then the formula for the radius and center of the optimal hypersphere is:

Figure BDA0002537544200000053
Figure BDA0002537544200000053

Figure BDA0002537544200000054
Figure BDA0002537544200000054

其中,c表示中心点,且

Figure BDA0002537544200000055
n表示样本数大小,
Figure BDA00025375442000000511
表示每个连接网络,αi和αj表示拉格朗日乘子,
Figure BDA0002537544200000058
表示内积,
Figure BDA0002537544200000059
表示支持向量,且
Figure BDA00025375442000000510
where c represents the center point, and
Figure BDA0002537544200000055
n is the sample size,
Figure BDA00025375442000000511
denote each connected network, α i and α j denote Lagrange multipliers,
Figure BDA0002537544200000058
represents the inner product,
Figure BDA0002537544200000059
represents the support vector, and
Figure BDA00025375442000000510

进一步的,所述S4中选取任意所述生态因子数据中的待预测中药材的测试点;Further, in the described S4, select the test point of the Chinese medicinal material to be predicted in any of the ecological factor data;

计算所述待预测中药材的测试点与所述最优的超球体中心点的距离:Calculate the distance between the test point of the Chinese medicinal material to be predicted and the center point of the optimal hypersphere:

Figure BDA0002537544200000057
Figure BDA0002537544200000057

其中,x′表示测试点,s(x′)表示测试点与所述最优的超球体中心点的距离,

Figure BDA0002537544200000056
表示SVDD模型的超参数;Among them, x' represents the test point, s(x') represents the distance between the test point and the optimal hypersphere center point,
Figure BDA0002537544200000056
Represents the hyperparameters of the SVDD model;

判断s(x′)是否大于体积最小的超球体的半径,当s(x′)大于所述最优的超球体的半径时,所述测试点为非适生区;Judging whether s(x') is greater than the radius of the hypersphere with the smallest volume, when s(x') is greater than the radius of the optimal hypersphere, the test point is an unsuitable area;

当s(x′)小于或等于所述最优的超球体的半径时,所述测试点为适生区;When s(x') is less than or equal to the radius of the optimal hypersphere, the test point is a suitable area;

对所有的待预测中药材的测试点进行上述操作得到待预测中药材的所有适生区。The above operation is performed on all the test points of the Chinese herbal medicine to be predicted to obtain all the suitable areas of the Chinese herbal medicine to be predicted.

一种基于深度SVDD模型的中药材适生区预测系统,包括:A prediction system based on the deep SVDD model for the suitable area of Chinese medicinal materials, including:

采集模块,用于采集中药材的生态因子数据,并生成中药材的伪不存在点样本数据;The collection module is used to collect ecological factor data of Chinese medicinal materials and generate pseudo-absence point sample data of Chinese medicinal materials;

所述伪不存在点样本数据为通过MaxEnt模型得到的中药材非适生区;The pseudo-absence point sample data is the unsuitable area of Chinese medicinal materials obtained by the MaxEnt model;

预处理模块,用于对采集到的中药材的生态因子数据进行预处理,得到生态因子预处理数据;The preprocessing module is used for preprocessing the collected ecological factor data of Chinese medicinal materials to obtain ecological factor preprocessing data;

预测模型生成模块,用于根据所述生态因子预处理数据、伪不存在点样本数据和SVDD模型构建中药材适生区预测模型;The prediction model generation module is used for constructing the prediction model of the suitable area for Chinese medicinal materials according to the ecological factor preprocessing data, the pseudo-absence point sample data and the SVDD model;

预测模块,用于预测并得到待预测中药材的适生区。The prediction module is used to predict and obtain the suitable area of the Chinese herbal medicine to be predicted.

进一步的,所述预测模块的预测过程:选取任意所述生态因子数据中的待预测中药材的测试点;Further, the prediction process of the prediction module: select the test point of the Chinese medicinal material to be predicted in any of the ecological factor data;

计算所述待预测中药材的测试点与所述中药材适生区预测模型中最优的超球体中心点的距离:Calculate the distance between the test point of the Chinese medicinal material to be predicted and the optimal hypersphere center point in the prediction model of the Chinese medicinal material suitable area:

Figure BDA0002537544200000062
Figure BDA0002537544200000062

其中,x′表示测试点,s(x′)表示测试点与所述中药材适生区预测模型最优的超球体中心点的距离,

Figure BDA0002537544200000061
表示SVDD模型的超参数;Among them, x' represents the test point, s(x') represents the distance between the test point and the center point of the hypersphere where the prediction model of the suitable area for Chinese medicinal materials is optimal,
Figure BDA0002537544200000061
Represents the hyperparameters of the SVDD model;

判断s(x′)是否大于体积最小的超球体的半径,当s(x′)大于所述最优的超球体的半径时,所述测试点为非适生区;Judging whether s(x') is greater than the radius of the hypersphere with the smallest volume, when s(x') is greater than the radius of the optimal hypersphere, the test point is an unsuitable area;

当s(x′)小于或等于所述最优的超球体的半径时,所述测试点为适生区;对所有的待预测中药材的测试点进行上述操作得到待预测中药材的所有适生区。When s(x') is less than or equal to the radius of the optimal hypersphere, the test point is a suitable area; perform the above operation on all the test points of the Chinese herbal medicine to be predicted to obtain all suitable conditions of the Chinese herbal medicine to be predicted. living area.

本发明与现有技术相比,具有如下的优点和有益效果:Compared with the prior art, the present invention has the following advantages and beneficial effects:

1、本发明基于深度SVDD模型的中药材适生区预测方法及系统,采用SVDD模型进行中药材适生区的训练,使用SGD及其变体优化深度SVDD模型的参数,因为其计算复杂性在训练批的数量上是线性扩展的,因此对大型数据集进行很好的扩展;1. The present invention is based on the deep SVDD model-based method and system for predicting the suitable areas for Chinese medicinal materials, using the SVDD model to train the suitable areas for Chinese medicinal materials, and using SGD and its variants to optimize the parameters of the deep SVDD model, because its computational complexity is The number of training batches scales linearly, so scales well for large datasets;

2、本发明基于深度SVDD模型的中药材适生区预测方法及系统,采用SVDD模型时因为其对所有的数据采用向量描述不会产生采样偏差以及过度拟合,通过对不同测试点与最优超球体的距离判断获得适生区,提高中药材适生区预测结果准确率。2. The present invention is based on the deep SVDD model for the prediction method and system of the suitable area for Chinese medicinal materials. When the SVDD model is used, it will not generate sampling deviation and over-fitting because it uses vector description for all data. The distance judgment of the hypersphere can obtain the suitable area, and improve the accuracy of the prediction result of the suitable area of Chinese medicinal materials.

附图说明Description of drawings

此处所说明的附图用来提供对本发明实施例的进一步理解,构成本申请的一部分,并不构成对本发明实施例的限定。在附图中:The accompanying drawings described herein are used to provide further understanding of the embodiments of the present invention, and constitute a part of the present application, and do not constitute limitations to the embodiments of the present invention. In the attached image:

图1为本发明整体方法流程图;Fig. 1 is the overall method flow chart of the present invention;

图2为本发明系统结构示意图;2 is a schematic diagram of the system structure of the present invention;

图3为本发明SVDD模型运算示意图。FIG. 3 is a schematic diagram of the operation of the SVDD model of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚明白,下面结合实施例和附图,对本发明作进一步的详细说明,本发明的示意性实施方式及其说明仅用于解释本发明,并不作为对本发明的限定。In order to make the purpose, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the embodiments and the accompanying drawings. as a limitation of the present invention.

实施例1Example 1

如图1所示,一种基于深度SVDD模型的中药材适生区预测方法,包括:As shown in Figure 1, a method for predicting the suitable area of Chinese medicinal materials based on the deep SVDD model includes:

S1:采集中药材的生态因子数据,采用MaxEnt模型生成中药材的伪不存在点样本数据;S1: Collect ecological factor data of Chinese herbal medicines, and use MaxEnt model to generate pseudo-absence point sample data of Chinese herbal medicines;

所述伪不存在点样本数据为通过MaxEnt模型得到的中药材非适生区;The pseudo-absence point sample data is the unsuitable area of Chinese medicinal materials obtained by the MaxEnt model;

S2:对采集到的中药材的生态因子数据进行预处理,得到生态因子预处理数据;S2: preprocessing the collected ecological factor data of Chinese medicinal materials to obtain ecological factor preprocessing data;

S3:根据所述生态因子预处理数据、伪不存在点样本数据和SVDD模型构建中药材适生区预测模型;S3: constructing a prediction model for the suitable area for Chinese medicinal materials according to the ecological factor preprocessing data, the pseudo-absence point sample data and the SVDD model;

S4:将待预测中药材的测试点放入所述中药材适生区预测模型进行判断,得到待预测中药材的适生区。S4: Put the test point of the Chinese herbal medicine to be predicted into the prediction model of the Chinese herbal medicine suitable area for judgment, and obtain the suitable area of the Chinese herbal medicine to be predicted.

所述生态因子数据包括样本分布数据、环境因子数据和地图数据。The ecological factor data includes sample distribution data, environmental factor data and map data.

所述S1中伪不存在点样本数据的生成步骤包括:The step of generating the pseudo-absent point sample data in the S1 includes:

采用MaxEnt模型生成中药材的适生区数值结果;Use MaxEnt model to generate the numerical results of the suitable area of Chinese medicinal materials;

剔除所述中药材的适生区数值结果中大于或等于阈值的数值,得到非适生区;Eliminate the values that are greater than or equal to the threshold value in the result of the suitable area of the Chinese medicinal material to obtain the non-suitable area;

从所述非适生区中选择与中药材生态因子数据相同数量的伪不存在点,得到中药材的伪不存在点样本数据。The same number of pseudo-absence points as the ecological factor data of Chinese medicinal materials are selected from the non-suitable area to obtain sample data of pseudo-absence points of traditional Chinese medicinal materials.

所述S2的预处理过程包括:The preprocessing process of S2 includes:

采用词向量模型Word2vec将所述生态因子数据转化为高维空间词向量;The word vector model Word2vec is used to convert the ecological factor data into a high-dimensional space word vector;

使用t-SNE算法将高维空间词向量映射为二维空间词向量。Use the t-SNE algorithm to map high-dimensional spatial word vectors to two-dimensional spatial word vectors.

所述t-SNE算法:The t-SNE algorithm:

构建一个高维对象之间的概率分布,不同数据之间的相似度表示:Construct a probability distribution between high-dimensional objects, and the similarity between different data is expressed as:

Figure BDA0002537544200000071
Figure BDA0002537544200000071

其中,pj|i表示高维空间中不同数据之间的相似度,xi和xj为N维数据x1,x2,…,xN中的任意两个不相同的数据,参数σi表示以xi为中心的高斯分布的方差,||||表示二范数运算;Among them, p j|i represents the similarity between different data in high-dimensional space, x i and x j are any two different data in the N-dimensional data x 1 , x 2 ,...,x N , and the parameter σ i represents the variance of the Gaussian distribution centered on x i , and |||| represents the two-norm operation;

在低维空间对所述高维对象进行概率分布的构建,不同数据之间的相似度表示:The probability distribution is constructed for the high-dimensional objects in the low-dimensional space, and the similarity between different data is expressed as:

Figure BDA0002537544200000072
Figure BDA0002537544200000072

其中,qj|i表示低维空间中不同数据之间的相似度,yi和yj表示低维空间下的二维数据y1,y2Among them, q j|i represents the similarity between different data in the low-dimensional space, and y i and y j represent the two-dimensional data y 1 , y 2 in the low-dimensional space;

分别构造高维空间和低维空间的联合概率分布P和Q,使得对任意i和j,均有qi|j=pj|i,qi|j=qj|iConstruct the joint probability distributions P and Q of the high-dimensional space and the low-dimensional space, respectively, so that for any i and j, there are q i|j =p j|i , q i|j =q j|i .

Figure BDA0002537544200000081
Figure BDA0002537544200000081

Figure BDA0002537544200000082
Figure BDA0002537544200000082

其中,pi,j表示高维空间任意两个数据之间的联合概率,qi,j表示低维空间任意两个数据之间的联合概率。Among them, p i,j represents the joint probability between any two data in the high-dimensional space, and q i,j represents the joint probability between any two data in the low-dimensional space.

使用KL散度对高维空间和低维空间的联合概率分布的相似性进行衡量,得到:Using KL divergence to measure the similarity of the joint probability distribution of high-dimensional space and low-dimensional space, we get:

Figure BDA0002537544200000083
Figure BDA0002537544200000083

其中,C表示高维空间和低维空间的联合概率分布的相似性,P表示高维空间的联合概率,Q表示低维空间的联合概率。Among them, C represents the similarity of the joint probability distribution of high-dimensional space and low-dimensional space, P represents the joint probability of high-dimensional space, and Q represents the joint probability of low-dimensional space.

所述S3中中药材适生区预测模型的构建过程:The construction process of the prediction model of the suitable area for Chinese medicinal materials in the S3:

所述SVDD模型采用全连接网络

Figure BDA0002537544200000084
将所述生态因子预处理数据映射到高维特征空间;The SVDD model uses a fully connected network
Figure BDA0002537544200000084
mapping the ecological factor preprocessing data to a high-dimensional feature space;

在所述高位特征空间中找出最优的超球体,且所述伪不存在点样本数据位于所述最优的超球体的超球面外,所述生态因子数据中的样本分布数据位于所述最优的超球体内部。Find the optimal hypersphere in the high-level feature space, and the pseudo-absence point sample data is located outside the hypersphere of the optimal hypersphere, and the sample distribution data in the ecological factor data is located in the Inside the optimal hypersphere.

所述最优的超球体的半径和中心公式:The formula for the radius and center of the optimal hypersphere:

Figure BDA0002537544200000085
Figure BDA0002537544200000085

Figure BDA0002537544200000086
Figure BDA0002537544200000086

其中,c表示中心点,且

Figure BDA0002537544200000087
n表示样本数大小,
Figure BDA00025375442000000811
表示每个连接网络,αi和αj表示拉格朗日乘子,
Figure BDA0002537544200000088
表示内积,
Figure BDA0002537544200000089
表示支持向量,且
Figure BDA00025375442000000810
where c represents the center point, and
Figure BDA0002537544200000087
n is the sample size,
Figure BDA00025375442000000811
denote each connected network, α i and α j denote Lagrange multipliers,
Figure BDA0002537544200000088
represents the inner product,
Figure BDA0002537544200000089
represents the support vector, and
Figure BDA00025375442000000810

所述S4中选取任意所述生态因子数据中的待预测中药材的测试点;In described S4, select the test point of the Chinese medicinal material to be predicted in any of the ecological factor data;

计算所述待预测中药材的测试点与所述最优的超球体中心点的距离:Calculate the distance between the test point of the Chinese medicinal material to be predicted and the center point of the optimal hypersphere:

Figure BDA0002537544200000093
Figure BDA0002537544200000093

其中,x′表示测试点,s(x′)表示测试点与所述最优的超球体中心点的距离,

Figure BDA0002537544200000094
表示SVDD模型的超参数;Among them, x' represents the test point, s(x') represents the distance between the test point and the optimal hypersphere center point,
Figure BDA0002537544200000094
Represents the hyperparameters of the SVDD model;

判断s(x′)是否大于体积最小的超球体的半径,当s(x′)大于所述最优的超球体的半径时,所述测试点为非适生区;Judging whether s(x') is greater than the radius of the hypersphere with the smallest volume, when s(x') is greater than the radius of the optimal hypersphere, the test point is an unsuitable area;

当s(x′)小于或等于所述最优的超球体的半径时,所述测试点为适生区;When s(x') is less than or equal to the radius of the optimal hypersphere, the test point is a suitable area;

对所有的待预测中药材的测试点进行上述操作得到待预测中药材的所有适生区。The above operation is performed on all the test points of the Chinese herbal medicine to be predicted to obtain all the suitable areas of the Chinese herbal medicine to be predicted.

如图2所示,一种基于深度SVDD模型的中药材适生区预测系统,包括:As shown in Figure 2, a prediction system based on the deep SVDD model for the suitable area of Chinese medicinal materials includes:

采集模块,用于采集中药材的生态因子数据,并生成中药材的伪不存在点样本数据;The collection module is used to collect ecological factor data of Chinese medicinal materials and generate pseudo-absence point sample data of Chinese medicinal materials;

所述伪不存在点样本数据为通过MaxEnt模型得到的中药材非适生区;The pseudo-absence point sample data is the unsuitable area of Chinese medicinal materials obtained by the MaxEnt model;

预处理模块,用于对采集到的中药材的生态因子数据进行预处理,得到生态因子预处理数据;The preprocessing module is used for preprocessing the collected ecological factor data of Chinese medicinal materials to obtain ecological factor preprocessing data;

预测模型生成模块,用于根据所述生态因子预处理数据、伪不存在点样本数据和SVDD模型构建中药材适生区预测模型;The prediction model generation module is used for constructing the prediction model of the suitable area for Chinese medicinal materials according to the ecological factor preprocessing data, the pseudo-absence point sample data and the SVDD model;

预测模块,用于预测并得到待预测中药材的适生区。The prediction module is used to predict and obtain the suitable area of the Chinese herbal medicine to be predicted.

进一步的,所述预测模块的预测过程:选取任意所述生态因子数据中的待预测中药材的测试点;Further, the prediction process of the prediction module: select the test point of the Chinese medicinal material to be predicted in any of the ecological factor data;

计算所述待预测中药材的测试点与所述中药材适生区预测模型中最优的超球体中心点的距离:Calculate the distance between the test point of the Chinese medicinal material to be predicted and the optimal hypersphere center point in the prediction model of the Chinese medicinal material suitable area:

Figure BDA0002537544200000092
Figure BDA0002537544200000092

其中,x′表示测试点,s(x′)表示测试点与所述中药材适生区预测模型最优的超球体中心点的距离,

Figure BDA0002537544200000091
表示SVDD模型的超参数;Among them, x' represents the test point, s(x') represents the distance between the test point and the center point of the hypersphere where the prediction model of the suitable area for Chinese medicinal materials is optimal,
Figure BDA0002537544200000091
Represents the hyperparameters of the SVDD model;

判断s(x′)是否大于体积最小的超球体的半径,当s(x′)大于所述最优的超球体的半径时,所述测试点为非适生区;Judging whether s(x') is greater than the radius of the hypersphere with the smallest volume, when s(x') is greater than the radius of the optimal hypersphere, the test point is an unsuitable area;

当s(x′)小于或等于所述最优的超球体的半径时,所述测试点为适生区;对所有的待预测中药材的测试点进行上述操作得到待预测中药材的所有适生区。When s(x') is less than or equal to the radius of the optimal hypersphere, the test point is a suitable area; perform the above operation on all the test points of the Chinese herbal medicine to be predicted to obtain all suitable conditions of the Chinese herbal medicine to be predicted. living area.

实施例2Example 2

如图3所示,在实施例1的基础上,随着丹参的需求量不断上升,本发明将丹参作为研究对象,获取丹参存在点样本分布数据共计120条;选用环境因子总共26个,如表1中药材生态环境因子与分布列表所示,包括:气候因子19个、地形因子3个、土壤因子4个;伪不存在点样本数据为120条。As shown in Figure 3, on the basis of Example 1, along with the rising demand of Salvia miltiorrhiza, the present invention takes Salvia miltiorrhiza as the research object, and obtains a total of 120 pieces of data on the distribution of samples of Salvia miltiorrhiza; a total of 26 environmental factors are selected, such as Table 1 shows the ecological environment factors and distribution list of Chinese medicinal materials, including: 19 climate factors, 3 terrain factors, and 4 soil factors; there are 120 pseudo-absence point sample data.

使用240个丹参样本数据验证模型有效性,训练集和测试集分别占80%和20%;学习率设置为0.0001;训练轮数设置为150,在一个轮次内,采用上述数据在实施例1的基础上操作,使所有训练集都在整个网络中进行一次完整训练;批样本大小设置为20,权重衰减系数设置为5e-07。240 samples of Salvia miltiorrhiza were used to verify the validity of the model. The training set and test set accounted for 80% and 20% respectively; the learning rate was set to 0.0001; On the basis of operation, all training sets are fully trained in the entire network; the batch sample size is set to 20, and the weight decay coefficient is set to 5e-07.

使用AUC值作为评价指标,得到本实施例的AUC值为0.997,MaxEnt模型的AUC值为0.899。Using the AUC value as the evaluation index, the AUC value of this embodiment is obtained as 0.997, and the AUC value of the MaxEnt model is 0.899.

表1中药材生态环境因子与分布列表Table 1 List of ecological environment factors and distribution of Chinese medicinal materials

Figure BDA0002537544200000101
Figure BDA0002537544200000101

以上所述的具体实施方式,对本发明的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施方式而已,并不用于限定本发明的保护范围,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The specific embodiments described above further describe the objectives, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above descriptions are only specific embodiments of the present invention, and are not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included within the protection scope of the present invention.

Claims (7)

1.一种基于深度SVDD模型的中药材适生区预测方法,其特征在于,包括:1. a Chinese medicinal material suitable area prediction method based on deep SVDD model, is characterized in that, comprises: S1:采集中药材的生态因子数据,采用MaxEnt模型生成中药材的伪不存在点样本数据;S1: Collect ecological factor data of Chinese herbal medicines, and use MaxEnt model to generate pseudo-absence point sample data of Chinese herbal medicines; 所述伪不存在点样本数据为通过MaxEnt模型得到的中药材非适生区;The pseudo-absence point sample data is the unsuitable area of Chinese medicinal materials obtained by the MaxEnt model; S2:对采集到的中药材的生态因子数据进行预处理,得到生态因子预处理数据;S2: preprocessing the collected ecological factor data of Chinese medicinal materials to obtain ecological factor preprocessing data; S3:根据所述生态因子预处理数据、伪不存在点样本数据和SVDD模型构建中药材适生区预测模型;S3: constructing a prediction model for the suitable area for Chinese medicinal materials according to the ecological factor preprocessing data, the pseudo-absence point sample data and the SVDD model; 所述S3中中药材适生区预测模型的构建过程:The construction process of the prediction model of the suitable area for Chinese medicinal materials in S3: 所述SVDD模型采用全连接网络
Figure FDA0003558342190000011
将所述生态因子预处理数据映射到高维特征空间;
The SVDD model uses a fully connected network
Figure FDA0003558342190000011
mapping the ecological factor preprocessing data to a high-dimensional feature space;
在所述高维特征空间中找出最优的超球体,且所述伪不存在点样本数据位于所述最优的超球体的超球面外,所述生态因子数据中的样本分布数据位于所述最优的超球体内部;Find the optimal hypersphere in the high-dimensional feature space, and the pseudo-absence point sample data is located outside the hypersphere of the optimal hypersphere, and the sample distribution data in the ecological factor data is located in the the interior of the optimal hypersphere; 所述最优的超球体的半径和中心公式:The formula for the radius and center of the optimal hypersphere:
Figure FDA0003558342190000012
Figure FDA0003558342190000012
Figure FDA0003558342190000013
Figure FDA0003558342190000013
其中,c表示中心点,且
Figure FDA0003558342190000014
n表示样本数大小,
Figure FDA0003558342190000015
表示每个连接网络,αi和αj示拉格朗日乘子,
Figure FDA0003558342190000016
表示内积,
Figure FDA0003558342190000017
Figure FDA0003558342190000018
都表示内积,
Figure FDA0003558342190000019
表示支持向量,且
Figure FDA00035583421900000110
where c represents the center point, and
Figure FDA0003558342190000014
n is the sample size,
Figure FDA0003558342190000015
denote each connected network, α i and α j are Lagrangian multipliers,
Figure FDA0003558342190000016
represents the inner product,
Figure FDA0003558342190000017
and
Figure FDA0003558342190000018
both represent inner product,
Figure FDA0003558342190000019
represents the support vector, and
Figure FDA00035583421900000110
S4:将待预测中药材的测试点放入所述中药材适生区预测模型进行判断,得到待预测中药材的适生区。S4: Put the test point of the Chinese herbal medicine to be predicted into the prediction model of the Chinese herbal medicine suitable area for judgment, and obtain the suitable area of the Chinese herbal medicine to be predicted.
2.根据权利要求1所述的一种基于深度SVDD模型的中药材适生区预测方法,其特征在于,所述生态因子数据包括样本分布数据、环境因子数据和地图数据。2 . The method for predicting suitable areas for Chinese medicinal materials based on a deep SVDD model according to claim 1 , wherein the ecological factor data includes sample distribution data, environmental factor data and map data. 3 . 3.根据权利要求1所述的一种基于深度SVDD模型的中药材适生区预测方法,其特征在于,所述S1中伪不存在点样本数据的生成步骤包括:3. The method for predicting the suitable area for Chinese medicinal materials based on the deep SVDD model according to claim 1, wherein the step of generating the pseudo-absent point sample data in the S1 comprises: 采用MaxEnt模型生成中药材的适生区数值结果;Use MaxEnt model to generate the numerical results of the suitable area of Chinese medicinal materials; 剔除所述中药材的适生区数值结果中大于或等于阈值的数值,得到非适生区;Eliminate the values that are greater than or equal to the threshold value in the result of the suitable area of the Chinese medicinal material to obtain the non-suitable area; 从所述非适生区中选择与中药材生态因子数据相同数量的伪不存在点,得到中药材的伪不存在点样本数据。The same number of pseudo-absence points as the ecological factor data of Chinese medicinal materials are selected from the non-suitable area to obtain sample data of pseudo-absence points of traditional Chinese medicinal materials. 4.根据权利要求1所述的一种基于深度SVDD模型的中药材适生区预测方法,其特征在于,所述S2的预处理过程包括:4. The method for predicting the suitable growth area of Chinese medicinal materials based on the deep SVDD model according to claim 1, wherein the preprocessing process of the S2 comprises: 采用词向量模型Word2vec将所述生态因子数据转化为高维空间词向量;The word vector model Word2vec is used to convert the ecological factor data into a high-dimensional space word vector; 使用t-SNE算法将高维空间词向量映射为二维空间词向量。Use the t-SNE algorithm to map high-dimensional spatial word vectors to two-dimensional spatial word vectors. 5.根据权利要求4所述的一种基于深度SVDD模型的中药材适生区预测方法,其特征在于,所述t-SNE算法:5. a kind of Chinese medicinal material suitable area prediction method based on deep SVDD model according to claim 4, is characterized in that, described t-SNE algorithm: 构建一个高维对象之间的概率分布,不同数据之间的相似度表示:Construct a probability distribution between high-dimensional objects, and the similarity between different data is expressed as:
Figure FDA0003558342190000021
Figure FDA0003558342190000021
其中,pj|i表示高维空间中不同数据之间的相似度,xi和xj为N维数据x1,x2,...,xN中的任意两个不相同的数据,参数σi表示以xi为中心的高斯分布的方差,|| ||表示二范数运算,xk表示N维数据x1,x2,...,xN中下标为k的数据;Among them, p j|i represents the similarity between different data in the high-dimensional space, x i and x j are any two different data in the N-dimensional data x 1 , x 2 ,..., x N , The parameter σ i represents the variance of the Gaussian distribution centered on x i , || || represents the two-norm operation, x k represents the N-dimensional data x 1 , x 2 ,..., the data subscript k in x N ; 在低维空间对所述高维对象进行概率分布的构建,不同数据之间的相似度表示:The probability distribution is constructed for the high-dimensional objects in the low-dimensional space, and the similarity between different data is expressed as:
Figure FDA0003558342190000022
Figure FDA0003558342190000022
其中,qj|i表示低维空间中不同数据之间的相似度,yi和yj表示低维空间下的二维数据y1,y2,yk表示低维空间中下标为k的二维数据;Among them, q j|i represents the similarity between different data in the low-dimensional space, y i and y j represent the two-dimensional data y 1 , y 2 , y k in the low-dimensional space, and the subscript k in the low-dimensional space two-dimensional data; 分别构造高维空间和低维空间的联合概率分布P和Q,使得对任意i和j,均有qi|j=pj|i,qi|j=qj|iConstruct the joint probability distributions P and Q of the high-dimensional space and the low-dimensional space respectively, so that for any i and j, there are q i|j =p j|i , q i|j =q j|i ;
Figure FDA0003558342190000023
Figure FDA0003558342190000023
Figure FDA0003558342190000024
Figure FDA0003558342190000024
其中,pi,j表示高维空间任意两个数据之间的联合概率,qi,j表示低维空间任意两个数据之间的联合概率,yl表示低维空间中下标为l的二维数据;Among them, pi , j represent the joint probability between any two data in the high-dimensional space, q i, j represent the joint probability between any two data in the low-dimensional space, y l represents the subscript l in the low-dimensional space two-dimensional data; 使用KL散度对高维空间和低维空间的联合概率分布的相似性进行衡量,得到:Using KL divergence to measure the similarity of the joint probability distribution of high-dimensional space and low-dimensional space, we get:
Figure FDA0003558342190000025
Figure FDA0003558342190000025
其中,C表示高维空间和低维空间的联合概率分布的相似性,P表示高维空间的联合概率,Q表示低维空间的联合概率。Among them, C represents the similarity of the joint probability distribution of high-dimensional space and low-dimensional space, P represents the joint probability of high-dimensional space, and Q represents the joint probability of low-dimensional space.
6.根据权利要求1所述的一种基于深度SVDD模型的中药材适生区预测方法,其特征在于,所述S4中选取任意所述生态因子数据中的待预测中药材的测试点;6. A method for predicting suitable areas for Chinese medicinal materials based on a deep SVDD model according to claim 1, wherein in said S4, a test point of the Chinese medicinal material to be predicted in any said ecological factor data is selected; 计算所述待预测中药材的测试点与所述最优的超球体中心点的距离:Calculate the distance between the test point of the Chinese medicinal material to be predicted and the center point of the optimal hypersphere:
Figure FDA0003558342190000031
Figure FDA0003558342190000031
其中,x′表示测试点,s(x′)表示测试点与所述最优的超球体中心点的距离,
Figure FDA0003558342190000032
表示SVDD模型的超参数;
Among them, x' represents the test point, s(x') represents the distance between the test point and the optimal hypersphere center point,
Figure FDA0003558342190000032
Represents the hyperparameters of the SVDD model;
判断s(x′)是否大于体积最小的超球体的半径,当s(x′)大于所述最优的超球体的半径时,所述测试点为非适生区;Judging whether s(x') is greater than the radius of the hypersphere with the smallest volume, when s(x') is greater than the radius of the optimal hypersphere, the test point is an unsuitable area; 当s(x′)小于或等于所述最优的超球体的半径时,所述测试点为适生区;When s(x') is less than or equal to the radius of the optimal hypersphere, the test point is a suitable area; 对所有的待预测中药材的测试点进行上述操作得到待预测中药材的所有适生区。The above operation is performed on all the test points of the Chinese herbal medicine to be predicted to obtain all the suitable areas of the Chinese herbal medicine to be predicted.
7.一种基于深度SVDD模型的中药材适生区预测系统,其特征在于,包括:7. A Chinese medicinal material suitable area prediction system based on deep SVDD model, is characterized in that, comprises: 采集模块,用于采集中药材的生态因子数据,并生成中药材的伪不存在点样本数据;The collection module is used to collect ecological factor data of Chinese medicinal materials and generate pseudo-absence point sample data of Chinese medicinal materials; 所述伪不存在点样本数据为通过MaxEnt模型得到的中药材非适生区;The pseudo-absence point sample data is the unsuitable area of Chinese medicinal materials obtained by the MaxEnt model; 预处理模块,用于对采集到的中药材的生态因子数据进行预处理,得到生态因子预处理数据;The preprocessing module is used for preprocessing the collected ecological factor data of Chinese medicinal materials to obtain ecological factor preprocessing data; 预测模型生成模块,用于根据所述生态因子预处理数据、伪不存在点样本数据和SVDD模型构建中药材适生区预测模型;The prediction model generation module is used for constructing the prediction model of the suitable area for Chinese medicinal materials according to the ecological factor preprocessing data, the pseudo-absence point sample data and the SVDD model; 预测模块,用于预测并得到待预测中药材的适生区;The prediction module is used to predict and obtain the suitable area of the Chinese herbal medicine to be predicted; 所述预测模块的预测过程:选取任意所述生态因子数据中的待预测中药材的测试点;The prediction process of the prediction module: select the test points of the Chinese medicinal materials to be predicted in any of the ecological factor data; 计算所述待预测中药材的测试点与所述中药材适生区预测模型中最优的超球体中心点的距离:Calculate the distance between the test point of the Chinese medicinal material to be predicted and the optimal hypersphere center point in the prediction model of the Chinese medicinal material suitable area:
Figure FDA0003558342190000033
Figure FDA0003558342190000033
其中,x′表示测试点,s(x′)表示测试点与所述中药材适生区预测模型最优的超球体中心点的距离,
Figure FDA0003558342190000034
表示SVDD模型的超参数;
Among them, x' represents the test point, s(x') represents the distance between the test point and the center point of the hypersphere where the prediction model of the suitable area for Chinese medicinal materials is optimal,
Figure FDA0003558342190000034
Represents the hyperparameters of the SVDD model;
判断s(x′)是否大于体积最小的超球体的半径,当s(x′)大于所述最优的超球体的半径时,所述测试点为非适生区;Judging whether s(x') is greater than the radius of the hypersphere with the smallest volume, when s(x') is greater than the radius of the optimal hypersphere, the test point is an unsuitable area; 当s(x′)小于或等于所述最优的超球体的半径时,所述测试点为适生区;对所有的待预测中药材的测试点进行上述操作得到待预测中药材的所有适生区。When s(x') is less than or equal to the radius of the optimal hypersphere, the test point is a suitable area; perform the above operation on all the test points of the Chinese herbal medicine to be predicted to obtain all suitable conditions of the Chinese herbal medicine to be predicted. living area.
CN202010537578.6A 2020-06-12 2020-06-12 Prediction method and system of Chinese herbal medicine suitable area based on deep SVDD model Active CN111680843B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010537578.6A CN111680843B (en) 2020-06-12 2020-06-12 Prediction method and system of Chinese herbal medicine suitable area based on deep SVDD model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010537578.6A CN111680843B (en) 2020-06-12 2020-06-12 Prediction method and system of Chinese herbal medicine suitable area based on deep SVDD model

Publications (2)

Publication Number Publication Date
CN111680843A CN111680843A (en) 2020-09-18
CN111680843B true CN111680843B (en) 2022-06-28

Family

ID=72435523

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010537578.6A Active CN111680843B (en) 2020-06-12 2020-06-12 Prediction method and system of Chinese herbal medicine suitable area based on deep SVDD model

Country Status (1)

Country Link
CN (1) CN111680843B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113095674A (en) * 2021-04-12 2021-07-09 云南省林业调查规划院 Analysis method for potential habitat of Yunnan key protection wild plant based on MaxEnt and GIS
CN114266129A (en) * 2021-09-27 2022-04-01 中山大学 Spiro intrusion adaptability analysis method, system and terminal based on MaxEnt model

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101398417A (en) * 2008-10-29 2009-04-01 中国药科大学 Universal method for rapid detection and structural identification for Chinese traditional medicine complex component
CN102521480A (en) * 2011-11-15 2012-06-27 中国医学科学院药用植物研究所 Method for selecting new producing area of Chinese medical herb
CN103345588A (en) * 2013-07-18 2013-10-09 成都中医药大学 Method for calculating number of wild traditional Chinese medicine potential resources
CN106372460A (en) * 2016-08-24 2017-02-01 成都旅美科技有限公司 Environment analysis-based biological distribution determination apparatus
CN106845699A (en) * 2017-01-05 2017-06-13 南昌大学 A kind of method for predicting oil tea normal region
CN106961973A (en) * 2017-03-30 2017-07-21 杨友仁 The method that pulse family Chinese medicine is sowed on a large scale is realized using intelligent bulb technology
CN107403057A (en) * 2016-05-20 2017-11-28 中国中医科学院中药研究所 A kind of Chinese medicine Quality Regionalization model based on maximum informational entropy and improved independence weight coefficient
CN110222343A (en) * 2019-06-13 2019-09-10 电子科技大学 A kind of Chinese medicine plant resource name entity recognition method
CN110348060A (en) * 2019-06-13 2019-10-18 中国测绘科学研究院 A kind of snow leopard Habitat suitability evaluation method and device
CN111178631A (en) * 2019-12-30 2020-05-19 广州地理研究所 Method and system for predicting water lettuce invasion distribution area

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101398417A (en) * 2008-10-29 2009-04-01 中国药科大学 Universal method for rapid detection and structural identification for Chinese traditional medicine complex component
CN102521480A (en) * 2011-11-15 2012-06-27 中国医学科学院药用植物研究所 Method for selecting new producing area of Chinese medical herb
CN103345588A (en) * 2013-07-18 2013-10-09 成都中医药大学 Method for calculating number of wild traditional Chinese medicine potential resources
CN107403057A (en) * 2016-05-20 2017-11-28 中国中医科学院中药研究所 A kind of Chinese medicine Quality Regionalization model based on maximum informational entropy and improved independence weight coefficient
CN106372460A (en) * 2016-08-24 2017-02-01 成都旅美科技有限公司 Environment analysis-based biological distribution determination apparatus
CN106845699A (en) * 2017-01-05 2017-06-13 南昌大学 A kind of method for predicting oil tea normal region
CN106961973A (en) * 2017-03-30 2017-07-21 杨友仁 The method that pulse family Chinese medicine is sowed on a large scale is realized using intelligent bulb technology
CN110222343A (en) * 2019-06-13 2019-09-10 电子科技大学 A kind of Chinese medicine plant resource name entity recognition method
CN110348060A (en) * 2019-06-13 2019-10-18 中国测绘科学研究院 A kind of snow leopard Habitat suitability evaluation method and device
CN111178631A (en) * 2019-12-30 2020-05-19 广州地理研究所 Method and system for predicting water lettuce invasion distribution area

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Predicting the Potential Distribution Patterns of the Rare Plant Gymnocarpos Przewalskii Under Present and Future Climate Change;Ma Songmei等;《2011 International Conference on Consumer Electronics, Communications and Networks (CECNet)》;20110516;1513-1515 *
基于GIS的中药材产地适宜性分析系统的设计与实现;孙成忠 等;《世界科学技术-中医药现代化》;20060331;第8卷(第3期);112-117 *
基于MaxEnt和GIS技术的桔梗适宜性分布区划研究;董光 等;《中药材》;20190131;第42卷(第1期);66-70 *
基于Maxent模型对党参害虫烟草甲在中国的适生区预测分析;侯沁文等;《长治学院学报》;20200415(第02期);176-183 *
基于生态因子的山东太子参生态适宜区划研究;边丽华等;《山东农业科学》;20180228(第02期);68-75 *

Also Published As

Publication number Publication date
CN111680843A (en) 2020-09-18

Similar Documents

Publication Publication Date Title
CN114067160B (en) Small sample remote sensing image scene classification method based on embedded smooth graph neural network
Li et al. RSI-CB: A large-scale remote sensing image classification benchmark using crowdsourced data
Liao et al. A neighbor decay cellular automata approach for simulating urban expansion based on particle swarm intelligence
CN106529721B (en) A kind of ad click rate forecasting system and its prediction technique that depth characteristic is extracted
Xu et al. Simulation of land-use changes using the partitioned ANN-CA model and considering the influence of land-use change frequency
CN110072183B (en) Passive positioning fingerprint database construction method based on crowd sensing
CN104933428B (en) A kind of face identification method and device based on tensor description
Hajikhodaverdikhan et al. Earthquake prediction with meteorological data by particle filter-based support vector regression
Deng et al. Identifying stages of kidney renal cell carcinoma by combining gene expression and DNA methylation data
CN107590515A (en) The hyperspectral image classification method of self-encoding encoder based on entropy rate super-pixel segmentation
Chen et al. Research on location fusion of spatial geological disaster based on fuzzy SVM
CN111680843B (en) Prediction method and system of Chinese herbal medicine suitable area based on deep SVDD model
CN117388953B (en) Weather forecast method for improving MIM-rwkv by SADBO based on big data frame
CN112966135A (en) Image-text retrieval method and system based on attention mechanism and gate control mechanism
Feng et al. A cellular automata model based on nonlinear kernel principal component analysis for urban growth simulation
Zhou et al. Estimating and interpreting fine-scale gridded population using random forest regression and multisource data
CN106228197A (en) A kind of satellite image cloud amount recognition methods based on self adaptation extreme learning machine
Yao et al. Investigation on the expansion of urban construction land use based on the CART-CA model
Bao et al. An artificial neural network for lightning prediction based on atmospheric electric field observations
Wu et al. Short-term regional temperature prediction based on deep spatial and temporal networks
CN113935458A (en) Air pollution multi-site joint prediction method based on convolutional autoencoder deep learning
Gao et al. Prediction of prospecting target based on ResNet convolutional neural network
Hu et al. Novel trajectory representation learning method and its application to trajectory-user linking
CN105447100B (en) A kind of cloud atlas search method based on shape feature
Zhu et al. A high-dimensional indexing model for multi-source remote sensing big data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant