CN111680843B

CN111680843B - Prediction method and system of Chinese herbal medicine suitable area based on deep SVDD model

Info

Publication number: CN111680843B
Application number: CN202010537578.6A
Authority: CN
Inventors: 李巧勤; 蔡茁; 刘勇国; 杨尚明
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-06-12
Filing date: 2020-06-12
Publication date: 2022-06-28
Anticipated expiration: 2040-06-12
Also published as: CN111680843A

Abstract

The invention discloses a Chinese medicinal material survival area prediction method and system based on a deep SVDD model, which comprises the following steps: collecting ecological factor data of the traditional Chinese medicinal materials, and generating pseudo-nonexistent sample data of the traditional Chinese medicinal materials by adopting a MaxEnt model; preprocessing the collected ecological factor data of the traditional Chinese medicinal materials to obtain ecological factor preprocessing data; constructing a prediction model of the Chinese medicinal material survival area according to the ecological factor preprocessing data, the pseudo-nonexistent sample data and the SVDD model; and putting the test points of the traditional Chinese medicinal materials to be predicted into the prediction model of the survival area of the traditional Chinese medicinal materials for judgment to obtain the survival area of the traditional Chinese medicinal materials to be predicted. The SGD and the SGD variant optimize parameters of a deep SVDD model, and the calculation complexity of the model is linear expansion on the training quantity, so that a large data set can be well expanded; the survival area is obtained by judging the distance between different test points and the optimal hypersphere, and the accuracy of the prediction result of the survival area of the traditional Chinese medicinal materials is improved.

Description

Prediction method and system of suitable growth area of Chinese medicinal materials based on deep SVDD model

技术领域technical field

本发明涉及中药资源开发利用领域，具体涉及基于深度SVDD模型的中药材适生区预测方法及系统。The invention relates to the field of development and utilization of traditional Chinese medicine resources, in particular to a method and a system for predicting suitable growth areas of traditional Chinese medicinal materials based on a deep SVDD model.

背景技术Background technique

中药资源的开发利用和保护其持续发展，对于我国的中药资源研究来说非常重要，因为我国的中药资源面临诸多问题，比如盲目的扩大栽培地区不仅严重影响中药材质量及产区，还使引种药材的有效成分与《中国药典》标准相差明显，严重制约了中药材的可持续发展。为了更加科学地扩大中药材引种地区，需要加强药材生态适宜性研究，寻找药材形成的生态因素如光、温度、水分、地形、土壤等，并加大中药材的引种栽培和区划管理，达到充分合理利用环境资源，保护中药资源、实现其可持续发展的目标。The development and utilization of traditional Chinese medicine resources and the protection of their sustainable development are very important for the research of traditional Chinese medicine resources in my country, because my country's traditional Chinese medicine resources face many problems. The effective components of medicinal materials are significantly different from the standard of "Chinese Pharmacopoeia", which seriously restricts the sustainable development of traditional Chinese medicinal materials. In order to more scientifically expand the areas where Chinese medicinal materials are introduced, it is necessary to strengthen the research on the ecological suitability of medicinal materials, find the ecological factors such as light, temperature, moisture, terrain, soil, etc. Rational use of environmental resources, protection of traditional Chinese medicine resources, to achieve the goal of sustainable development.

目前，针对药材适生区分布预测的研究大多是采用最大熵模型及已有分布资料和生态环境，预测适生区分布格局及变迁等。现有技术中采用MaxEnt生态位模型与GIS技术相结合的方法，根据214个桔梗样本点分布数据，基于刀切法分析生态因子的贡献率，探索影响桔梗生长的主要生态因子与适生地特征，从而对桔梗在全国范围内的生长适宜性区域进行区划研究，预测精度评价指标AUC(Area Under Curve)值达0.922。At present, most of the researches on the distribution prediction of the suitable areas for medicinal materials use the maximum entropy model and the existing distribution data and ecological environment to predict the distribution pattern and changes of the suitable areas. In the prior art, the method of combining the MaxEnt niche model and GIS technology was adopted. According to the distribution data of 214 Platycodon grandiflorum sample points, the contribution rate of ecological factors was analyzed based on the knife-cut method, and the main ecological factors and suitable habitat characteristics affecting the growth of Platycodon grandiflorum were explored. Therefore, the regionalization research of the suitable growth area of Platycodon grandiflorum in the whole country was carried out, and the prediction accuracy evaluation index AUC (Area Under Curve) value reached 0.922.

但是Maxent模型是一种复杂的机器学习算法，对采样偏差敏感，容易产生过度拟合的情况，所述Maxent模型的转移能力仅在低阈值情况下较好；且所述Maxent模型基于默认参数会对预测结果的准确率产生影响。However, the Maxent model is a complex machine learning algorithm, which is sensitive to sampling bias and is prone to overfitting. The transfer ability of the Maxent model is only good in the case of a low threshold; and the Maxent model will be based on default parameters. have an impact on the accuracy of the prediction results.

发明内容SUMMARY OF THE INVENTION

本发明所要解决的技术问题是现有的中药材适生区预测方法中采用Maxent模型导致采样偏差，容易产生过度拟合以及Maxent模型基于默认参数会对预测结果的准确率产生影响的问题，目的在于提供基于深度SVDD模型的中药材适生区预测方法及系统，解决上述问题。The technical problem to be solved by the present invention is that the Maxent model is used in the existing traditional Chinese medicinal material suitable area prediction method, which leads to sampling deviation, which is prone to overfitting, and the Maxent model will affect the accuracy of the prediction result based on the default parameters. The purpose is to provide a method and system for predicting the suitable area of Chinese medicinal materials based on the deep SVDD model to solve the above problems.

本发明通过下述技术方案实现：The present invention is achieved through the following technical solutions:

一种基于深度SVDD模型的中药材适生区预测方法，包括：A method for predicting suitable areas for Chinese medicinal materials based on a deep SVDD model, comprising:

S1：采集中药材的生态因子数据，采用MaxEnt模型生成中药材的伪不存在点样本数据；S1: Collect ecological factor data of Chinese herbal medicines, and use MaxEnt model to generate pseudo-absence point sample data of Chinese herbal medicines;

所述伪不存在点样本数据为通过MaxEnt模型得到的中药材非适生区；The pseudo-absence point sample data is the unsuitable area of Chinese medicinal materials obtained by the MaxEnt model;

S2：对采集到的中药材的生态因子数据进行预处理，得到生态因子预处理数据；S2: preprocessing the collected ecological factor data of Chinese medicinal materials to obtain ecological factor preprocessing data;

S3：根据所述生态因子预处理数据、伪不存在点样本数据和SVDD模型构建中药材适生区预测模型；S3: constructing a prediction model for the suitable area for Chinese medicinal materials according to the ecological factor preprocessing data, the pseudo-absence point sample data and the SVDD model;

S4：将待预测中药材的测试点放入所述中药材适生区预测模型进行判断，得到待预测中药材的适生区。S4: Put the test point of the Chinese herbal medicine to be predicted into the prediction model of the Chinese herbal medicine suitable area for judgment, and obtain the suitable area of the Chinese herbal medicine to be predicted.

本发明提出基于深度支持向量数据描述即基于深度SVDD模型的中药材适生区预测模型。由于生态环境因子的数据格式不同，因此需要对数据进行预处理操作，即基于t-SNE算法实现数据的统一转换，使用深度支持向量数据描述模型将转换后的数据基于非线性映射到高维特征空间，并在特征空间寻找最优超球体，使用SGD及其变体优化深度SVDD模型的参数，因为其计算复杂性在训练批的数量上是线性扩展的，因此对大型数据集进行很好的扩展。The present invention proposes a prediction model for the suitable area of Chinese medicinal materials based on the deep support vector data description, that is, based on the deep SVDD model. Due to the different data formats of ecological environment factors, it is necessary to preprocess the data, that is, to realize the unified transformation of the data based on the t-SNE algorithm, and use the deep support vector data description model to map the transformed data to high-dimensional features based on nonlinearity. space, and find the optimal hypersphere in the feature space, using SGD and its variants to optimize the parameters of the deep SVDD model, since its computational complexity scales linearly in the number of training batches, it performs well on large datasets extension.

在现有长期的中药材资源研究中，对于中药材的生态因子数据国家都是有提供大数据的存储与管理的，并且具有权威性和真实性，因此本发明中通过中国植物标本馆与国家标本平台进行采集获得中草药的样本分布数据，由于样本存在点分布地点较广，且存在重复数据，需通过数据清洗等方法筛选有效数据，获得样本点的经纬度；通过查阅相关中药材的环境因子文献，采集中药材的环境因子数据；所述环境因子数据包括气候因子、地形因子和土壤因子；通过国家基础地理信息系统网进行经纬度映射从而采集得到中国各地区的生态环境分布情况。In the existing long-term research on Chinese medicinal materials resources, the country provides the storage and management of big data for the ecological factor data of Chinese medicinal materials, and is authoritative and authentic. The sample platform collects and obtains the sample distribution data of Chinese herbal medicines. Due to the wide distribution of sample points and the existence of duplicate data, it is necessary to screen the valid data by data cleaning and other methods to obtain the longitude and latitude of the sample points; by consulting the environmental factor literature of relevant Chinese herbal medicines , collect environmental factor data of traditional Chinese medicinal materials; the environmental factor data includes climate factor, terrain factor and soil factor; the longitude and latitude mapping is performed through the national basic geographic information system network to collect the ecological environment distribution in various regions of China.

为了增加模型可靠性，同时需要真实存在和不存在即某一药材一定不在某地生长的样本数据一起训练模型，由于没有真实不存在数据，本发明利用参数调整之后的优化MaxEnt模型构造伪不存在数据。In order to increase the reliability of the model, it is necessary to train the model together with sample data that does exist and does not exist, that is, a certain medicinal material must not grow in a certain place. Since there is no real nonexistent data, the present invention uses the optimized MaxEnt model after parameter adjustment to construct a false nonexistence data.

由于输入数据是多源异构的，比如土壤质地为文本类型，温度、降水量等为数值类型，本发明首先采用词向量模型Word2vec将文本中的词转化为词向量的表示，获取文本数据的特征表示；由于词向量高维空间处理效率较低，利用t-SNE算法将高维词向量空间映射为二维空间，使得词义相近的两个单词，在映射之后依然保持相近，词义较远的单词保持很远的映射距离。Since the input data is multi-source and heterogeneous, such as soil texture is text type, temperature, precipitation, etc. are numerical types, the present invention first uses the word vector model Word2vec to convert the words in the text into the representation of word vectors, and obtains the text data. Feature representation; due to the low processing efficiency of the word vector high-dimensional space, the t-SNE algorithm is used to map the high-dimensional word vector space into a two-dimensional space, so that two words with similar word meanings remain similar after mapping, and those with farther word meanings Words maintain a large mapping distance.

进一步的，对中药材的适生区进行判断的主要生态因子数据包括样本分布数据、环境因子数据和地图数据。Further, the main ecological factor data for judging the suitable habitat of Chinese medicinal materials include sample distribution data, environmental factor data and map data.

进一步的，所述S1中伪不存在点样本数据的生成步骤包括：Further, the step of generating the pseudo-absent point sample data in the S1 includes:

采用MaxEnt模型生成中药材的适生区数值结果；Use MaxEnt model to generate the numerical results of the suitable area of Chinese medicinal materials;

MaxEnt模型输出的结果为0～1，代表每个栅格可认为是地图中的像素点的适生指数，其数值越高代表中药材越适生，本发明将适生指数在一定阈值以上的栅格视为适生区，选出适生区后将其经纬度从图上剔除，仅留下非适生区。The output result of the MaxEnt model is 0 to 1, which means that each grid can be considered as the fitness index of the pixel in the map. The higher the value, the more suitable the Chinese medicinal material is. The grid is regarded as a suitable area. After the suitable area is selected, its latitude and longitude are removed from the map, leaving only the non-suitable area.

剔除所述中药材的适生区数值结果中大于或等于阈值的数值，得到非适生区；Eliminate the values that are greater than or equal to the threshold value in the result of the suitable area of the Chinese medicinal material to obtain the non-suitable area;

且存在点和伪不存在点数量相同情况下模型的模拟效果最好，因此本发明从所述非适生区中选择与中药材生态因子数据相同数量的伪不存在点，得到中药材的伪不存在点样本数据。And the simulation effect of the model is the best when the number of existence points and pseudo-absence points is the same, so the present invention selects the same number of pseudo-absence points as the ecological factor data of Chinese medicinal materials from the non-suitable area to obtain the pseudo-absence of traditional Chinese medicinal materials. No point sample data exists.

进一步的，所述S2的预处理过程包括：Further, the preprocessing process of S2 includes:

采用词向量模型Word2vec将所述生态因子数据转化为高维空间词向量；The word vector model Word2vec is used to convert the ecological factor data into a high-dimensional space word vector;

使用t-SNE算法将高维空间词向量映射为二维空间词向量。Use the t-SNE algorithm to map high-dimensional spatial word vectors to two-dimensional spatial word vectors.

进一步的，所述t-SNE算法：Further, the t-SNE algorithm:

为了使相似的对象有更高的概率被选择，而非相似的对象有较低的概率被选择，对象之间的相似度通过将欧式距离转换为条件概率来表达，即构建一个高维对象之间的概率分布，不同数据之间的相似度表示：In order to make similar objects have a higher probability to be selected, and dissimilar objects have a lower probability to be selected, the similarity between objects is expressed by converting Euclidean distance into conditional probability, that is, constructing a high-dimensional object between The probability distribution between different data, the similarity between different data is expressed:

其中，p_j|i表示高维空间中不同数据之间的相似度，x_i和x_j为N维数据x₁,x₂,…,x_N中的任意两个不相同的数据，参数σ_i表示以x_i为中心的高斯分布的方差，||||表示二范数运算；本发明只关心不同两点之间的相似度，因此设定p_i|i＝0；Among them, p _j|i represents the similarity between different data in high-dimensional space, x _i and x _j are any two different data in the N-dimensional data x ₁ , x ₂ ,...,x _N , and the parameter σ _i represents the variance of the Gaussian distribution centered on x _i , and |||| represents the two-norm operation; the present invention only cares about the similarity between two different points, so set p _i|i =0;

因为需要将高维空间的向量映射到低维空间，为了低维空间中与高维空间相同的对象的概率分布尽可能的与高维度空间的概率分布相似，需要在低维空间对所述高维对象进行概率分布的构建，不同数据之间的相似度表示：Because it is necessary to map the vectors of the high-dimensional space to the low-dimensional space, in order to make the probability distribution of the same objects in the low-dimensional space as the high-dimensional space as similar as possible to the probability distribution of the high-dimensional space, it is necessary to map the high-dimensional space in the low-dimensional space to the probability distribution of the high-dimensional space. The dimensional object is used to construct the probability distribution, and the similarity between different data is expressed:

其中，q_j|i表示低维空间中不同数据之间的相似度，y_i和y_j表示低维空间下的二维数据y₁,y₂；假定高斯分布为方差为

同理q_i|i＝0。Among them, q _j|i represents the similarity between different data in the low-dimensional space, and y _i and y _j represent the two-dimensional data y ₁ , y ₂ in the low-dimensional space; it is assumed that the Gaussian distribution has a variance of

Similarly q _i|i =0.

分别构造高维空间和低维空间的联合概率分布P和Q，使得对任意i和j，均有q_i|j＝p_j|i，q_i|j＝q_j|i。Construct the joint probability distributions P and Q of the high-dimensional space and the low-dimensional space, respectively, so that for any i and j, there are q _i|j =p _j|i , q _i|j =q _j|i .

其中，p_i,j表示高维空间任意两个数据之间的联合概率，q_i,j表示低维空间任意两个数据之间的联合概率。Among them, p _i,j represents the joint probability between any two data in the high-dimensional space, and q _i,j represents the joint probability between any two data in the low-dimensional space.

使用KL散度对高维空间和低维空间的联合概率分布的相似性进行衡量，得到：Using KL divergence to measure the similarity of the joint probability distribution of high-dimensional space and low-dimensional space, we get:

其中，C表示高维空间和低维空间的联合概率分布的相似性，P表示高维空间的联合概率，Q表示低维空间的联合概率。Among them, C represents the similarity of the joint probability distribution of high-dimensional space and low-dimensional space, P represents the joint probability of high-dimensional space, and Q represents the joint probability of low-dimensional space.

进一步的，所述S3中中药材适生区预测模型的构建过程：Further, the construction process of the prediction model of the suitable area for Chinese medicinal materials in the S3:

所述SVDD模型采用全连接网络

将所述生态因子预处理数据映射到高维特征空间；The SVDD model uses a fully connected network

mapping the ecological factor preprocessing data to a high-dimensional feature space;

在所述高位特征空间中找出最优的超球体，且所述伪不存在点样本数据位于所述最优的超球体的超球面外，所述生态因子数据中的样本分布数据位于所述最优的超球体内部。Find the optimal hypersphere in the high-level feature space, and the pseudo-absence point sample data is located outside the hypersphere of the optimal hypersphere, and the sample distribution data in the ecological factor data is located in the Inside the optimal hypersphere.

进一步的，假设

对于任意输入

第

层的输出为：Further, suppose

for any input

the first

The output of the layer is:

其中·表示一个线性操作(比如，矩阵乘法)，

是第

层的激活函数，

是第

层的权重。where · represents a linear operation (e.g., matrix multiplication),

is the first

layer activation function,

is the first

layer weights.

目标函数如下：The objective function is as follows:

公式(10)中第一项是半径的平方求和后取均值，且满足每个网络表示

到中心

的距离的平方小于半径二次方与松弛变量之和，其中n表示样本数大小，第二项是一个权重衰减，其采用L2正则化，其中λ为权重衰减系数，且λ>0，ξ_i是一个松弛变量，且满足

||·||_F为F-范数。因此，可以看作是发现一个以c为中心的最小体积超球面，通过最小化所有数据表示到中心的平均偏差来收缩球体半径。The first item in formula (10) is the sum of the squares of the radii and then the average value, and it satisfies each network representation

to the center

The square of the distance is less than the sum of the square of the radius and the slack variable, where n represents the number of samples, and the second term is a weight decay, which uses L2 regularization, where λ is the weight decay coefficient, and λ>0, ξ _i is a slack variable that satisfies

||·|| _F is the F-norm. Thus, it can be seen as finding a minimal volume hypersphere centered at c, shrinking the sphere radius by minimizing the mean deviation of all data representations from the center.

通过最小化公式(10)，引入拉格朗日乘子α_i和β_i，构建拉格朗日函数如下：By minimizing formula (10), the Lagrangian multipliers α _i and β _i are introduced, and the Lagrangian function is constructed as follows:

s.t.α_i≥0,β_i≥0stα _i ≥0,β _i ≥0

对R,c,ξ求导，可得：Taking the derivative of R, c, ξ, we can get:

结合公式(11)和公式(12)，可得：Combining formula (11) and formula (12), we can get:

则所述最优的超球体的半径和中心公式：Then the formula for the radius and center of the optimal hypersphere is:

其中，c表示中心点，且

n表示样本数大小，

表示每个连接网络，α_i和α_j表示拉格朗日乘子，

表示内积，

表示支持向量，且

where c represents the center point, and

n is the sample size,

denote each connected network, α _i and α _j denote Lagrange multipliers,

represents the inner product,

represents the support vector, and

进一步的，所述S4中选取任意所述生态因子数据中的待预测中药材的测试点；Further, in the described S4, select the test point of the Chinese medicinal material to be predicted in any of the ecological factor data;

计算所述待预测中药材的测试点与所述最优的超球体中心点的距离：Calculate the distance between the test point of the Chinese medicinal material to be predicted and the center point of the optimal hypersphere:

其中，x′表示测试点，s(x′)表示测试点与所述最优的超球体中心点的距离，

表示SVDD模型的超参数；Among them, x' represents the test point, s(x') represents the distance between the test point and the optimal hypersphere center point,

Represents the hyperparameters of the SVDD model;

判断s(x′)是否大于体积最小的超球体的半径，当s(x′)大于所述最优的超球体的半径时，所述测试点为非适生区；Judging whether s(x') is greater than the radius of the hypersphere with the smallest volume, when s(x') is greater than the radius of the optimal hypersphere, the test point is an unsuitable area;

当s(x′)小于或等于所述最优的超球体的半径时，所述测试点为适生区；When s(x') is less than or equal to the radius of the optimal hypersphere, the test point is a suitable area;

对所有的待预测中药材的测试点进行上述操作得到待预测中药材的所有适生区。The above operation is performed on all the test points of the Chinese herbal medicine to be predicted to obtain all the suitable areas of the Chinese herbal medicine to be predicted.

一种基于深度SVDD模型的中药材适生区预测系统，包括：A prediction system based on the deep SVDD model for the suitable area of Chinese medicinal materials, including:

采集模块，用于采集中药材的生态因子数据，并生成中药材的伪不存在点样本数据；The collection module is used to collect ecological factor data of Chinese medicinal materials and generate pseudo-absence point sample data of Chinese medicinal materials;

预处理模块，用于对采集到的中药材的生态因子数据进行预处理，得到生态因子预处理数据；The preprocessing module is used for preprocessing the collected ecological factor data of Chinese medicinal materials to obtain ecological factor preprocessing data;

预测模型生成模块，用于根据所述生态因子预处理数据、伪不存在点样本数据和SVDD模型构建中药材适生区预测模型；The prediction model generation module is used for constructing the prediction model of the suitable area for Chinese medicinal materials according to the ecological factor preprocessing data, the pseudo-absence point sample data and the SVDD model;

预测模块，用于预测并得到待预测中药材的适生区。The prediction module is used to predict and obtain the suitable area of the Chinese herbal medicine to be predicted.

进一步的，所述预测模块的预测过程：选取任意所述生态因子数据中的待预测中药材的测试点；Further, the prediction process of the prediction module: select the test point of the Chinese medicinal material to be predicted in any of the ecological factor data;

计算所述待预测中药材的测试点与所述中药材适生区预测模型中最优的超球体中心点的距离：Calculate the distance between the test point of the Chinese medicinal material to be predicted and the optimal hypersphere center point in the prediction model of the Chinese medicinal material suitable area:

其中，x′表示测试点，s(x′)表示测试点与所述中药材适生区预测模型最优的超球体中心点的距离，

表示SVDD模型的超参数；Among them, x' represents the test point, s(x') represents the distance between the test point and the center point of the hypersphere where the prediction model of the suitable area for Chinese medicinal materials is optimal,

Represents the hyperparameters of the SVDD model;

当s(x′)小于或等于所述最优的超球体的半径时，所述测试点为适生区；对所有的待预测中药材的测试点进行上述操作得到待预测中药材的所有适生区。When s(x') is less than or equal to the radius of the optimal hypersphere, the test point is a suitable area; perform the above operation on all the test points of the Chinese herbal medicine to be predicted to obtain all suitable conditions of the Chinese herbal medicine to be predicted. living area.

本发明与现有技术相比，具有如下的优点和有益效果：Compared with the prior art, the present invention has the following advantages and beneficial effects:

1、本发明基于深度SVDD模型的中药材适生区预测方法及系统，采用SVDD模型进行中药材适生区的训练，使用SGD及其变体优化深度SVDD模型的参数，因为其计算复杂性在训练批的数量上是线性扩展的，因此对大型数据集进行很好的扩展；1. The present invention is based on the deep SVDD model-based method and system for predicting the suitable areas for Chinese medicinal materials, using the SVDD model to train the suitable areas for Chinese medicinal materials, and using SGD and its variants to optimize the parameters of the deep SVDD model, because its computational complexity is The number of training batches scales linearly, so scales well for large datasets;

2、本发明基于深度SVDD模型的中药材适生区预测方法及系统，采用SVDD模型时因为其对所有的数据采用向量描述不会产生采样偏差以及过度拟合，通过对不同测试点与最优超球体的距离判断获得适生区，提高中药材适生区预测结果准确率。2. The present invention is based on the deep SVDD model for the prediction method and system of the suitable area for Chinese medicinal materials. When the SVDD model is used, it will not generate sampling deviation and over-fitting because it uses vector description for all data. The distance judgment of the hypersphere can obtain the suitable area, and improve the accuracy of the prediction result of the suitable area of Chinese medicinal materials.

附图说明Description of drawings

此处所说明的附图用来提供对本发明实施例的进一步理解，构成本申请的一部分，并不构成对本发明实施例的限定。在附图中：The accompanying drawings described herein are used to provide further understanding of the embodiments of the present invention, and constitute a part of the present application, and do not constitute limitations to the embodiments of the present invention. In the attached image:

图1为本发明整体方法流程图；Fig. 1 is the overall method flow chart of the present invention;

图2为本发明系统结构示意图；2 is a schematic diagram of the system structure of the present invention;

图3为本发明SVDD模型运算示意图。FIG. 3 is a schematic diagram of the operation of the SVDD model of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚明白，下面结合实施例和附图，对本发明作进一步的详细说明，本发明的示意性实施方式及其说明仅用于解释本发明，并不作为对本发明的限定。In order to make the purpose, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the embodiments and the accompanying drawings. as a limitation of the present invention.

实施例1Example 1

如图1所示，一种基于深度SVDD模型的中药材适生区预测方法，包括：As shown in Figure 1, a method for predicting the suitable area of Chinese medicinal materials based on the deep SVDD model includes:

所述生态因子数据包括样本分布数据、环境因子数据和地图数据。The ecological factor data includes sample distribution data, environmental factor data and map data.

所述S1中伪不存在点样本数据的生成步骤包括：The step of generating the pseudo-absent point sample data in the S1 includes:

从所述非适生区中选择与中药材生态因子数据相同数量的伪不存在点，得到中药材的伪不存在点样本数据。The same number of pseudo-absence points as the ecological factor data of Chinese medicinal materials are selected from the non-suitable area to obtain sample data of pseudo-absence points of traditional Chinese medicinal materials.

所述S2的预处理过程包括：The preprocessing process of S2 includes:

所述t-SNE算法：The t-SNE algorithm:

构建一个高维对象之间的概率分布，不同数据之间的相似度表示：Construct a probability distribution between high-dimensional objects, and the similarity between different data is expressed as:

其中，p_j|i表示高维空间中不同数据之间的相似度，x_i和x_j为N维数据x₁,x₂,…,x_N中的任意两个不相同的数据，参数σ_i表示以x_i为中心的高斯分布的方差，||||表示二范数运算；Among them, p _j|i represents the similarity between different data in high-dimensional space, x _i and x _j are any two different data in the N-dimensional data x ₁ , x ₂ ,...,x _N , and the parameter σ _i represents the variance of the Gaussian distribution centered on x _i , and |||| represents the two-norm operation;

在低维空间对所述高维对象进行概率分布的构建，不同数据之间的相似度表示：The probability distribution is constructed for the high-dimensional objects in the low-dimensional space, and the similarity between different data is expressed as:

其中，q_j|i表示低维空间中不同数据之间的相似度，y_i和y_j表示低维空间下的二维数据y₁,y₂；Among them, q _j|i represents the similarity between different data in the low-dimensional space, and y _i and y _j represent the two-dimensional data y ₁ , y ₂ in the low-dimensional space;

所述S3中中药材适生区预测模型的构建过程：The construction process of the prediction model of the suitable area for Chinese medicinal materials in the S3:

所述SVDD模型采用全连接网络

所述最优的超球体的半径和中心公式：The formula for the radius and center of the optimal hypersphere:

其中，c表示中心点，且

n表示样本数大小，

表示每个连接网络，α_i和α_j表示拉格朗日乘子，

表示内积，

表示支持向量，且

where c represents the center point, and

n is the sample size,

denote each connected network, α _i and α _j denote Lagrange multipliers,

represents the inner product,

represents the support vector, and

所述S4中选取任意所述生态因子数据中的待预测中药材的测试点；In described S4, select the test point of the Chinese medicinal material to be predicted in any of the ecological factor data;

Represents the hyperparameters of the SVDD model;

如图2所示，一种基于深度SVDD模型的中药材适生区预测系统，包括：As shown in Figure 2, a prediction system based on the deep SVDD model for the suitable area of Chinese medicinal materials includes:

Represents the hyperparameters of the SVDD model;

实施例2Example 2

如图3所示，在实施例1的基础上，随着丹参的需求量不断上升，本发明将丹参作为研究对象，获取丹参存在点样本分布数据共计120条；选用环境因子总共26个，如表1中药材生态环境因子与分布列表所示，包括：气候因子19个、地形因子3个、土壤因子4个；伪不存在点样本数据为120条。As shown in Figure 3, on the basis of Example 1, along with the rising demand of Salvia miltiorrhiza, the present invention takes Salvia miltiorrhiza as the research object, and obtains a total of 120 pieces of data on the distribution of samples of Salvia miltiorrhiza; a total of 26 environmental factors are selected, such as Table 1 shows the ecological environment factors and distribution list of Chinese medicinal materials, including: 19 climate factors, 3 terrain factors, and 4 soil factors; there are 120 pseudo-absence point sample data.

使用240个丹参样本数据验证模型有效性，训练集和测试集分别占80％和20％；学习率设置为0.0001；训练轮数设置为150，在一个轮次内，采用上述数据在实施例1的基础上操作，使所有训练集都在整个网络中进行一次完整训练；批样本大小设置为20，权重衰减系数设置为5e-07。240 samples of Salvia miltiorrhiza were used to verify the validity of the model. The training set and test set accounted for 80% and 20% respectively; the learning rate was set to 0.0001; On the basis of operation, all training sets are fully trained in the entire network; the batch sample size is set to 20, and the weight decay coefficient is set to 5e-07.

使用AUC值作为评价指标，得到本实施例的AUC值为0.997，MaxEnt模型的AUC值为0.899。Using the AUC value as the evaluation index, the AUC value of this embodiment is obtained as 0.997, and the AUC value of the MaxEnt model is 0.899.

表1中药材生态环境因子与分布列表Table 1 List of ecological environment factors and distribution of Chinese medicinal materials

以上所述的具体实施方式，对本发明的目的、技术方案和有益效果进行了进一步详细说明，所应理解的是，以上所述仅为本发明的具体实施方式而已，并不用于限定本发明的保护范围，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The specific embodiments described above further describe the objectives, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above descriptions are only specific embodiments of the present invention, and are not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included within the protection scope of the present invention.

Claims

1. a Chinese medicinal material suitable area prediction method based on deep SVDD model, is characterized in that, comprises:

S1: Collect ecological factor data of Chinese herbal medicines, and use MaxEnt model to generate pseudo-absence point sample data of Chinese herbal medicines;

The pseudo-absence point sample data is the unsuitable area of Chinese medicinal materials obtained by the MaxEnt model;

S2: preprocessing the collected ecological factor data of Chinese medicinal materials to obtain ecological factor preprocessing data;

S3: constructing a prediction model for the suitable area for Chinese medicinal materials according to the ecological factor preprocessing data, the pseudo-absence point sample data and the SVDD model;

The construction process of the prediction model of the suitable area for Chinese medicinal materials in S3:

The SVDD model uses a fully connected network

Find the optimal hypersphere in the high-dimensional feature space, and the pseudo-absence point sample data is located outside the hypersphere of the optimal hypersphere, and the sample distribution data in the ecological factor data is located in the the interior of the optimal hypersphere;

The formula for the radius and center of the optimal hypersphere:

where c represents the center point, and

n is the sample size,

denote each connected network, α _i and α _j are Lagrangian multipliers,

represents the inner product,

and

both represent inner product,

represents the support vector, and

S4: Put the test point of the Chinese herbal medicine to be predicted into the prediction model of the Chinese herbal medicine suitable area for judgment, and obtain the suitable area of the Chinese herbal medicine to be predicted.

2 . The method for predicting suitable areas for Chinese medicinal materials based on a deep SVDD model according to claim 1 , wherein the ecological factor data includes sample distribution data, environmental factor data and map data. 3 .

3. The method for predicting the suitable area for Chinese medicinal materials based on the deep SVDD model according to claim 1, wherein the step of generating the pseudo-absent point sample data in the S1 comprises:

Use MaxEnt model to generate the numerical results of the suitable area of Chinese medicinal materials;

Eliminate the values that are greater than or equal to the threshold value in the result of the suitable area of the Chinese medicinal material to obtain the non-suitable area;

The same number of pseudo-absence points as the ecological factor data of Chinese medicinal materials are selected from the non-suitable area to obtain sample data of pseudo-absence points of traditional Chinese medicinal materials.

4. The method for predicting the suitable growth area of Chinese medicinal materials based on the deep SVDD model according to claim 1, wherein the preprocessing process of the S2 comprises:

The word vector model Word2vec is used to convert the ecological factor data into a high-dimensional space word vector;

Use the t-SNE algorithm to map high-dimensional spatial word vectors to two-dimensional spatial word vectors.

5. a kind of Chinese medicinal material suitable area prediction method based on deep SVDD model according to claim 4, is characterized in that, described t-SNE algorithm:

Construct a probability distribution between high-dimensional objects, and the similarity between different data is expressed as:

Among them, p _j|i represents the similarity between different data in the high-dimensional space, x _i and x _j are any two different data in the N-dimensional data x ₁ , x ₂ ,..., x _N , The parameter σ _i represents the variance of the Gaussian distribution centered on x _i , || || represents the two-norm operation, x _k represents the N-dimensional data x ₁ , x ₂ ,..., the data subscript k in x _N ;

The probability distribution is constructed for the high-dimensional objects in the low-dimensional space, and the similarity between different data is expressed as:

Among them, q _j|i represents the similarity between different data in the low-dimensional space, y _i and y _j represent the two-dimensional data y ₁ , y ₂ , y _k in the low-dimensional space, and the subscript k in the low-dimensional space two-dimensional data;

Construct the joint probability distributions P and Q of the high-dimensional space and the low-dimensional space respectively, so that for any i and j, there are q _i|j =p _j|i , q _i|j =q _j|i ;

Among them, pi _{, j} represent the joint probability between any two data in the high-dimensional space, q _{i, j} represent the joint probability between any two data in the low-dimensional space, y _l represents the subscript l in the low-dimensional space two-dimensional data;

Using KL divergence to measure the similarity of the joint probability distribution of high-dimensional space and low-dimensional space, we get:

Among them, C represents the similarity of the joint probability distribution of high-dimensional space and low-dimensional space, P represents the joint probability of high-dimensional space, and Q represents the joint probability of low-dimensional space.

6. A method for predicting suitable areas for Chinese medicinal materials based on a deep SVDD model according to claim 1, wherein in said S4, a test point of the Chinese medicinal material to be predicted in any said ecological factor data is selected;

Calculate the distance between the test point of the Chinese medicinal material to be predicted and the center point of the optimal hypersphere:

Among them, x' represents the test point, s(x') represents the distance between the test point and the optimal hypersphere center point,

Represents the hyperparameters of the SVDD model;

Judging whether s(x') is greater than the radius of the hypersphere with the smallest volume, when s(x') is greater than the radius of the optimal hypersphere, the test point is an unsuitable area;

When s(x') is less than or equal to the radius of the optimal hypersphere, the test point is a suitable area;

The above operation is performed on all the test points of the Chinese herbal medicine to be predicted to obtain all the suitable areas of the Chinese herbal medicine to be predicted.

7. A Chinese medicinal material suitable area prediction system based on deep SVDD model, is characterized in that, comprises:

The collection module is used to collect ecological factor data of Chinese medicinal materials and generate pseudo-absence point sample data of Chinese medicinal materials;

The preprocessing module is used for preprocessing the collected ecological factor data of Chinese medicinal materials to obtain ecological factor preprocessing data;

The prediction model generation module is used for constructing the prediction model of the suitable area for Chinese medicinal materials according to the ecological factor preprocessing data, the pseudo-absence point sample data and the SVDD model;

The prediction module is used to predict and obtain the suitable area of the Chinese herbal medicine to be predicted;

The prediction process of the prediction module: select the test points of the Chinese medicinal materials to be predicted in any of the ecological factor data;

Calculate the distance between the test point of the Chinese medicinal material to be predicted and the optimal hypersphere center point in the prediction model of the Chinese medicinal material suitable area:

Among them, x' represents the test point, s(x') represents the distance between the test point and the center point of the hypersphere where the prediction model of the suitable area for Chinese medicinal materials is optimal,

Represents the hyperparameters of the SVDD model;

When s(x') is less than or equal to the radius of the optimal hypersphere, the test point is a suitable area; perform the above operation on all the test points of the Chinese herbal medicine to be predicted to obtain all suitable conditions of the Chinese herbal medicine to be predicted. living area.