CN112308117A

CN112308117A - Homogeneous crowd identification method based on double-index particle swarm algorithm

Info

Publication number: CN112308117A
Application number: CN202011075681.XA
Authority: CN
Inventors: 胡晓敏; 李瑞珠; 李敏
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2020-10-09
Filing date: 2020-10-09
Publication date: 2021-02-02

Abstract

The invention provides a homogeneous crowd identification method based on a double-index particle swarm algorithm, aiming at the defect that a public health service platform cannot be comprehensively analyzed by a single-index clustering algorithm, and the method comprises the following steps: collecting the use crowd information of a public health service platform as a user information data set; obtaining two initial adaptive values of the user information data set through a clustering algorithm; and iterating the two initial adaptive values as adaptive functions to obtain a clustering result and obtain homogeneous crowd information data. According to the method, optimization guidance is performed on the particle swarm clustering on the basis of two index results through double-index adaptive value evaluation, the evaluation tendency of a single index is eliminated, the application unicity of an internal index is expanded, the labor and the time can be saved, and the complex and diverse crowd information can be comprehensively analyzed.

Description

Homogeneous crowd identification method based on dual-index particle swarm algorithm

技术领域technical field

本发明涉及群体智能进化领域，主要涉及一种基于双指标粒子群算法的同质人群识别方法。The invention relates to the field of swarm intelligence evolution, and mainly relates to a homogeneous crowd identification method based on a dual-index particle swarm algorithm.

背景技术Background technique

目前，对于同质人群识别来说，国内外已经有许多将聚类算法应用到人群的识别当中，例如：k-means算法，但它有着对初始中心敏感和对K取值敏感的缺点；基于网格的方法，在精度上存在缺陷；多元回归方法，对数据过于敏感；基于密度的方法抗噪性不强，且对领域半径的取值较为依赖。At present, for the identification of homogeneous crowds, many clustering algorithms have been applied to crowd identification at home and abroad, such as the k-means algorithm, but it has the shortcomings of being sensitive to the initial center and the value of K; based on The grid method has defects in accuracy; the multiple regression method is too sensitive to the data; the density-based method is not strong against noise and is more dependent on the value of the field radius.

另外，由于内部指标的设计具有倾向性，使得单一指标的表达能力受到限制，因此，遗传算法结合聚类算法、差分算法结合聚类算法等以单指标作为优化适应值评价的方法，得到的结果也是单一化的。In addition, due to the tendency of the design of internal indicators, the expressive ability of a single indicator is limited. Therefore, the genetic algorithm combined with the clustering algorithm, the difference algorithm combined with the clustering algorithm and other methods take a single indicator as the optimal fitness value evaluation method, and the results obtained is also singular.

对于从事公共卫生服务的工作者来说，根据服务平台的各类使用人群优化工作平台，使得公共卫生服务在最大程度内满足人们的需求是极其重要且一直都需要随着人群的使用而进行的。而公共卫生服务平台的使用人群具备多样性，不同年龄层、不同生活环境等都会存在差异。在以往的平台优化中更多的是以线上问卷调查、线下走访调查的形式对使用人群进行了解，或者通过单指标聚类算法对使用人群数据进行分析，但上述途径需要耗费大量人力与时间，且面对人群信息的复杂性与多样性，分析不够全面。For workers engaged in public health services, it is extremely important to optimize the work platform according to the various users of the service platform, so that public health services can meet the needs of people to the greatest extent, and it is always necessary to carry out with the use of the population. . The population of the public health service platform is diverse, and there will be differences in different age groups and different living environments. In the past platform optimization, more people were used to understand the user population in the form of online questionnaires and offline surveys, or to analyze the user population data through a single-index clustering algorithm, but the above methods require a lot of manpower and Time, and in the face of the complexity and diversity of crowd information, the analysis is not comprehensive enough.

发明内容SUMMARY OF THE INVENTION

本发明针对单指标聚类算法对公共卫生服务平台无法进行全面分析的不足，提出了一种基于双指标粒子群算法的同质人群识别方法，利用双指标适应值评价进行同质人群聚类分析，在平台优化的效率与全面性上获得充分的提高。Aiming at the deficiency that the single-index clustering algorithm cannot comprehensively analyze the public health service platform, the present invention proposes a homogeneous population identification method based on the dual-index particle swarm algorithm, and uses the dual-index fitness value evaluation to perform the homogeneous population clustering analysis , the efficiency and comprehensiveness of platform optimization have been fully improved.

本发明解决上述技术问题所采取的技术方案是：基于双指标粒子群算法的同质人群识别方法，包括：The technical solution adopted by the present invention to solve the above-mentioned technical problems is: a method for identifying a homogeneous crowd based on a dual-index particle swarm algorithm, including:

采集公共卫生服务平台的使用人群信息，作为用户信息数据集；Collect the information on the user population of the public health service platform as a user information data set;

将所述用户信息数据集通过聚类算法获得两个初始适应值；Obtaining two initial fitness values from the user information data set through a clustering algorithm;

将两个所述初始适应值作为适应函数进行迭代，得到聚类结果，获取同质人群信息数据。The two initial fitness values are used as the fitness function to iterate to obtain the clustering result, and the information data of the homogeneous population is obtained.

所述“采集公共卫生服务平台的使用人群信息，作为用户信息数据集”，包括：The “collection of user information of the public health service platform as a user information dataset” includes:

将公共卫生服务平台的使用人群的信息全部转为数字，得到数据集；Convert all the information on the users of the public health service platform into numbers to obtain a data set;

将所述数据集转化为可读文件格式；converting the dataset into a readable file format;

剔除所述数据集中的无用属性数据列，得到处理后的数据集；Eliminate useless attribute data columns in the data set to obtain a processed data set;

将所述处理后的数据集进行标准化处理，得到所述用户信息数据集。Standardize the processed data set to obtain the user information data set.

所述“可读文件格式”，包括：csv格式和/或bat格式。The "readable file format" includes: csv format and/or bat format.

所述“公共卫生服务平台的使用人群的信息”，包括：国籍、居住地、年龄信息。The "information on the population of users of the public health service platform" includes: nationality, place of residence, and age information.

两个所述“初始适应值”，为：Fitness1(CH)、Fitness2(S_Dbw)。The two "initial fitness values" are: Fitness1 (CH), Fitness2 (S_Dbw).

所述Fitness1(CH)的获取步骤是：The steps for obtaining the Fitness1(CH) are:

指标公式：

Indicator formula:

表示类别中离差矩阵的迹；

represents the trace of the dispersion matrix in the category;

表示类别间离差矩阵的迹，m表示整个数据集的平均值向量；N为样本数量；K为迭代次数。

Represents the trace of the dispersion matrix between classes, m represents the mean vector of the entire dataset; N is the number of samples; K is the number of iterations.

所述Fitness2(S_Dbw)的获取步骤是：The steps for obtaining the Fitness2 (S_Dbw) are:

指标公式：S_Dbw(c)＝Scat(c)+Dens_bw(c)Indicator formula: S_Dbw(c)=Scat(c)+Dens_bw(c)

其中：in:

Dens_bw(c)用来评估两个类一起的密度和每个单独的类的密度的关系；Dens_bw(c) is used to evaluate the relationship between the density of the two classes together and the density of each individual class;

density(u)用来表征u周围点的数目，比较的阈值为1中的stdev；density(u) is used to characterize the number of points around u, and the threshold for comparison is stdev in 1;

stdev表示一个数据集各个cluster的平均偏离；stdev represents the average deviation of each cluster in a dataset;

Scater(c)表示类间的平均分散度。Scater(c) represents the average dispersion between classes.

所述“将两个所述初始适应值作为适应函数进行迭代，得到聚类结果，获取同质人群信息数据”的过程是：The process of "iterating the two initial fitness values as adaptation functions to obtain clustering results and obtaining homogeneous crowd information data" is as follows:

a.初始化粒子群算法的各个参数：a. Initialize the parameters of the particle swarm algorithm:

b.根据CH、S_Dbw指标公式分别计算每个粒子的适应值Fitness1与适应值Fitness2；b. Calculate the fitness value Fitness1 and fitness value of each particle according to the CH and S_Dbw index formulas respectively;

c.对每一个粒子，将其当前位置的所述适应值Fitness1与所述适应值Fitness2与其历史最佳位置对应的两个适应值比较，如果当前位置的两个适应值都更高，则用当前位置更新历史最佳位置，否则不做更新；c. For each particle, compare the fitness value Fitness1 of its current position with the fitness value Fitness2 of the two fitness values corresponding to its historical best position, if the two fitness values of the current position are higher, then use The current location updates the best location in history, otherwise it will not be updated;

d.对每一个粒子，将其当前位置的所述适应值Fitness1与所述适应值Fitness2与其全局最佳位置对应的两个适应值比较，如果当前位置的两个适应值都更高，则用当前位置更新全局最佳位置，否则不做更新；d. For each particle, compare the fitness value Fitness1 of its current position with the fitness value Fitness2 of the two fitness values corresponding to its global best position, if the two fitness values of the current position are higher, then use The current position updates the global best position, otherwise it will not be updated;

e.更新粒子的位置与速度：e. Update the position and velocity of the particle:

粒子速度更新公式：The particle velocity update formula:

粒子位置更新公式：The particle position update formula:

其中，Vidk表示第k次迭代粒子i速度矢量的第d维分量；xidk表示第k次迭代粒子i位置矢量的第d维分量；c1，c2表示加速度常数；r1，r2表示两个随机参数，取值范围[0,1]；w表示惯性权重；Among them, Vidk represents the d-dimensional component of the velocity vector of particle i in the k-th iteration; xidk represents the d-dimensional component of the position vector of particle i in the k-th iteration; c1, c2 represent acceleration constants; r1, r2 represent two random parameters, Value range [0,1]; w represents inertia weight;

f.若未满足结束条件，则返回步骤b，若满足结束条件则算法结束，全局最佳位置即全局最优解；f. If the end condition is not met, return to step b. If the end condition is met, the algorithm ends, and the global optimal position is the global optimal solution;

最后输出聚类结果，获得准确的同质人群信息数据。Finally, the clustering results are output to obtain accurate homogeneous population information data.

本发明的有益效果是：本发明通过双指标适应值评价，在两个指标结果的基础上对粒子群聚类进行优化指导，消除单指标的评价倾向性以及扩展内部指标的应用单一性；相对单指标，双指标的适应值评价可以在聚类过程中更完整地保留人群数据信息，最大程度上保证基于粒子群算法的聚类结果拥有类间分散、类内紧凑的最优结果；本发明所述的方法对于人群数据属性数较多的情况，双指标可以更具包容性的得到所有样本的最类似的类别划分，能够节省人力与时间，全面分析复杂、多样的人群信息。The beneficial effects of the present invention are as follows: the present invention conducts optimization guidance for particle swarm clustering on the basis of the results of the two indicators through the evaluation of the fitness value of the two indicators, eliminates the evaluation tendency of a single indicator and expands the application singleness of the internal indicators; The single-index and dual-index fitness value evaluation can more completely retain the crowd data information in the clustering process, and to the greatest extent ensure that the clustering results based on the particle swarm algorithm have the optimal results of inter-class dispersion and intra-class compactness; the present invention In the case of a large number of crowd data attributes, the method described above can more inclusively obtain the most similar classification of all samples, save manpower and time, and comprehensively analyze complex and diverse crowd information.

附图说明Description of drawings

图1为本发明中用户人群信息提取的流程图。FIG. 1 is a flowchart of user crowd information extraction in the present invention.

图2为本发明算法使用的编码图。Fig. 2 is a coding diagram used by the algorithm of the present invention.

图3为本发明所述双指标粒子群算法的流程图。FIG. 3 is a flowchart of the dual-index particle swarm algorithm according to the present invention.

具体实施方式Detailed ways

下面结合附图对本发明进行进一步的说明。The present invention will be further described below with reference to the accompanying drawings.

如图1～3，本发明所述的基于双指标粒子群算法的同质人群识别方法，包括：As shown in Figures 1 to 3, the method for identifying homogeneous crowds based on the dual-index particle swarm algorithm according to the present invention includes:

具体步骤为：The specific steps are:

首先，将公共卫生服务平台的使用人群信息全部转为数字，例如国籍与居住地，年龄信息，身体质量指数(BMI)、日常生活活动评分(ADL)；然后将数据集转化为本专利程序可读的文件格式，如csv、bat，读入数据集，读入人群信息的所有属性值，属性筛选，剔除无用属性数据列，包含N个样本，每个样本具备D个属性，然后对所有数据进行标准化处理，得到用户信息数据集。在实际情况中obj[i].dataX[j]表示第i个样本的第j个属性。First, all the information on the user population of the public health service platform is converted into numbers, such as nationality and place of residence, age information, body mass index (BMI), activity of daily living score (ADL); then the data set can be converted into this patented procedure. Read the file format, such as csv, bat, read the data set, read all the attribute values of the crowd information, attribute filtering, eliminate the useless attribute data column, contain N samples, each sample has D attributes, and then analyze all the data. Carry out standardization processing to obtain a user information data set. In the actual situation obj[i].dataX[j] represents the jth attribute of the ith sample.

完成上述所有步骤后，聚类算法程序进入种群个体编码，对种群进行初始化，并利用最近邻聚类规则NMP(nearest multiple prototypes)获取初始聚类类簇，同时利用欧式距离公式计算得到每个类簇的初始类中心，然后获得初始适应值Fitness1(CH)、Fitness2(S_Dbw)After completing all the above steps, the clustering algorithm program enters the population individual coding, initializes the population, and uses the nearest neighbor clustering rule NMP (nearest multiple prototypes) to obtain the initial clustering clusters, and uses the Euclidean distance formula to calculate each class. The initial class center of the cluster, and then obtain the initial fitness value Fitness1(CH), Fitness2(S_Dbw)

①CH(Calinski-Harabasz)：①CH (Calinski-Harabasz):

表示类别中离差矩阵的迹，是各个数据点到所在类的质心点的距离之和。

Represents the trace of the dispersion matrix in the class, which is the sum of the distances from each data point to the centroid of the class.

其中，

表示类别间离差矩阵(由向量距离值组成的矩阵)的迹，m表示整个数据集的平均值向量。in,

represents the trace of the between-class dispersion matrix (a matrix of vector distance values), and m represents the mean vector for the entire dataset.

②S_Dbw：②S_Dbw:

S_Dbw(c)＝Scat(c)+Dens_bw(c)S_Dbw(c)=Scat(c)+Dens_bw(c)

Dens_bw(c)用来评估两个类一起的密度，和每个单独的类的密度的关系。两个类一起的密度，显著小于每个单独的类的密度，表示聚类效果越好。Dens_bw(c) is used to evaluate the relationship between the density of two classes together, and the density of each individual class. The density of the two classes together, which is significantly less than the density of each individual class, indicates better clustering.

density(u)用来表征u周围点的数目，比较的阈值为1中的stdev。density(u) is used to characterize the number of points around u, and the threshold for comparison is stdev in 1.

stdev表示一个数据集各个类的平均偏离。stdev represents the average deviation of each class in a dataset.

计算上述两个内部指标作为适应值函数，然后进入下面迭代过程：Calculate the above two internal indicators as the fitness value function, and then enter the following iterative process:

while(循环次数小于设定的数值)dowhile (the number of loops is less than the set value) do

粒子i的位置：x_i＝(xi1,xi2,...,xiD)；Position of particle i: x _i =(xi1,xi2,...,xiD);

粒子i的速度：vi＝(vi1,vi1,...,viD)；The speed of particle i: vi=(vi1,vi1,...,viD);

粒子i经历过的最佳位置：pbesti＝(pi1,pi2,...,piD),其对应的bFitness1与bFitness2；The best position experienced by particle i: pbesti=(pi1,pi2,...,piD), its corresponding bFitness1 and bFitness2;

全体粒子经历过的最佳位置：gbesti＝(gi1,gi2,...,giD)，其对应的gFitness1与gFitness2；The best position experienced by all particles: gbesti=(gi1,gi2,...,giD), its corresponding gFitness1 and gFitness2;

每一维的位置要限制在区间：[Xmin,d,Xmax,d]；The position of each dimension should be limited to the interval: [Xmin,d,Xmax,d];

每一维的速度要限制在区间：[-Vmax,d,Vmax,d]；The speed of each dimension should be limited to the interval: [-Vmax,d,Vmax,d];

b.根据CH、S_Dbw指标公式计算每个粒子的适应值Fitness1与Fitness2；b. Calculate the fitness values Fitness1 and Fitness2 of each particle according to the CH and S_Dbw index formulas;

c.对每一个粒子，将其当前位置的两个适应值与其历史最佳位置(pbest)对应的两个适应值比较，如果当前位置的两个适应值都更高，则用当前位置更新历史最佳位置，否则不做更新；c. For each particle, compare the two fitness values of its current position with the two fitness values corresponding to its historical best position (pbest). If the two fitness values of the current position are higher, update the history with the current position The best location, otherwise it will not be updated;

d.对每一个粒子，将其当前位置的两个适应值与其全局最佳位置(gbest)对应的两个适应值比较，如果当前位置的两个适应值都更高，则用当前位置更新全局最佳位置，否则不做更新；d. For each particle, compare the two fitness values of its current position with the two fitness values corresponding to its global best position (gbest). If the two fitness values of the current position are higher, update the global position with the current position. The best location, otherwise it will not be updated;

e.根据公式更新粒子的位置与速度：e. Update the particle's position and velocity according to the formula:

粒子速度更新公式：The particle velocity update formula:

粒子位置更新公式：The particle position update formula:

其中，Vidk表示第k次迭代粒子i速度矢量的第d维分量；xidk表示第k次迭代粒子i位置矢量的第d维分量；c1，c2表示加速度常数，调节学习最大步长；r1，r2表示两个随机参数，取值范围[0,1]，以增加搜索的随机性w表示惯性权重，非负数，用来调节对解空间的搜素范围；Among them, Vidk represents the d-dimensional component of the velocity vector of particle i in the k-th iteration; xidk represents the d-dimensional component of the position vector of particle i in the k-th iteration; c1, c2 represent the acceleration constant, and adjust the maximum learning step size; r1, r2 Represents two random parameters, the value range is [0, 1], to increase the randomness of the search w represents the inertia weight, a non-negative number, used to adjust the search range of the solution space;

f.若未满足结束条件，则返回步骤b，若满足结束条件则算法结束，全局最佳位置(gbest)即全局最优解。f. If the end condition is not met, return to step b. If the end condition is met, the algorithm ends, and the global best position (gbest) is the global optimal solution.

end whileend while

最后输出聚类结果，即可获得准确的同质人群信息数据。Finally, the clustering results are output, and accurate homogeneous population information data can be obtained.

需要明确的是：除了本文中使用的CH与S_Dbw之外，还可以根据指标的特性与问题分析的目的选择其他内部指标进行适应值替换。同时，可以使用基于遗传算法的聚类、结合差分算法的聚类替换基于粒子群算法聚类方案进行聚类分析。It should be clear that in addition to CH and S_Dbw used in this paper, other internal indicators can also be selected to replace the fitness value according to the characteristics of the indicators and the purpose of problem analysis. At the same time, clustering based on genetic algorithm and clustering combined with differential algorithm can be used to replace the clustering scheme based on particle swarm algorithm for clustering analysis.

以上，仅为本发明的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易变化或替换，都属于本发明的保护范围之内。因此本发明的保护范围所述以权利要求的保护范围为准。The above are only specific embodiments of the present invention, but the protection scope of the present invention is not limited thereto. Any person skilled in the art who is familiar with the technical scope disclosed by the present invention can easily change or replace them, all belonging to the present invention. within the scope of protection. Therefore, the protection scope of the present invention is described in accordance with the protection scope of the claims.

Claims

1. A homogeneous population identification method based on a double-index particle swarm algorithm is characterized by comprising the following steps:

collecting the use crowd information of a public health service platform as a user information data set;

obtaining two initial adaptive values of the user information data set through a clustering algorithm;

and iterating the two initial adaptive values as adaptive functions to obtain a clustering result and obtain homogeneous crowd information data.

2. The homogeneous population recognition method based on the dual-index particle swarm algorithm according to claim 1, wherein the collecting of the use population information of the public health service platform as the user information data set comprises:

the method comprises the steps that information of users of a public health service platform is converted into numbers to obtain a data set;

converting the data set into a readable file format;

removing useless attribute data columns in the data set to obtain a processed data set;

and carrying out standardization processing on the processed data set to obtain the user information data set.

3. The homogeneous population recognition method based on the dual-index particle swarm algorithm of claim 2, wherein the "readable file format" comprises: csv format or bat format.

4. The homogeneous population identification method based on the dual-index particle swarm algorithm according to claim 2, wherein the information of the people using the public health service platform comprises: nationality, place of residence, age information.

5. The homogeneous population recognition method based on the dual-index particle swarm algorithm according to claim 1, wherein the two "initial adaptive values" are: fitness1(CH), Fitness2(S _ Dbw).

6. The homogeneous population recognition method based on the dual-index particle swarm algorithm of claim 5, wherein the Fitness1(CH) obtaining step is:

traces representing dispersion matrices in the categories;

the trace of the inter-class dispersion matrix is represented, and m represents the mean vector of the whole data set; n is the number of samples; and K is the iteration number.

7. The homogeneous population recognition method based on the dual-index particle swarm algorithm according to claim 5, wherein the Fitness2(S _ Dbw) obtaining step is:

S_Dbw(c)＝Scat(c)+Dens_bw(c)

wherein:

dens _ bw (c) to evaluate the relationship of the density of two classes together and the density of each individual class;

density (u) is used to characterize the number of points around u, and the threshold for comparison is stdev in 1;

stdev represents the average deviation of classes of a data set;

scater (c) represents the mean dispersion between the classes.

8. The homogeneous population identification method based on the dual-index particle swarm algorithm according to claim 5, wherein the process of obtaining the homogeneous population information data by iterating the two initial adaptive values as adaptive functions to obtain clustering results is as follows:

a. initializing various parameters of a particle swarm algorithm:

b. respectively calculating an adaptive value Fitness1 and an adaptive value Fitness2 of each particle according to CH and S _ Dbw index formulas;

c. for each particle, comparing the adaptive value Fitness1 of the current position with the adaptive value Fitness2 and two adaptive values corresponding to the historical optimal position of the particle, if the two adaptive values of the current position are higher, updating the historical optimal position by using the current position, otherwise, not updating;

d. for each particle, comparing the adaptive value Fitness1 of the current position with the adaptive value Fitness2 and two adaptive values corresponding to the global optimal position of the particle, if the two adaptive values of the current position are higher, updating the global optimal position by using the current position, otherwise, not updating;

e. update of position and velocity of particles:

particle velocity update formula:

particle position update formula:

wherein, V_id ^kA d-dimension component representing a velocity vector of a particle i at the k-th iteration; x is the number of_i ^dkA d-dimension component representing a location vector of a particle i at the k-th iteration; c. C₁，c₂Represents an acceleration constant; r is₁，r₂Represents two random parameters with the value range of [0,1]](ii) a w represents an inertial weight;

f. if the end condition is not met, returning to the step b, if the end condition is met, ending the algorithm, and obtaining a global optimal position, namely a global optimal solution;

and finally, outputting a clustering result to obtain accurate homogeneous crowd information data.