CN111460161A - A method for extracting topic-related genes from unsupervised text for unbalanced large datasets - Google Patents
A method for extracting topic-related genes from unsupervised text for unbalanced large datasets Download PDFInfo
- Publication number
- CN111460161A CN111460161A CN202010255801.8A CN202010255801A CN111460161A CN 111460161 A CN111460161 A CN 111460161A CN 202010255801 A CN202010255801 A CN 202010255801A CN 111460161 A CN111460161 A CN 111460161A
- Authority
- CN
- China
- Prior art keywords
- sample
- feature
- category
- matrix
- sample set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 43
- 108090000623 proteins and genes Proteins 0.000 title claims abstract description 29
- 239000011159 matrix material Substances 0.000 claims abstract description 49
- 238000009826 distribution Methods 0.000 claims abstract description 21
- 238000000556 factor analysis Methods 0.000 claims abstract description 9
- 239000013598 vector Substances 0.000 claims description 12
- 238000004458 analytical method Methods 0.000 claims description 10
- 238000010586 diagram Methods 0.000 claims description 10
- 230000001186 cumulative effect Effects 0.000 claims description 7
- 238000012360 testing method Methods 0.000 claims description 5
- 238000012887 quadratic function Methods 0.000 claims description 3
- 230000002087 whitening effect Effects 0.000 claims description 3
- 238000000605 extraction Methods 0.000 abstract description 13
- 238000005070 sampling Methods 0.000 abstract description 6
- 230000009467 reduction Effects 0.000 abstract description 5
- 238000010187 selection method Methods 0.000 abstract description 5
- 238000012549 training Methods 0.000 abstract description 4
- 230000008569 process Effects 0.000 description 9
- 230000000694 effects Effects 0.000 description 4
- 230000006870 function Effects 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000012880 independent component analysis Methods 0.000 description 2
- 238000003064 k means clustering Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000010219 correlation analysis Methods 0.000 description 1
- 230000009193 crawling Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000013441 quality evaluation Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 230000003313 weakening effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Probability & Statistics with Applications (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明公开了一种面向不均衡大数据集的无监督文本主题相关基因提取方法,采用因子分析和密度峰值算法来获取高维样本集的聚类簇,标注无标签样本;利用平均局部密度和信息熵来改进基于CHI统计矩阵的特征选取方法,强化低密度、小样本簇的特征表达度;采用基于负熵的快速固定点算法,分析多维数据间的高阶统计相关性,提取独立的隐含主题特征基因及完成分量间高阶冗余的去除。不需采用大规模已标注样本进行训练,能够有效避免对样本类别关系及特征结构的预定义;克服采用过抽样或欠抽样方法对原始不均衡数据集的类别分布所带来的影响。通过对特征类别结构的修正,改善了CHI统计选择方法的性能;还实现了在保持样本集辨识能力情况下的有效特征降维。The invention discloses an unsupervised text topic-related gene extraction method for unbalanced large data sets. Factor analysis and density peak algorithm are used to obtain clusters of high-dimensional sample sets, and unlabeled samples are marked; the average local density and Information entropy is used to improve the feature selection method based on the CHI statistical matrix, and the feature expression degree of low-density and small sample clusters is strengthened; the fast fixed-point algorithm based on negative entropy is used to analyze the high-order statistical correlation between multi-dimensional data and extract independent hidden features. Contains subject eigengenes and completes the removal of high-order redundancy between components. There is no need to use large-scale labeled samples for training, which can effectively avoid the pre-definition of sample category relationships and feature structures; overcome the impact of over-sampling or under-sampling methods on the category distribution of the original unbalanced dataset. By revising the feature category structure, the performance of the CHI statistical selection method is improved; it also achieves effective feature dimensionality reduction while maintaining the ability to identify the sample set.
Description
技术领域technical field
本发明属于自然语言处理中的数据解释及主题发现技术领域,具体涉及一种面向不均衡大数据集的无监督文本主题相关基因提取方法。The invention belongs to the technical field of data interpretation and topic discovery in natural language processing, and in particular relates to an unsupervised text topic-related gene extraction method for unbalanced large data sets.
背景技术Background technique
随着社会逐渐步入“大数据”时代,人们通过网页、微博、论坛等途径获取的信息数量越来越多,而用于阅读和整理信息的时间却越来越少,因此,高效、准确地分析信息的主题就成为实现大数据理解与价值发现的有效手段,其适用领域更是涵盖了互联网舆情监测与预警、网络有害信息过滤以及情感分析等多方面。而在处理这些领域数据时,常需要面对大量具有冗余或不相关特征的高维数据,这使得学习算法的效率及性能大大降低,因此特征提取作为机器学习和数据挖掘中至关重要的一环,也直接影响到模型构建及分析效率和准确性。As society gradually enters the era of "big data", people acquire more and more information through web pages, microblogs, forums, etc., while spending less and less time reading and organizing information. Accurately analyzing the subject of information has become an effective means to realize big data understanding and value discovery, and its application fields cover many aspects such as Internet public opinion monitoring and early warning, network harmful information filtering, and sentiment analysis. When processing data in these fields, it is often necessary to deal with a large number of high-dimensional data with redundant or irrelevant features, which greatly reduces the efficiency and performance of learning algorithms. One link also directly affects the efficiency and accuracy of model construction and analysis.
目前,特征提取根据类别信息的不同,可分为有监督和无监督两类。在文本内容分析过程,无论采取何种类别,均需利用向量空间模型(Vector Space Model)将文本表示成由一定数量特征词构成的向量空间,这在实际应用中不可避免地出现两方面的问题:At present, feature extraction can be divided into two categories: supervised and unsupervised, according to different categories of information. In the process of text content analysis, no matter what kind of category is adopted, it is necessary to use the Vector Space Model to represent the text as a vector space composed of a certain number of feature words, which inevitably causes two problems in practical applications. :
①数据集内样本类别(簇)分布不均衡,而作为特征子集质量评价的度量函数,无论是基于独立性的相关性分析、相似性分析;还是基于距离的欧几里得距离、马氏距离;甚至目前应用最为广泛基于信息熵的互信息、信息增益等方法,均采取了对数据集内样本类别(簇)分布相同或相近的一致性假设,使得所确定的特征大多来自类别(簇)数量(密度)占优的“大类”,没有或者很少部分来自不占优的“小类”,导致选取出的最具区分度的特征子集,无法准确反映整个样本空间中真实信息,降低后续学习方法解决实际问题的性能;① The distribution of sample categories (clusters) in the data set is not balanced, and as a metric function for the quality evaluation of feature subsets, whether it is correlation analysis and similarity analysis based on independence; or Euclidean distance, Mahalanobis distance based on distance distance; even the most widely used methods such as mutual information and information gain based on information entropy all adopt the same or similar distribution of sample categories (clusters) in the dataset, so that most of the determined features come from categories (clusters). ) The number (density) is dominant in the "big class", and none or very few parts come from the non-dominant "small class", resulting in the most discriminative feature subset selected, which cannot accurately reflect the real information in the entire sample space , reducing the performance of subsequent learning methods to solve practical problems;
②“大数据”使得待处理的对象变得愈加纷繁复杂,数据维数呈现爆炸性增长,面对超高维度的数据集,不仅意味着巨大的内存需求,而且意味着高昂的计算成本投入。这些高维特征空间中,繁多的特征点之间存在着很强的相关性,造成大量冗余甚至噪声的引入,使得采用传统方法选取出的特征项泛化能力急剧恶化,高维数据空间的“空空间”现象,也使得多元密度估计问题变得十分困难。如何从纷繁复杂的表象信息中提取出事物的本质特征,即找出相互独立的、隐藏的潜在信息,完成高阶冗余的去除,提取出完整的、独立的主题相关基因数据,提高特征项的泛化能力就愈显重要。② "Big data" makes the objects to be processed become more complex, and the data dimension shows an explosive growth. Facing the ultra-high-dimensional data set, it not only means huge memory requirements, but also means high computing costs. In these high-dimensional feature spaces, there is a strong correlation between a large number of feature points, resulting in the introduction of a lot of redundancy and even noise, which makes the generalization ability of the feature items selected by the traditional method deteriorate sharply. The phenomenon of "empty space" also makes the problem of multivariate density estimation very difficult. How to extract the essential features of things from the complex representational information, that is, to find out the independent and hidden potential information, to complete the removal of high-order redundancy, to extract the complete and independent theme-related gene data, and to improve the feature items. The generalization ability is more important.
发明内容SUMMARY OF THE INVENTION
本发明所要解决的技术问题在于针对上述现有技术中的不足,提供一种面向不均衡大数据集的无监督文本主题相关基因提取方法,有效避免对样本类别关系及特征结构的预定义,及克服采用过抽样或欠抽样方法对原始不均衡数据集的类别分布所带来的影响。The technical problem to be solved by the present invention is to provide an unsupervised text subject-related gene extraction method for unbalanced large data sets in view of the deficiencies in the above-mentioned prior art, which effectively avoids the pre-definition of sample category relationships and feature structures, and Overcome the impact of oversampling or undersampling on the class distribution of the original imbalanced dataset.
本发明采用以下技术方案:The present invention adopts following technical scheme:
面向不均衡大数据集的无监督文本主题相关基因提取方法,包括以下步骤:An unsupervised text topic-related gene extraction method for imbalanced large datasets, including the following steps:
S1、采用因子分析对无标签样本集中的高维样本进行降维,输出样本集的特征指标矩阵;S1. Use factor analysis to reduce the dimension of the high-dimensional samples in the unlabeled sample set, and output the feature index matrix of the sample set;
S2、对每个由公共因子表述的样本,解析局部密度以及到具有更高局部密度点的距离,绘制决策图,运用快速搜索和密度峰值发现算法对降维后的样本集进行探索性聚类,获得n个样本的C个聚类划分,输出样本集的聚类划分;S2. For each sample represented by a common factor, analyze the local density and the distance to the point with higher local density, draw a decision diagram, and use the fast search and density peak discovery algorithm to perform exploratory clustering on the dimensionality-reduced sample set , obtain C clustering divisions of n samples, and output the clustering divisions of the sample set;
S3、利用信息熵和平均局部密度改进χ2统计量,构建基于加权χ2统计量的样本特征分布矩阵,对样本集χ2统计量中的特征与样本类别进行加权,通过加权后的χ2统计量构建新的统计矩阵表示特征在不同类别和相同类别中的加权概率分布,进行特征选择得到特征子集T=t1,t2,…tp;S3. Use information entropy and average local density to improve the χ 2 statistic, construct a sample feature distribution matrix based on the weighted χ 2 statistic, weight the features and sample categories in the sample set χ 2 statistic, and pass the weighted χ 2 Statistics constructs a new statistical matrix to represent the weighted probability distribution of features in different categories and the same category, and performs feature selection to obtain feature subsets T=t 1 , t 2 ,...t p ;
S4、利用基于负熵的快速固定点算法分析多维特征子集T=t1,t2,…tp中数据间的高阶统计相关性,提取独立特征基因,完成分量间高阶冗余的去除。S4. Use the fast fixed-point algorithm based on negative entropy to analyze the high-order statistical correlation between the data in the multi-dimensional feature subset T=t 1 , t 2 , ... t p , extract independent feature genes, and complete the high-order redundancy between components. remove.
具体的,步骤S1具体为:Specifically, step S1 is specifically:
S101、设样本集合X包含n个样本x1,x2,…,xn,每个样本xi由m个特性指标构成,记为X=(xij)n×m=(X1,X2,…,Xm),对样本间的相关度进行KMO检验,当KMO统计量大于0.5时跳转到步骤S102,否则跳转到步骤S106;S101. Assume that the sample set X includes n samples x 1 , x 2 , . . . , x n , and each sample x i is composed of m characteristic indicators, denoted as X=(x ij ) n×m =(X 1 ,X 2 ,...,X m ), perform KMO test on the correlation between samples, and jump to step S102 when the KMO statistic is greater than 0.5, otherwise jump to step S106;
S102、计算样本集X1,X2,…,Xm协方差矩阵Σ=(hij)m×m的特征根及特征向量,并根据特征根之和占全部特征根之和的百分比,确定公共因子的个数;S102. Calculate the eigenvalues and eigenvectors of the covariance matrix Σ=(h ij ) m×m of the sample set X 1 , X 2 ,...,X m , and determine according to the percentage of the sum of eigen roots in the sum of all eigen roots the number of common factors;
S103、计算因子载荷矩阵,当每个因子在不同特征指标上的载荷没有明显差异时,跳转到步骤S104,否则跳转到步骤S105;S103, calculating the factor load matrix, when there is no obvious difference in the load of each factor on different characteristic indexes, jump to step S104, otherwise jump to step S105;
S104、采用正交旋转法对因子载荷矩阵进行旋转;S104, using the orthogonal rotation method to rotate the factor loading matrix;
S105、评价因子载荷矩阵中的特征指标在对应公共因子中的载荷,保留最大载荷值;S105. Evaluate the load of the characteristic index in the factor load matrix in the corresponding common factor, and retain the maximum load value;
S106、输出样本集X的特征指标矩阵。S106, output the feature index matrix of the sample set X.
进一步的,步骤S106中,每个样本xi由u个特性指标因子构成的有限样本集合XΔ,n个样本的特性指标矩阵X*具体为:Further, in step S106, each sample x i is composed of a finite sample set XΔ composed of u characteristic index factors, and the characteristic index matrix X* of n samples is specifically:
其中,表示第i个样本的第j个特性指标因子,i=1,2,…,n;j=1,2,…,u,。in, Indicates the j-th characteristic index factor of the i-th sample, i=1,2,...,n; j=1,2,...,u,.
具体的,步骤S2具体为:Specifically, step S2 is specifically:
S201、利用调整的余弦相似度,计算样本间的相似度来定义变量dij,计算任意两个数据点之间的距离Sim(i,j);S201. Using the adjusted cosine similarity, calculate the similarity between samples to define the variable d ij , and calculate any two data points The distance between Sim(i,j);
S202、选取合适的截断距离,以此计算X*中任意数据点的局部密度和该点到具有更高局部密度点的距离 S202. Select an appropriate cutoff distance to calculate any data point in X * the local density of and the distance from that point to the point with higher local density
S203、依据全部样本点的局部密度及其到具有更高局部密度点的距离值,以为横轴,以为纵轴,绘制决策图;S203. According to the local density of all the sample points and the distance value from the point with higher local density, to is the horizontal axis, with As the vertical axis, draw a decision diagram;
S204、利用决策图,标记样本集和的簇中心点及噪声点;S204. Use the decision diagram to mark the sample set and The cluster center point and noise point of ;
S205、将剩余点进行分配,以获得n个样本的C个聚类划分,输出样本集的聚类划分,以此作为下一步分析的基础。S205: Allocate the remaining points to obtain C cluster divisions of the n samples, and output the cluster divisions of the sample set, which are used as the basis for the next analysis.
进一步的,步骤S201中,样本之间的相似度Sim(i,j)定义如下:Further, in step S201, the sample The similarity between Sim(i,j) is defined as follows:
其中,i,j=1,2,…,n,u为对象的属性数,截断距离dc是以数据点xi为圆心,以dc为半径,ρi累计个数满足|X|×2%。Among them, i,j=1,2,...,n, u is the number of attributes of the object, the cut-off distance d c takes the data point x i as the center of the circle, and d c as the radius, and the cumulative number of ρ i satisfies |X|×2%.
具体的,步骤S3具体为:Specifically, step S3 is specifically:
S301、利用特征与样本类别(簇)的信息熵值,对样本集χ2统计量进行加权;S301, using the information entropy value of the feature and the sample category (cluster) to weight the χ 2 statistic of the sample set;
S302、利用加权χ2统计量构建新的统计矩阵K,K中行与列分别表示为特征在不同类别(簇)和相同类别(簇)中的加权概率分布;S302, construct a new statistical matrix K by using the weighted χ 2 statistic, and the rows and columns in K are respectively represented as weighted probability distributions of features in different categories (clusters) and the same category (clusters);
S303、依次选择统计矩阵K中的每一行ti,查找每行中及 S303. Select each row ti in the statistical matrix K in turn, and search for the and
S304、通过将ti转化成对应的隶属度μij,并构造新类别向量bij为ti中,按照降序排列的隶属度μij;S304, pass Convert t i to the corresponding degree of membership μ ij , and construct a new category vector b ij is the membership degree μ ij in descending order in ti ;
S305、计算特征ti对各类提供的贡献总和;S305. Calculate the sum of contributions of the feature t i to various types of offers;
S306、计算累计方差贡献率;S306. Calculate the cumulative variance contribution rate;
S307、重复执行步骤S303~S306,时,得到特征子集T=t1,t2,…tp。S307. Repeat steps S303 to S306, When , the feature subset T=t 1 , t 2 ,...t p is obtained.
进一步的,步骤S301中,在χ2统计量中,对特征t与样本类别ci进行加权,并将加权后的χ2统计量定义为Wχ2(t,ci),权重定义为特征t与样本类别ci的信息熵值,具体为:Further, in step S301, in the χ 2 statistic, the feature t and the sample category c i are weighted, and the weighted χ 2 statistic is defined as Wχ 2 (t, c i ), and the weight is defined as the feature t. and the information entropy value of the sample category c i , specifically:
Wχ2(t,ci)表示为 Wχ 2 (t, c i ) is expressed as
其中,p(t|ci)为特征t在样本类别ci中的出现概率,p(ci)为样本类别ci出现的概率,p(t,ci)为样本类别ci中出现特征t的概率,为样本类别ci中样本点的平均局部密度,C={c1,c2,…,ck}表示样本类别集合;的定义为ci.rep表示簇ci中的样本点。Among them, p(t|c i ) is the occurrence probability of feature t in sample category c i , p(c i ) is the probability of occurrence of sample category c i , p(t, c i ) is the occurrence probability of sample category c i the probability of feature t, is the average local density of sample points in sample category c i , C={c 1 ,c 2 ,...,c k } represents the set of sample categories; is defined as c i .rep represents the sample points in cluster c i .
进一步的,步骤S302中,统计矩阵K表示为:Further, in step S302, the statistical matrix K is expressed as:
其中,行与列分别表示特征在不同类别和相同类别中的加权概率分布。Among them, the row and column represent the weighted probability distribution of features in different categories and the same category, respectively.
具体的,步骤S4具体为:Specifically, step S4 is specifically:
S401、对特征子集T=t1,t2,…tp进行中心化使其均值为0;S401. Center the feature subset T=t 1 , t 2 , . . . t p to make the
S402、对中心化后的特征子集进行白化得到z;S402, to the centralized feature subset Perform whitening to get z;
S403、选择要估计的独立成分个数m,令i=1;S403, select the number m of independent components to be estimated, and set i=1;
S404、选择一个具有单位范数的初始化(可随机选取)向量wi;S404, select an initialization (can be randomly selected) vector w i with unit norm;
S405、更新函数g为非二次型函数G的导数;S405, update The function g is the derivative of the non-quadratic function G;
S406、标准化wi,wi←wi/||wi||;S406. Standardize w i , w i ←w i /||w i ||;
S407、如果尚未收敛返回步骤S405;S407, if it has not converged, return to step S405;
S408、令i←i+1,如果i≤m,返回步骤S404。S408, let i←i+1, if i≤m, return to step S404.
与现有技术相比,本发明至少具有以下有益效果:Compared with the prior art, the present invention at least has the following beneficial effects:
本发明面向不均衡大数据集的无监督文本主题相关基因提取方法,不需要采用大规模已标注样本进行训练,能够有效避免对样本类别关系及特征结构的预定义,更具实用价值:大部分通过爬取手段获得的样本均未标注类别,导致传统的有监督主题发现方法难以有效实施。本发明基于无监督的特征抽取方法,不存在这种局限性;克服了采用过抽样或欠抽样方法对原始不均衡数据集的类别分布所带来的影响。通过对特征类别结构的修正,准确反映出样本空间中的真实信息,在面对不均衡大数据集时具有更强的泛化性;本发明实现了在保持样本集辨识能力情况下的有效特征降维,进一步减小噪声词干扰,削弱高维数据空间的“空空间”现象,降低样本分析中的不确定性。The present invention is oriented to the unsupervised text topic related gene extraction method for unbalanced large data sets, does not need to use large-scale labeled samples for training, can effectively avoid the pre-definition of sample category relationships and feature structures, and has more practical value: most of the The samples obtained by crawling methods are not labeled with categories, which makes it difficult to effectively implement traditional supervised topic discovery methods. The present invention is based on an unsupervised feature extraction method, so there is no such limitation; it overcomes the influence of the over-sampling or under-sampling method on the class distribution of the original unbalanced data set. Through the modification of the feature category structure, the real information in the sample space is accurately reflected, and it has stronger generalization in the face of unbalanced large data sets; the present invention realizes effective features while maintaining the sample set identification ability. Dimensionality reduction, further reducing noise word interference, weakening the "empty space" phenomenon of high-dimensional data space, and reducing uncertainty in sample analysis.
进一步的,利用因子分析方法发现描述原高维向量空间的最优低维基底,为密度峰值算法快速发现大规模数据集的样本聚类簇提供可能。Furthermore, factor analysis method is used to find the optimal low-dimensional base to describe the original high-dimensional vector space, which provides the possibility for the density peak algorithm to quickly find the sample clusters of large-scale data sets.
进一步的,利用样本点的邻域相似度指导密度峰值的聚类算法实现对无标记文本集的聚类及自动标注。Further, the clustering algorithm of the density peak is guided by the neighborhood similarity of the sample points to realize the clustering and automatic labeling of the unlabeled text set.
进一步的,将平均局部密度和信息熵引入特征项权重定义中,以此构造特征项对样本类别(簇)的区分度矩阵,达到消除传统方法对不均衡样本集进行特征选取时的缺陷。Further, the average local density and information entropy are introduced into the weight definition of feature items, so as to construct the discrimination matrix of feature items to sample categories (clusters), so as to eliminate the defects of traditional methods for feature selection of unbalanced sample sets.
进一步的,采用独立成分分析方法(ICA),通过分析多维统计数据间的高阶相关性,找出相互独立的隐含信息成分,实现在不均衡大数据集中,较准确地选取全面、真实反映文本主题信息的最优特征子集,提升文本分类识别性能。Further, the independent component analysis method (ICA) is used to find out the implicit information components that are independent of each other by analyzing the high-order correlation between multi-dimensional statistical data, so as to achieve a more accurate selection of comprehensive and true reflections in the unbalanced large data set. The optimal feature subset of text topic information to improve the performance of text classification and recognition.
综上所述,本发明聚焦无监督的文本特征提取,研究如何选取稳定、泛化能力强的文本主题相关基因子集,从而降低向量空间的特征维数,增强特征词的类别(簇)表征能力,提高分类识别效果。To sum up, the present invention focuses on unsupervised text feature extraction, and studies how to select a stable and strong generalization ability of a subset of text topic-related genes, thereby reducing the feature dimension of the vector space and enhancing the category (cluster) representation of feature words. ability to improve the classification and recognition effect.
下面通过附图和实施例,对本发明的技术方案做进一步的详细描述。The technical solutions of the present invention will be further described in detail below through the accompanying drawings and embodiments.
附图说明Description of drawings
图1为本发明面向不均衡大数据集的无监督文本主题相关基因提取方法整体流程图;Fig. 1 is the overall flow chart of the unsupervised text subject-related gene extraction method for unbalanced large data sets of the present invention;
图2为样本特征分析过程流程图;Figure 2 is a flow chart of the sample feature analysis process;
图3为样本聚类过程流程图;Figure 3 is a flow chart of the sample clustering process;
图4为特征选择过程流程图;Fig. 4 is a flow chart of feature selection process;
图5为主题基因提取过程流程图;Figure 5 is a flow chart of the subject gene extraction process;
图6为本发明在选择不同特征数目下各个算法的归一化互信息值示意图,其中,(a)为搜狐新闻数据(SogouCS)20151022语料库上各个算法的归一化互信息(%),(b)为Reuter-21578语料库上各个算法的归一化互信息(%)。6 is a schematic diagram of the normalized mutual information value of each algorithm under the selection of different numbers of features of the present invention, wherein, (a) is the normalized mutual information (%) of each algorithm on the Sohu News Data (SogouCS) 20151022 corpus, ( b) is the normalized mutual information (%) of each algorithm on the Reuter-21578 corpus.
具体实施方式Detailed ways
本发明提供了一种面向不均衡大数据集的无监督文本主题相关基因提取方法,采用因子分析和密度峰值算法来获取高维样本集的聚类簇,并以此标注无标签样本;利用平均局部密度和信息熵来改进基于CHI统计矩阵的特征选取方法,以此强化低密度、小样本簇的特征表达度;采用基于负熵的快速固定点算法(FastICA),分析多维数据间的高阶统计相关性,以此提取独立的隐含主题特征基因及完成分量间高阶冗余的去除。该方法既不需要采用大规模已标注样本进行训练,能够有效避免对样本类别关系及特征结构的预定义;又克服了采用过抽样或欠抽样方法对原始不均衡数据集的类别分布所带来的影响。通过对特征类别结构的修正,极大改善了CHI统计选择方法的性能;还实现了在保持样本集辨识能力情况下的有效特征降维。The present invention provides an unsupervised text topic-related gene extraction method oriented to unbalanced large data sets, which adopts factor analysis and density peak algorithm to obtain clusters of high-dimensional sample sets, and labels unlabeled samples accordingly; Local density and information entropy are used to improve the feature selection method based on CHI statistical matrix, so as to strengthen the feature expression of low-density and small-sample clusters; the fast fixed-point algorithm (FastICA) based on negative entropy is used to analyze high-order inter-dimensional data. Statistical correlations are used to extract independent implicit theme eigengenes and complete the removal of high-order redundancy between components. This method does not require the use of large-scale labeled samples for training, which can effectively avoid the pre-definition of sample category relationships and feature structures; Impact. Through the modification of the feature category structure, the performance of the CHI statistical selection method is greatly improved; it also achieves effective feature dimension reduction while maintaining the sample set identification ability.
请参阅图1,本发明一种面向不均衡大数据集的无监督文本主题相关基因提取方法,包括以下步骤:Referring to FIG. 1, a method for extracting related genes from unsupervised text topics oriented to unbalanced large data sets of the present invention includes the following steps:
S1、采用因子分析对无标签样本集中的高维样本进行降维,输出样本集的特征指标矩阵;S1. Use factor analysis to reduce the dimension of the high-dimensional samples in the unlabeled sample set, and output the feature index matrix of the sample set;
对样本集的原始特征变量进行因子分析,选取出少数“抽象”变量(即公共因子)来代替原始特征变量,以此实现样本特征相关性的消减及降维。具体流程如图2所示:Factor analysis is performed on the original feature variables of the sample set, and a small number of "abstract" variables (ie common factors) are selected to replace the original feature variables, so as to reduce the correlation of sample features and reduce the dimension. The specific process is shown in Figure 2:
S101、对样本间的相关度进行KMO检验,当KMO统计量大于0.5时跳转到步骤S102,否则跳转到步骤S106;S101. Perform KMO test on the correlation between samples, and jump to step S102 when the KMO statistic is greater than 0.5, otherwise jump to step S106;
设样本集合X包含n个样本x1,x2,…,xn,每个样本xi由m个特性指标构成,记为X=(xij)n×m=(X1,X2,…,Xm);Suppose the sample set X contains n samples x 1 , x 2 ,...,x n , each sample x i is composed of m characteristic indicators, denoted as X=(x ij ) n×m =(X 1 ,X 2 , ..., X m );
通过KMO(Kaiser Meyer Olkin,KMO)检验来判定样本X1,X2,…,Xm之间的相关程度,以确定进行因子分析的必要性。KMO统计量越接近0,表明X1,X2,…,Xm相关性越弱,KMO统计量越接近1,表明X1,X2,…,Xm相关性越强。The KMO (Kaiser Meyer Olkin, KMO) test is used to determine the degree of correlation between the samples X 1 , X 2 , ..., X m to determine the necessity of factor analysis. The closer the KMO statistic is to 0, the weaker the correlation of X 1 , X 2 ,…,X m is; the closer the KMO statistic is to 1, the stronger the correlation of X 1 , X 2 ,…, X m is.
通常情况下,KMO统计量大于0.5,进行因子分析就具有实际意义。Usually, when the KMO statistic is greater than 0.5, it is practical to perform factor analysis.
S102、计算样本集X1,X2,…,Xm协方差矩阵Σ=(hij)m×m的特征根及特征向量,并根据其特征根之和占全部特征根之和的百分比,确定公共因子的个数;S102. Calculate the eigenroot and eigenvector of the sample set X 1 , X 2 ,...,X m covariance matrix Σ=(h ij ) m×m , and according to the percentage of the sum of its eigen roots in the sum of all eigen roots, Determine the number of common factors;
由Σ的特征方程|Σ-λI|=0可以求得协方差矩阵的特征根为λ1≥λ2≥…≥λp≥0,相应的单位特征向量为T1,T2,…,Tp;From the characteristic equation of Σ |Σ-λI|=0, the characteristic root of the covariance matrix can be obtained as λ 1 ≥λ 2 ≥…≥λ p ≥0, and the corresponding unit eigenvectors are T 1 ,T 2 ,…,T p ;
另外,按照在实际问题中的处理原则,取头u个特征根及特征向量,使得它们的特征根之和占全部特征根之和的85%以上,以此确定公共因子的个数;In addition, according to the processing principle in practical problems, the first u eigenvalues and eigenvectors are taken, so that the sum of their eigenvalues accounts for more than 85% of the sum of all eigenvalues, so as to determine the number of common factors;
S103、计算因子载荷矩阵,当每个因子在不同特征指标上的载荷没有明显差异时,跳转到步骤S104,否则跳转到步骤S105;S103, calculating the factor load matrix, when there is no obvious difference in the load of each factor on different characteristic indexes, jump to step S104, otherwise jump to step S105;
利用Σ的特征根和特征向量计算因子载荷矩阵如下:The factor loading matrix is calculated using the eigenroots and eigenvectors of Σ as follows:
S104、采用正交旋转法对因子载荷矩阵进行旋转;S104, using the orthogonal rotation method to rotate the factor loading matrix;
若每个因子在不同特性指标上的载荷没有明显差别,还需要对因子载荷矩阵进行旋转,通常采用正交旋转法对因子载荷矩阵进行旋转,得到旋转后的因子载荷矩阵A'如下:If there is no significant difference in the loading of each factor on different characteristic indicators, the factor loading matrix needs to be rotated. Usually, the orthogonal rotation method is used to rotate the factor loading matrix, and the rotated factor loading matrix A' is obtained as follows:
对旋转后的因子载荷矩阵A'行向量进行bip=Max{bi1,bi2,…,biu},i=1,2,…,m,p∈{1,2,…,u}的运算,保留矩阵A'中特性指标Xi在u个因子中的最大载荷值bip,得到矩阵如下:Perform b ip =Max{b i1 ,b i2 ,...,b iu }, i=1,2,...,m, p∈{1,2,...,u} on the rotated factor loading matrix A' row vector The operation of retaining the maximum load value b ip of the characteristic index X i in the u factors in the matrix A', the matrix is obtained as follows:
A*=(b'ij)m×u A * =(b' ij ) m×u
其中,i=1,2,…,m;j=1,2,…,u;Among them, i=1,2,...,m; j=1,2,...,u;
S105、评价因子载荷矩阵中的特征指标在对应公共因子中的载荷,保留其最大载荷值;S105. Evaluate the load of the characteristic index in the factor load matrix in the corresponding common factor, and retain its maximum load value;
S106、输出样本集的特征指标矩阵,以此作为下一步分析的基础。S106, output the feature index matrix of the sample set, which is used as the basis for the next step of analysis.
样本集合X,通过上述运算过程简化为包含n个样本,每个样本xi由u个特性指标因子构成的有限样本集合XΔ,由此构造出n个样本的特性指标矩阵如下:The sample set X is simplified to include n samples through the above operation process, and each sample xi is a finite sample set X Δ composed of u characteristic index factors, thus constructing the characteristic index matrix of n samples as follows:
其中,表示第i个样本的第j个特性指标因子,i=1,2,…,n;j=1,2,…,u。in, Indicates the jth characteristic index factor of the ith sample, i=1,2,...,n; j=1,2,...,u.
S2、运用快速搜索和密度峰值发现算法对降维后的样本集进行探索性聚类;S2. Use fast search and density peak discovery algorithms to perform exploratory clustering on the dimensionality-reduced sample set;
对每个由公共因子表述的样本,解析其局部密度以及到具有更高局部密度点的距离,绘制决策图,生成样本集的聚类划分。具体流程如图3所示:For each sample represented by a common factor, analyze its local density and the distance to a point with a higher local density, draw a decision diagram, and generate a clustering partition of the sample set. The specific process is shown in Figure 3:
S201、计算样本集的特征指标矩阵中任意两个行向量间的相似度;S201. Calculate the similarity between any two row vectors in the feature index matrix of the sample set;
利用调整的余弦相似度,计算样本间的相似度来定义变量dij,计算任意两个数据点之间的距离Sim(i,j),样本之间的相似度Sim(i,j)定义如下:Using the adjusted cosine similarity, calculate the similarity between samples to define the variable d ij , calculate any two data points distance between Sim(i,j), samples The similarity between Sim(i,j) is defined as follows:
其中,i,j=1,2,…,n,u为对象的属性数,截断距离dc是以数据点xi为圆心,以dc为半径,ρi累计个数满足|X|×2%Among them, i,j=1,2,...,n, u is the number of attributes of the object, the truncation distance d c takes the data point x i as the center of the circle, and d c as the radius, the cumulative number of ρ i satisfies |X|×2%
S202、选取合适的截断距离,以此计算X*中任意数据点的局部密度和该点到具有更高局部密度点的距离 S202. Select an appropriate cutoff distance to calculate any data point in X* the local density of and the distance from that point to the point with higher local density
令数据点xi的局部密度为数据点xi到局部密度比它大且聚类最近的数据点xj的距离为其中,dij为不同的数据点之间的距离,dc为截断距离(超参数)。Let the local density of data point x i be The distance from a data point x i to a data point x j whose local density is greater than it and the cluster is closest to is in, d ij is the distance between different data points, and d c is the cutoff distance (hyperparameter).
S203、依据全部样本点的局部密度及其到具有更高局部密度点的距离值,以为横轴,以为纵轴,绘制决策图;S203. According to the local density of all the sample points and the distance value from the point with higher local density, to is the horizontal axis, with As the vertical axis, draw a decision diagram;
S204、利用决策图,标记样本集和的簇中心点及噪声点;S204. Use the decision diagram to mark the sample set and The cluster center point and noise point of ;
计算任意数据点xi的ρi和δi,将排序后C个ρi和δi的数据点标记为簇的中心点,剩余的数据点分配到它的最近邻且密度比其大的数据点所在的簇。Calculate ρ i and δ i of any data point xi, mark C data points of ρ i and δ i as the center point of the cluster after sorting, and assign the remaining data points to its nearest neighbors and the data with higher density than it The cluster where the point is located.
S205、将剩余点进行分配,以获得n个样本的C个聚类划分,输出样本集的聚类划分,以此作为下一步分析的基础。S205: Allocate the remaining points to obtain C cluster divisions of the n samples, and output the cluster divisions of the sample set, which are used as the basis for the next analysis.
S3、利用信息熵和平均局部密度来改进χ2统计量,构建基于加权χ2统计量的样本特征分布矩阵,对样本集χ2统计量中的特征与样本类别(簇)进行加权,其权重定义为特征与样本类别(簇)的信息熵值,通过加权后的χ2统计量构建新的统计矩阵来表示特征在不同类别(簇)和相同类别(簇)中的加权概率分布,并在此为基础进行特征选择;S3. Use information entropy and average local density to improve the χ 2 statistic, construct a sample feature distribution matrix based on the weighted χ 2 statistic, and weight the features and sample categories (clusters) in the sample set χ 2 statistic, and the weights Defined as the information entropy value of the feature and the sample category (cluster), a new statistical matrix is constructed by the weighted χ 2 statistic to represent the weighted probability distribution of the feature in different categories (clusters) and the same category (clusters). This is the basis for feature selection;
具体流程如图4所示:The specific process is shown in Figure 4:
S301、利用特征与样本类别(簇)的信息熵值,对样本集χ2统计量进行加权;S301, using the information entropy value of the feature and the sample category (cluster) to weight the χ 2 statistic of the sample set;
在χ2统计量中,对特征t与样本类别(簇)ci进行加权,并将加权后的χ2统计量定义为Wχ2(t,ci),其权重定义为特征t与样本类别(簇)ci的信息熵值,即In the χ 2 statistic, the feature t and the sample category (cluster) c i are weighted, and the weighted χ 2 statistic is defined as Wχ 2 (t, c i ), and its weight is defined as the feature t and the sample category The information entropy value of (cluster) c i , i.e.
Wχ2(t,ci)表示为 Wχ 2 (t, c i ) is expressed as
其中,p(t|ci)为特征t在样本类别(簇)ci中的出现概率,p(ci)为样本类别(簇)ci出现的概率,p(t,ci)为样本类别(簇)ci中出现特征t的概率,为样本类别(簇)ci中样本点的平均局部密度,C={c1,c2,…,ck}表示样本类别(簇)集合。的定义为ci.rep表示簇ci中的样本点。Among them, p(t|c i ) is the occurrence probability of feature t in the sample category (cluster) c i , p(c i ) is the occurrence probability of the sample category (cluster) c i , and p(t, c i ) is The probability of feature t appearing in sample category (cluster) c i , is the average local density of sample points in the sample category (cluster) c i , and C={c 1 , c 2 , . . . , c k } represents the set of sample categories (clusters). is defined as c i .rep represents the sample points in cluster c i .
S302、利用加权χ2统计量构建新的统计矩阵K,K中行与列分别表示为特征在不同类别(簇)和相同类别(簇)中的加权概率分布;S302, construct a new statistical matrix K by using the weighted χ 2 statistic, and the rows and columns in K are respectively represented as weighted probability distributions of features in different categories (clusters) and the same category (clusters);
利用加权后的χ2统计量构建统计矩阵K,即K表示为Use the weighted χ 2 statistic to construct a statistical matrix K, that is, K is expressed as
其中,行与列分别表示特征在不同类别(簇)和相同类别(簇)中的加权概率分布。Among them, the rows and columns represent the weighted probability distribution of features in different categories (clusters) and the same category (clusters), respectively.
S303、依次选择统计矩阵K中的每一行ti,查找每行中及 S303. Select each row ti in the statistical matrix K in turn, and search for the and
S304、通过将ti转化成对应的隶属度μij,并构造新类别(簇)向量bij为ti中,按照降序排列的隶属度μij;S304, pass Convert t i to the corresponding degree of membership μ ij , and construct a new category (cluster) vector b ij is the membership degree μ ij in descending order in ti ;
S305、计算特征ti对各类所提供的贡献总和;S305. Calculate the sum of contributions provided by the feature t i to each category;
特征ti对各类所提供的贡献总和具体为:The sum of the contributions provided by the feature t i to each category Specifically:
S306、计算累计方差贡献率;S306. Calculate the cumulative variance contribution rate;
累计方差贡献率具体为:Cumulative variance contribution rate Specifically:
S307、重复执行步骤S303~S306,时,得到特征子集T=t1,t2,…tp。S307. Repeat steps S303 to S306, When , the feature subset T=t 1 , t 2 ,...t p is obtained.
S4、利用基于负熵的快速固定点算法(FastICA),分析多维特征子集T=t1,t2,…tp中数据间的高阶统计相关性,提取独立特征基因及完成分量间高阶冗余的去除。S4. Using the fast fixed point algorithm (FastICA) based on negative entropy, analyze the high-order statistical correlation between the data in the multi-dimensional feature subset T=t 1 , t 2 , ... t p , extract independent feature genes and complete the high-order correlation between components order redundancy removal.
请参阅图5,主题相关基因提取流程具体如下:Please refer to Figure 5. The specific process of subject-related gene extraction is as follows:
S401、对特征子集T=t1,t2,…tp进行中心化使其均值为0;S401. Center the feature subset T=t 1 , t 2 , . . . t p to make the
S402、对中心化后的特征子集进行白化得到z;S402, to the centralized feature subset Perform whitening to get z;
S403、选择要估计的独立成分个数m,令i=1;S403, select the number m of independent components to be estimated, and set i=1;
S404、选择一个具有单位范数的初始化(可随机选取)向量wi;S404, select an initialization (can be randomly selected) vector w i with unit norm;
S405、更新函数g为非二次型函数G的导数;S405, update The function g is the derivative of the non-quadratic function G;
S406、标准化wi,wi←wi/||wi||;S406. Standardize w i , w i ←w i /||w i ||;
S407、如果尚未收敛返回步骤S405;S407, if it has not converged, return to step S405;
S408、令i←i+1,如果i≤m,返回步骤S404。S408, let i←i+1, if i≤m, return to step S404.
提取主题相关基因是对工业大数据和社会大数据进行预处理的关键步骤,通过发现相互独立的隐含信息成分,实现在不均衡大数据集中,较准确地选取全面、真实反映文本主题信息的最优特征子集,使的分类器的识别性能得到显著提升。Extracting theme-related genes is a key step in the preprocessing of industrial big data and social big data. By discovering mutually independent implicit information components, it is possible to more accurately select data that comprehensively and truly reflect text theme information in unbalanced big data sets. The optimal feature subset can significantly improve the recognition performance of the classifier.
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。通常在此处附图中的描述和所示的本发明实施例的组件可以通过各种不同的配置来布置和设计。因此,以下对在附图中提供的本发明的实施例的详细描述并非旨在限制要求保护的本发明的范围,而是仅仅表示本发明的选定实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purposes, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments These are some embodiments of the present invention, but not all embodiments. The components of the embodiments of the invention generally described and illustrated in the drawings herein may be arranged and designed in a variety of different configurations. Thus, the following detailed description of the embodiments of the invention provided in the accompanying drawings are not intended to limit the scope of the invention as claimed, but are merely representative of selected embodiments of the invention. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
为了进一步检验本发明的实际应用效果,仿真实验选取搜狐新闻数据(SogouCS)20151022、Reuter-21578这2个语料库,采用k-means聚类算法,分析各个算法的聚类结果类别信息与原始类别信息的归一化互信息,衡量算法的有效性。由于k-means聚类时需明确给出聚类个数,为减少K值选择对方法的影响,本发明方法以及对比方法的聚类数目K设置为各数据标签包含的类别数,即20、10、12。图6(a)、(b)显示在选择不同特征数目下各个算法的归一化互信息值。In order to further test the practical application effect of the present invention, the simulation experiment selects two corpora of Sohu news data (SogouCS) 20151022 and Reuter-21578, adopts k-means clustering algorithm, and analyzes the clustering result category information and original category information of each algorithm The normalized mutual information to measure the effectiveness of the algorithm. Since the number of clusters needs to be clearly given during k-means clustering, in order to reduce the influence of K value selection on the method, the number of clusters K of the method of the present invention and the comparison method is set to the number of categories contained in each data label, that is, 20, 10, 12. Figures 6(a) and (b) show the normalized mutual information values of each algorithm under the selection of different number of features.
从图(a)、(b)中可以看出,本发明所提出的面向不均衡大数据集的无监督文本主题相关基因提取方法相较其他4种算法优势明显,图(a)、(b)中本发明方法在特征数目较少时,就能快速达到较优效果,因此,利用本发明所提出算法进行无监督特征选择会比一般的无监督特征选择算法效果好。It can be seen from Figures (a) and (b) that the method for extracting topic-related genes from unsupervised texts for unbalanced large data sets proposed by the present invention has obvious advantages over the other four algorithms. Figures (a) and (b) ), the method of the present invention can quickly achieve a better effect when the number of features is small. Therefore, using the algorithm proposed by the present invention for unsupervised feature selection has better effect than the general unsupervised feature selection algorithm.
综上所述,本发明一种面向不均衡大数据集的无监督文本主题相关基因提取方法,既不需要大规模标注样本进行训练,避免了预先定义类别关系和相关特征,又克服了样本类别分布不均衡所导致的模型泛化能力不强问题。在快速搜索发现峰值的文本聚类方法基础上,利用信息熵构建了加权χ2统计量的文本特征分布矩阵,避免采用过抽样或和欠抽样方法对原始不均衡数据集的类别分布所做出得改变,通过特征类别分布的修正,极大改善CHI统计选择方法性能。最后采用基于负熵的快速固定点算法(FastICA)进行多维数据间的独立隐含信息成分提取,其特征子集的泛化性能优于RSR、FSFC、UFS-MI、RUFS,达到了保持数据集辨识能力情况下的特征降维。To sum up, the present invention is an unsupervised text topic related gene extraction method for unbalanced large data sets, which does not require large-scale labeled samples for training, avoids pre-defining category relationships and related features, and overcomes sample categories. The problem of weak generalization ability of the model caused by the unbalanced distribution. Based on the text clustering method of finding peaks by fast search, the text feature distribution matrix of weighted χ 2 statistic is constructed by using information entropy, avoiding the use of over-sampling or under-sampling methods to make the class distribution of the original unbalanced data set. The performance of the CHI statistical selection method is greatly improved by modifying the distribution of feature categories. Finally, the fast fixed-point algorithm based on negative entropy (FastICA) is used to extract independent implicit information components between multi-dimensional data. Feature dimensionality reduction in the case of discrimination ability.
另外,信息化带来的工业大数据和社会大数据,特征降维是对这些大数据进行预处理的关键步骤,本发明提出的文本主题相关基因提取思想将在这些大数据领域发挥更重要的作用,而如何更好地适应这些领域的数据处理需求,是今后要进行的研究工作。In addition, for the industrial big data and social big data brought by informatization, feature dimensionality reduction is a key step in the preprocessing of these big data. The idea of extracting genes related to text topics proposed by the present invention will play a more important role in these big data fields. How to better adapt to the data processing needs of these fields is the research work to be carried out in the future.
以上内容仅为说明本发明的技术思想,不能以此限定本发明的保护范围,凡是按照本发明提出的技术思想,在技术方案基础上所做的任何改动,均落入本发明权利要求书的保护范围之内。The above content is only to illustrate the technical idea of the present invention, and cannot limit the protection scope of the present invention. Any changes made on the basis of the technical solution according to the technical idea proposed by the present invention all fall within the scope of the claims of the present invention. within the scope of protection.
Claims (9)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010255801.8A CN111460161A (en) | 2020-04-02 | 2020-04-02 | A method for extracting topic-related genes from unsupervised text for unbalanced large datasets |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010255801.8A CN111460161A (en) | 2020-04-02 | 2020-04-02 | A method for extracting topic-related genes from unsupervised text for unbalanced large datasets |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111460161A true CN111460161A (en) | 2020-07-28 |
Family
ID=71684436
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010255801.8A Pending CN111460161A (en) | 2020-04-02 | 2020-04-02 | A method for extracting topic-related genes from unsupervised text for unbalanced large datasets |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111460161A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112182164A (en) * | 2020-10-16 | 2021-01-05 | 上海明略人工智能(集团)有限公司 | High-dimensional data feature processing method and system |
CN112907035A (en) * | 2021-01-27 | 2021-06-04 | 厦门卫星定位应用股份有限公司 | K-means-based transportation subject credit rating method and device |
CN114124536A (en) * | 2021-11-24 | 2022-03-01 | 四川九洲电器集团有限责任公司 | Multi-station detection signal tracing method |
CN115952432A (en) * | 2022-12-21 | 2023-04-11 | 四川大学华西医院 | Unsupervised clustering method based on diabetes data |
CN119377407A (en) * | 2024-12-25 | 2025-01-28 | 深圳市迪博企业风险管理技术有限公司 | A method for automatic screening of large model training data |
-
2020
- 2020-04-02 CN CN202010255801.8A patent/CN111460161A/en active Pending
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112182164A (en) * | 2020-10-16 | 2021-01-05 | 上海明略人工智能(集团)有限公司 | High-dimensional data feature processing method and system |
CN112182164B (en) * | 2020-10-16 | 2024-02-23 | 上海明略人工智能(集团)有限公司 | High-dimensional data feature processing method and system |
CN112907035A (en) * | 2021-01-27 | 2021-06-04 | 厦门卫星定位应用股份有限公司 | K-means-based transportation subject credit rating method and device |
CN112907035B (en) * | 2021-01-27 | 2022-08-05 | 厦门卫星定位应用股份有限公司 | K-means-based transportation subject credit rating method and device |
CN114124536A (en) * | 2021-11-24 | 2022-03-01 | 四川九洲电器集团有限责任公司 | Multi-station detection signal tracing method |
CN115952432A (en) * | 2022-12-21 | 2023-04-11 | 四川大学华西医院 | Unsupervised clustering method based on diabetes data |
CN115952432B (en) * | 2022-12-21 | 2024-03-12 | 四川大学华西医院 | Unsupervised clustering method based on diabetes data |
CN119377407A (en) * | 2024-12-25 | 2025-01-28 | 深圳市迪博企业风险管理技术有限公司 | A method for automatic screening of large model training data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022126810A1 (en) | Text clustering method | |
CN111460161A (en) | A method for extracting topic-related genes from unsupervised text for unbalanced large datasets | |
Li et al. | Using discriminant analysis for multi-class classification: an experimental investigation | |
CN104915386B (en) | A kind of short text clustering method based on deep semantic feature learning | |
CN109189925A (en) | Term vector model based on mutual information and based on the file classification method of CNN | |
CN102663100A (en) | Two-stage hybrid particle swarm optimization clustering method | |
Seng et al. | Big feature data analytics: Split and combine linear discriminant analysis (SC-LDA) for integration towards decision making analytics | |
Malekipirbazari et al. | Performance comparison of feature selection and extraction methods with random instance selection | |
Gothai et al. | Map-reduce based distance weighted k-nearest neighbor machine learning algorithm for big data applications | |
Wang et al. | Distance variance score: an efficient feature selection method in text classification | |
CN110334777A (en) | A weighted multi-view unsupervised attribute selection method | |
CN107154923A (en) | A kind of network inbreak detection method based on the very fast learning machine of multilayer | |
CN115410199A (en) | Image content retrieval method, device, equipment and storage medium | |
CN110348497B (en) | Text representation method constructed based on WT-GloVe word vector | |
CN114281994B (en) | A text clustering integration method and system based on three-layer weighted model | |
CN113626604B (en) | Web Page Text Classification System Based on Maximum Spacing Criterion | |
Wang et al. | An improved K_means algorithm for document clustering based on knowledge graphs | |
Zhu | Classification Research Based on Quantitative Expansion of Short Text Feature Correlation | |
CN105760471B (en) | Two-class text classification method based on combined convex linear perceptron | |
Mehrotra et al. | To identify the usage of clustering techniques for improving search result of a website | |
Balafar et al. | Active learning for constrained document clustering with uncertainty region | |
CN111382273A (en) | Text classification method based on feature selection of attraction factors | |
Yu et al. | Improved Logistic Regression Algorithm Based on Kernel Density Estimation for Multi-Classification with Non-Equilibrium Samples. | |
CN111914108A (en) | A Discrete Supervised Cross-modal Hash Retrieval Method Based on Semantic Preservation | |
Jing-Ming et al. | Unsupervised Text Topic-Related Gene Extraction for Large Unbalanced Datasets |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200728 |
|
RJ01 | Rejection of invention patent application after publication |