CN111460161A - Unsupervised text theme related gene extraction method for unbalanced big data set - Google Patents
Unsupervised text theme related gene extraction method for unbalanced big data set Download PDFInfo
- Publication number
- CN111460161A CN111460161A CN202010255801.8A CN202010255801A CN111460161A CN 111460161 A CN111460161 A CN 111460161A CN 202010255801 A CN202010255801 A CN 202010255801A CN 111460161 A CN111460161 A CN 111460161A
- Authority
- CN
- China
- Prior art keywords
- sample
- characteristic
- matrix
- class
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 108090000623 proteins and genes Proteins 0.000 title claims abstract description 30
- 238000000605 extraction Methods 0.000 title claims abstract description 28
- 239000011159 matrix material Substances 0.000 claims abstract description 50
- 238000000034 method Methods 0.000 claims abstract description 35
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 24
- 238000009826 distribution Methods 0.000 claims abstract description 23
- 238000000556 factor analysis Methods 0.000 claims abstract description 9
- 230000009467 reduction Effects 0.000 claims abstract description 9
- 239000013598 vector Substances 0.000 claims description 17
- 230000009191 jumping Effects 0.000 claims description 12
- 238000004458 analytical method Methods 0.000 claims description 11
- 238000005192 partition Methods 0.000 claims description 10
- 230000001186 cumulative effect Effects 0.000 claims description 6
- 238000012360 testing method Methods 0.000 claims description 5
- 238000010276 construction Methods 0.000 claims description 4
- 238000010586 diagram Methods 0.000 claims description 3
- 238000012887 quadratic function Methods 0.000 claims description 3
- 238000005070 sampling Methods 0.000 abstract description 9
- 238000010187 selection method Methods 0.000 abstract description 5
- 238000012549 training Methods 0.000 abstract description 4
- 230000008569 process Effects 0.000 description 6
- 230000000694 effects Effects 0.000 description 4
- 230000006870 function Effects 0.000 description 3
- 238000012880 independent component analysis Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000010219 correlation analysis Methods 0.000 description 1
- 230000009193 crawling Effects 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000003064 k means clustering Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000013441 quality evaluation Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Probability & Statistics with Applications (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an unsupervised text topic related gene extraction method facing an unbalanced large data set, which adopts factor analysis and density peak algorithm to obtain a cluster of a high-dimensional sample set and labels unlabeled samples; improving a feature selection method based on a CHI statistical matrix by using average local density and information entropy to strengthen the feature expression degree of low-density and small sample clusters; and analyzing high-order statistical correlation among the multidimensional data by adopting a negative entropy-based quick fixed point algorithm, extracting independent implicit subject characteristic genes and finishing the removal of high-order redundancy among the components. Large-scale labeled samples are not required for training, and the predefinition of sample class relations and characteristic structures can be effectively avoided; the influence of an over-sampling method or an under-sampling method on the class distribution of the original unbalanced data set is overcome. The performance of the CHI statistical selection method is improved by correcting the characteristic class structure; effective feature dimension reduction under the condition of keeping the identification capability of the sample set is also realized.
Description
Technical Field
The invention belongs to the technical field of data interpretation and subject discovery in natural language processing, and particularly relates to an unsupervised text subject related gene extraction method for an unbalanced large data set.
Background
With the society gradually stepping into the era of "big data", people obtain more and more information through approaches such as webpage, microblog, forum, but the time for reading and arranging the information is less and less, so, the topic of high-efficient, accurate analysis information becomes the effective means that realizes big data understanding and value discovery, and its applicable field has covered many aspects such as internet public opinion monitoring and early warning, network harmful information filtration and sentiment analysis more. When the data in the fields are processed, a large amount of high-dimensional data with redundant or irrelevant features are often required to be faced, so that the efficiency and the performance of a learning algorithm are greatly reduced, and therefore, the feature extraction is used as a crucial loop in machine learning and data mining, and the model construction and analysis efficiency and accuracy are directly influenced.
Currently, feature extraction can be classified into supervised and unsupervised types according to different category information. In the text content analysis process, no matter what kind of category is adopted, a Vector Space Model (Vector Space Model) is required to be used for representing the text into a Vector Space formed by a certain number of feature words, so that two problems inevitably occur in practical application:
① the distribution of the sample category (cluster) in the data set is not balanced, and as the measurement function of the quality evaluation of the characteristic subset, no matter the correlation analysis and the similarity analysis based on independence, or the Euclidean distance and the Mahalanobis distance based on distance, even the most widely applied methods such as mutual information and information gain based on information entropy at present, the consistency assumption that the distribution of the sample category (cluster) in the data set is the same or similar is adopted, so that most of the determined characteristics come from the 'big class' with the dominant number (density) of the category (cluster), and none or few parts come from the 'small class' with the dominant number, the selected characteristic subset with the highest discrimination can not accurately reflect the real information in the whole sample space, and the performance of the subsequent learning method for solving the practical problem is reduced;
② the object to be processed becomes increasingly complex, the data dimension increases explosively, and when facing the data set with ultra-high dimension, it means not only huge memory requirement, but also high calculation cost investment, in these high dimension characteristic spaces, there is strong correlation between the many characteristic points, causing the introduction of a lot of redundancy and even noise, so that the generalization ability of the characteristic item selected by the traditional method deteriorates sharply, the "empty space" phenomenon of the high dimension data space also makes the multivariate density estimation problem very difficult.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide an unsupervised text topic-related gene extraction method for large unbalanced data sets, which effectively avoids predefining a sample class relationship and a feature structure, and overcomes the influence of an over-sampling or under-sampling method on the class distribution of the original unbalanced data set.
The invention adopts the following technical scheme:
the unsupervised text theme related gene extraction method for the unbalanced large data set comprises the following steps of:
s1, performing dimensionality reduction on high-dimensional samples in the unlabeled sample set by adopting factor analysis, and outputting a characteristic index matrix of the sample set;
s2, analyzing local density and the distance to a point with higher local density for each sample expressed by the common factor, drawing a decision diagram, carrying out exploratory clustering on the reduced-dimension sample set by using a fast search and density peak value discovery algorithm, obtaining C clustering partitions of n samples, and outputting the clustering partitions of the sample set;
s3, improving chi by utilizing information entropy and average local density2Statistics, construction based on weighted χ2Sample feature distribution matrix of statistics, χ for sample set2Features in the statistics are weighted with the sample class by the weighted χ2Constructing a new statistical matrix representing the weighted probability distribution of the features in different classes and the same class by the statistics, and selecting the features to obtain a feature subset T (T)1,t2,…tp;
S4, analyzing the multi-dimensional feature subset T-T by using a negative entropy-based fast fixed point algorithm1,t2,…tpAnd (4) extracting independent characteristic genes according to the high-order statistical correlation among the intermediate data, and finishing the removal of high-order redundancy among the components.
Specifically, step S1 specifically includes:
s101, a sample set X is set to contain n samples X1,x2,…,xnEach sample xiIs composed of m characteristic indexes, and is marked as X ═ Xij)n×m=(X1,X2,…,Xm) Performing KMO test on the correlation degree between the samples, and jumping to the step S102 when the KMO statistic is larger than 0.5, otherwise jumping to the step S106;
s102, calculating a sample set X1,X2,…,XmCovariance matrix ∑ ═ hij)m×mDetermining the number of the common factors according to the percentage of the sum of the characteristic roots to the sum of all the characteristic roots;
s103, calculating a factor load matrix, and jumping to the step S104 when the load of each factor on different characteristic indexes is not obviously different, or jumping to the step S105;
s104, rotating the factor load matrix by adopting an orthogonal rotation method;
s105, evaluating the load of the characteristic indexes in the factor load matrix in the corresponding common factors, and reserving a maximum load value;
and S106, outputting a characteristic index matrix of the sample set X.
Further, in step S106, each sample xiA finite sample set X Δ composed of u characteristic index factors, and a characteristic index matrix X of n samples is specifically:
wherein,a j-th characteristic index factor representing an i-th sample, i ═ 1,2, …, n; j ═ 1,2, …, u,.
Specifically, step S2 specifically includes:
s201, calculating the similarity between samples to define a variable d by utilizing the adjusted cosine similarityijCalculating any two data pointsThe distance between Sim (i, j);
s202, selecting a proper truncation distance to calculate X*Middle arbitrary data pointLocal density ofAnd the distance of the point to the point with the higher local density
S203, according to the local density of all the sample points and the distance value from the point with higher local densityIs the horizontal axis, inDrawing a decision graph with a vertical axis;
and S205, distributing the residual points to obtain C cluster partitions of the n samples, and outputting the cluster partitions of the sample set to serve as the basis of the next analysis.
wherein i, j is 1,2, …, n,u is the number of attributes of the object, the truncation distance dcIs the data point xiAs the center of circle, with dcIs a radius, ρiThe cumulative number satisfies | X | × 2%.
Specifically, step S3 specifically includes:
s301, utilizing information entropy values of characteristics and sample types (clusters) to carry out Chi on sample set2Weighting the statistic;
s302, utilizing the weighting χ2A new statistical matrix K is established through statistics, and rows and columns in the K are respectively expressed as weighted probability distribution of features in different categories (clusters) and the same categories (clusters);
s304, passingWill tiConverted into corresponding membership muijAnd constructing a new category vectorbijIs tiIn descending order of membership μij;
S305, calculating characteristic tiThe sum of the contributions provided to each class;
s306, calculating the cumulative variance contribution rate;
s307, repeatedly executing the steps S303 to S306,then, the feature subset T ═ T is obtained1,t2,…tp。
Further, in step S301, at χ2In statistics, for feature t and sample class ciWeighting is carried out, and the weighted χ is2The statistic is defined as W%2(t,ci) The weight is defined as the feature t and the sample class ciThe information entropy value of (a) is specifically:
Wherein p (t | c)i) In sample class c for feature tiThe probability of occurrence of p (c)i) Is a sample class ciProbability of occurrence, p (t, c)i) Is a sample class ciThe probability of the occurrence of the feature t in,is a sample class ciAverage local density of middle sample points, C ═ C1,c2,…,ckRepresents a sample class set;is defined asciRep denotes the cluster ciSample point(s) in (c).
Further, in step S302, the statistical matrix K is represented as:
where rows and columns represent weighted probability distributions of features in different categories and the same category, respectively.
Specifically, step S4 specifically includes:
s401, setting the feature subset T as T1,t2,…tpCentering to make the average value of 0;
s403, selecting the number m of the independent components to be estimated, and setting i to 1;
s404, selecting an initialization (randomly selectable) vector w with unit normi;
s406, standardizing wi,wi←wi/||wi||;
S407, if the convergence is not already finished, returning to the step S405;
s408, let i ← i +1, if i ≦ m, return to step S404.
Compared with the prior art, the invention has at least the following beneficial effects:
the unsupervised text theme related gene extraction method oriented to the unbalanced large data set does not need to adopt large-scale labeled samples for training, can effectively avoid predefining the class relation and the characteristic structure of the samples, and has more practical value: most samples obtained by crawling means are not labeled with categories, so that the traditional supervised topic discovery method is difficult to implement effectively. The invention is based on an unsupervised feature extraction method, and has no limitation; the method overcomes the influence of an over-sampling or under-sampling method on the class distribution of the original unbalanced data set. The real information in the sample space is accurately reflected by correcting the characteristic class structure, and the method has stronger generalization in the face of an unbalanced large data set; the invention realizes effective characteristic dimension reduction under the condition of keeping the identification capability of the sample set, further reduces noise word interference, weakens the phenomenon of 'empty space' of a high-dimensional data space, and reduces uncertainty in sample analysis.
Furthermore, an optimal low-dimensional base describing the original high-dimensional vector space is found by utilizing a factor analysis method, so that the possibility of quickly finding a sample cluster of a large-scale data set by a density peak algorithm is provided.
Further, the clustering algorithm of the density peak value is guided by using the neighborhood similarity of the sample points to realize clustering and automatic labeling of the unmarked text set.
Furthermore, the average local density and the information entropy are introduced into the feature item weight definition, so that a discrimination matrix of the feature items on the sample categories (clusters) is constructed, and the defect of characteristic selection of the unbalanced sample set by the traditional method is overcome.
Furthermore, an Independent Component Analysis (ICA) method is adopted, high-order correlation among multi-dimensional statistical data is analyzed, mutually independent implicit information components are found out, an optimal feature subset which can comprehensively and truly reflect text subject information is accurately selected in an unbalanced large data set, and text classification and identification performance is improved.
In conclusion, the invention focuses on unsupervised text feature extraction and researches how to select a stable text theme related gene subset with strong generalization capability, thereby reducing the feature dimension of a vector space, enhancing the category (cluster) representation capability of feature words and improving the classification and identification effects.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
FIG. 1 is a general flow chart of the unsupervised text topic-related gene extraction method for an unbalanced large data set according to the present invention;
FIG. 2 is a flow chart of a sample feature analysis process;
FIG. 3 is a flow chart of a sample clustering process;
FIG. 4 is a flow chart of a feature selection process;
FIG. 5 is a flow chart of the subject gene extraction process;
fig. 6 is a diagram illustrating normalized mutual information values of algorithms under different feature numbers selected in the present invention, where (a) is normalized mutual information (%) of each algorithm in the fox search news data (SogouCS)20151022 corpus, and (b) is normalized mutual information (%) of each algorithm in the Reuter-21578 corpus.
Detailed Description
The invention provides an unsupervised text topic related gene extraction method facing an unbalanced large data set, which adopts factor analysis and density peak algorithm to obtain a cluster of a high-dimensional sample set and labels a label-free sample; improving a feature selection method based on a CHI statistical matrix by using average local density and information entropy so as to strengthen the feature expression degree of low-density and small sample clusters; and analyzing high-order statistical correlation among the multidimensional data by adopting a negative entropy-based fast fixed point algorithm (FastICA) so as to extract independent implicit theme characteristic genes and finish the removal of high-order redundancy among the components. The method does not need to adopt large-scale labeled samples for training, and can effectively avoid predefining the class relationship and the characteristic structure of the samples; and the influence of the over-sampling or under-sampling method on the class distribution of the original unbalanced data set is overcome. The performance of the CHI statistical selection method is greatly improved by correcting the characteristic class structure; effective feature dimension reduction under the condition of keeping the identification capability of the sample set is also realized.
Referring to fig. 1, the unsupervised text topic-related gene extraction method for an unbalanced big data set of the present invention includes the following steps:
s1, performing dimensionality reduction on high-dimensional samples in the unlabeled sample set by adopting factor analysis, and outputting a characteristic index matrix of the sample set;
and carrying out factor analysis on the original characteristic variables of the sample set, and selecting a few 'abstract' variables (namely common factors) to replace the original characteristic variables so as to realize the reduction and dimension reduction of the sample characteristic correlation. The specific flow is shown in fig. 2:
s101, performing KMO test on the correlation degree between samples, and jumping to S102 when the KMO statistic is larger than 0.5, otherwise, jumping to S106;
let the sample set X contain n samples X1,x2,…,xnEach sample xiIs composed of m characteristic indexes, and is marked as X ═ Xij)n×m=(X1,X2,…,Xm);
Determination of sample X by KMO (Kaiser Meyer Olkin, KMO) test1,X2,…,XmTo determine the necessity of performing factor analysis. The closer the KMO statistic is to 0, indicating X1,X2,…,XmThe weaker the correlation, the closer the KMO statistic is to 1, indicating X1,X2,…,XmThe stronger the correlation.
Typically, the KMO statistic is greater than 0.5, and performing factor analysis is of practical significance.
S102, calculating a sample set X1,X2,…,XmCovariance matrix ∑ ═ hij)m×mDetermining the number of the common factors according to the percentage of the sum of the characteristic roots to the sum of all the characteristic roots;
from the characteristic equation | Σ - λ I | ═ 0 of Σ, the characteristic root of the covariance matrix can be found as λ1≥λ2≥…≥λpNot less than 0, corresponding unit characterThe vector is T1,T2,…,Tp;
In addition, according to the processing principle in the practical problem, the first u characteristic roots and the characteristic vectors are taken, so that the sum of the characteristic roots of the u characteristic roots and the characteristic vectors accounts for more than 85% of the sum of all the characteristic roots, and the number of the common factors is determined;
s103, calculating a factor load matrix, and jumping to the step S104 when the load of each factor on different characteristic indexes is not obviously different, or jumping to the step S105;
calculating the factor load matrix by using the characteristic root and the characteristic vector of the sigma as follows:
s104, rotating the factor load matrix by adopting an orthogonal rotation method;
if the loads of each factor on different characteristic indexes are not obviously different, the factor load matrix needs to be rotated, and the factor load matrix is generally rotated by adopting an orthogonal rotation method, so that the rotated factor load matrix A' is obtained as follows:
b is carried out on the row vector of the factor load matrix A' after rotationip=Max{bi1,bi2,…,biuThe operation of 1,2, …, m, p ∈ {1,2, …, u }, and the retention of the characteristic index X in the matrix aiMaximum load value b among u factorsipThe matrix is obtained as follows:
A*=(b'ij)m×u
wherein i is 1,2, …, m; j is 1,2, …, u;
s105, evaluating the load of the characteristic indexes in the factor load matrix in the corresponding common factor, and reserving the maximum load value of the load;
and S106, outputting a characteristic index matrix of the sample set, and taking the characteristic index matrix as the basis of the next analysis.
The sample set X is simplified to include n samples through the operation process, and each sample XiFinite sample set X composed of u characteristic index factorsΔThus, the characteristic index matrix of n samples is constructed as follows:
wherein,a j-th characteristic index factor representing an i-th sample, i ═ 1,2, …, n; j is 1,2, …, u.
S2, carrying out exploratory clustering on the reduced-dimension sample set by using a rapid search and density peak value discovery algorithm;
and analyzing the local density and the distance to a point with higher local density of each sample expressed by the common factor, drawing a decision graph, and generating the clustering division of the sample set. The specific flow is shown in fig. 3:
s201, calculating the similarity between any two row vectors in a characteristic index matrix of a sample set;
calculating the similarity between samples to define the variable d by using the adjusted cosine similarityijCalculating any two data pointsDistance between Sim (i, j), samplesThe similarity between Sim (i, j) is defined as follows:
wherein i, j is 1,2, …, n,u is the number of attributes of the object, the truncation distance dcIs the data point xiAs the center of circle, with dcIs a radius, ρiThe accumulated number satisfies | X | × 2%
S202, selecting a proper truncation distance to calculate any data point in XLocal density ofAnd the distance of the point to the point with the higher local density
Let data point xiHas a local density ofData point xiTo the data point x with a local density greater than it and closest to the clusterjA distance ofWherein,dijis the distance between different data points, dcIs the truncation distance (hyper-parameter).
S203, according to the local density of all the sample points and the distance value from the point with higher local densityIs the horizontal axis, inDrawing a decision graph with a vertical axis;
calculate an arbitrary data point xiRho ofiAndithe C pieces of rho after sortingiAndiis labeled as the center point of the cluster and the remaining data points are assigned to its nearest neighbor and the cluster in which the data points are more dense than it.
And S205, distributing the residual points to obtain C cluster partitions of the n samples, and outputting the cluster partitions of the sample set to serve as the basis of the next analysis.
S3, improving chi by using information entropy and average local density2Statistics, construction based on weighted χ2Sample feature distribution matrix of statistics, χ for sample set2Weighting the characteristics in the statistic and the sample category (cluster), wherein the weight is defined as the information entropy value of the characteristics and the sample category (cluster), and the weighted χ is used2Constructing a new statistical matrix by the statistics to represent the weighted probability distribution of the features in different categories (clusters) and the same categories (clusters), and selecting the features on the basis of the weighted probability distribution;
the specific flow is shown in fig. 4:
s301, utilizing information entropy values of characteristics and sample types (clusters) to carry out Chi on sample set2Weighting the statistic;
at x2In the statistic, the feature t and the sample class (cluster) c are comparediWeighting is carried out, and the weighted χ is2The statistic is defined as W%2(t,ci) The weights of which are defined as the feature t and the sample class (cluster) ciInformation entropy value of (i.e. of
Wherein p (t | c)i) In the sample class (cluster) c for the feature tiThe probability of occurrence of p (c)i) As a sample class (cluster) ciProbability of occurrence, p (t, c)i) As a sample class (cluster) ciThe probability of the occurrence of the feature t in,as a sample class (cluster) ciAverage local density of middle sample points, C ═ C1,c2,…,ckDenotes a set of sample classes (clusters).Is defined asciRep denotes the cluster ciSample point(s) in (c).
S302, utilizing the weighting χ2A new statistical matrix K is established through statistics, and rows and columns in the K are respectively expressed as weighted probability distribution of features in different categories (clusters) and the same categories (clusters);
using the weighted χ2The statistics construct a statistical matrix K, where K is expressed as
Where rows and columns represent weighted probability distributions of features in different categories (clusters) and the same category (cluster), respectively.
s304, passingWill tiConverted into corresponding membership muijAnd constructing a new class (cluster) vectorbijIs tiIn descending order of membership μij;
S305, calculating characteristic tiThe sum of the contributions provided to each class;
characteristic tiSum of contributions provided to each classThe method specifically comprises the following steps:
s306, calculating the cumulative variance contribution rate;
s307, repeatedly executing the steps S303 to S306,then, the feature subset T ═ T is obtained1,t2,…tp。
S4, analyzing the multi-dimensional feature subset T-T by using a fast fixed point algorithm (FastICA) based on negative entropy1,t2,…tpAnd (4) extracting independent characteristic genes and finishing the removal of high-order redundancy among components by high-order statistical correlation among the intermediate data.
Referring to fig. 5, the process of extracting the subject related genes is as follows:
s401, pertT ═ T1,t2,…tpCentering to make the average value of 0;
s403, selecting the number m of the independent components to be estimated, and setting i to 1;
s404, selecting an initialization (randomly selectable) vector w with unit normi;
s406, standardizing wi,wi←wi/||wi||;
S407, if the convergence is not already finished, returning to the step S405;
s408, let i ← i +1, if i ≦ m, return to step S404.
The extraction of the theme related genes is a key step for preprocessing the industrial big data and the social big data, and the optimal feature subset which comprehensively and truly reflects the text theme information is accurately selected in the unbalanced big data set by finding mutually independent implicit information components, so that the recognition performance of the classifier is remarkably improved.
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to further test the practical application effect of the method, 2 corpora of the fox searching news data (SogouCS)20151022 and the Reuter-21578 are selected in a simulation experiment, a k-means clustering algorithm is adopted, the normalized mutual information of the clustering result category information and the original category information of each algorithm is analyzed, and the effectiveness of the algorithms is measured. Because the number of clusters needs to be definitely given when K-means is clustered, in order to reduce the influence of K value selection on the method, the cluster number K of the method and the comparison method is set as the number of categories contained by each data label, namely 20, 10 and 12. Fig. 6(a), (b) show the normalized mutual information values of the respective algorithms under selection of different feature numbers.
It can be seen from the graphs (a) and (b) that the unsupervised text topic-oriented related gene extraction method for the unbalanced large data set has obvious advantages compared with other 4 algorithms, and the method of the invention in the graphs (a) and (b) can quickly achieve better effect when the number of features is less, so that the unsupervised feature selection performed by using the algorithm provided by the invention has better effect than the general unsupervised feature selection algorithm.
In conclusion, the unsupervised text topic related gene extraction method for the large unbalanced data set does not need to label samples on a large scale for training, avoids predefining class relations and related characteristics, and solves the problem of poor model generalization capability caused by unbalanced sample class distribution. On the basis of a text clustering method for rapidly searching and finding peaks, a weighted x is constructed by using information entropy2The text feature distribution matrix of the statistics avoids the change of the category distribution of the original unbalanced data set by adopting an oversampling or under-sampling method, and greatly improves the performance of the CHI statistical selection method by correcting the feature category distribution. Finally, a fast fixed point algorithm (FastICA) based on negative entropy is adopted to extract independent implicit information components among multi-dimensional data, the generalization performance of the feature subset is superior to RSR, FSFC, UFS-MI and RUFS, and the features under the condition of keeping the identification capability of the data set are achievedAnd (5) reducing the dimensionality.
In addition, the feature dimensionality reduction of industrial big data and social big data brought by informatization is a key step for preprocessing the big data, and the text theme related gene extraction idea provided by the invention plays a more important role in the fields of the big data, so that how to better adapt to the data processing requirements in the fields is research work to be carried out in the future.
The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the claims of the present invention.
Claims (9)
1. The unsupervised text theme related gene extraction method for the unbalanced large data set is characterized by comprising the following steps of:
s1, performing dimensionality reduction on high-dimensional samples in the unlabeled sample set by adopting factor analysis, and outputting a characteristic index matrix of the sample set;
s2, analyzing local density and the distance to a point with higher local density for each sample expressed by the common factor, drawing a decision diagram, carrying out exploratory clustering on the reduced-dimension sample set by using a fast search and density peak value discovery algorithm, obtaining C clustering partitions of n samples, and outputting the clustering partitions of the sample set;
s3, improving chi by utilizing information entropy and average local density2Statistics, construction based on weighted χ2Sample feature distribution matrix of statistics, χ for sample set2Features in the statistics are weighted with the sample class by the weighted χ2Constructing a new statistical matrix representing the weighted probability distribution of the features in different classes and the same class by the statistics, and selecting the features to obtain a feature subset T (T)1,t2,…tp;
S4, analyzing the multi-dimensional feature subset T-T by using a negative entropy-based fast fixed point algorithm1,t2,…tpHigh-order statistical correlation among the intermediate data, and extracting independent characteristic genesRemoval of higher order redundancy among the component quantities.
2. The unsupervised text topic-related gene extraction method for the unbalanced big data set as recited in claim 1, wherein the step S1 specifically comprises:
s101, a sample set X is set to contain n samples X1,x2,…,xnEach sample xiIs composed of m characteristic indexes, and is marked as X ═ Xij)n×m=(X1,X2,…,Xm) Performing KMO test on the correlation degree between the samples, and jumping to the step S102 when the KMO statistic is larger than 0.5, otherwise jumping to the step S106;
s102, calculating a sample set X1,X2,…,XmCovariance matrix ∑ ═ hij)m×mDetermining the number of the common factors according to the percentage of the sum of the characteristic roots to the sum of all the characteristic roots;
s103, calculating a factor load matrix, and jumping to the step S104 when the load of each factor on different characteristic indexes is not obviously different, or jumping to the step S105;
s104, rotating the factor load matrix by adopting an orthogonal rotation method;
s105, evaluating the load of the characteristic indexes in the factor load matrix in the corresponding common factors, and reserving a maximum load value;
and S106, outputting a characteristic index matrix of the sample set X.
3. The unsupervised text topic related gene extraction method of claim 2, wherein in step S106, each sample x isiFinite sample set X composed of u characteristic index factorsΔN samples of the characteristic index matrix X*The method specifically comprises the following steps:
wherein x isij *A j-th characteristic index factor representing an i-th sample, i ═ 1,2, …, n; j ═ 1,2, …, u,.
4. The unsupervised text topic-related gene extraction method for the unbalanced big data set as recited in claim 1, wherein the step S2 specifically comprises:
s201, calculating the similarity between samples to define a variable d by utilizing the adjusted cosine similarityijCalculating any two data pointsThe distance between Sim (i, j);
s202, selecting a proper truncation distance to calculate X*Middle arbitrary data pointLocal density ofAnd the distance of the point to the point with the higher local density
S203, according to the local density of all the sample points and the distance value from the point with higher local densityIs the horizontal axis, inDrawing a decision graph with a vertical axis;
and S205, distributing the residual points to obtain C cluster partitions of the n samples, and outputting the cluster partitions of the sample set to serve as the basis of the next analysis.
5. The unsupervised text topic related gene extraction method of claim 4, wherein in step S201, the sample is extracted from the large unbalanced data setThe similarity between Sim (i, j) is defined as follows:
6. The unsupervised text topic-related gene extraction method for the unbalanced big data set as recited in claim 1, wherein the step S3 specifically comprises:
s301, utilizing information entropy values of characteristics and sample types (clusters) to carry out Chi on sample set2Weighting the statistic;
s302, utilizing the weighting χ2A new statistical matrix K is established through statistics, and rows and columns in the K are respectively expressed as weighted probability distribution of features in different categories (clusters) and the same categories (clusters);
s304, passingWill tiConverted into corresponding membership muijAnd constructing a new category vectorbijIs tiIn descending order of membership μij;
S305, calculating characteristic tiThe sum of the contributions provided to each class;
s306, calculating the cumulative variance contribution rate;
7. The unsupervised text topic related gene extraction method of claim 6, wherein in step S301X is2In statistics, for feature t and sample class ciWeighting is carried out, and the weighted χ is2The statistic is defined as W%2(t,ci) The weight is defined as the feature t and the sample class ciThe information entropy value of (a) is specifically:
Wherein p (t | c)i) In sample class c for feature tiThe probability of occurrence of p (c)i) Is a sample class ciProbability of occurrence, p (t, c)i) Is a sample class ciThe probability of the occurrence of the feature t in,is a sample class ciAverage local density of middle sample points, C ═ C1,c2,…,ckRepresents a sample class set;is defined asciRep denotes the cluster ciSample point(s) in (c).
9. The unsupervised text topic-related gene extraction method for the unbalanced big data set as recited in claim 1, wherein the step S4 specifically comprises:
s401, setting the feature subset T as T1,t2,…tpCentering to make the average value of 0;
s403, selecting the number m of the independent components to be estimated, and setting i to 1;
s404, selecting an initialization (randomly selectable) vector w with unit normi;
s406, standardizing wi,wi←wi/||wi||;
S407, if the convergence is not already finished, returning to the step S405;
s408, let i ← i +1, if i ≦ m, return to step S404.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010255801.8A CN111460161A (en) | 2020-04-02 | 2020-04-02 | Unsupervised text theme related gene extraction method for unbalanced big data set |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010255801.8A CN111460161A (en) | 2020-04-02 | 2020-04-02 | Unsupervised text theme related gene extraction method for unbalanced big data set |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111460161A true CN111460161A (en) | 2020-07-28 |
Family
ID=71684436
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010255801.8A Pending CN111460161A (en) | 2020-04-02 | 2020-04-02 | Unsupervised text theme related gene extraction method for unbalanced big data set |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111460161A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112182164A (en) * | 2020-10-16 | 2021-01-05 | 上海明略人工智能(集团)有限公司 | High-dimensional data feature processing method and system |
CN112907035A (en) * | 2021-01-27 | 2021-06-04 | 厦门卫星定位应用股份有限公司 | K-means-based transportation subject credit rating method and device |
CN114124536A (en) * | 2021-11-24 | 2022-03-01 | 四川九洲电器集团有限责任公司 | Multi-station detection signal tracing method |
CN115952432A (en) * | 2022-12-21 | 2023-04-11 | 四川大学华西医院 | Unsupervised clustering method based on diabetes data |
-
2020
- 2020-04-02 CN CN202010255801.8A patent/CN111460161A/en active Pending
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112182164A (en) * | 2020-10-16 | 2021-01-05 | 上海明略人工智能(集团)有限公司 | High-dimensional data feature processing method and system |
CN112182164B (en) * | 2020-10-16 | 2024-02-23 | 上海明略人工智能(集团)有限公司 | High-dimensional data feature processing method and system |
CN112907035A (en) * | 2021-01-27 | 2021-06-04 | 厦门卫星定位应用股份有限公司 | K-means-based transportation subject credit rating method and device |
CN112907035B (en) * | 2021-01-27 | 2022-08-05 | 厦门卫星定位应用股份有限公司 | K-means-based transportation subject credit rating method and device |
CN114124536A (en) * | 2021-11-24 | 2022-03-01 | 四川九洲电器集团有限责任公司 | Multi-station detection signal tracing method |
CN115952432A (en) * | 2022-12-21 | 2023-04-11 | 四川大学华西医院 | Unsupervised clustering method based on diabetes data |
CN115952432B (en) * | 2022-12-21 | 2024-03-12 | 四川大学华西医院 | Unsupervised clustering method based on diabetes data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111460161A (en) | Unsupervised text theme related gene extraction method for unbalanced big data set | |
Marsili | Dissecting financial markets: sectors and states | |
Redmond et al. | A method for initialising the K-means clustering algorithm using kd-trees | |
Fayyad | Knowledge discovery in databases: An overview | |
WO2022126810A1 (en) | Text clustering method | |
CN111144106A (en) | Two-stage text feature selection method under unbalanced data set | |
CN112270596A (en) | Risk control system and method based on user portrait construction | |
CN113688906A (en) | Customer segmentation method and system based on quantum K-means algorithm | |
Rahman et al. | An efficient approach for selecting initial centroid and outlier detection of data clustering | |
Mehrotra et al. | To identify the usage of clustering techniques for improving search result of a website | |
CN115186138A (en) | Comparison method and terminal for power distribution network data | |
Akyol | Clustering hotels and analyzing the importance of their features by machine learning techniques | |
CN115098674A (en) | Method for generating confrontation network generation data based on cloud ERP supply chain ecosphere | |
Sundari et al. | A study of various text mining techniques | |
CN112884028A (en) | System resource adjusting method, device and equipment | |
Peleja et al. | Text Categorization: A comparison of classifiers, feature selection metrics and document representation | |
CN111382273A (en) | Text classification method based on feature selection of attraction factors | |
CN114281994B (en) | Text clustering integration method and system based on three-layer weighting model | |
Luo et al. | A comparison of som based document categorization systems | |
Jing-Ming et al. | Unsupervised Text Topic-Related Gene Extraction for Large Unbalanced Datasets | |
Gupta et al. | A detailed Study of different Clustering Algorithms in Data Mining | |
CN117932072B (en) | Text classification method based on feature vector sparsity | |
CN113688229B (en) | Text recommendation method, system, storage medium and equipment | |
Ibitoye et al. | Customer Churn Predictive Analytics using Relative Churn Fuzzy Feature-Weight Model in Telecoms | |
CN118378180B (en) | Financial big data analysis method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200728 |