CN111460161A - Unsupervised text theme related gene extraction method for unbalanced big data set - Google Patents

Unsupervised text theme related gene extraction method for unbalanced big data set Download PDF

Info

Publication number
CN111460161A
CN111460161A CN202010255801.8A CN202010255801A CN111460161A CN 111460161 A CN111460161 A CN 111460161A CN 202010255801 A CN202010255801 A CN 202010255801A CN 111460161 A CN111460161 A CN 111460161A
Authority
CN
China
Prior art keywords
sample
characteristic
matrix
class
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010255801.8A
Other languages
Chinese (zh)
Inventor
孙晶涛
李敬明
陈彦萍
张秋余
王忠民
孙韩林
温福喜
何继光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Posts and Telecommunications
Original Assignee
Xian University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Posts and Telecommunications filed Critical Xian University of Posts and Telecommunications
Priority to CN202010255801.8A priority Critical patent/CN111460161A/en
Publication of CN111460161A publication Critical patent/CN111460161A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an unsupervised text topic related gene extraction method facing an unbalanced large data set, which adopts factor analysis and density peak algorithm to obtain a cluster of a high-dimensional sample set and labels unlabeled samples; improving a feature selection method based on a CHI statistical matrix by using average local density and information entropy to strengthen the feature expression degree of low-density and small sample clusters; and analyzing high-order statistical correlation among the multidimensional data by adopting a negative entropy-based quick fixed point algorithm, extracting independent implicit subject characteristic genes and finishing the removal of high-order redundancy among the components. Large-scale labeled samples are not required for training, and the predefinition of sample class relations and characteristic structures can be effectively avoided; the influence of an over-sampling method or an under-sampling method on the class distribution of the original unbalanced data set is overcome. The performance of the CHI statistical selection method is improved by correcting the characteristic class structure; effective feature dimension reduction under the condition of keeping the identification capability of the sample set is also realized.

Description

Unsupervised text theme related gene extraction method for unbalanced big data set
Technical Field
The invention belongs to the technical field of data interpretation and subject discovery in natural language processing, and particularly relates to an unsupervised text subject related gene extraction method for an unbalanced large data set.
Background
With the society gradually stepping into the era of "big data", people obtain more and more information through approaches such as webpage, microblog, forum, but the time for reading and arranging the information is less and less, so, the topic of high-efficient, accurate analysis information becomes the effective means that realizes big data understanding and value discovery, and its applicable field has covered many aspects such as internet public opinion monitoring and early warning, network harmful information filtration and sentiment analysis more. When the data in the fields are processed, a large amount of high-dimensional data with redundant or irrelevant features are often required to be faced, so that the efficiency and the performance of a learning algorithm are greatly reduced, and therefore, the feature extraction is used as a crucial loop in machine learning and data mining, and the model construction and analysis efficiency and accuracy are directly influenced.
Currently, feature extraction can be classified into supervised and unsupervised types according to different category information. In the text content analysis process, no matter what kind of category is adopted, a Vector Space Model (Vector Space Model) is required to be used for representing the text into a Vector Space formed by a certain number of feature words, so that two problems inevitably occur in practical application:
① the distribution of the sample category (cluster) in the data set is not balanced, and as the measurement function of the quality evaluation of the characteristic subset, no matter the correlation analysis and the similarity analysis based on independence, or the Euclidean distance and the Mahalanobis distance based on distance, even the most widely applied methods such as mutual information and information gain based on information entropy at present, the consistency assumption that the distribution of the sample category (cluster) in the data set is the same or similar is adopted, so that most of the determined characteristics come from the 'big class' with the dominant number (density) of the category (cluster), and none or few parts come from the 'small class' with the dominant number, the selected characteristic subset with the highest discrimination can not accurately reflect the real information in the whole sample space, and the performance of the subsequent learning method for solving the practical problem is reduced;
② the object to be processed becomes increasingly complex, the data dimension increases explosively, and when facing the data set with ultra-high dimension, it means not only huge memory requirement, but also high calculation cost investment, in these high dimension characteristic spaces, there is strong correlation between the many characteristic points, causing the introduction of a lot of redundancy and even noise, so that the generalization ability of the characteristic item selected by the traditional method deteriorates sharply, the "empty space" phenomenon of the high dimension data space also makes the multivariate density estimation problem very difficult.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide an unsupervised text topic-related gene extraction method for large unbalanced data sets, which effectively avoids predefining a sample class relationship and a feature structure, and overcomes the influence of an over-sampling or under-sampling method on the class distribution of the original unbalanced data set.
The invention adopts the following technical scheme:
the unsupervised text theme related gene extraction method for the unbalanced large data set comprises the following steps of:
s1, performing dimensionality reduction on high-dimensional samples in the unlabeled sample set by adopting factor analysis, and outputting a characteristic index matrix of the sample set;
s2, analyzing local density and the distance to a point with higher local density for each sample expressed by the common factor, drawing a decision diagram, carrying out exploratory clustering on the reduced-dimension sample set by using a fast search and density peak value discovery algorithm, obtaining C clustering partitions of n samples, and outputting the clustering partitions of the sample set;
s3, improving chi by utilizing information entropy and average local density2Statistics, construction based on weighted χ2Sample feature distribution matrix of statistics, χ for sample set2Features in the statistics are weighted with the sample class by the weighted χ2Constructing a new statistical matrix representing the weighted probability distribution of the features in different classes and the same class by the statistics, and selecting the features to obtain a feature subset T (T)1,t2,…tp
S4, analyzing the multi-dimensional feature subset T-T by using a negative entropy-based fast fixed point algorithm1,t2,…tpAnd (4) extracting independent characteristic genes according to the high-order statistical correlation among the intermediate data, and finishing the removal of high-order redundancy among the components.
Specifically, step S1 specifically includes:
s101, a sample set X is set to contain n samples X1,x2,…,xnEach sample xiIs composed of m characteristic indexes, and is marked as X ═ Xij)n×m=(X1,X2,…,Xm) Performing KMO test on the correlation degree between the samples, and jumping to the step S102 when the KMO statistic is larger than 0.5, otherwise jumping to the step S106;
s102, calculating a sample set X1,X2,…,XmCovariance matrix ∑ ═ hij)m×mDetermining the number of the common factors according to the percentage of the sum of the characteristic roots to the sum of all the characteristic roots;
s103, calculating a factor load matrix, and jumping to the step S104 when the load of each factor on different characteristic indexes is not obviously different, or jumping to the step S105;
s104, rotating the factor load matrix by adopting an orthogonal rotation method;
s105, evaluating the load of the characteristic indexes in the factor load matrix in the corresponding common factors, and reserving a maximum load value;
and S106, outputting a characteristic index matrix of the sample set X.
Further, in step S106, each sample xiA finite sample set X Δ composed of u characteristic index factors, and a characteristic index matrix X of n samples is specifically:
Figure BDA0002437259390000031
wherein,
Figure BDA0002437259390000032
a j-th characteristic index factor representing an i-th sample, i ═ 1,2, …, n; j ═ 1,2, …, u,.
Specifically, step S2 specifically includes:
s201, calculating the similarity between samples to define a variable d by utilizing the adjusted cosine similarityijCalculating any two data points
Figure BDA0002437259390000041
The distance between Sim (i, j);
s202, selecting a proper truncation distance to calculate X*Middle arbitrary data point
Figure BDA0002437259390000042
Local density of
Figure BDA0002437259390000043
And the distance of the point to the point with the higher local density
Figure BDA0002437259390000044
S203, according to the local density of all the sample points and the distance value from the point with higher local density
Figure BDA0002437259390000045
Is the horizontal axis, in
Figure BDA0002437259390000046
Drawing a decision graph with a vertical axis;
s204, marking a sample set by using the decision graph
Figure BDA0002437259390000047
And
Figure BDA0002437259390000048
cluster center point and noise point;
and S205, distributing the residual points to obtain C cluster partitions of the n samples, and outputting the cluster partitions of the sample set to serve as the basis of the next analysis.
Further, in step S201, the sample
Figure BDA0002437259390000049
The similarity between Sim (i, j) is defined as follows:
Figure BDA00024372593900000410
wherein i, j is 1,2, …, n,
Figure BDA00024372593900000411
u is the number of attributes of the object, the truncation distance dcIs the data point xiAs the center of circle, with dcIs a radius, ρiThe cumulative number satisfies | X | × 2%.
Specifically, step S3 specifically includes:
s301, utilizing information entropy values of characteristics and sample types (clusters) to carry out Chi on sample set2Weighting the statistic;
s302, utilizing the weighting χ2A new statistical matrix K is established through statistics, and rows and columns in the K are respectively expressed as weighted probability distribution of features in different categories (clusters) and the same categories (clusters);
s303, sequentially selecting each row t in the statistical matrix KiLook up in each row
Figure BDA00024372593900000412
And
Figure BDA00024372593900000413
s304, passing
Figure BDA0002437259390000051
Will tiConverted into corresponding membership muijAnd constructing a new category vector
Figure BDA0002437259390000052
bijIs tiIn descending order of membership μij
S305, calculating characteristic tiThe sum of the contributions provided to each class;
s306, calculating the cumulative variance contribution rate;
s307, repeatedly executing the steps S303 to S306,
Figure BDA0002437259390000053
then, the feature subset T ═ T is obtained1,t2,…tp
Further, in step S301, at χ2In statistics, for feature t and sample class ciWeighting is carried out, and the weighted χ is2The statistic is defined as W%2(t,ci) The weight is defined as the feature t and the sample class ciThe information entropy value of (a) is specifically:
2(t,ci) Is shown as
Figure BDA0002437259390000054
Wherein p (t | c)i) In sample class c for feature tiThe probability of occurrence of p (c)i) Is a sample class ciProbability of occurrence, p (t, c)i) Is a sample class ciThe probability of the occurrence of the feature t in,
Figure BDA0002437259390000055
is a sample class ciAverage local density of middle sample points, C ═ C1,c2,…,ckRepresents a sample class set;
Figure BDA0002437259390000056
is defined as
Figure BDA0002437259390000057
ciRep denotes the cluster ciSample point(s) in (c).
Further, in step S302, the statistical matrix K is represented as:
Figure BDA0002437259390000058
where rows and columns represent weighted probability distributions of features in different categories and the same category, respectively.
Specifically, step S4 specifically includes:
s401, setting the feature subset T as T1,t2,…tpCentering to make the average value of 0;
s402, centering the feature subset after being centered
Figure BDA0002437259390000061
Whitening to obtain z;
s403, selecting the number m of the independent components to be estimated, and setting i to 1;
s404, selecting an initialization (randomly selectable) vector w with unit normi
S405, update
Figure BDA0002437259390000062
The function G is the derivative of a non-quadratic function G;
s406, standardizing wi,wi←wi/||wi||;
S407, if the convergence is not already finished, returning to the step S405;
s408, let i ← i +1, if i ≦ m, return to step S404.
Compared with the prior art, the invention has at least the following beneficial effects:
the unsupervised text theme related gene extraction method oriented to the unbalanced large data set does not need to adopt large-scale labeled samples for training, can effectively avoid predefining the class relation and the characteristic structure of the samples, and has more practical value: most samples obtained by crawling means are not labeled with categories, so that the traditional supervised topic discovery method is difficult to implement effectively. The invention is based on an unsupervised feature extraction method, and has no limitation; the method overcomes the influence of an over-sampling or under-sampling method on the class distribution of the original unbalanced data set. The real information in the sample space is accurately reflected by correcting the characteristic class structure, and the method has stronger generalization in the face of an unbalanced large data set; the invention realizes effective characteristic dimension reduction under the condition of keeping the identification capability of the sample set, further reduces noise word interference, weakens the phenomenon of 'empty space' of a high-dimensional data space, and reduces uncertainty in sample analysis.
Furthermore, an optimal low-dimensional base describing the original high-dimensional vector space is found by utilizing a factor analysis method, so that the possibility of quickly finding a sample cluster of a large-scale data set by a density peak algorithm is provided.
Further, the clustering algorithm of the density peak value is guided by using the neighborhood similarity of the sample points to realize clustering and automatic labeling of the unmarked text set.
Furthermore, the average local density and the information entropy are introduced into the feature item weight definition, so that a discrimination matrix of the feature items on the sample categories (clusters) is constructed, and the defect of characteristic selection of the unbalanced sample set by the traditional method is overcome.
Furthermore, an Independent Component Analysis (ICA) method is adopted, high-order correlation among multi-dimensional statistical data is analyzed, mutually independent implicit information components are found out, an optimal feature subset which can comprehensively and truly reflect text subject information is accurately selected in an unbalanced large data set, and text classification and identification performance is improved.
In conclusion, the invention focuses on unsupervised text feature extraction and researches how to select a stable text theme related gene subset with strong generalization capability, thereby reducing the feature dimension of a vector space, enhancing the category (cluster) representation capability of feature words and improving the classification and identification effects.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
FIG. 1 is a general flow chart of the unsupervised text topic-related gene extraction method for an unbalanced large data set according to the present invention;
FIG. 2 is a flow chart of a sample feature analysis process;
FIG. 3 is a flow chart of a sample clustering process;
FIG. 4 is a flow chart of a feature selection process;
FIG. 5 is a flow chart of the subject gene extraction process;
fig. 6 is a diagram illustrating normalized mutual information values of algorithms under different feature numbers selected in the present invention, where (a) is normalized mutual information (%) of each algorithm in the fox search news data (SogouCS)20151022 corpus, and (b) is normalized mutual information (%) of each algorithm in the Reuter-21578 corpus.
Detailed Description
The invention provides an unsupervised text topic related gene extraction method facing an unbalanced large data set, which adopts factor analysis and density peak algorithm to obtain a cluster of a high-dimensional sample set and labels a label-free sample; improving a feature selection method based on a CHI statistical matrix by using average local density and information entropy so as to strengthen the feature expression degree of low-density and small sample clusters; and analyzing high-order statistical correlation among the multidimensional data by adopting a negative entropy-based fast fixed point algorithm (FastICA) so as to extract independent implicit theme characteristic genes and finish the removal of high-order redundancy among the components. The method does not need to adopt large-scale labeled samples for training, and can effectively avoid predefining the class relationship and the characteristic structure of the samples; and the influence of the over-sampling or under-sampling method on the class distribution of the original unbalanced data set is overcome. The performance of the CHI statistical selection method is greatly improved by correcting the characteristic class structure; effective feature dimension reduction under the condition of keeping the identification capability of the sample set is also realized.
Referring to fig. 1, the unsupervised text topic-related gene extraction method for an unbalanced big data set of the present invention includes the following steps:
s1, performing dimensionality reduction on high-dimensional samples in the unlabeled sample set by adopting factor analysis, and outputting a characteristic index matrix of the sample set;
and carrying out factor analysis on the original characteristic variables of the sample set, and selecting a few 'abstract' variables (namely common factors) to replace the original characteristic variables so as to realize the reduction and dimension reduction of the sample characteristic correlation. The specific flow is shown in fig. 2:
s101, performing KMO test on the correlation degree between samples, and jumping to S102 when the KMO statistic is larger than 0.5, otherwise, jumping to S106;
let the sample set X contain n samples X1,x2,…,xnEach sample xiIs composed of m characteristic indexes, and is marked as X ═ Xij)n×m=(X1,X2,…,Xm);
Determination of sample X by KMO (Kaiser Meyer Olkin, KMO) test1,X2,…,XmTo determine the necessity of performing factor analysis. The closer the KMO statistic is to 0, indicating X1,X2,…,XmThe weaker the correlation, the closer the KMO statistic is to 1, indicating X1,X2,…,XmThe stronger the correlation.
Typically, the KMO statistic is greater than 0.5, and performing factor analysis is of practical significance.
S102, calculating a sample set X1,X2,…,XmCovariance matrix ∑ ═ hij)m×mDetermining the number of the common factors according to the percentage of the sum of the characteristic roots to the sum of all the characteristic roots;
from the characteristic equation | Σ - λ I | ═ 0 of Σ, the characteristic root of the covariance matrix can be found as λ1≥λ2≥…≥λpNot less than 0, corresponding unit characterThe vector is T1,T2,…,Tp
In addition, according to the processing principle in the practical problem, the first u characteristic roots and the characteristic vectors are taken, so that the sum of the characteristic roots of the u characteristic roots and the characteristic vectors accounts for more than 85% of the sum of all the characteristic roots, and the number of the common factors is determined;
s103, calculating a factor load matrix, and jumping to the step S104 when the load of each factor on different characteristic indexes is not obviously different, or jumping to the step S105;
calculating the factor load matrix by using the characteristic root and the characteristic vector of the sigma as follows:
Figure BDA0002437259390000091
s104, rotating the factor load matrix by adopting an orthogonal rotation method;
if the loads of each factor on different characteristic indexes are not obviously different, the factor load matrix needs to be rotated, and the factor load matrix is generally rotated by adopting an orthogonal rotation method, so that the rotated factor load matrix A' is obtained as follows:
Figure BDA0002437259390000092
b is carried out on the row vector of the factor load matrix A' after rotationip=Max{bi1,bi2,…,biuThe operation of 1,2, …, m, p ∈ {1,2, …, u }, and the retention of the characteristic index X in the matrix aiMaximum load value b among u factorsipThe matrix is obtained as follows:
A*=(b'ij)m×u
Figure BDA0002437259390000101
wherein i is 1,2, …, m; j is 1,2, …, u;
s105, evaluating the load of the characteristic indexes in the factor load matrix in the corresponding common factor, and reserving the maximum load value of the load;
and S106, outputting a characteristic index matrix of the sample set, and taking the characteristic index matrix as the basis of the next analysis.
The sample set X is simplified to include n samples through the operation process, and each sample XiFinite sample set X composed of u characteristic index factorsΔThus, the characteristic index matrix of n samples is constructed as follows:
Figure BDA0002437259390000102
Figure BDA0002437259390000103
wherein,
Figure BDA0002437259390000108
a j-th characteristic index factor representing an i-th sample, i ═ 1,2, …, n; j is 1,2, …, u.
S2, carrying out exploratory clustering on the reduced-dimension sample set by using a rapid search and density peak value discovery algorithm;
and analyzing the local density and the distance to a point with higher local density of each sample expressed by the common factor, drawing a decision graph, and generating the clustering division of the sample set. The specific flow is shown in fig. 3:
s201, calculating the similarity between any two row vectors in a characteristic index matrix of a sample set;
calculating the similarity between samples to define the variable d by using the adjusted cosine similarityijCalculating any two data points
Figure BDA0002437259390000104
Distance between Sim (i, j), samples
Figure BDA0002437259390000105
The similarity between Sim (i, j) is defined as follows:
Figure BDA0002437259390000106
wherein i, j is 1,2, …, n,
Figure BDA0002437259390000107
u is the number of attributes of the object, the truncation distance dcIs the data point xiAs the center of circle, with dcIs a radius, ρiThe accumulated number satisfies | X | × 2%
S202, selecting a proper truncation distance to calculate any data point in X
Figure BDA0002437259390000111
Local density of
Figure BDA0002437259390000112
And the distance of the point to the point with the higher local density
Figure BDA0002437259390000113
Let data point xiHas a local density of
Figure BDA0002437259390000114
Data point xiTo the data point x with a local density greater than it and closest to the clusterjA distance of
Figure BDA0002437259390000115
Wherein,
Figure BDA0002437259390000116
dijis the distance between different data points, dcIs the truncation distance (hyper-parameter).
S203, according to the local density of all the sample points and the distance value from the point with higher local density
Figure BDA0002437259390000117
Is the horizontal axis, in
Figure BDA0002437259390000118
Drawing a decision graph with a vertical axis;
s204, marking a sample set by using the decision graph
Figure BDA0002437259390000119
And
Figure BDA00024372593900001110
cluster center point and noise point;
calculate an arbitrary data point xiRho ofiAndithe C pieces of rho after sortingiAndiis labeled as the center point of the cluster and the remaining data points are assigned to its nearest neighbor and the cluster in which the data points are more dense than it.
And S205, distributing the residual points to obtain C cluster partitions of the n samples, and outputting the cluster partitions of the sample set to serve as the basis of the next analysis.
S3, improving chi by using information entropy and average local density2Statistics, construction based on weighted χ2Sample feature distribution matrix of statistics, χ for sample set2Weighting the characteristics in the statistic and the sample category (cluster), wherein the weight is defined as the information entropy value of the characteristics and the sample category (cluster), and the weighted χ is used2Constructing a new statistical matrix by the statistics to represent the weighted probability distribution of the features in different categories (clusters) and the same categories (clusters), and selecting the features on the basis of the weighted probability distribution;
the specific flow is shown in fig. 4:
s301, utilizing information entropy values of characteristics and sample types (clusters) to carry out Chi on sample set2Weighting the statistic;
at x2In the statistic, the feature t and the sample class (cluster) c are comparediWeighting is carried out, and the weighted χ is2The statistic is defined as W%2(t,ci) The weights of which are defined as the feature t and the sample class (cluster) ciInformation entropy value of (i.e. of
2(t,ci) Is shown as
Figure BDA0002437259390000121
Wherein p (t | c)i) In the sample class (cluster) c for the feature tiThe probability of occurrence of p (c)i) As a sample class (cluster) ciProbability of occurrence, p (t, c)i) As a sample class (cluster) ciThe probability of the occurrence of the feature t in,
Figure BDA0002437259390000122
as a sample class (cluster) ciAverage local density of middle sample points, C ═ C1,c2,…,ckDenotes a set of sample classes (clusters).
Figure BDA0002437259390000123
Is defined as
Figure BDA0002437259390000124
ciRep denotes the cluster ciSample point(s) in (c).
S302, utilizing the weighting χ2A new statistical matrix K is established through statistics, and rows and columns in the K are respectively expressed as weighted probability distribution of features in different categories (clusters) and the same categories (clusters);
using the weighted χ2The statistics construct a statistical matrix K, where K is expressed as
Figure BDA0002437259390000125
Where rows and columns represent weighted probability distributions of features in different categories (clusters) and the same category (cluster), respectively.
S303, sequentially selecting each row t in the statistical matrix KiLook up in each row
Figure BDA0002437259390000126
And
Figure BDA0002437259390000127
s304, passing
Figure BDA0002437259390000128
Will tiConverted into corresponding membership muijAnd constructing a new class (cluster) vector
Figure BDA0002437259390000129
bijIs tiIn descending order of membership μij
S305, calculating characteristic tiThe sum of the contributions provided to each class;
characteristic tiSum of contributions provided to each class
Figure BDA00024372593900001210
The method specifically comprises the following steps:
Figure BDA0002437259390000131
s306, calculating the cumulative variance contribution rate;
cumulative variance contribution rate
Figure BDA0002437259390000132
The method specifically comprises the following steps:
Figure BDA0002437259390000133
s307, repeatedly executing the steps S303 to S306,
Figure BDA0002437259390000136
then, the feature subset T ═ T is obtained1,t2,…tp
S4, analyzing the multi-dimensional feature subset T-T by using a fast fixed point algorithm (FastICA) based on negative entropy1,t2,…tpAnd (4) extracting independent characteristic genes and finishing the removal of high-order redundancy among components by high-order statistical correlation among the intermediate data.
Referring to fig. 5, the process of extracting the subject related genes is as follows:
s401, pertT ═ T1,t2,…tpCentering to make the average value of 0;
s402, centering the feature subset after being centered
Figure BDA0002437259390000134
Whitening to obtain z;
s403, selecting the number m of the independent components to be estimated, and setting i to 1;
s404, selecting an initialization (randomly selectable) vector w with unit normi
S405, update
Figure BDA0002437259390000135
The function G is the derivative of a non-quadratic function G;
s406, standardizing wi,wi←wi/||wi||;
S407, if the convergence is not already finished, returning to the step S405;
s408, let i ← i +1, if i ≦ m, return to step S404.
The extraction of the theme related genes is a key step for preprocessing the industrial big data and the social big data, and the optimal feature subset which comprehensively and truly reflects the text theme information is accurately selected in the unbalanced big data set by finding mutually independent implicit information components, so that the recognition performance of the classifier is remarkably improved.
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to further test the practical application effect of the method, 2 corpora of the fox searching news data (SogouCS)20151022 and the Reuter-21578 are selected in a simulation experiment, a k-means clustering algorithm is adopted, the normalized mutual information of the clustering result category information and the original category information of each algorithm is analyzed, and the effectiveness of the algorithms is measured. Because the number of clusters needs to be definitely given when K-means is clustered, in order to reduce the influence of K value selection on the method, the cluster number K of the method and the comparison method is set as the number of categories contained by each data label, namely 20, 10 and 12. Fig. 6(a), (b) show the normalized mutual information values of the respective algorithms under selection of different feature numbers.
It can be seen from the graphs (a) and (b) that the unsupervised text topic-oriented related gene extraction method for the unbalanced large data set has obvious advantages compared with other 4 algorithms, and the method of the invention in the graphs (a) and (b) can quickly achieve better effect when the number of features is less, so that the unsupervised feature selection performed by using the algorithm provided by the invention has better effect than the general unsupervised feature selection algorithm.
In conclusion, the unsupervised text topic related gene extraction method for the large unbalanced data set does not need to label samples on a large scale for training, avoids predefining class relations and related characteristics, and solves the problem of poor model generalization capability caused by unbalanced sample class distribution. On the basis of a text clustering method for rapidly searching and finding peaks, a weighted x is constructed by using information entropy2The text feature distribution matrix of the statistics avoids the change of the category distribution of the original unbalanced data set by adopting an oversampling or under-sampling method, and greatly improves the performance of the CHI statistical selection method by correcting the feature category distribution. Finally, a fast fixed point algorithm (FastICA) based on negative entropy is adopted to extract independent implicit information components among multi-dimensional data, the generalization performance of the feature subset is superior to RSR, FSFC, UFS-MI and RUFS, and the features under the condition of keeping the identification capability of the data set are achievedAnd (5) reducing the dimensionality.
In addition, the feature dimensionality reduction of industrial big data and social big data brought by informatization is a key step for preprocessing the big data, and the text theme related gene extraction idea provided by the invention plays a more important role in the fields of the big data, so that how to better adapt to the data processing requirements in the fields is research work to be carried out in the future.
The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims (9)

1. The unsupervised text theme related gene extraction method for the unbalanced large data set is characterized by comprising the following steps of:
s1, performing dimensionality reduction on high-dimensional samples in the unlabeled sample set by adopting factor analysis, and outputting a characteristic index matrix of the sample set;
s2, analyzing local density and the distance to a point with higher local density for each sample expressed by the common factor, drawing a decision diagram, carrying out exploratory clustering on the reduced-dimension sample set by using a fast search and density peak value discovery algorithm, obtaining C clustering partitions of n samples, and outputting the clustering partitions of the sample set;
s3, improving chi by utilizing information entropy and average local density2Statistics, construction based on weighted χ2Sample feature distribution matrix of statistics, χ for sample set2Features in the statistics are weighted with the sample class by the weighted χ2Constructing a new statistical matrix representing the weighted probability distribution of the features in different classes and the same class by the statistics, and selecting the features to obtain a feature subset T (T)1,t2,…tp
S4, analyzing the multi-dimensional feature subset T-T by using a negative entropy-based fast fixed point algorithm1,t2,…tpHigh-order statistical correlation among the intermediate data, and extracting independent characteristic genesRemoval of higher order redundancy among the component quantities.
2. The unsupervised text topic-related gene extraction method for the unbalanced big data set as recited in claim 1, wherein the step S1 specifically comprises:
s101, a sample set X is set to contain n samples X1,x2,…,xnEach sample xiIs composed of m characteristic indexes, and is marked as X ═ Xij)n×m=(X1,X2,…,Xm) Performing KMO test on the correlation degree between the samples, and jumping to the step S102 when the KMO statistic is larger than 0.5, otherwise jumping to the step S106;
s102, calculating a sample set X1,X2,…,XmCovariance matrix ∑ ═ hij)m×mDetermining the number of the common factors according to the percentage of the sum of the characteristic roots to the sum of all the characteristic roots;
s103, calculating a factor load matrix, and jumping to the step S104 when the load of each factor on different characteristic indexes is not obviously different, or jumping to the step S105;
s104, rotating the factor load matrix by adopting an orthogonal rotation method;
s105, evaluating the load of the characteristic indexes in the factor load matrix in the corresponding common factors, and reserving a maximum load value;
and S106, outputting a characteristic index matrix of the sample set X.
3. The unsupervised text topic related gene extraction method of claim 2, wherein in step S106, each sample x isiFinite sample set X composed of u characteristic index factorsΔN samples of the characteristic index matrix X*The method specifically comprises the following steps:
Figure FDA0002437259380000021
wherein x isij *A j-th characteristic index factor representing an i-th sample, i ═ 1,2, …, n; j ═ 1,2, …, u,.
4. The unsupervised text topic-related gene extraction method for the unbalanced big data set as recited in claim 1, wherein the step S2 specifically comprises:
s201, calculating the similarity between samples to define a variable d by utilizing the adjusted cosine similarityijCalculating any two data points
Figure FDA0002437259380000022
The distance between Sim (i, j);
s202, selecting a proper truncation distance to calculate X*Middle arbitrary data point
Figure FDA0002437259380000023
Local density of
Figure FDA0002437259380000024
And the distance of the point to the point with the higher local density
Figure FDA0002437259380000025
S203, according to the local density of all the sample points and the distance value from the point with higher local density
Figure FDA0002437259380000026
Is the horizontal axis, in
Figure FDA0002437259380000027
Drawing a decision graph with a vertical axis;
s204, marking a sample set by using the decision graph
Figure FDA0002437259380000028
And
Figure FDA0002437259380000029
cluster center point and noise point;
and S205, distributing the residual points to obtain C cluster partitions of the n samples, and outputting the cluster partitions of the sample set to serve as the basis of the next analysis.
5. The unsupervised text topic related gene extraction method of claim 4, wherein in step S201, the sample is extracted from the large unbalanced data set
Figure FDA00024372593800000210
The similarity between Sim (i, j) is defined as follows:
Figure FDA0002437259380000031
wherein i, j is 1,2, …, n,
Figure FDA0002437259380000032
u is the number of attributes of the object, the truncation distance dcIs the data point xiAs the center of circle, with dcIs a radius, ρiThe cumulative number satisfies | X | × 2%.
6. The unsupervised text topic-related gene extraction method for the unbalanced big data set as recited in claim 1, wherein the step S3 specifically comprises:
s301, utilizing information entropy values of characteristics and sample types (clusters) to carry out Chi on sample set2Weighting the statistic;
s302, utilizing the weighting χ2A new statistical matrix K is established through statistics, and rows and columns in the K are respectively expressed as weighted probability distribution of features in different categories (clusters) and the same categories (clusters);
s303, sequentially selecting each row t in the statistical matrix KiLook up in each row
Figure FDA0002437259380000033
And
Figure FDA0002437259380000037
s304, passing
Figure FDA0002437259380000034
Will tiConverted into corresponding membership muijAnd constructing a new category vector
Figure FDA0002437259380000035
bijIs tiIn descending order of membership μij
S305, calculating characteristic tiThe sum of the contributions provided to each class;
s306, calculating the cumulative variance contribution rate;
s307, repeatedly executing the steps S303 to S306,
Figure FDA0002437259380000036
then, the feature subset T ═ T is obtained1,t2,…tp
7. The unsupervised text topic related gene extraction method of claim 6, wherein in step S301X is2In statistics, for feature t and sample class ciWeighting is carried out, and the weighted χ is2The statistic is defined as W%2(t,ci) The weight is defined as the feature t and the sample class ciThe information entropy value of (a) is specifically:
2(t,ci) Is shown as
Figure FDA0002437259380000041
Wherein p (t | c)i) In sample class c for feature tiThe probability of occurrence of p (c)i) Is a sample class ciProbability of occurrence, p (t, c)i) Is a sample class ciThe probability of the occurrence of the feature t in,
Figure FDA0002437259380000042
is a sample class ciAverage local density of middle sample points, C ═ C1,c2,…,ckRepresents a sample class set;
Figure FDA0002437259380000043
is defined as
Figure FDA0002437259380000044
ciRep denotes the cluster ciSample point(s) in (c).
8. The unsupervised text topic related gene extraction method of claim 6, wherein in step S302, the statistical matrix K is expressed as:
Figure FDA0002437259380000045
where rows and columns represent weighted probability distributions of features in different categories and the same category, respectively.
9. The unsupervised text topic-related gene extraction method for the unbalanced big data set as recited in claim 1, wherein the step S4 specifically comprises:
s401, setting the feature subset T as T1,t2,…tpCentering to make the average value of 0;
s402, centering the feature subset after being centered
Figure FDA0002437259380000046
Whitening to obtain z;
s403, selecting the number m of the independent components to be estimated, and setting i to 1;
s404, selecting an initialization (randomly selectable) vector w with unit normi
S405, update
Figure FDA0002437259380000047
The function G is the derivative of a non-quadratic function G;
s406, standardizing wi,wi←wi/||wi||;
S407, if the convergence is not already finished, returning to the step S405;
s408, let i ← i +1, if i ≦ m, return to step S404.
CN202010255801.8A 2020-04-02 2020-04-02 Unsupervised text theme related gene extraction method for unbalanced big data set Pending CN111460161A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010255801.8A CN111460161A (en) 2020-04-02 2020-04-02 Unsupervised text theme related gene extraction method for unbalanced big data set

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010255801.8A CN111460161A (en) 2020-04-02 2020-04-02 Unsupervised text theme related gene extraction method for unbalanced big data set

Publications (1)

Publication Number Publication Date
CN111460161A true CN111460161A (en) 2020-07-28

Family

ID=71684436

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010255801.8A Pending CN111460161A (en) 2020-04-02 2020-04-02 Unsupervised text theme related gene extraction method for unbalanced big data set

Country Status (1)

Country Link
CN (1) CN111460161A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112182164A (en) * 2020-10-16 2021-01-05 上海明略人工智能(集团)有限公司 High-dimensional data feature processing method and system
CN112907035A (en) * 2021-01-27 2021-06-04 厦门卫星定位应用股份有限公司 K-means-based transportation subject credit rating method and device
CN114124536A (en) * 2021-11-24 2022-03-01 四川九洲电器集团有限责任公司 Multi-station detection signal tracing method
CN115952432A (en) * 2022-12-21 2023-04-11 四川大学华西医院 Unsupervised clustering method based on diabetes data

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112182164A (en) * 2020-10-16 2021-01-05 上海明略人工智能(集团)有限公司 High-dimensional data feature processing method and system
CN112182164B (en) * 2020-10-16 2024-02-23 上海明略人工智能(集团)有限公司 High-dimensional data feature processing method and system
CN112907035A (en) * 2021-01-27 2021-06-04 厦门卫星定位应用股份有限公司 K-means-based transportation subject credit rating method and device
CN112907035B (en) * 2021-01-27 2022-08-05 厦门卫星定位应用股份有限公司 K-means-based transportation subject credit rating method and device
CN114124536A (en) * 2021-11-24 2022-03-01 四川九洲电器集团有限责任公司 Multi-station detection signal tracing method
CN115952432A (en) * 2022-12-21 2023-04-11 四川大学华西医院 Unsupervised clustering method based on diabetes data
CN115952432B (en) * 2022-12-21 2024-03-12 四川大学华西医院 Unsupervised clustering method based on diabetes data

Similar Documents

Publication Publication Date Title
CN111460161A (en) Unsupervised text theme related gene extraction method for unbalanced big data set
Marsili Dissecting financial markets: sectors and states
Redmond et al. A method for initialising the K-means clustering algorithm using kd-trees
Fayyad Knowledge discovery in databases: An overview
WO2022126810A1 (en) Text clustering method
CN111144106A (en) Two-stage text feature selection method under unbalanced data set
CN112270596A (en) Risk control system and method based on user portrait construction
CN113688906A (en) Customer segmentation method and system based on quantum K-means algorithm
Rahman et al. An efficient approach for selecting initial centroid and outlier detection of data clustering
Mehrotra et al. To identify the usage of clustering techniques for improving search result of a website
CN115186138A (en) Comparison method and terminal for power distribution network data
Akyol Clustering hotels and analyzing the importance of their features by machine learning techniques
CN115098674A (en) Method for generating confrontation network generation data based on cloud ERP supply chain ecosphere
Sundari et al. A study of various text mining techniques
CN112884028A (en) System resource adjusting method, device and equipment
Peleja et al. Text Categorization: A comparison of classifiers, feature selection metrics and document representation
CN111382273A (en) Text classification method based on feature selection of attraction factors
CN114281994B (en) Text clustering integration method and system based on three-layer weighting model
Luo et al. A comparison of som based document categorization systems
Jing-Ming et al. Unsupervised Text Topic-Related Gene Extraction for Large Unbalanced Datasets
Gupta et al. A detailed Study of different Clustering Algorithms in Data Mining
CN117932072B (en) Text classification method based on feature vector sparsity
CN113688229B (en) Text recommendation method, system, storage medium and equipment
Ibitoye et al. Customer Churn Predictive Analytics using Relative Churn Fuzzy Feature-Weight Model in Telecoms
CN118378180B (en) Financial big data analysis method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200728