CN114492566A - Weight-adjustable high-dimensional data dimension reduction method and system - Google Patents

Weight-adjustable high-dimensional data dimension reduction method and system Download PDF

Info

Publication number
CN114492566A
CN114492566A CN202111557901.7A CN202111557901A CN114492566A CN 114492566 A CN114492566 A CN 114492566A CN 202111557901 A CN202111557901 A CN 202111557901A CN 114492566 A CN114492566 A CN 114492566A
Authority
CN
China
Prior art keywords
data
dimensional
weight
matrix
dimensional space
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111557901.7A
Other languages
Chinese (zh)
Inventor
杨旭东
张树巍
刘焰明
张庆明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwest University of Science and Technology
Original Assignee
Southwest University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest University of Science and Technology filed Critical Southwest University of Science and Technology
Priority to CN202111557901.7A priority Critical patent/CN114492566A/en
Publication of CN114492566A publication Critical patent/CN114492566A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Probability & Statistics with Applications (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method and a system for reducing dimension of high-dimensional data with adjustable weight, relating to the technical field of dimension reduction of data, wherein the dimension reduction method comprises the following steps: step1, extracting data; step2, acquiring an attribute weight matrix; step3, calculating the weighted Euclidean point pair distance; step4, calculating the joint probability of the high-dimensional space; and Step5, acquiring the low-dimensional space point distribution. The invention solves the problems of lower dimension reduction accuracy, larger error and the like in the prior art.

Description

Weight-adjustable high-dimensional data dimension reduction method and system
Technical Field
The invention relates to the technical field of data dimension reduction, in particular to a weight-adjustable high-dimensional data dimension reduction method and system.
Background
At present, the human society is entering a big data era, with the rapid development of computer information technology, various industries in the society are gradually digitalized, and more data are generated and stored. How to convert the complex high-dimensional data into low-dimensional data which can be observed and further used conveniently is an important problem to be solved urgently. At present, most dimension reduction methods are classified into linearity and nonlinearity, and are mainly represented as PCA, MDS, t-SNE and the like, wherein t-SNE measures the similarity between a high-dimensional space point pair and a low-dimensional space point pair through conditional probability, KL divergence is used as a target function, so that the low-dimensional space can keep a good embedding effect, and the t-SNE algorithm adopts a Gaussian kernel function when calculating the similarity of the high-dimensional space point pair, so that the Euclidean distance between the point pairs is inevitably calculated. Due to the characteristics of the data and the difference between attributes, when the Euclidean distance is calculated, the distance of each data attribute is not the same and important, so that the probability structure of a real high-dimensional space can not be completely reflected by adopting a Gaussian kernel function of the Euclidean distance, the dimension reduction effect on the basis is not ideal enough, more accurate and flexible dimension reduction is difficult to be performed according to the characteristics of the data, and the dimension reduction effect is weakened along with the increase of the complexity of the data.
A plurality of dimension reduction and clustering methods for high-dimension complex data exist. In the patent re-recognition method for confusing digital handwriting (Chinese patent publication No. CN 109034021A, published time 2018.12.18), original t-SNE high-dimensional space points are subjected to grouping weighting on distance calculation, so that dimension reduction errors are reduced, and re-recognition accuracy is improved. However, the difference of each attribute in the original data before is not considered, the calculated Euclidean distances are only subjected to grouping weighting, and certain limitation still exists when multi-attribute data are analyzed. In the patent of a score clustering analysis method based on t-SNE (Chinese patent publication No. CN 111625576A, published time 2020.09.04), a t-SNE algorithm is directly used for performing dimensionality reduction processing on high-dimensional student score data, and although a visual experiment result shows that the dimensionality reduction processing has an effect on the student score data, the visual experiment result is directed at student score data with strong attribute characteristics, the relevance among attributes is not considered, and quantitative comparison indexes for the experiment result are lacked.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a weight-adjustable high-dimensional data dimension reduction method and system, and solves the problems of low dimension reduction accuracy, large error and the like in the prior art.
The technical scheme adopted by the invention for solving the problems is as follows:
a weight-adjustable high-dimensional data dimension reduction method comprises the following steps:
step1, extracting data: extracting n m-dimensional high-dimensional data to form a data matrix X of n X m;
Figure BDA0003419644680000021
wherein x isikThe ith row and the kth column of the high-dimensional data, n is more than 2 and is a positive integer, m is more than 3 and is a positive integer, i is a positive integer and is more than or equal to 1 and less than or equal to n, k is a positive integer and is more than or equal to 1 and less than or equal to m;
step2, obtaining an attribute weight matrix: performing attribute weight calculation on the data matrix X to obtain an attribute weight matrix weight;
weight=[wc1 … wci … wcm];
wherein wciWeighting the attribute weight of the ith column of data in the data matrix X;
step3, calculating the weighted Euclidean point pair distance: substituting weight into a high-dimensional space point pair Euclidean distance calculation formula to obtain an attribute weighted Euclidean point pair distance matrix D between each point pair;
Figure BDA0003419644680000031
wherein the content of the first and second substances,
Figure BDA0003419644680000032
dijthe weighted Euclidean distance between the ith row of data and the jth row of data in a high-dimensional space in the data matrix X is defined, wherein the high-dimensional space refers to a space with the dimension larger than 3; x is the number ofikFor data in the ith row and the kth column of the data matrix X, XjkThe data of the jth row and the kth column in the data matrix X;
step4, calculating the joint probability of the high-dimensional space: according to the attribute weighting Euclidean point pair distance matrix D between each point pair, continuously calculating the joint conditional probability p of the high-dimensional space of the data matrix Xij
Step5, acquiring the distribution of low-dimensional space points: computing low-dimensional spatial joint probability qijAdopting KL divergence as a target function, continuously calculating the similarity of the low-dimensional space until the value of the KL divergence is converged to obtain the low-dimensional spaceDistribution of spatial points; wherein, the low-dimensional space refers to a space with the dimension less than or equal to 3.
As a preferable technical scheme, in Step2, an attribute weight is calculated by adopting an SVD weight method and a Critic weight method respectively, and the two attribute weights are respectively recorded as weightaAnd weightbThen weight is addedaAnd weightbAnd calculating an attribute weight value corresponding to the global optimal solution as the initial positions of two sample points of the particle swarm algorithm.
As a preferable embodiment, in Step4, the optimum standard deviation δ centered on the ith row data in the data matrix X is searched for using binary search based on the set confusion valueiAnd the optimum standard deviation delta centered on the jth row of datajCalculating conditional probability p of high-dimensional data matrixi|j、pj|iThen calculate the joint probability p of the high-dimensional spaceijThe calculation formula is as follows:
Figure BDA0003419644680000033
Figure BDA0003419644680000041
Figure BDA0003419644680000042
wherein k is a positive integer and is more than or equal to 1 and less than or equal to n.
As a preferred technical solution, the condition for stopping binary search is as follows: the absolute value of the difference between the set confusion value and the currently calculated confusion value is less than 0.0001 or the binary search times is more than 50.
As a preferred technical solution, in Step5, using KL divergence as a target function, continuously updating the positions of all points in the low-dimensional space by a gradient descent method, and recalculating the conditional probability and the joint probability of the low-dimensional space and the new value of the KL divergence until the value of the KL divergence function C converges to obtain the distribution of the points in the low-dimensional space;
Figure BDA0003419644680000043
Figure BDA0003419644680000044
Figure BDA0003419644680000045
wherein Y is a randomly initialized low-dimensional space matrix of n x 2 according to t distribution, YiIndicating the ith point, y, of a randomly initialized low-dimensional spatial matrixi1Is the abscissa of the ith point, yi2Is the ordinate of the ith point.
As a preferable technical solution, in Step5, changing the position of the low-dimensional space point by introducing the momentum parameter, iteratively calculating the joint probability of the low-dimensional space until the value of the KL divergence function converges, and obtaining the distribution of the low-dimensional space point.
As a preferable technical solution, in Step5, the judgment condition until the value of the KL divergence function converges is whether the difference between the KL divergence function value of the current iteration and the KL divergence function value of the previous iteration is less than 0.005. If the space is less than 0.005, outputting a low-dimensional space dimension reduction result matrix; otherwise, the positions of the low-dimensional space points are continuously updated iteratively until the difference value is less than 0.005.
As a preferred technical solution, in Step5, a calculation formula for obtaining the distribution of the low-dimensional spatial points is as follows:
Figure BDA0003419644680000051
wherein, yuFor an updated two-dimensional spatial matrix of the lower dimension y(u-1)A two-dimensional space matrix of a lower dimension generated for the last iteration, wherein eta is a step length, alpha (u) is a learning rate, and alpha (u) (y)(u-1)-y(u-2)) Is a momentum gradient, at the first iteration (y)(u-1)-y(u-2)) Default to 0.
As a preferable embodiment, in Step5, the dimensionality reduction effect is evaluated using the contour coefficient, the DBI, the CH score, and/or the KL divergence value.
A weight-adjustable high-dimensional data dimension reduction system is applied to the weight-adjustable high-dimensional data dimension reduction method, and comprises the following modules which are sequentially connected:
a data extraction module: extracting n m-dimensional high-dimensional data to form a data matrix X of n X m;
Figure BDA0003419644680000052
wherein x isikThe ith row and the kth column of the high-dimensional data, n is more than 2 and is a positive integer, m is more than 3 and is a positive integer, i is a positive integer and is more than or equal to 1 and less than or equal to n, k is a positive integer and is more than or equal to 1 and less than or equal to m;
an attribute weight matrix acquisition module: the method comprises the following steps of performing attribute weight calculation on a data matrix X to obtain an attribute weight matrix weight;
weight=[wc1 … wci … wcm];
wherein, wciWeighting the attribute weight of the ith column of data in the data matrix X;
a weighted Euclidean point pair distance calculating module: substituting weight into the Euclidean distance calculation formula of the high-dimensional space point pair to obtain an attribute weighted Euclidean point pair distance matrix D between each point pair;
Figure BDA0003419644680000061
wherein the content of the first and second substances,
Figure BDA0003419644680000062
dijthe weighted Euclidean distance between the ith row of data and the jth row of data in the data matrix X in the high-dimensional spaceRefers to a space with dimension > 3; x is the number ofikFor data in the ith row and the kth column of the data matrix X, XjkThe data of the jth row and the kth column in the data matrix X;
a module for calculating high-dimensional spatial joint probability: the method is used for weighting the Euclidean point pair distance matrix D according to the attribute between each point pair and continuously calculating the joint conditional probability p of the high-dimensional space of the data matrix Xij
A low-dimensional spatial point distribution obtaining module: to calculate the joint probability q of the low-dimensional spaceijAdopting KL divergence as a target function, and continuously calculating the similarity of the low-dimensional space until the value of the KL divergence function is converged to obtain the distribution condition of the low-dimensional space points; wherein, the low-dimensional space refers to a space with the dimension less than or equal to 3.
Compared with the prior art, the invention has the following beneficial effects:
(1) according to the invention, the distance of the t-SNE algorithm high-dimensional space point is improved through the attribute weight, and when the attribute weight is calculated, the solution can be carried out through a specific weight algorithm, and meanwhile, different weights with different attributes can be distributed through the requirement of a user on dimension reduction, so that the user-defined weight distribution and dimension reduction are realized;
(2) the method can more remarkably reflect the difference of data in the original high-dimensional space and the low-dimensional embedding after dimension reduction, solve the dimension reduction problem of high-dimensional complex data, improve the clustering effect and the dimension reduction accuracy;
(3) in the same condition, the KL divergence value is reduced, the profile coefficient is better, the Davies-Bouldin index is better, and the Calinski-Harbasz Score is higher.
Drawings
FIG. 1 is a diagram illustrating the steps of a method for reducing dimension of high-dimensional data with adjustable weight according to the present invention;
FIG. 2 is an overall framework of the dimension reduction method of the present invention;
FIG. 3 is a flow chart of a dimension reduction method according to the present invention;
FIG. 4 is a graph comparing the visual results of dimension reduction in the case of 2000 sets of data for PCA (top left), MDS (top right), t-SNE (bottom left) and the dimension reduction method of the present invention (bottom right);
FIG. 5 is a graph comparing the profile coefficient indices for PCA, MDS, t-SNE and the dimensionality reduction method of the present invention in the case of 2000 sets of data;
FIG. 6 is a comparison of Davies-Bouldin indices for PCA, MDS, t-SNE and the dimensionality reduction method of the present invention in a 2000-set of data;
FIG. 7 is a plot comparing Calinski-Harbasz Score indices in the case of 2000 sets of data for PCA, MDS, t-SNE and the dimensionality reduction method of the present invention;
FIG. 8 is a plot of KL divergence values versus values for t-SNE versus the dimensionality reduction method of the present invention for the 2000 th set of data and the 1000 th iteration;
FIG. 9 is a graph of the change of KL divergence values with the increase of the number of iterations in the case of the t-SNE and the 2000 sets of data for the dimension reduction method of the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited to these examples.
Example 1
As shown in fig. 1 to 9, the present embodiment takes the complex high-dimensional medical record data as an example.
The method comprises the following steps:
step 1: preprocessing the complex high-dimensional medical record data: forming an experimental data set by n pieces of data, wherein each piece of data comprises m attributes, and then carrying out standardization processing on n m-dimensional data to form a data matrix X for experiments;
Figure BDA0003419644680000081
wherein x isikThe ith row and the kth column of the high-dimensional data, n is more than 2 and is a positive integer, m is more than 3 and is a positive integer, i is a positive integer and is more than or equal to 1 and less than or equal to n, k is a positive integer and is more than or equal to 1 and less than or equal to m;
step 2: respectively carrying out weight calculation on the data matrix of n x m by adopting an SVD (singular value decomposition) and Critic weight value method;
step 3: substituting the weighted value calculated in Step2 into a high-dimensional space point pair Euclidean distance calculation formula, and multiplying the distance between each component by the corresponding weighted value to obtain a weighted point pair Euclidean distance matrix D;
Figure BDA0003419644680000082
wherein the content of the first and second substances,
Figure BDA0003419644680000083
dijthe weighted Euclidean distance between the ith row of data and the jth row of data in a high-dimensional space in the data matrix X is defined, wherein the high-dimensional space refers to a space with the dimension larger than 3; x is the number ofikFor data in the ith row and the kth column of the data matrix X, XjkThe data of the jth row and the kth column in the data matrix X;
step 4: searching for the optimum standard deviation delta centered on ith row data in data matrix X using a binary search based on the set confusion valueiAnd the optimum standard deviation delta with the jth row of data as the centerjCalculating conditional probability p of high-dimensional data matrixi|j、pj|iThen calculate the joint probability p of the high-dimensional spaceijThe calculation formula is as follows:
Figure BDA0003419644680000091
Figure BDA0003419644680000092
Figure BDA0003419644680000093
wherein k is a positive integer and is more than or equal to 1 and less than or equal to n.
Step 5: randomly generating a normal distribution matrix of n x 2, and calculating the low-dimensional conditional probability and the low-dimensional spatial joint probability q for the matrixijAdopting KL divergence as a target function, continuously calculating the similarity of a high-dimensional space and a low-dimensional space by a gradient descent method, introducing the position of a low-dimensional space point changed by a momentum parameter,recalculating low-dimensional spatial joint probabilities qijUntil the value of the KL divergence function is basically unchanged, obtaining the distribution condition of the low-dimensional space points;
Figure BDA0003419644680000094
objective function KL divergence:
Figure BDA0003419644680000095
gradient formula:
Figure BDA0003419644680000096
Figure BDA0003419644680000097
the position variation formula of the low-dimensional space point is as follows:
Figure BDA0003419644680000101
Figure BDA0003419644680000102
preferably, the two weights obtained in Step2 are used as sample points in a space with m, the KL divergence is used as a fitness function, the positions and the movement speeds of the points are continuously updated through a particle swarm algorithm, the fitness function is continuously calculated through the positions of the points, a global optimal position is obtained, the position is used as the weight, the weighted Euclidean distance is continuously calculated in Step3, and Step5 is repeated to obtain the optimal dimensionality reduction effect.
Velocity update formula:
vi=ω×vi+ci×rand()×(pbesti-xi)+c2×rand()×(gbesti-xi)
location update formula:
xi=xi+vi
example 2
As shown in fig. 1 to 9, as a further optimization of embodiment 1, this embodiment includes all the technical features of embodiment 1, and in addition, this embodiment further includes the following technical features:
the present embodiment takes the complex high-dimensional medical record data as an example.
Firstly, selecting n pieces of complex high-dimensional case data, forming a data matrix X of n X m because each piece of data has m attributes, and then carrying out standardization processing on the matrix;
Figure BDA0003419644680000103
wherein x isikThe ith row and the kth column of the high-dimensional data, n is more than 2 and is a positive integer, m is more than 3 and is a positive integer, i is a positive integer and is more than or equal to 1 and less than or equal to n, k is a positive integer and is more than or equal to 1 and less than or equal to m;
xi=[xi1 xi2 … xik … xim]。
secondly, carrying out attribute weight calculation on the formed data matrix X by two methods of SVD (singular value decomposition) and Critic weight method to obtain weightaAnd weightbTwo weights;
weighta=[wa1 … wai … wam]
weightb=[wb1 … wbi … wbm]
wherein wai、wbiAttribute weight values of ith row data in the data matrix X are calculated by SVD and Critic weight value methods respectively;
thirdly, the weight calculated in the second stepaAnd weightbInitial positions p of two sample points as Particle Swarm Optimization (PSO)1And p2In this case, the search space dimension and the number of particles of the particle algorithm are m (m is p)1And p2I.e. representing how many attributes there are) and 2. By setting the number of iterations by itself, the initial velocity viInertia factor ω, learning factor c1And c2Will beThe contour coefficient is used as a fitness function, iteration is started, the fitness of each sample point is evaluated, and the local optimum value pbest of each sample point is updatediAnd global optimum gbestiBy continuously varying the speed v of the sample pointsiAnd position piAnd returning the position corresponding to the global optimal value at the moment as a new weight until the iteration number meets the set maximum iteration number requirement, wherein the calculation formula is as follows:
vi=ω×vi+c1×rand()×(pbesti-pi)+c2×rand()×(gbesti-pi (1);
xi=pi+vi (2);
vifor the updated velocity value, v, (left 1 in equation 9)iIs (right 2 in equation 9) the original velocity value, ω is the inertia factor, c1And c2As a learning factor, pbestiFor the locally optimal solution of the ith particle, gbestiIs a globally optimal solution, and rand () generates an interval [0, 1%]Random number between, xi(left 1 in equation 10) is the position after the particle update, xi(Right 1 in equation 10) is the home position, pbesti
Indicating the local optimum, gbest, of the ith particle calculated from the fitness functioniRepresents the global optimum value calculated by the ith particle according to the fitness function.
Fourthly, the attribute weight obtained by calculation in the third step is substituted into a high-dimensional space point pair Euclidean distance calculation formula (3) to obtain a square matrix D of the attribute weighted Euclidean point pair distance between each point pair;
Figure BDA0003419644680000121
Figure BDA0003419644680000122
fifthly, weighting according to the calculated attributesThe Euclidean point pair distance square matrix D is continuously calculated through the Gaussian kernel function to obtain the conditional probability p of the high-dimensional space of the data matrix Xi|j
Figure BDA0003419644680000123
Sixth, it sets a confusion PyFinding the optimal σj. In the fourth step pi|jThe perplexity P is calculated by the formula (5)xHandle PxAnd PyPerforming difference operation, performing dichotomy iteration, and updating sigmajAnd PxUp to PxAnd PyIs less than the minimum limit value of 0.00001, the iteration is stopped, the current sigma isjI.e. the optimal value, and obtains the optimal p of the high-dimensional spacei|j
Figure BDA0003419644680000124
pi|jCalculating the high-dimensional space conditional probability for the fifth step;
seventhly, according to the calculated conditional probability of the high-dimensional space of the data matrix X, calculating the joint probability p of the high-dimensional spaceij
Figure BDA0003419644680000125
Figure BDA0003419644680000126
Wherein p isi|jIs a matrix. Therefore, only p is needed herei|jTransposing to obtain pj|i
Eighthly, randomly initializing a low-dimensional space matrix Y of n x 2 according with t distribution, simultaneously calculating the Euclidean distance of a low-dimensional space point pair for the Y matrix, and calculating the low-dimensional space through a t distribution probability density function with the degree of freedom of 1Joint probability qij
Figure BDA0003419644680000131
Wherein y isiDenotes the ith point, where yi1Is the abscissa of the ith point, yi2Is the ordinate of the ith point.
Figure BDA0003419644680000132
And ninthly, taking the KL divergence as an objective function, and minimizing the KL divergence to enable the similarity between the high-dimensional space point pair and the low-dimensional space point pair to be as close as possible, wherein the KL divergence of the objective function is shown as a formula (8):
Figure BDA0003419644680000133
pijis a high dimensional spatial conditional probability, qijIs a low dimensional spatial conditional probability;
tenth, finding the minimum value of the divergence of the objective function KL by a gradient descent method, wherein the gradient formula is as follows:
Figure BDA0003419644680000134
eleventh, to speed up the search and avoid falling into the locally optimal solution, the momentum parameter is added to update the low-dimensional space matrix Y, as shown in equation 10. And continuing to update the updated low-dimensional space matrix Y through formulas 7,8,9 and 10. When the set iteration times are reached, stopping iteration to obtain a relatively accurate low-dimensional space two-dimensional matrix after dimensionality reduction:
Figure BDA0003419644680000141
yufor an updated two-dimensional spatial matrix, y(u-1)Is the last oneThe two-dimensional space matrix generated by the sub-iteration, eta is the step size, alpha (u) is the learning rate, and alpha (u) (y)(u-1)-y(u-2)) Is the momentum gradient for enhancing the effect of the gradient descent algorithm, at the first iteration (y)(u-1)-y(u-2)) Default to 0.
Twelfth, the more accurate low-dimensional space two-dimensional matrix obtained in the eleventh step is the actual output value of the two-dimensional data obtained by reducing the dimension of the high-dimensional medical record data after the improvement of the original t-SNE algorithm. The complex high-dimensional medical record data is label-free data. Therefore, the invention develops a new method, firstly adopts a k-means algorithm to cluster the low-dimensional space two-dimensional matrix obtained in the eleventh step, and finally adopts the contour coefficient, the DBI, the CH fraction and the KL divergence value to evaluate the clustering effect of the dimension reduction data, so that the dimension reduction effect of the invention is reflected from the side surface.
Thirteenth, the invention provides the improvement of the distance of the t-SNE algorithm high-dimensional space point through the attribute weight, when calculating the attribute weight, the invention can not only solve through a specific weight algorithm, but also distribute different weights of different attributes according to the requirement of the user on dimension reduction, thereby achieving the self-defined weight distribution and dimension reduction.
Calculating high-dimensional data attribute weight by SVD and Critic weight value method, using the weight as an initial sample point in PSO (particle swarm optimization), using contour coefficient as fitness function, continuously iterating sample point position and speed to finally obtain the optimal attribute weight distribution value of the data, introducing the optimal weight value into the calculation of Euclidean distance in high-dimensional space to obtain attribute weighted Euclidean distance, using the attribute weighted Euclidean distance as the calculation mode of Gaussian kernel function in high-dimensional space, and then performing dimension reduction on the basis of t-SNE algorithm. Therefore, the method can more remarkably reflect the difference of the data in the original high-dimensional space and the low-dimensional embedding after dimension reduction, solve the dimension reduction problem of high-dimensional complex data, improve the clustering effect and improve the dimension reduction accuracy.
According to the comparison with the precision of a plurality of dimensionality reduction algorithms such as PCA, MDS, t-SNE and the like and the clustering judgment index, the improved t-SNE algorithm provided by the invention has the advantages that in the same condition, the KL divergence value is reduced by 52.2% compared with the t-SNE, the contour coefficient is better than 27.6% -98.6% compared with the PCA, MDS and t-SNE, the Davies-Bouldin index is better than the PCA, MDS and t-SNE by 31.1% -45.7%, and the Calinski-Harbasz Score is higher than the PCA, MDS and t-SNE by 2-5 times.
As described above, the present invention can be preferably realized.
All features disclosed in all embodiments of the present specification, or all methods or process steps implicitly disclosed, may be combined and/or expanded, or substituted, in any way, except for mutually exclusive features and/or steps.
The foregoing is only a preferred embodiment of the present invention, and the present invention is not limited thereto in any way, and any simple modification, equivalent replacement and improvement made to the above embodiment within the spirit and principle of the present invention still fall within the protection scope of the present invention.

Claims (10)

1. A weight-adjustable high-dimensional data dimension reduction method is characterized by comprising the following steps:
step1, extracting data: extracting n m-dimensional high-dimensional data to form a data matrix X of n X m;
Figure FDA0003419644670000011
wherein x isikThe ith row and the kth column of the high-dimensional data, n is more than 2 and is a positive integer, m is more than 3 and is a positive integer, i is a positive integer and is more than or equal to 1 and less than or equal to n, k is a positive integer and is more than or equal to 1 and less than or equal to m;
step2, obtaining an attribute weight matrix: performing attribute weight calculation on the data matrix X to obtain an attribute weight matrix weight;
weight=[wc1 … wci … wcm];
wherein, wciWeighting the attribute weight of the ith column of data in the data matrix X;
step3, calculating the weighted Euclidean point pair distance: substituting weight into a high-dimensional space point pair Euclidean distance calculation formula to obtain an attribute weighted Euclidean point pair distance matrix D between each point pair;
Figure FDA0003419644670000012
wherein the content of the first and second substances,
Figure FDA0003419644670000013
dijthe weighted Euclidean distance between the ith row of data and the jth row of data in a high-dimensional space in the data matrix X is defined, wherein the high-dimensional space refers to a space with the dimension larger than 3; x is the number ofikFor data in the ith row and the kth column of the data matrix X, XjkThe data of the jth row and the kth column in the data matrix X;
step4, calculating the joint probability of the high-dimensional space: according to the attribute weighting Euclidean point pair distance matrix D between each point pair, continuously calculating the joint conditional probability p of the high-dimensional space of the data matrix Xij
Step5, acquiring the distribution of low-dimensional space points: computing low-dimensional spatial joint probability qijAdopting KL divergence as a target function, and continuously calculating the similarity of the low-dimensional space until the value of the KL divergence function is converged to obtain the distribution condition of the low-dimensional space points; wherein, the low-dimensional space refers to a space with the dimension less than or equal to 3.
2. The method as claimed in claim 1, wherein Step2 calculates an attribute weight by using SVD weight method and Critic weight method, and the two attribute weights are respectively denoted as weightaAnd weightbThen weight is addedaAnd weightbAnd calculating an attribute weight value corresponding to the global optimal solution as the initial positions of two sample points of the particle swarm algorithm.
3. The method of claim 2, wherein in Step4, binary search is used to search data according to the set confusion valueOptimum standard deviation delta centered on ith row of data in matrix XiAnd the optimum standard deviation delta centered on the jth row of datajCalculating conditional probability p of high-dimensional data matrixi|j、pj|iThen calculate the joint probability p of the high-dimensional spaceijThe calculation formula is as follows:
Figure FDA0003419644670000021
Figure FDA0003419644670000022
Figure FDA0003419644670000023
wherein k is a positive integer and is more than or equal to 1 and less than or equal to n.
4. The method of claim 3, wherein the condition for stopping binary search is: the absolute value of the difference between the set confusion value and the currently calculated confusion value is less than 0.0001 or the binary search times is more than 50.
5. The method according to claim 4, wherein in Step5, KL divergence is used as the objective function, the positions of all points in the low-dimensional space are continuously updated by a gradient descent method, the conditional probability and the joint probability of the low-dimensional space and the new value of the KL divergence are recalculated until the value of the KL divergence function C converges, and the distribution of the points in the low-dimensional space is obtained;
Figure FDA0003419644670000031
Figure FDA0003419644670000032
Figure FDA0003419644670000033
wherein Y is a randomly initialized low-dimensional space matrix of n x 2 according to t distribution, YiIndicating the ith point, y, of a randomly initialized low-dimensional spatial matrixi1Is the abscissa of the ith point, yi2Is the ordinate of the ith point.
6. The method of claim 5, wherein in Step5, the momentum parameter is further introduced to change the positions of the low-dimensional spatial points, and the joint probability of the low-dimensional space is iteratively calculated until the value of the KL divergence function converges, so as to obtain the distribution of the low-dimensional spatial points.
7. The method of claim 6, wherein in Step5, the judgment condition until the value of the KL divergence function converges is whether the difference between the KL divergence function value of the current iteration and the KL divergence function value of the previous iteration is less than 0.005. If the space is less than 0.005, outputting a low-dimensional space dimension reduction result matrix; otherwise, the positions of the low-dimensional space points are continuously updated iteratively until the difference value is less than 0.005.
8. The method of claim 7, wherein in Step5, the formula for obtaining the distribution of the low-dimensional spatial points is as follows:
Figure FDA0003419644670000041
wherein, yuFor an updated two-dimensional spatial matrix of the lower dimension y(u-1)A two-dimensional space matrix of low dimension generated for the last iteration, ηStep size, α (u) is learning rate, α (u) (y)(u-1)-y(u-2)) Is a momentum gradient, at the first iteration (y)(u-1)-y(u-2)) Default to 0.
9. The method for dimensionality reduction of high-dimensional data with adjustable weight according to any one of claims 1 to 8, wherein in Step5, the dimensionality reduction effect is evaluated by using contour coefficients, DBI, CH scores and/or KL divergence values.
10. A weight-adjustable high-dimensional data dimension reduction system is applied to the weight-adjustable high-dimensional data dimension reduction method of any one of claims 1 to 9, and comprises the following modules which are sequentially connected in sequence:
a data extraction module: extracting n m-dimensional high-dimensional data to form a data matrix X of n X m;
Figure FDA0003419644670000042
wherein x isikThe ith row and the kth column of the high-dimensional data, n is more than 2 and is a positive integer, m is more than 3 and is a positive integer, i is a positive integer and is more than or equal to 1 and less than or equal to n, k is a positive integer and is more than or equal to 1 and less than or equal to m;
an attribute weight matrix acquisition module: the method is used for carrying out attribute weight calculation on the data matrix X to obtain an attribute weight matrix weight;
weight=[wc1 … wci … wcm];
wherein, wciWeighting the attribute weight of the ith column of data in the data matrix X;
the calculation weighted Euclidean point pair distance module: substituting weight into the Euclidean distance calculation formula of the high-dimensional space point pair to obtain an attribute weighted Euclidean point pair distance matrix D between each point pair;
Figure FDA0003419644670000051
wherein the content of the first and second substances,
Figure FDA0003419644670000052
dijthe weighted Euclidean distance between the ith row of data and the jth row of data in a data matrix X in a high-dimensional space, wherein the high-dimensional space refers to a space with the dimension larger than 3; x is the number ofikFor data in the ith row and the kth column of the data matrix X, XjkThe data of the jth row and the kth column in the data matrix X;
a module for calculating high-dimensional spatial joint probability: the method is used for weighting the Euclidean point pair distance matrix D according to the attribute between each point pair and continuously calculating the joint conditional probability p of the high-dimensional space of the data matrix Xij
A low-dimensional spatial point distribution acquisition module: to calculate the joint probability q of the low-dimensional spaceijAdopting KL divergence as a target function, and continuously calculating the similarity of the low-dimensional space until the value of the KL divergence function is converged to obtain the distribution condition of the low-dimensional space points; wherein, the low-dimensional space refers to a space with the dimension less than or equal to 3.
CN202111557901.7A 2021-12-20 2021-12-20 Weight-adjustable high-dimensional data dimension reduction method and system Pending CN114492566A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111557901.7A CN114492566A (en) 2021-12-20 2021-12-20 Weight-adjustable high-dimensional data dimension reduction method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111557901.7A CN114492566A (en) 2021-12-20 2021-12-20 Weight-adjustable high-dimensional data dimension reduction method and system

Publications (1)

Publication Number Publication Date
CN114492566A true CN114492566A (en) 2022-05-13

Family

ID=81493950

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111557901.7A Pending CN114492566A (en) 2021-12-20 2021-12-20 Weight-adjustable high-dimensional data dimension reduction method and system

Country Status (1)

Country Link
CN (1) CN114492566A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116743961A (en) * 2023-06-15 2023-09-12 中国铁塔股份有限公司安徽省分公司 Visual intelligent analysis system of high altitude monitoring
CN117176011A (en) * 2023-11-02 2023-12-05 南通威尔电机有限公司 Parameter intelligent adjusting method and system for permanent magnet synchronous submersible motor

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116743961A (en) * 2023-06-15 2023-09-12 中国铁塔股份有限公司安徽省分公司 Visual intelligent analysis system of high altitude monitoring
CN117176011A (en) * 2023-11-02 2023-12-05 南通威尔电机有限公司 Parameter intelligent adjusting method and system for permanent magnet synchronous submersible motor
CN117176011B (en) * 2023-11-02 2024-02-13 南通威尔电机有限公司 Parameter intelligent adjusting method and system for permanent magnet synchronous submersible motor

Similar Documents

Publication Publication Date Title
CN101894130B (en) Sparse dimension reduction-based spectral hash indexing method
Koniusz et al. Higher-order occurrence pooling for bags-of-words: Visual concept detection
US6173275B1 (en) Representation and retrieval of images using context vectors derived from image information elements
Caelli et al. An eigenspace projection clustering method for inexact graph matching
Wolfson et al. Geometric hashing: An overview
US6760714B1 (en) Representation and retrieval of images using content vectors derived from image information elements
CN114492566A (en) Weight-adjustable high-dimensional data dimension reduction method and system
US20160283533A1 (en) Multi-distance clustering
CN113190699A (en) Remote sensing image retrieval method and device based on category-level semantic hash
CN110688502A (en) Image retrieval method and storage medium based on depth hash and quantization
CN111006860B (en) Airplane actuator fault diagnosis method based on AdaBoost-ASVM algorithm
CN106844620B (en) View-based feature matching three-dimensional model retrieval method
CN111125411A (en) Large-scale image retrieval method for deep strong correlation hash learning
US11709858B2 (en) Mapping of unlabeled data onto a target schema via semantic type detection
CN107832786A (en) A kind of recognition of face sorting technique based on dictionary learning
Li et al. Online low-rank representation learning for joint multi-subspace recovery and clustering
CN110705636A (en) Image classification method based on multi-sample dictionary learning and local constraint coding
CN105843925A (en) Similar image searching method based on improvement of BOW algorithm
CN110348287A (en) A kind of unsupervised feature selection approach and device based on dictionary and sample similar diagram
Gong et al. An enhanced initialization method for non-negative matrix factorization
Wu et al. Efficient motion data indexing and retrieval with local similarity measure of motion strings
CN114220164A (en) Gesture recognition method based on variational modal decomposition and support vector machine
Wang et al. A convolutional neural network image classification based on extreme learning machine
CN109857886B (en) Three-dimensional model retrieval method based on minimum maximum value game theory view approximation
CN113190681B (en) Fine granularity text classification method based on capsule network mask memory attention

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination