CN111476100B - Data processing method, device and storage medium based on principal component analysis - Google Patents
Data processing method, device and storage medium based on principal component analysis Download PDFInfo
- Publication number
- CN111476100B CN111476100B CN202010155934.8A CN202010155934A CN111476100B CN 111476100 B CN111476100 B CN 111476100B CN 202010155934 A CN202010155934 A CN 202010155934A CN 111476100 B CN111476100 B CN 111476100B
- Authority
- CN
- China
- Prior art keywords
- data
- features
- sample data
- feature
- correlation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000513 principal component analysis Methods 0.000 title claims abstract description 42
- 238000003672 processing method Methods 0.000 title claims abstract description 22
- 238000000034 method Methods 0.000 claims abstract description 37
- 238000012545 processing Methods 0.000 claims abstract description 25
- 230000009467 reduction Effects 0.000 claims abstract description 19
- 239000011159 matrix material Substances 0.000 claims description 79
- 239000013598 vector Substances 0.000 claims description 25
- 238000004590 computer program Methods 0.000 claims description 6
- 238000012847 principal component analysis method Methods 0.000 claims description 4
- 238000000354 decomposition reaction Methods 0.000 claims description 3
- 230000007547 defect Effects 0.000 abstract description 2
- 238000012549 training Methods 0.000 description 15
- 238000004364 calculation method Methods 0.000 description 12
- 230000009466 transformation Effects 0.000 description 7
- 230000006870 function Effects 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 238000012360 testing method Methods 0.000 description 5
- 238000004422 calculation algorithm Methods 0.000 description 4
- 238000010219 correlation analysis Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 230000002093 peripheral effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000013079 data visualisation Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 239000006185 dispersion Substances 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000011946 reduction process Methods 0.000 description 1
- 238000011524 similarity measure Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/172—Classification, e.g. identification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
- G06F18/2135—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
本发明实施例涉及软件缺陷预测领域,公开了一种基于主成分分析的数据处理方法、装置及计算机可读存储介质,所述方法包括:对初始样本数据进行降维处理,得到预设维度的样本数据;获取所述样本数据的多个特征,并计算每个特征与预设类别的相关度,其中,所述预设类别为所述样本数据具有的多种类别中的一种类别;去除所述多个特征中相关度小于预设相关度的特征,将剩余特征作为所述样本数据的鉴别特征。本发明提供的基于主成分分析的数据处理方法、装置及计算机可读存储介质能够去除样本数据中的冗余特征,得到具有高鉴别性的样本数据,从而提高预测效率。
Embodiments of the present invention relate to the field of software defect prediction, and disclose a data processing method, device and computer-readable storage medium based on principal component analysis. The method includes: performing dimensionality reduction processing on initial sample data to obtain a preset dimension. Sample data; obtain multiple features of the sample data, and calculate the correlation between each feature and a preset category, wherein the preset category is one of multiple categories of the sample data; remove Among the features whose correlation degree is less than the preset correlation degree among the plurality of features, the remaining features are used as the identification features of the sample data. The data processing method, device and computer-readable storage medium based on principal component analysis provided by the present invention can remove redundant features in sample data and obtain sample data with high discriminability, thereby improving prediction efficiency.
Description
技术领域Technical field
本发明实施例涉及数据处理领域,特别涉及一种基于主成分分析的数据处理方法、装置及计算机可读存储介质。Embodiments of the present invention relate to the field of data processing, and in particular to a data processing method, device and computer-readable storage medium based on principal component analysis.
背景技术Background technique
信息熵是消除不确定性所需信息量的度量,也即未知事件可能含有的信息量。一个事件或一个系统,准确的说是一个随机变量,它有着一定的不确定性。某些随机变量的不确定性很高,要消除这个不确定性,就需要引入很多的信息,这些很多信息的度量就用“信息熵”表达。需要引入消除不确定性的信息量越多,则信息熵越高,反之则越低。如果某个情况因为确定性很高,几乎不需要引入信息,因此信息熵很低。根据香农给出的信息熵公式,对于任意一个随机变量X,它的信息熵定义如下,单位为比特(bit):H(X)=-∑xεX[P(x)logP(x)]。系统中各种随机性的概率越均等,信息熵越大,反之越小。Information entropy is a measure of the amount of information required to eliminate uncertainty, that is, the amount of information that unknown events may contain. An event or a system is, to be precise, a random variable, which has a certain degree of uncertainty. The uncertainty of some random variables is very high. To eliminate this uncertainty, a lot of information needs to be introduced. The measurement of this much information is expressed by "information entropy". The more information that needs to be introduced to eliminate uncertainty, the higher the information entropy, and vice versa. If a situation is very certain, there is little need to introduce information, so the information entropy is very low. According to the information entropy formula given by Shannon, for any random variable The more equal the probabilities of various randomnesses in the system, the greater the information entropy, and vice versa.
发明人发现现有技术中至少存在如下问题:根据上述公式分析样本数据的特征,得到的冗余特征较多,导致利用该样本数据训练的模型预测效率不高。The inventor found that there are at least the following problems in the prior art: analyzing the characteristics of the sample data according to the above formula results in a large number of redundant features, resulting in low prediction efficiency of the model trained using the sample data.
发明内容Contents of the invention
本发明实施方式的目的在于提供一种基于主成分分析的数据处理方法、装置及计算机可读存储介质,其能够去除样本数据中的冗余特征,得到具有高鉴别性的样本数据,从而提高预测效率。The purpose of the embodiments of the present invention is to provide a data processing method, device and computer-readable storage medium based on principal component analysis, which can remove redundant features in sample data and obtain sample data with high discriminability, thereby improving prediction efficiency.
为解决上述技术问题,本发明的实施方式提供了一种基于主成分分析的数据处理方法,包括:In order to solve the above technical problems, embodiments of the present invention provide a data processing method based on principal component analysis, including:
对初始样本数据进行降维处理,得到预设维度的样本数据;获取所述样本数据的多个特征,并计算每个特征与预设类别的相关度,其中,所述预设类别为所述样本数据具有的多种类别中的一种类别;去除所述多个特征中相关度小于预设相关度的特征,将剩余特征作为所述样本数据的鉴别特征。Perform dimensionality reduction processing on the initial sample data to obtain sample data with preset dimensions; obtain multiple features of the sample data, and calculate the correlation between each feature and the preset category, where the preset category is the One of the multiple categories of the sample data; remove the features whose correlation degree is less than the preset correlation degree among the multiple features, and use the remaining features as the identification features of the sample data.
本发明的实施方式还提供了一种基于主成分分析的数据处理装置,包括:至少一个处理器;以及,与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行上述的基于主成分分析的数据处理方法。An embodiment of the present invention also provides a data processing device based on principal component analysis, including: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores data that can be The instructions executed by the at least one processor are executed by the at least one processor so that the at least one processor can execute the above-mentioned data processing method based on principal component analysis.
本发明的实施方式还提供了一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时实现上述的基于主成分分析的数据处理方法。Embodiments of the present invention also provide a computer-readable storage medium that stores a computer program. When the computer program is executed by a processor, the above-mentioned data processing method based on principal component analysis is implemented.
本发明的实施方式相对于现有技术而言,通过对所述初始样本数据进行降维处理,得到预设维度的样本数据,以便于后续步骤的计算,减小后续步骤的运算量,从而提高数据处理方法的效率;通过获取所述样本数据的多个特征,并计算每个特征与预设类别的相关度,由于预设类别为样本数据具有的多种类别中的一种类别,通过此种方式,能够根据相似度得知样本数据的多个特征中哪些特征为冗余特征;通过去除所述多个特征中相关度小于预设相关度的特征,将剩余特征作为所述样本数据的鉴别特征,能够得到具有高鉴别性的样本数据,使得使用该样本数据的训练模型的预算速度变快,从而提高预测效率。Compared with the existing technology, the embodiment of the present invention performs dimensionality reduction processing on the initial sample data to obtain sample data with preset dimensions, so as to facilitate the calculation of subsequent steps, reduce the calculation amount of subsequent steps, and thereby improve The efficiency of the data processing method; by obtaining multiple features of the sample data and calculating the correlation between each feature and the preset category. Since the preset category is one of the multiple categories of the sample data, through this In this way, it is possible to know which features among the multiple features of the sample data are redundant features based on the similarity; by removing the features whose correlation degree is less than the preset correlation among the multiple features, the remaining features are used as the redundant features of the sample data. The discriminant feature can obtain sample data with high discriminability, which makes the budget of the training model using the sample data faster, thereby improving the prediction efficiency.
另外,在去除所述多个特征中相关度小于预设相关度的特征之后,还包括:将所述剩余特征按照所述相关度由高到低的顺序排序;将排序后的所述剩余特征划分为N个特征段,其中,每个特征段中均包括M个特征,N、M均为大于1的整数;判断是否存在M个特征均大于预设阈值的特征段,在判定存在时,去除所述特征段中相似度最小的特征。In addition, after removing the features whose correlation degree is less than the preset correlation degree among the plurality of features, it also includes: sorting the remaining features in the order of the correlation degree from high to low; Divided into N feature segments, each feature segment includes M features, and N and M are both integers greater than 1; determine whether there are M feature segments with all features greater than the preset threshold. When determining whether there are, Remove the features with the smallest similarity among the feature segments.
另外,所述对所述初始样本数据进行降维处理,具体包括:将所述初始样本数据转化成数据矩阵;计算所述数据矩阵的协方差矩阵,并对所述协方差矩阵进行特征分解,得到所述协方差矩阵的特征值,以及与所述特征值对应的特征向量;根据所述特征值及所述特征向量得到投影矩阵,并将所述初始样本数据的维度降低至所述投影矩阵对应的维度。In addition, performing dimensionality reduction processing on the initial sample data specifically includes: converting the initial sample data into a data matrix; calculating the covariance matrix of the data matrix, and performing eigendecomposition on the covariance matrix, Obtain the eigenvalues of the covariance matrix and the eigenvectors corresponding to the eigenvalues; obtain a projection matrix based on the eigenvalues and the eigenvectors, and reduce the dimension of the initial sample data to the projection matrix corresponding dimensions.
另外,所述根据所述特征值及所述特征向量得到投影矩阵,具体包括:将所述特征向量从上到下按行排列成矩阵,其中,所述特征向量对应的特征值越大,所述特征向量位于所述矩阵的越前行;取前k行组成所述投影矩阵,其中,k为大于1的整数。In addition, obtaining a projection matrix based on the eigenvalues and the eigenvector specifically includes: arranging the eigenvectors in rows from top to bottom into a matrix, wherein the larger the eigenvalue corresponding to the eigenvector, the greater the eigenvalue. The feature vector is located in the preceding row of the matrix; the first k rows are taken to form the projection matrix, where k is an integer greater than 1.
另外,在计算所述数据矩阵的协方差矩阵之前,还包括:对所述数据矩阵的每一行进行零均值化处理;所述计算所述数据矩阵的协方差矩阵,具体包括:计算零均值化处理后的数据矩阵的协方差矩阵。In addition, before calculating the covariance matrix of the data matrix, it also includes: performing zero-meaning processing on each row of the data matrix; and calculating the covariance matrix of the data matrix specifically includes: calculating zero-meaning The covariance matrix of the processed data matrix.
另外,通过以下公式计算特征与所述预设类别的相关度:Si=[XT×Y+X`T×Y+XT×Y`+X`T×Y`]+[2×(IG(X|L))-(H(X)+H(L))]+[2×(IG(Y|L))-(H(Y)+H(L))];其中,Si为所述相似度;X、Y为样本数据的两个不同特征;X`为X在不同维度的表示、Y`为Y在不同维度的表示;L为所述预设类别;[XT×Y+X`T×Y+XT×Y`+X`T×Y`]表示X在不同维度的表示与Y在不同维度的表示的鉴别相关性;[2×(IG(X|L))-(H(X)+H(L))]+[2×(IG(Y|L))-(H(Y)+H(L))]表示X、Y分别与预设类别之间的相关度。In addition, the correlation between the feature and the preset category is calculated through the following formula: Si=[X T ×Y+X` T ×Y+X T ×Y`+X` T ×Y`]+[2×(IG (X|L))-(H(X)+H(L))]+[2×(IG(Y|L))-(H(Y)+H(L))]; where Si is Described similarity; X and Y are two different characteristics of the sample data; X` is the representation of X in different dimensions, Y` is the representation of Y in different dimensions; L is the preset category; [ X` T ×Y+X T ×Y`+X` T ×Y`] represents the discriminant correlation between the representation of X in different dimensions and the representation of Y in different dimensions; [2×(IG(X|L))- (H(X)+H(L))]+[2×(IG(Y|L))-(H(Y)+H(L))] represents the correlation between X, Y and the default category Spend.
另外,通过以下公式计算特征与所述预设类别的相关度:Si=[XT×Y+X`T×Y+XT×Y`+X`T×Y`]+λ×[2×(IG(X|L))-(H(X)+H(L))]+[2×(IG(Y|L))-(H(Y)+H(L))];其中,Si为所述相似度;X、Y为样本数据的两个不同特征;X`为X在不同维度的表示、Y`为Y在不同维度的表示;L为所述预设类别;[XT×Y+X`T×Y+XT×Y`+X`T×Y`]表示X在不同维度的表示与Y在不同维度的表示的鉴别相关性;[2×(IG(X|L))-(H(X)+H(L))]+[2×(IG(Y|L))-(H(Y)+H(L))]表示X、Y分别与预设类别之间的相关度;λ为平衡常数。In addition, the correlation between the feature and the preset category is calculated through the following formula: Si=[X T ×Y+X` T ×Y+X T ×Y`+X` T ×Y`]+λ×[2× (IG(X|L))-(H(X)+H(L))]+[2×(IG(Y|L))-(H(Y)+H(L))]; where, Si is the similarity; X and Y are two different features of the sample data; X` is the representation of X in different dimensions, Y` is the representation of Y in different dimensions; L is the preset category; [ Y+X` T ×Y+X T ×Y`+X` T ×Y`] represents the discriminant correlation between the representation of X in different dimensions and the representation of Y in different dimensions; [2×(IG(X|L) )-(H(X)+H(L))]+[2×(IG(Y|L))-(H(Y)+H(L))] represents the difference between X and Y respectively and the default category The degree of correlation; λ is the equilibrium constant.
另外,所述初始样本数据为图像样本数据。In addition, the initial sample data is image sample data.
附图说明Description of the drawings
一个或多个实施例通过与之对应的附图中的图片进行示例性说明,这些示例性说明并不构成对实施例的限定,附图中具有相同参考数字标号的元件表示为类似的元件,除非有特别申明,附图中的图不构成比例限制。One or more embodiments are exemplified by the pictures in the corresponding drawings. These illustrative illustrations do not constitute limitations to the embodiments. Elements with the same reference numerals in the drawings are represented as similar elements. Unless otherwise stated, the figures in the drawings are not intended to be limited to scale.
图1是根据本发明第一实施方式提供的基于主成分分析的数据处理方法的流程图;Figure 1 is a flow chart of a data processing method based on principal component analysis provided according to the first embodiment of the present invention;
图2是根据本发明第二实施方式提供的基于主成分分析的数据处理方法的流程图;Figure 2 is a flow chart of a data processing method based on principal component analysis provided according to a second embodiment of the present invention;
图3是根据本发明第三实施方式提供的基于主成分分析的数据处理装置的结构示意图。Figure 3 is a schematic structural diagram of a data processing device based on principal component analysis provided according to the third embodiment of the present invention.
具体实施方式Detailed ways
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合附图对本发明的各实施方式进行详细的阐述。然而,本领域的普通技术人员可以理解,在本发明各实施方式中,为了使读者更好地理解本发明而提出了许多技术细节。但是,即使没有这些技术细节和基于以下各实施方式的种种变化和修改,也可以实现本发明所要求保护的技术方案。In order to make the objectives, technical solutions and advantages of the embodiments of the present invention clearer, each implementation mode of the present invention will be described in detail below with reference to the accompanying drawings. However, those of ordinary skill in the art will understand that in each embodiment of the present invention, many technical details are provided to enable readers to better understand the present invention. However, even without these technical details and various changes and modifications based on the following embodiments, the technical solution claimed by the present invention can also be implemented.
本发明的第一实施方式涉及一种基于主成分分析的数据处理方法,具体流程如图1所示,包括:The first embodiment of the present invention relates to a data processing method based on principal component analysis. The specific process is shown in Figure 1, including:
S101:对初始样本数据进行降维处理,得到预设维度的样本数据。S101: Perform dimensionality reduction processing on the initial sample data to obtain sample data with preset dimensions.
具体的说,在对初始样本数据进行降维处理之前,会预先获取待处理的初始样本数据。本实施方式中所述对所述初始样本数据进行降维处理,具体包括:将所述初始样本数据转化成数据矩阵;计算所述数据矩阵的协方差矩阵,并对所述协方差矩阵进行特征分解,得到所述协方差矩阵的特征值,以及与所述特征值对应的特征向量;根据所述特征值及所述特征向量得到投影矩阵,并将所述初始样本数据的维度降低至所述投影矩阵对应的维度。所述根据所述特征值及所述特征向量得到投影矩阵,具体包括:将所述特征向量从上到下按行排列成矩阵,其中,所述特征向量对应的特征值越大,所述特征向量位于所述矩阵的越前行;取前k行组成所述投影矩阵,其中,k为大于1的整数。例如,特征向量从上到下按行排列成的矩阵共有8行,表明共有8个特征向量,若某一特征向量对应的特征值在这8个特征向量对应的特征值中最大,则该特征向量位于该矩阵的第一行,以此类推。Specifically, before performing dimensionality reduction on the initial sample data, the initial sample data to be processed will be obtained in advance. The dimension reduction processing of the initial sample data described in this embodiment specifically includes: converting the initial sample data into a data matrix; calculating the covariance matrix of the data matrix, and characterizing the covariance matrix. Decompose to obtain the eigenvalues of the covariance matrix and the eigenvectors corresponding to the eigenvalues; obtain the projection matrix according to the eigenvalues and the eigenvectors, and reduce the dimension of the initial sample data to the The dimension corresponding to the projection matrix. Obtaining a projection matrix based on the eigenvalues and the eigenvector specifically includes: arranging the eigenvectors in rows from top to bottom into a matrix, wherein the larger the eigenvalue corresponding to the eigenvector, the greater the eigenvalue. The vector is located in the forward row of the matrix; the first k rows are taken to form the projection matrix, where k is an integer greater than 1. For example, there are 8 rows in a matrix of eigenvectors arranged from top to bottom, indicating that there are 8 eigenvectors in total. If the eigenvalue corresponding to a certain eigenvector is the largest among the eigenvalues corresponding to these 8 eigenvectors, then the feature The vector is in the first row of the matrix, and so on.
值得一提的是,为了减小数据矩阵的误差,避免数据矩阵中的噪声数据对最后的分析结果造成影响,在计算所述数据矩阵的协方差矩阵之前,还包括:对所述数据矩阵的每一行进行零均值化处理;所述计算所述数据矩阵的协方差矩阵,具体包括:计算零均值化处理后的数据矩阵的协方差矩阵。It is worth mentioning that in order to reduce the error of the data matrix and avoid the impact of noise data in the data matrix on the final analysis results, before calculating the covariance matrix of the data matrix, it also includes: Each row is subjected to zero-meaning processing; the calculation of the covariance matrix of the data matrix specifically includes: calculating the covariance matrix of the data matrix after zero-meaning processing.
可以理解的是,本实施方式是通过PCA方法降低初始样本数据的维度。需要说明的是,现有技术中通常使用多维尺度分析(MDS)对数据样本进行降维。MDS是一种维度降低的方法,通过分析相似数据来挖掘数据中的隐藏结构信息,通常,相似度量使用欧式距离度量来表示。所以,MDS算法的目的是在尽可能的保留数据样本间距离的情况下,将数据样本映射到一个低维的空间,以此降低样本的维度。MDS就是理论上保持欧式距离的一个经典方法,MDS最早主要用于做数据的可视化。由于MDS得到的低维表示中心在原点,所以又可以说保持内积。也就是说,用低维空间中的内积近似高维空间中的距离。经典的MDS方法,高维空间中的距离一般用欧式距离。多维尺度分析(MDS)和主成分分析(PCA)都是数据降维技术,但是在优化的方向有所不同。PCA的输入是n维空间的原始向量,并且将数据投影到具有最大协方差的投影方向上,因此在降维过程中数据的特性基本被保留。MDS的输入是点与点之间的成对距离,MDS的输出是距离被保留的点在二维或三维的投影。It can be understood that this implementation method uses the PCA method to reduce the dimension of the initial sample data. It should be noted that in the existing technology, multidimensional scaling analysis (MDS) is usually used to reduce the dimensionality of data samples. MDS is a dimensionality reduction method that mines hidden structural information in the data by analyzing similar data. Usually, the similarity measure is represented by the Euclidean distance measure. Therefore, the purpose of the MDS algorithm is to map the data samples to a low-dimensional space while retaining the distance between data samples as much as possible, thereby reducing the dimension of the samples. MDS is a classic method that theoretically maintains Euclidean distance. MDS was first mainly used for data visualization. Since the center of the low-dimensional representation obtained by MDS is at the origin, it can be said to maintain the inner product. That is, distances in high-dimensional space are approximated by inner products in low-dimensional space. In the classic MDS method, the distance in high-dimensional space generally uses Euclidean distance. Multidimensional scaling analysis (MDS) and principal component analysis (PCA) are both data dimensionality reduction techniques, but they are different in the direction of optimization. The input of PCA is the original vector of n-dimensional space, and the data is projected to the projection direction with the largest covariance, so the characteristics of the data are basically preserved during the dimensionality reduction process. The input of MDS is the pairwise distance between points, and the output of MDS is the projection of the distance-preserved point in two or three dimensions.
简言之,PCA最小化样本维度,是可以保存数据的协方差。MDS最小化样本维度,是可以保存数据点之间的距离。如果在数据协方差和高维度数据点之间的欧几里得距离,即欧式距离一致的时候,他们是相同的;如果距离测量是不同的,那这两种方法是不同的。显而易见,MDS有其局限性,而PCA恰好作为替代方法可以弥补,应用范围更加广泛,并且PCA的输入为n维空间的原始向量,因此其相对MDS在输入方面就简化了算法,降低了算法复杂度,最重要的是PCA方法在软件缺陷方面对数据的降维和预处理应用非常广泛,效果较MDS也较好。In short, PCA minimizes the sample dimension, which can preserve the covariance of the data. MDS minimizes the sample dimension and can save the distance between data points. If the data covariance and the Euclidean distance between high-dimensional data points are consistent, they are the same; if the distance measures are different, then the two methods are different. Obviously, MDS has its limitations, and PCA can make up for it as an alternative method, with a wider range of applications. Moreover, the input of PCA is the original vector of n-dimensional space, so compared with MDS, it simplifies the algorithm and reduces the complexity of the algorithm in terms of input. The most important thing is that the PCA method is widely used in dimensionality reduction and preprocessing of data in terms of software defects, and the effect is better than MDS.
为了便于理解,下面对PCA方法的算法过程进行详细的解释说明:In order to facilitate understanding, the algorithm process of the PCA method is explained in detail below:
设共有N张图像训练样本,简单地表示为xk∈X(k=1,...,N),X为训练样本数据集,训练样本共有c类,每类分别有Ni张训练样本,把每幅数据的图像矩阵展开得到的列向量维数为n。所有图像训练样本的平均样本用下式表示: Suppose there are N image training samples in total, which are simply expressed as x k ∈X (k=1,...,N) . , the column vector dimension obtained by expanding the image matrix of each data is n. The average sample of all image training samples is expressed by:
训练样本的第i(i=1,…,c)类的平均样本表示如下: The average sample of the i-th (i=1,...,c) class of training samples is expressed as follows:
主成分分析方法的具体过程是:首先要读入数据库,把每一个读入的二维的数据图像数据都展开成为一维的向量,每类图像样本都可以根据产生的随机矩阵选择一定数量的图像构成训练样本集,剩下的就构成测试样本集。接着就是计算K-L正交变换的生成矩阵,该生成矩阵可以由训练样本的总体散度矩阵ST表示,也可以由训练样本的类间散度矩阵SB来表示,散度矩阵是由训练集生成的,在此用总体散度矩阵ST表示,定义为: The specific process of the principal component analysis method is: first, read the database, and expand each read-in two-dimensional data image data into a one-dimensional vector. Each type of image sample can select a certain number of images based on the generated random matrix. The images constitute the training sample set, and the rest constitute the test sample set. The next step is to calculate the generating matrix of the KL orthogonal transformation. The generating matrix can be represented by the overall divergence matrix S T of the training sample, or by the inter-class divergence matrix S B of the training sample. The divergence matrix is represented by the training set Generated, here represented by the overall divergence matrix S T , defined as:
生成矩阵Σ可表示为:Σ=STST T The generating matrix Σ can be expressed as: Σ=S T S T T
接着进行特征值分解,计算生成矩阵Σ的特征值和特征向量,把特征值按从大到小依次进行排序,保留前m个最大的特征值,以及这m个特征值所对应的特征向量,从而获得了从高维空间向低维空间投影的投影矩阵,构造特征子空间。也就是说利用K-L变换的PCA方法,旨在寻找一组最佳的投影向量,满足准则函数: Then perform eigenvalue decomposition, calculate the eigenvalues and eigenvectors of the generated matrix Σ, sort the eigenvalues from large to small, and retain the first m largest eigenvalues and the eigenvectors corresponding to these m eigenvalues. Thus, the projection matrix from high-dimensional space to low-dimensional space is obtained, and the characteristic subspace is constructed. That is to say, the PCA method using KL transformation aims to find a set of optimal projection vectors that satisfy the criterion function:
接下来就是寻找最佳投影向量,也就是最大化上述准则函数的单位向量w,其物理意义是:在该投影向量w表示的方向上,图像向量投影后得到的特征向量的总体分散程度最大,即图像数据的每个样本与总体训练样本的平均样本之间的距离最大。因为上述计算的最佳投影向量,就是总体散度矩阵ST的最大特征值所对应的单位特征向量。而在样本类别数较多的情况下,只有单一的最优投影方向不足以用来完全表示所有图像样本的特征。从而,这里就需要寻找一组既能够极大化准则函数又能够满足标准正交条件的最佳投影向量组w1,w2,...,wm。而最佳投影矩阵就是通过最佳投影向量组表示的,即P=[w1,w2,...,wm]。The next step is to find the best projection vector, that is, the unit vector w that maximizes the above criterion function. Its physical meaning is: in the direction represented by the projection vector w, the overall degree of dispersion of the feature vector obtained after the image vector is projected is the largest. That is, the distance between each sample of the image data and the average sample of the overall training samples is the largest. Because the optimal projection vector calculated above is the unit eigenvector corresponding to the maximum eigenvalue of the overall divergence matrix S T. When there are a large number of sample categories, only a single optimal projection direction is not enough to fully represent the characteristics of all image samples. Therefore, it is necessary to find a set of criteria functions that can maximize The best projection vector group w 1 , w 2 ,...,w m that can satisfy the standard orthogonal condition. The optimal projection matrix is represented by the optimal projection vector group, that is, P=[w 1 , w 2 ,..., w m ].
接着,分别将训练样本和测试样本都投影到上面求出的特征子空间中,每一幅数据图像投影到上述特征子空间之后,都会对应于子空间中的一个点。同样,特征子空间中的任意一点也都能找到其相应的某一幅数据图像,这些特征子空间中数据图像投影得到的点就被称为“特征脸”。顾名思义,“特征脸”方法即表示通过K-L正交变换来进行数据识别的方法。Next, the training samples and test samples are respectively projected into the feature subspace calculated above. After each data image is projected into the above feature subspace, it will correspond to a point in the subspace. Similarly, any point in the feature subspace can also find its corresponding data image. The points obtained by the projection of the data image in these feature subspaces are called "eigenfaces". As the name suggests, the "eigenface" method refers to the method of data recognition through K-L orthogonal transformation.
最后,把经过上述向量投影,变换到特征子空间中的所有测试图像样本和训练图像样本进行比较,从而确定待识别数据图像样本所属的类别,这就是对测试样本进行的分类,需要选择合适的分类器和相异度测试公式。Finally, all the test image samples transformed into the feature subspace after the above vector projection are compared with the training image samples to determine the category of the data image sample to be identified. This is the classification of the test samples, and it is necessary to select the appropriate Classifier and dissimilarity test formulas.
S102:获取样本数据的多个特征,并计算每个特征与预设类别的相关度。S102: Obtain multiple features of the sample data and calculate the correlation between each feature and the preset category.
具体的说,本实施方式可以通过以下公式计算特征与所述预设类别的相关度:Si=[XT×Y+X`T×Y+XT×Y`+X`T×Y`]+[2×(IG(X|L))-(H(X)+H(L))]+[2×(IG(Y|L))-(H(Y)+H(L))];其中,Si为所述相似度;X、Y为样本数据的两个不同特征;X`为X在不同维度的表示、Y`为Y在不同维度的表示;L为所述预设类别;[XT×Y+X`T×Y+XT×Y`+X`T×Y`]表示X在不同维度的表示与Y在不同维度的表示的鉴别相关性;[2×(IG(X|L))-(H(X)+H(L))]+[2×(IG(Y|L))-(H(Y)+H(L))]表示X、Y分别与预设类别之间的相关度。Specifically, this implementation can calculate the correlation between the feature and the preset category through the following formula: Si=[X T ×Y+X` T ×Y+X T ×Y`+X` T ×Y`] +[2×(IG(X|L))-(H(X)+H(L))]+[2×(IG(Y|L))-(H(Y)+H(L))] ; Among them, Si is the similarity; X and Y are two different characteristics of the sample data; X` is the representation of X in different dimensions, Y` is the representation of Y in different dimensions; L is the preset category; [X T ×Y+X` T ×Y+X T ×Y`+X` T ×Y`] represents the discriminant correlation between the representation of X in different dimensions and the representation of Y in different dimensions; [2×(IG( X|L))-(H(X)+H(L))]+[2×(IG(Y|L))-(H(Y)+H(L))] means that Assume the correlation between categories.
值得一提的是,为了使得第一部分和第二部分的计算项平衡,需要增加一个平衡参数λ,因此,本实施方式还可以通过以下公式计算特征与所述预设类别的相关度:It is worth mentioning that in order to balance the calculation items of the first part and the second part, a balance parameter λ needs to be added. Therefore, this embodiment can also calculate the correlation between the feature and the preset category through the following formula:
Si=[XT×Y+X`T×Y+XT×Y`+X`T×Y`]+λ×[2×(IG(X|L))-(H(X)+H(L))]+[2×(IG(Y|L))-(H(Y)+H(L))];其中,Si为所述相似度;X、Y为样本数据的两个不同特征;X`为X在不同维度的表示、Y`为Y在不同维度的表示;L为所述预设类别;[XT×Y+X`T×Y+XT×Y`+X`T×Y`]表示X在不同维度的表示与Y在不同维度的表示的鉴别相关性;[2×(IG(X|L))-(H(X)+H(L))]+[2×(IG(Y|L))-(H(Y)+H(L))]表示X、Y分别与预设类别之间的相关度;λ为平衡常数。Si=[X T ×Y+X` T ×Y+X T ×Y`+X` T ×Y`]+λ×[2×(IG(X|L))-(H(X)+H( L))]+[2×(IG(Y|L))-(H(Y)+H(L))]; where Si is the similarity; X and Y are two different features of the sample data ;X` is the representation of X in different dimensions, Y` is the representation of Y in different dimensions ; L is the preset category; [X T ×Y`] represents the discriminant correlation between the representation of X in different dimensions and the representation of Y in different dimensions; [2×(IG(X|L))-(H(X)+H(L))]+[2 ×(IG(Y|L))-(H(Y)+H(L))] represents the correlation between X and Y and the preset category; λ is the equilibrium constant.
可以理解的是,相对于现有技术中的信息熵公式,本实施方式中的公式扩展了原方法中只有样本的特征的计算方式,使得同一个样本之间任意两个不同特征之间的鉴别特征选取类别相关性较高。第一个方括号表示同一个样本之间任意两个不同特征及其在不同维度的表示之间的鉴别相关性,并且通过该计算的约束性可以更好的获得比较直观的相关性较高的特征,而原方法并不能很直观的通过计算得出相关性较高的特征,它只是考虑了样本特征的相关性,却忽略了样本特征之间存在更重要的特征相关性,通过该计算可以充分利用同一样本的不同特征之间的联系,包括相关特征和冗余特征,很显然,如果第一个方括号计算出来的值越大,则表示样本特征之间的相关性越高,可以获得其相关特征,反之亦然,如果第一个方括号计算出来的值越小,则表示样本特征之间的冗余度越高,可以有效去除其冗余特征,使得样本特征的鉴别性更高。后面两个方括号的计算表示同一样本的两个不同特征分别与类别变量之间的相似度,同样,值越大表示特征与类别的相似度越高,即相关性越高,反之亦然,值越小表示特征与类别的相似度越低,即相关性越低。It can be understood that compared with the information entropy formula in the prior art, the formula in this embodiment expands the calculation method of only sample features in the original method, so that the identification between any two different features between the same sample Feature selection categories have high correlation. The first square bracket represents the discriminant correlation between any two different features of the same sample and their representation in different dimensions, and through the constraint of this calculation, a more intuitive and higher correlation can be obtained better Features, and the original method cannot intuitively calculate features with higher correlation. It only considers the correlation of sample features, but ignores the more important feature correlations between sample features. Through this calculation, we can Make full use of the connection between different features of the same sample, including relevant features and redundant features. Obviously, if the value calculated by the first square bracket is larger, it means that the correlation between the sample features is higher, and it can be obtained Its related features, and vice versa, if the value calculated by the first square bracket is smaller, it means the higher the redundancy between the sample features, and its redundant features can be effectively removed, making the sample features more discriminative. . The calculation of the following two square brackets indicates the similarity between two different features of the same sample and the categorical variable. Similarly, the larger the value, the higher the similarity between the feature and the category, that is, the higher the correlation, and vice versa. The smaller the value, the lower the similarity between the feature and the category, that is, the lower the correlation.
S103:去除多个特征中相关度小于预设相关度的特征,将剩余特征作为样本数据的鉴别特征。S103: Remove features whose correlation degree is less than the preset correlation degree among multiple features, and use the remaining features as discriminating features of the sample data.
具体的说,预设相关度的大小可以根据实际需求设置,本实施方式并不对此作具体限定。本实施方式是基于主成分分析样本数据,主成分分析的基本思想是提取高维数据空间的主要特征,并保持原来高维数据的绝大部分信息,使得高维数据可以在一个较低维的特征空间上被处理。K-L变换是主成分分析的基础。它是一种最优正交变换,是基于目标统计特征的,其目的是找到一个线性的投影变换,通过这个投影变换使得新的特征分量正交或不相关,并且为了使数据的能量更加集中,要求经过投影重建后的特征分量与原输入样本在最小均方意义下的误差最小。从而得到原样本的低维近似表示,能够更好得压缩原始数据。运用K-L变换进行数据识别,提出了经典的特征脸方法(Eigenfaces),形成了子空间学习方法的基础。简而言之,就是从输入数据训练图像中,通过主成分分析得到一组特征脸图像,再给定任意的数据图像,使得每个数据图像都可以用这组特征脸图像来线性表示,即通过计算主成分分析得到的特征脸图像的加权线性组合。Specifically, the size of the preset correlation degree can be set according to actual needs, and this embodiment does not specifically limit this. This implementation is based on principal component analysis sample data. The basic idea of principal component analysis is to extract the main features of the high-dimensional data space and maintain most of the information of the original high-dimensional data, so that the high-dimensional data can be stored in a lower-dimensional space. are processed in feature space. K-L transform is the basis of principal component analysis. It is an optimal orthogonal transformation based on the target statistical characteristics. Its purpose is to find a linear projection transformation through which the new feature components are orthogonal or irrelevant, and in order to make the energy of the data more concentrated. , it is required that the characteristic component after projection reconstruction and the original input sample have the smallest error in the sense of minimum mean square. This results in a low-dimensional approximate representation of the original sample, which can better compress the original data. Using K-L transformation for data recognition, the classic eigenface method (Eigenfaces) was proposed, which formed the basis of the subspace learning method. In short, a set of eigenface images is obtained from the input data training images through principal component analysis, and then any data image is given, so that each data image can be linearly represented by this set of eigenface images, that is Weighted linear combinations of eigenface images obtained by computing principal component analysis.
主成分分析的本质是计算协方差矩阵并将其对角化。可以假设所有的数据图像都在一个线性的低维空间中,而且在该低维空间中所有数据图像都是线性可分的,再把主成分分析方法用于数据特征识别,其具体做法是要进行K-L变换,将高维图像输入空间经变换得到一组新的正交基,按照一定的方法对变换得到的正交基按一定的条件进行筛选,剔除一些冗余的向量,保留那些特征鉴别能力强的向量,来生成低维的数据子空间,也就是数据的特征脸子空间。利用主成分分析方法降低输入数据空间维数的关键是找出最能够代表原始数据的投影方法,以“降噪”和消灭“冗余”的维度,使得降低维度的同时,能够保证原始输入数据中最重要的特征不丢失。在协方差矩阵中,只需要选取那些能量(特征值)比较大的维度,其余相对较低的就舍掉,这样就能够保留输入图像数据中那些重要的特征信息,而舍弃无益于数据识别的其他部分。The essence of principal component analysis is to calculate the covariance matrix and diagonalize it. It can be assumed that all data images are in a linear low-dimensional space, and all data images are linearly separable in this low-dimensional space. Then the principal component analysis method is used for data feature identification. The specific method is to Perform K-L transformation, transform the high-dimensional image input space to obtain a new set of orthogonal bases, filter the transformed orthogonal bases according to certain methods according to certain conditions, eliminate some redundant vectors, and retain those feature identifications A powerful vector can be used to generate a low-dimensional data subspace, which is the characteristic face subspace of the data. The key to using the principal component analysis method to reduce the spatial dimension of the input data is to find the projection method that best represents the original data to "reduce noise" and eliminate "redundant" dimensions, so that while reducing the dimension, the original input data can be guaranteed The most important features are not lost. In the covariance matrix, you only need to select those dimensions with relatively large energy (eigenvalues), and discard the others with relatively low energy. This way, you can retain the important feature information in the input image data, and discard the ones that are not beneficial to data recognition. other parts.
为了便于理解,下面对本实施方式中如何处理样本数据进行具体的举例说明:In order to facilitate understanding, the following is a specific example of how to process sample data in this implementation:
输入:训练样本集:X=[X1,X2,...,Xc],其中Xi=(F1,F2,...,Fm,L),k<m,i=1...m。Input: training sample set: X=[X1, X2,...,Xc], where Xi=(F1, F2,..., Fm, L), k<m, i=1...m.
PCA数据降维的维度:kDimensions of PCA data dimensionality reduction: k
相关性阈值(预设相关度):βCorrelation threshold (preset correlation): β
1)将原始数据按列组成数据矩阵,把每一个读入的二维的数据图像数据都展开成为一维的向量。1) Form the original data into a data matrix by columns, and expand each read-in two-dimensional data image data into a one-dimensional vector.
2)将数据矩阵的每一行(代表一个属性字段)进行零均值化,即减去这一行的均值。2) Zero-mean each row of the data matrix (representing an attribute field), that is, subtract the mean of this row.
3)求出协方差矩阵。3) Find the covariance matrix.
4)进行特征值分解,求出协方差矩阵的特征值及对应的特征向量。4) Perform eigenvalue decomposition to find the eigenvalues and corresponding eigenvectors of the covariance matrix.
5)将特征向量按对应特征值大小从上到下按行排列成矩阵,取前k行组成样本投影矩阵。5) Arrange the eigenvectors into a matrix from top to bottom in rows according to the size of the corresponding eigenvalues, and take the first k rows to form the sample projection matrix.
6)将数据降维到投影矩阵对应的维度,即k,X’=PX即为降维到k维后的数据。得到降维后的样本集表示为X’=[X'1,X'2,…,X'c],其中X'i=(F1,F2,…,Fk,L),k<m,i=1…m。6) Reduce the dimensionality of the data to the dimension corresponding to the projection matrix, that is, k. X' = PX is the data after dimensionality reduction to k dimensions. The obtained sample set after dimensionality reduction is expressed as X'=[X'1, X'2,...,X'c], where =1...m.
7)令i=1 to k,j=1 to k(i≠j)循环,计算Si=ISU(Fi,Fi’,Fj,Fj’,L)。7) Let i=1 to k, j=1 to k (i≠j) loop, and calculate Si=ISU(Fi, Fi’, Fj, Fj’, L).
8)对Si按从大到小排序。8) Sort Si from large to small.
9)将序列中最前面的g个特征作为新样本的特征,得到样本集X”=[X”1,X”2,…,X”c],其中X”i=(F1,F2,…,Fg,L),g<k,i=1…m。9) Use the first g features in the sequence as features of the new sample to obtain the sample set X”=[X”1, X”2,…,X”c], where , Fg, L), g<k, i=1...m.
10)对每对特征从后向前进行相关性分析,去除大于β的指定特征,得出最终样本Y。10) Perform correlation analysis on each pair of features from back to front, remove specified features greater than β, and obtain the final sample Y.
输出:样本集Y。Output: sample set Y.
本发明的实施方式相对于现有技术而言,通过对所述初始样本数据进行降维处理,得到预设维度的样本数据,以便于后续步骤的计算,减小后续步骤的运算量,从而提高数据处理方法的效率;通过获取所述样本数据的多个特征,并计算每个特征与预设类别的相关度,由于预设类别为样本数据具有的多种类别中的一种类别,通过此种方式,能够根据相似度得知样本数据的多个特征中哪些特征为冗余特征;通过去除所述多个特征中相关度小于预设相关度的特征,将剩余特征作为所述样本数据的鉴别特征,能够得到具有高鉴别性的样本数据,使得使用该样本数据的训练模型的预算速度变快,从而提高预测效率。Compared with the existing technology, the embodiment of the present invention performs dimensionality reduction processing on the initial sample data to obtain sample data with preset dimensions, so as to facilitate the calculation of subsequent steps, reduce the calculation amount of subsequent steps, and thereby improve The efficiency of the data processing method; by obtaining multiple features of the sample data and calculating the correlation between each feature and the preset category. Since the preset category is one of the multiple categories of the sample data, through this In this way, it is possible to know which features among the multiple features of the sample data are redundant features based on the similarity; by removing the features whose correlation degree is less than the preset correlation among the multiple features, the remaining features are used as the redundant features of the sample data. The discriminant feature can obtain sample data with high discriminability, which makes the budget of the training model using the sample data faster, thereby improving the prediction efficiency.
本发明的第二实施方式涉及一种基于主成分分析的数据处理方法,第二实施方式是在第一实施方式的基础上做了进一步的改进,具体改进之处在于:在第二实施方式中,在去除所述多个特征中相关度小于预设相关度的特征之后,还包括:将所述剩余特征按照所述相关度由高到低的顺序排序;将排序后的所述剩余特征划分为N个特征段,其中,每个特征段中均包括M个特征,N、M均为大于1的整数;判断是否存在M个特征均大于预设阈值的特征段,在判定存在时,去除所述特征段中相似度最小的特征。通过此种方式,能够进一步减少样本数据中的冗余特征,使得预测效率得到进一步的提高。The second embodiment of the present invention relates to a data processing method based on principal component analysis. The second embodiment is further improved on the basis of the first embodiment. The specific improvements are: in the second embodiment , after removing the features whose correlation degree is less than the preset correlation degree among the plurality of features, it also includes: sorting the remaining features according to the order of the correlation degree from high to low; dividing the sorted remaining features into is N feature segments, where each feature segment includes M features, and N and M are both integers greater than 1; determine whether there are M feature segments with all features greater than the preset threshold, and when determining that they exist, remove The feature with the smallest similarity among the feature segments. In this way, redundant features in the sample data can be further reduced, further improving the prediction efficiency.
本实施方式的具体流程如图2所示,包括:The specific process of this implementation is shown in Figure 2, including:
S201:对初始样本数据进行降维处理,得到预设维度的样本数据。S201: Perform dimensionality reduction processing on the initial sample data to obtain sample data with preset dimensions.
S202:获取样本数据的多个特征,并计算每个特征与预设类别的相关度。S202: Obtain multiple features of the sample data, and calculate the correlation between each feature and the preset category.
S203:去除多个特征中相关度小于预设相关度的特征。S203: Remove features whose correlation degree is less than the preset correlation degree among multiple features.
S204:将去除相关度小于预设相关度的特征后的多个特征按照相关度由高到低的顺序排序,并将排序后的剩余特征划分为N个特征段。S204: Sort the multiple features after removing the features whose correlation degree is less than the preset correlation degree in order from high to low correlation, and divide the remaining features after sorting into N feature segments.
S205:判断是否存在M个特征均大于预设阈值的特征段,在判定存在时,去除所述特征段中相似度最小的特征。S205: Determine whether there are M feature segments whose features are all greater than the preset threshold. When it is determined that there are M feature segments, remove the feature with the smallest similarity among the feature segments.
针对上述步骤S204至S205,具体的说,使用阈值相关性方法去除样本数据的冗余特征。阈值相关性方法是用特征之间的相关度来识别冗余特征,实际软件度量中,存在非线性关系,所以这里依然选择ISU来计算一对特征间的相关度,其阈值相关性方法使用预设的β(即预设阈值)作为相关性的临界值,在去除多个特征中相关度小于预设相关度的特征后,从后向前对剩余特征进行相关性分析,所有大于临界值的一对特征就从样本集中去除靠后的特征,以此类推。之所以从后向前进行相关性分析,是因为去除多个特征中相关度小于预设相关度的特征从后往前其鉴别性越来越高,所以从后往前进行相关性分析,当遇到相关度大于β值的两个特征时,就可以优先去掉鉴别性小的特征,从而保留鉴别性较大的特征。Regarding the above steps S204 to S205, specifically, the threshold correlation method is used to remove redundant features of the sample data. The threshold correlation method uses the correlation between features to identify redundant features. In actual software measurement, there is a nonlinear relationship, so ISU is still chosen here to calculate the correlation between a pair of features. The threshold correlation method uses pre- Let β (i.e., the preset threshold) be used as the critical value of correlation. After removing the features whose correlation degree is less than the preset correlation degree among multiple features, correlation analysis is performed on the remaining features from back to front. All features that are greater than the critical value are For a pair of features, the lower feature is removed from the sample set, and so on. The reason why the correlation analysis is performed from back to front is because the discriminability of features that are smaller than the preset correlation among multiple features becomes more and more discriminating from back to front, so the correlation analysis is performed from back to front. When encountering two features with a correlation greater than the β value, the less discriminating features can be removed first, thereby retaining the more discriminating features.
S206:将剩余特征作为样本数据的鉴别特征。S206: Use the remaining features as discriminating features of the sample data.
本实施方式的步骤S201至步骤S203、S206与第一实施方式的步骤S101至步骤S103类似,为了避免重复,此处不再赘述。Steps S201 to S203 and S206 in this embodiment are similar to steps S101 to S103 in the first embodiment. In order to avoid duplication, they will not be described again here.
本发明的实施方式相对于现有技术而言,通过对所述初始样本数据进行降维处理,得到预设维度的样本数据,以便于后续步骤的计算,减小后续步骤的运算量,从而提高数据处理方法的效率;通过获取所述样本数据的多个特征,并计算每个特征与预设类别的相关度,由于预设类别为样本数据具有的多种类别中的一种类别,通过此种方式,能够根据相似度得知样本数据的多个特征中哪些特征为冗余特征;通过去除所述多个特征中相关度小于预设相关度的特征,将剩余特征作为所述样本数据的鉴别特征,能够得到具有高鉴别性的样本数据,使得使用该样本数据的训练模型的预算速度变快,从而提高预测效率。Compared with the existing technology, the embodiment of the present invention performs dimensionality reduction processing on the initial sample data to obtain sample data with preset dimensions, so as to facilitate the calculation of subsequent steps, reduce the calculation amount of subsequent steps, and thereby improve The efficiency of the data processing method; by obtaining multiple features of the sample data and calculating the correlation between each feature and the preset category. Since the preset category is one of the multiple categories of the sample data, through this In this way, it is possible to know which features among the multiple features of the sample data are redundant features based on the similarity; by removing the features whose correlation degree is less than the preset correlation among the multiple features, the remaining features are used as the redundant features of the sample data. The discriminant feature can obtain sample data with high discriminability, which makes the budget of the training model using the sample data faster, thereby improving the prediction efficiency.
本发明第三实施方式涉及一种基于主成分分析的数据处理装置,如图3所示,包括:The third embodiment of the present invention relates to a data processing device based on principal component analysis, as shown in Figure 3, including:
至少一个处理器301;以及,at least one processor 301; and,
与至少一个处理器301通信连接的存储器302;其中,A memory 302 communicatively connected to at least one processor 301; wherein,
存储器302存储有可被至少一个处理器301执行的指令,指令被至少一个处理器301执行,以使至少一个处理器301能够执行上述基于主成分分析的数据处理方法。The memory 302 stores instructions that can be executed by at least one processor 301, and the instructions are executed by at least one processor 301, so that at least one processor 301 can execute the above-mentioned data processing method based on principal component analysis.
其中,存储器302和处理器301采用总线方式连接,总线可以包括任意数量的互联的总线和桥,总线将一个或多个处理器301和存储器302的各种电路连接在一起。总线还可以将诸如外围设备、稳压器和功率管理电路等之类的各种其他电路连接在一起,这些都是本领域所公知的,因此,本文不再对其进行进一步描述。总线接口在总线和收发机之间提供接口。收发机可以是一个元件,也可以是多个元件,比如多个接收器和发送器,提供用于在传输介质上与各种其他装置通信的单元。经处理器301处理的数据通过天线在无线介质上进行传输,进一步,天线还接收数据并将数据传送给处理器301。The memory 302 and the processor 301 are connected using a bus. The bus may include any number of interconnected buses and bridges. The bus connects various circuits of one or more processors 301 and the memory 302 together. The bus may also connect various other circuits together such as peripherals, voltage regulators, and power management circuits, which are all well known in the art and therefore will not be described further herein. The bus interface provides the interface between the bus and the transceiver. A transceiver may be one element or may be multiple elements, such as multiple receivers and transmitters, providing a unit for communicating with various other devices over a transmission medium. The data processed by the processor 301 is transmitted on the wireless medium through the antenna. Furthermore, the antenna also receives the data and transmits the data to the processor 301.
处理器301负责管理总线和通常的处理,还可以提供各种功能,包括定时,外围接口,电压调节、电源管理以及其他控制功能。而存储器302可以被用于存储处理器301在执行操作时所使用的数据。Processor 301 is responsible for managing the bus and general processing, and can also provide various functions, including timing, peripheral interfaces, voltage regulation, power management, and other control functions. The memory 302 may be used to store data used by the processor 301 when performing operations.
本发明第四实施方式涉及一种计算机可读存储介质,存储有计算机程序。计算机程序被处理器执行时实现上述方法实施例。The fourth embodiment of the present invention relates to a computer-readable storage medium storing a computer program. The above method embodiments are implemented when the computer program is executed by the processor.
即,本领域技术人员可以理解,实现上述实施例方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序存储在一个存储介质中,包括若干指令用以使得一个设备(可以是单片机,芯片等)或处理器(processor)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-OnlyMemory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。That is, those skilled in the art can understand that all or part of the steps in the methods of the above embodiments can be completed by instructing relevant hardware through a program. The program is stored in a storage medium and includes several instructions to cause a device ( It may be a microcontroller, a chip, etc.) or a processor (processor) that executes all or part of the steps of the methods described in various embodiments of this application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program code.
本领域的普通技术人员可以理解,上述各实施方式是实现本发明的具体实施例,而在实际应用中,可以在形式上和细节上对其作各种改变,而不偏离本发明的精神和范围。Those of ordinary skill in the art can understand that the above-mentioned embodiments are specific examples for realizing the present invention, and in practical applications, various changes can be made in form and details without departing from the spirit and spirit of the present invention. scope.
Claims (8)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010155934.8A CN111476100B (en) | 2020-03-09 | 2020-03-09 | Data processing method, device and storage medium based on principal component analysis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010155934.8A CN111476100B (en) | 2020-03-09 | 2020-03-09 | Data processing method, device and storage medium based on principal component analysis |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111476100A CN111476100A (en) | 2020-07-31 |
CN111476100B true CN111476100B (en) | 2023-11-14 |
Family
ID=71748104
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010155934.8A Active CN111476100B (en) | 2020-03-09 | 2020-03-09 | Data processing method, device and storage medium based on principal component analysis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111476100B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111914954B (en) * | 2020-09-14 | 2024-08-13 | 中移(杭州)信息技术有限公司 | Data analysis method, device and storage medium |
CN112528893A (en) * | 2020-12-15 | 2021-03-19 | 南京中兴力维软件有限公司 | Abnormal state identification method and device and computer readable storage medium |
CN113177879B (en) * | 2021-04-30 | 2024-12-06 | 北京百度网讯科技有限公司 | Image processing method, device, electronic device and storage medium |
CN115730592A (en) * | 2022-11-30 | 2023-03-03 | 贵州电网有限责任公司信息中心 | Power grid redundant data elimination method, device, equipment and storage medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101021897A (en) * | 2006-12-27 | 2007-08-22 | 中山大学 | Two-dimensional linear discrimination human face analysis identificating method based on interblock correlation |
CN103020640A (en) * | 2012-11-28 | 2013-04-03 | 金陵科技学院 | Facial image dimensionality reduction classification method based on two-dimensional principal component analysis |
CN103942572A (en) * | 2014-05-07 | 2014-07-23 | 中国标准化研究院 | Method and device for extracting facial expression features based on bidirectional compressed data space dimension reduction |
CN105138972A (en) * | 2015-08-11 | 2015-12-09 | 北京天诚盛业科技有限公司 | Face authentication method and device |
CN106845397A (en) * | 2017-01-18 | 2017-06-13 | 湘潭大学 | A kind of confirming face method based on measuring similarity |
CN109784668A (en) * | 2018-12-21 | 2019-05-21 | 国网江苏省电力有限公司南京供电分公司 | A kind of sample characteristics dimension-reduction treatment method for electric power monitoring system unusual checking |
CN109978023A (en) * | 2019-03-11 | 2019-07-05 | 南京邮电大学 | Feature selection approach and computer storage medium towards higher-dimension big data analysis |
CN109981335A (en) * | 2019-01-28 | 2019-07-05 | 重庆邮电大学 | The feature selection approach of combined class uneven traffic classification |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7760917B2 (en) * | 2005-05-09 | 2010-07-20 | Like.Com | Computer-implemented method for performing similarity searches |
WO2010144259A1 (en) * | 2009-06-09 | 2010-12-16 | Arizona Board Of Regents Acting For And On Behalf Of Arizona State University | Ultra-low dimensional representation for face recognition under varying expressions |
CN103839041B (en) * | 2012-11-27 | 2017-07-18 | 腾讯科技(深圳)有限公司 | The recognition methods of client features and device |
-
2020
- 2020-03-09 CN CN202010155934.8A patent/CN111476100B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101021897A (en) * | 2006-12-27 | 2007-08-22 | 中山大学 | Two-dimensional linear discrimination human face analysis identificating method based on interblock correlation |
CN103020640A (en) * | 2012-11-28 | 2013-04-03 | 金陵科技学院 | Facial image dimensionality reduction classification method based on two-dimensional principal component analysis |
CN103942572A (en) * | 2014-05-07 | 2014-07-23 | 中国标准化研究院 | Method and device for extracting facial expression features based on bidirectional compressed data space dimension reduction |
CN105138972A (en) * | 2015-08-11 | 2015-12-09 | 北京天诚盛业科技有限公司 | Face authentication method and device |
CN106845397A (en) * | 2017-01-18 | 2017-06-13 | 湘潭大学 | A kind of confirming face method based on measuring similarity |
CN109784668A (en) * | 2018-12-21 | 2019-05-21 | 国网江苏省电力有限公司南京供电分公司 | A kind of sample characteristics dimension-reduction treatment method for electric power monitoring system unusual checking |
CN109981335A (en) * | 2019-01-28 | 2019-07-05 | 重庆邮电大学 | The feature selection approach of combined class uneven traffic classification |
CN109978023A (en) * | 2019-03-11 | 2019-07-05 | 南京邮电大学 | Feature selection approach and computer storage medium towards higher-dimension big data analysis |
Non-Patent Citations (3)
Title |
---|
A new algorithm of face detection based on differential images and PCA in color image;Yan Xu 等;《2009 2nd IEEE International Conference on Computer Science and Information Technology》;172-176 * |
基于FPCA和ReliefF算法的图像特征降维;齐迎春 等;《吉林大学学报(理学版)》(第05期);153-158 * |
基于特征选择的数据降维算法研究;余大龙;《中国优秀硕士学位论文全文数据库信息科技辑》(第08期);I138-317 * |
Also Published As
Publication number | Publication date |
---|---|
CN111476100A (en) | 2020-07-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111476100B (en) | Data processing method, device and storage medium based on principal component analysis | |
Yang et al. | Subspace clustering via good neighbors | |
Paparrizos et al. | Debunking four long-standing misconceptions of time-series distance measures | |
US11294624B2 (en) | System and method for clustering data | |
US7353215B2 (en) | Kernels and methods for selecting kernels for use in learning machines | |
Zhao et al. | On similarity preserving feature selection | |
Chen et al. | Feature-aware label space dimension reduction for multi-label classification | |
Alzate et al. | Multiway spectral clustering with out-of-sample extensions through weighted kernel PCA | |
Duin et al. | Linear dimensionality reduction via a heteroscedastic extension of LDA: the Chernoff criterion | |
Roth et al. | Optimal cluster preserving embedding of nonmetric proximity data | |
US11301509B2 (en) | Image search system, image search method, and program | |
Li et al. | Discriminative multi-view interactive image re-ranking | |
US20140032451A1 (en) | Support vector machine-based method for analysis of spectral data | |
US20050228591A1 (en) | Kernels and kernel methods for spectral data | |
An et al. | Robust feature selection via nonconvex sparsity-based methods | |
Zhong et al. | Efficient multiple feature fusion with hashing for hyperspectral imagery classification: A comparative study | |
Wang et al. | Joint anchor graph embedding and discrete feature scoring for unsupervised feature selection | |
Ramachandran et al. | Evaluation of dimensionality reduction techniques for big data | |
Nie et al. | Implicit weight learning for multi-view clustering | |
Negrel et al. | Web-scale image retrieval using compact tensor aggregation of visual descriptors | |
US8412757B2 (en) | Non-negative matrix factorization as a feature selection tool for maximum margin classifiers | |
Leng et al. | Learning binary codes with bagging PCA | |
Wang et al. | Pseudo-label guided structural discriminative subspace learning for unsupervised feature selection | |
Stutz | Neural codes for image retrieval | |
Bozdogan et al. | Hybridized support vector machine and recursive feature elimination with information complexity |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |