CN111476100B - Data processing method, device and storage medium based on principal component analysis - Google Patents

Data processing method, device and storage medium based on principal component analysis Download PDF

Info

Publication number
CN111476100B
CN111476100B CN202010155934.8A CN202010155934A CN111476100B CN 111476100 B CN111476100 B CN 111476100B CN 202010155934 A CN202010155934 A CN 202010155934A CN 111476100 B CN111476100 B CN 111476100B
Authority
CN
China
Prior art keywords
data
features
sample data
feature
correlation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010155934.8A
Other languages
Chinese (zh)
Other versions
CN111476100A (en
Inventor
奚晓钰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Migu Cultural Technology Co Ltd
China Mobile Communications Group Co Ltd
Original Assignee
Migu Cultural Technology Co Ltd
China Mobile Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Migu Cultural Technology Co Ltd, China Mobile Communications Group Co Ltd filed Critical Migu Cultural Technology Co Ltd
Priority to CN202010155934.8A priority Critical patent/CN111476100B/en
Publication of CN111476100A publication Critical patent/CN111476100A/en
Application granted granted Critical
Publication of CN111476100B publication Critical patent/CN111476100B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application relates to the field of software defect prediction, and discloses a data processing method and device based on principal component analysis and a computer readable storage medium, wherein the method comprises the following steps: performing dimension reduction processing on the initial sample data to obtain sample data with preset dimensions; acquiring a plurality of characteristics of the sample data, and calculating the correlation degree of each characteristic and a preset category, wherein the preset category is one category of a plurality of categories of the sample data; and removing the features with the correlation degree smaller than the preset correlation degree from the plurality of features, and taking the remaining features as identification features of the sample data. The data processing method, the data processing device and the computer readable storage medium based on principal component analysis can remove redundant features in sample data to obtain sample data with high discrimination, so that the prediction efficiency is improved.

Description

Data processing method, device and storage medium based on principal component analysis
Technical Field
The embodiment of the application relates to the field of data processing, in particular to a data processing method and device based on principal component analysis and a computer readable storage medium.
Background
Information entropy is a measure of the amount of information needed to eliminate uncertainty, i.e., the amount of information an unknown event may contain. An event or a system, precisely a random variable, has a certain uncertainty. Some random variables have high uncertainty, and to eliminate this uncertainty, it is necessary to introduce a lot of information, the measure of which is expressed in terms of "entropy". The more information that needs to be introduced to eliminate uncertainty, the higher the entropy of the information and vice versa. If a situation is because the certainty is high, little information needs to be introduced, and therefore the entropy of the information is low. According to the information entropy formula given by shannon, for any random variable X, its information entropy is defined as follows, in bits (bit): h (X) = - Σxεx [ P (X) log P (X) ]. The more equal the probability of various randomness in the system, the greater the entropy of the information, and vice versa.
The inventor finds that at least the following problems exist in the prior art: according to the formula, the characteristics of the sample data are analyzed, the obtained redundant characteristics are more, and the model prediction efficiency trained by the sample data is low.
Disclosure of Invention
An object of an embodiment of the present application is to provide a data processing method, apparatus, and computer-readable storage medium based on principal component analysis, which can remove redundant features in sample data to obtain sample data with high discrimination, thereby improving prediction efficiency.
In order to solve the above technical problems, an embodiment of the present application provides a data processing method based on principal component analysis, including:
performing dimension reduction processing on the initial sample data to obtain sample data with preset dimensions; acquiring a plurality of characteristics of the sample data, and calculating the correlation degree of each characteristic and a preset category, wherein the preset category is one category of a plurality of categories of the sample data; and removing the features with the correlation degree smaller than the preset correlation degree from the plurality of features, and taking the remaining features as identification features of the sample data.
The embodiment of the application also provides a data processing device based on principal component analysis, which comprises: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the principal component analysis-based data processing method described above.
The embodiment of the application also provides a computer readable storage medium storing a computer program which, when executed by a processor, implements the data processing method based on principal component analysis.
Compared with the prior art, the embodiment of the application obtains the sample data with the preset dimension by performing the dimension reduction processing on the initial sample data, so that the calculation of the subsequent step is facilitated, the operand of the subsequent step is reduced, and the efficiency of the data processing method is improved; by acquiring a plurality of characteristics of the sample data and calculating the correlation degree of each characteristic and a preset category, the preset category is one category of a plurality of categories of the sample data, and in this way, which characteristics of the plurality of characteristics of the sample data are redundant characteristics can be known according to the similarity; by removing the features with the correlation degree smaller than the preset correlation degree from the plurality of features and taking the remaining features as the identification features of the sample data, the sample data with high identification can be obtained, so that the budget speed of a training model using the sample data is increased, and the prediction efficiency is improved.
In addition, after removing the feature with the correlation degree smaller than the preset correlation degree in the plurality of features, the method further comprises: ranking the remaining features in the order of the correlation from high to low; dividing the sequenced residual features into N feature segments, wherein each feature segment comprises M features, and N, M is an integer greater than 1; judging whether M feature segments with the features larger than a preset threshold exist or not, and removing the feature with the minimum similarity in the feature segments when judging the existence of the feature segments.
In addition, the performing the dimension reduction processing on the initial sample data specifically includes: converting the initial sample data into a data matrix; calculating a covariance matrix of the data matrix, and carrying out feature decomposition on the covariance matrix to obtain a feature value of the covariance matrix and a feature vector corresponding to the feature value; and obtaining a projection matrix according to the characteristic values and the characteristic vectors, and reducing the dimension of the initial sample data to the dimension corresponding to the projection matrix.
In addition, the obtaining a projection matrix according to the eigenvalue and the eigenvector specifically includes: arranging the feature vectors into a matrix from top to bottom according to rows, wherein the larger the feature value corresponding to the feature vector is, the more forward the feature vector is positioned in the matrix; and taking the first k rows to form the projection matrix, wherein k is an integer greater than 1.
In addition, before calculating the covariance matrix of the data matrix, the method further comprises: zero-equalizing each row of the data matrix; the calculating the covariance matrix of the data matrix specifically comprises the following steps: and calculating a covariance matrix of the data matrix after zero-mean processing.
In addition, the correlation of the feature with the preset category is calculated by the following formula: si= [ X ] T ×Y+X` T ×Y+X T ×Y`+X` T ×Y`]+[2×(IG(X|L))-(H(X)+H(L))]+[2×(IG(Y|L))-(H(Y)+H(L))]The method comprises the steps of carrying out a first treatment on the surface of the Wherein Si is the similarity; x, Y are two different features of the sample data; x 'is the representation of X in different dimensions, Y' is the representation of Y in different dimensions; l is the preset category; [ X ] T ×Y+X` T ×Y+X T ×Y`+X` T ×Y`]An authentication correlation of the representation of X in different dimensions with the representation of Y in different dimensions; [2× (IG (X|L)) - (H (X) +H (L))]+[2×(IG(Y|L))-(H(Y)+H(L))]The correlation between X, Y and the preset categories respectively is represented.
In addition, the correlation of the feature with the preset category is calculated by the following formula: si= [ X ] T ×Y+X` T ×Y+X T ×Y`+X` T ×Y`]+λ×[2×(IG(X|L))-(H(X)+H(L))]+[2×(IG(Y|L))-(H(Y)+H(L))]The method comprises the steps of carrying out a first treatment on the surface of the Wherein Si is the similarity; x, Y are two different features of the sample data; x 'is the representation of X in different dimensions, Y' is the representation of Y in different dimensions; l is the preset category; [ X ] T ×Y+X` T ×Y+X T ×Y`+X` T ×Y`]An authentication correlation of the representation of X in different dimensions with the representation of Y in different dimensions; [2× (IG (X|L)) - (H (X) +H (L))]+[2×(IG(Y|L))-(H(Y)+H(L))]Representing the correlation degree between X, Y and the preset categories respectively; lambda is the equilibrium constant.
In addition, the initial sample data is image sample data.
Drawings
One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements, and in which the figures of the drawings are not to be taken in a limiting sense, unless otherwise indicated.
FIG. 1 is a flow chart of a data processing method based on principal component analysis provided according to a first embodiment of the present application;
FIG. 2 is a flow chart of a data processing method based on principal component analysis according to a second embodiment of the present application;
fig. 3 is a schematic structural view of a data processing apparatus based on principal component analysis according to a third embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the following detailed description of the embodiments of the present application will be given with reference to the accompanying drawings. However, those of ordinary skill in the art will understand that in various embodiments of the present application, numerous technical details have been set forth in order to provide a better understanding of the present application. However, the claimed application may be practiced without these specific details and with various changes and modifications based on the following embodiments.
The first embodiment of the application relates to a data processing method based on principal component analysis, and the specific flow is shown in fig. 1, and the method comprises the following steps:
s101: and performing dimension reduction processing on the initial sample data to obtain sample data with preset dimensions.
Specifically, the initial sample data to be processed is obtained in advance before the dimension reduction processing is performed on the initial sample data. In this embodiment, the performing the dimension reduction processing on the initial sample data specifically includes: converting the initial sample data into a data matrix; calculating a covariance matrix of the data matrix, and carrying out feature decomposition on the covariance matrix to obtain a feature value of the covariance matrix and a feature vector corresponding to the feature value; and obtaining a projection matrix according to the characteristic values and the characteristic vectors, and reducing the dimension of the initial sample data to the dimension corresponding to the projection matrix. The obtaining a projection matrix according to the eigenvalue and the eigenvector specifically includes: arranging the feature vectors into a matrix from top to bottom according to rows, wherein the larger the feature value corresponding to the feature vector is, the more forward the feature vector is positioned in the matrix; and taking the first k rows to form the projection matrix, wherein k is an integer greater than 1. For example, a matrix formed by arranging feature vectors in rows from top to bottom has 8 rows, which indicates that there are 8 feature vectors in total, and if a feature value corresponding to a certain feature vector is the largest among feature values corresponding to the 8 feature vectors, the feature vector is located in the first row of the matrix, and so on.
It should be noted that, in order to reduce the error of the data matrix and avoid the influence of noise data in the data matrix on the final analysis result, before calculating the covariance matrix of the data matrix, the method further includes: zero-equalizing each row of the data matrix; the calculating the covariance matrix of the data matrix specifically comprises the following steps: and calculating a covariance matrix of the data matrix after zero-mean processing.
It will be appreciated that this embodiment reduces the dimensionality of the initial sample data by the PCA method. It should be noted that, in the prior art, multidimensional scaling (MDS) is generally used to perform dimension reduction on a data sample. MDS is a dimension reduction method that mines hidden structure information in data by analyzing similar data, and typically, similarity measures are represented using euclidean distance measures. The purpose of the MDS algorithm is therefore to map the data samples to a low dimensional space, with the distance between the data samples preserved as much as possible, thereby reducing the dimension of the samples. MDS is a classical method for theoretically maintaining Euclidean distance, and is mainly used for data visualization at the earliest time. Since the low-dimensional representation obtained by MDS is centered at the origin, it can be said that the inner product is preserved again. That is, the distance in the high-dimensional space is approximated by the inner product in the low-dimensional space. Classical MDS methods generally use Euclidean distances for distances in high dimensional space. Multidimensional scaling (MDS) and Principal Component Analysis (PCA) are both data dimension reduction techniques, but differ in the direction of optimization. The input to PCA is an original vector in n-dimensional space and projects the data into the projection direction with the greatest covariance, so that the characteristics of the data are substantially preserved during the dimension reduction. The input to the MDS is the point-to-point paired distance, and the output of the MDS is a projection of the distance-preserved point in two or three dimensions.
Briefly, PCA minimizes the sample dimension, which is the covariance that can preserve the data. MDS minimizes the sample dimension, which is the distance between data points that can be preserved. If the Euclidean distances between the data covariance and the high-dimensional data points, i.e., euclidean distances, are identical; if the distance measurements are different, the two methods are different. Obviously, the MDS has the limitation, the PCA can be exactly used as an alternative method to make up, the application range is wider, and the input of the PCA is the original vector of the n-dimensional space, so that the algorithm is simplified in the aspect of input relative to the MDS, the complexity of the algorithm is reduced, and most importantly, the PCA method has very wide application range of the data reduction and preprocessing in the aspect of software defects, and the effect is better than that of the MDS.
For ease of understanding, the algorithmic process of the PCA method is explained in detail below:
let a total of N image training samples, simply denoted as x k E X (k=1, once again, N), X is the data set of the training samples, training samples are c classes, each class has N respectively i And (3) training samples, and expanding an image matrix of each piece of data to obtain a column vector dimension n. The average sample of all image training samples is represented by the following formula:
the average samples of the i (i=1, …, c) th class of training samples are expressed as follows:
the main component analysis method comprises the following specific processes: firstly, a database is read in, each read-in two-dimensional data image data is unfolded into a one-dimensional vector, each type of image sample can select a certain number of images to form a training sample set according to a generated random matrix, and the rest of images form a test sample set. Next, a generator matrix of the K-L orthogonal transform is computed, which may be derived from the overall divergence matrix S of the training samples T Representing, also by the inter-class divergence matrix S of the training samples B To represent, the divergence matrix is generated from the training set, here the overall divergence matrix S T Representation, defined as:
the generator matrix Σ may be expressed as: Σ=s T S T T
Then, the eigenvalue decomposition is carried out, the eigenvalue and eigenvector of the generating matrix sigma are calculated, the eigenvalues are orderly sequenced from big to small, the first m largest eigenvalues and eigenvectors corresponding to the m eigenvalues are reserved, so that a projection matrix projected from a high-dimensional space to a low-dimensional space is obtained, and the eigenvoice subspace is constructed. That is, the PCA method using K-L transformation aims to find a set of optimal projection vectors, satisfying a criterion function:
the next step is to find the best projection vector, i.e. the unit vector w that maximizes the above criterion function, which has the physical meaning: the overall dispersion degree of the feature vector obtained after the projection of the image vector is maximum in the direction indicated by the projection vector w, that is, the distance between each sample of the image data and the average sample of the overall training sample is maximum. Because the best projection vector calculated is the overall divergence matrix S T A unit feature vector corresponding to the maximum feature value of (a). In the case of a large number of sample typesIn case, only a single optimal projection direction is insufficient for fully characterizing all image samples. Thus, there is a need to find a set of criteria functions that can maximizeOptimal projection vector group w capable of meeting standard orthogonal condition 1 ,w 2 ,...,w m . The best projection matrix is represented by the best projection vector set, i.e., p= [ w ] 1 ,w 2 ,...,w m ]。
Then, the training sample and the test sample are projected into the feature subspace obtained above, respectively, and each data image corresponds to a point in the subspace after being projected into the feature subspace. Similarly, any point in the feature subspace can find a corresponding data image, and the point obtained by projecting the data image in the feature subspace is called a feature face. As the name implies, the "eigenface" method means a method of data recognition by K-L orthogonal transformation.
Finally, comparing all the test image samples transformed into the feature subspace by the vector projection with the training image samples, thereby determining the category of the data image samples to be identified, namely classifying the test samples, and selecting a proper classifier and a dissimilarity test formula.
S102: and acquiring a plurality of characteristics of the sample data, and calculating the correlation degree of each characteristic and a preset category.
Specifically, the correlation between the features and the preset category can be calculated by the following formula: si= [ X ] T ×Y+X` T ×Y+X T ×Y`+X` T ×Y`]+[2×(IG(X|L))-(H(X)+H(L))]+[2×(IG(Y|L))-(H(Y)+H(L))]The method comprises the steps of carrying out a first treatment on the surface of the Wherein Si is the similarity; x, Y are two different features of the sample data; x 'is the representation of X in different dimensions, Y' is the representation of Y in different dimensions; l is the preset category; [ X ] T ×Y+X` T ×Y+X T ×Y`+X` T ×Y`]Representation of X in different dimensions than Y in different dimensionsThe identified correlation of the representation; [2× (IG (X|L)) - (H (X) +H (L))]+[2×(IG(Y|L))-(H(Y)+H(L))]The correlation between X, Y and the preset categories respectively is represented.
It should be noted that, in order to balance the calculation items of the first portion and the second portion, it is necessary to increase a balance parameter λ, so the present embodiment may further calculate the correlation between the feature and the preset category by the following formula:
Si=[X T ×Y+X` T ×Y+X T ×Y`+X` T ×Y`]+λ×[2×(IG(X|L))-(H(X)+H(L))]+[2×(IG(Y|L))-(H(Y)+H(L))]the method comprises the steps of carrying out a first treatment on the surface of the Wherein Si is the similarity; x, Y are two different features of the sample data; x 'is the representation of X in different dimensions, Y' is the representation of Y in different dimensions; l is the preset category; [ X ] T ×Y+X` T ×Y+X T ×Y`+X` T ×Y`]An authentication correlation of the representation of X in different dimensions with the representation of Y in different dimensions; [2× (IG (X|L)) - (H (X) +H (L))]+[2×(IG(Y|L))-(H(Y)+H(L))]Representing the correlation degree between X, Y and the preset categories respectively; lambda is the equilibrium constant.
It can be understood that, compared with the information entropy formula in the prior art, the formula in the embodiment expands the calculation mode of the characteristics of only the sample in the original method, so that the correlation of the selection category of the identification characteristics between any two different characteristics between the same sample is higher. The first square bracket represents the identification correlation between any two different features and the representation of the same sample in different dimensions, and the more visual features with higher correlation can be better obtained through the constraint of the calculation, while the original method cannot quite intuitively obtain the features with higher correlation through calculation, which only considers the correlation of the sample features, but ignores the more important feature correlation among the sample features, and the calculation can fully utilize the relation between the different features of the same sample, including the correlation features and the redundant features, obviously, if the value calculated by the first square bracket is larger, the higher the correlation among the sample features can be obtained, and vice versa, if the value calculated by the first square bracket is smaller, the higher the redundancy among the sample features can be represented, and the redundant features can be effectively removed, so that the identification of the sample features is higher. The latter calculation of two brackets indicates the similarity between two different features of the same sample and the class variable, respectively, and as such, a larger value indicates a higher similarity of the feature to the class, i.e. a higher correlation, and vice versa, a smaller value indicates a lower similarity of the feature to the class, i.e. a lower correlation.
S103: and removing the features with the correlation degree smaller than the preset correlation degree from the plurality of features, and taking the remaining features as identification features of the sample data.
Specifically, the magnitude of the preset correlation degree may be set according to actual requirements, which is not specifically limited in this embodiment. The present embodiment is based on principal component analysis of sample data, the basic idea of principal component analysis is to extract the main features of the high-dimensional data space and keep most of the information of the original high-dimensional data so that the high-dimensional data can be processed on a lower-dimensional feature space. The K-L transformation is the basis for principal component analysis. The method is an optimal orthogonal transformation and is based on target statistical characteristics, and aims to find a linear projection transformation, so that new characteristic components are orthogonal or uncorrelated through the projection transformation, and in order to make the energy of data more concentrated, the error between the characteristic components after projection reconstruction and an original input sample in the least mean square sense is required to be minimum. Thus, a low-dimensional approximation of the original sample is obtained, and the original data can be compressed better. The K-L transformation is used for data identification, a classical Eigenfaces method (Eigenfaces) is provided, and a basis of a subspace learning method is formed. In short, a group of characteristic face images is obtained from input data training images through principal component analysis, and any data image is given, so that each data image can be linearly represented by the group of characteristic face images, namely, the weighted linear combination of the characteristic face images obtained through principal component analysis is calculated.
The essence of principal component analysis is to calculate and diagonalize the covariance matrix. It may be assumed that all the data images are in a linear low-dimensional space, and in the low-dimensional space, all the data images are linearly separable, and then the principal component analysis method is used for data feature recognition. The key of reducing the space dimension of the input data by using the principal component analysis method is to find a projection method which can represent the original data most, so that the dimension is reduced and the most important characteristics in the original input data are not lost at the same time of reducing the dimension by 'noise reduction' and eliminating the redundant dimension. In the covariance matrix, only those dimensions with relatively large energy (eigenvalues) need to be selected, and the rest is discarded relatively low, so that important eigenvalues in the input image data can be retained, and other parts which are not beneficial to data identification are discarded.
For ease of understanding, the following specifically exemplifies how sample data is processed in this embodiment:
input: training sample set: x= [ X1, X2, ], xc ], where xi= (F1, F2, ], fm, L), k < m, i=1.
Dimension of dimension reduction of PCA data: k (k)
Correlation threshold (preset correlation): beta
1) The original data is formed into a data matrix according to columns, and each read two-dimensional data image data is unfolded into a one-dimensional vector.
2) Each row of the data matrix (representing an attribute field) is zero-averaged, i.e. the average value of this row is subtracted.
3) A covariance matrix is obtained.
4) And (5) carrying out eigenvalue decomposition to obtain eigenvalues and corresponding eigenvectors of the covariance matrix.
5) And arranging the eigenvectors into a matrix according to the corresponding eigenvalues from top to bottom, and taking the first k rows to form a sample projection matrix.
6) And reducing the dimension of the data to the dimension corresponding to the projection matrix, namely, k, and X' =PX, wherein the dimension reduced data is the data from the dimension to the dimension k. The resulting reduced-dimension sample set is denoted as X ' = [ X '1, X '2, …, X ' c ], where X ' i= (F1, F2, …, fk, L), k < m, i=1 … m.
7) Let i=1 to k, j=1 to k (i+.j) be looped, si=isu (Fi, fi ', fj', L) is calculated.
8) Si is ordered from large to small.
9) Taking the first g features in the sequence as features of a new sample, a sample set X "= [ X"1, X "2, …, X" c ] is obtained, wherein X "i= (F1, F2, …, fg, L), g < k, i=1 … m.
10 Performing correlation analysis on each pair of features from back to front, and removing the appointed features larger than beta to obtain a final sample Y.
And (3) outputting: sample set Y.
Compared with the prior art, the embodiment of the application obtains the sample data with the preset dimension by performing the dimension reduction processing on the initial sample data, so that the calculation of the subsequent step is facilitated, the operand of the subsequent step is reduced, and the efficiency of the data processing method is improved; by acquiring a plurality of characteristics of the sample data and calculating the correlation degree of each characteristic and a preset category, the preset category is one category of a plurality of categories of the sample data, and in this way, which characteristics of the plurality of characteristics of the sample data are redundant characteristics can be known according to the similarity; by removing the features with the correlation degree smaller than the preset correlation degree from the plurality of features and taking the remaining features as the identification features of the sample data, the sample data with high identification can be obtained, so that the budget speed of a training model using the sample data is increased, and the prediction efficiency is improved.
A second embodiment of the present application relates to a data processing method based on principal component analysis, and the second embodiment is a further improvement based on the first embodiment, and the specific improvement is that: in a second embodiment, after removing the features with the correlation degree smaller than the preset correlation degree from the plurality of features, the method further includes: ranking the remaining features in the order of the correlation from high to low; dividing the sequenced residual features into N feature segments, wherein each feature segment comprises M features, and N, M is an integer greater than 1; judging whether M feature segments with the features larger than a preset threshold exist or not, and removing the feature with the minimum similarity in the feature segments when judging the existence of the feature segments. In this way, redundant features in the sample data can be further reduced, so that the prediction efficiency is further improved.
The specific flow of this embodiment is shown in fig. 2, and includes:
s201: and performing dimension reduction processing on the initial sample data to obtain sample data with preset dimensions.
S202: and acquiring a plurality of characteristics of the sample data, and calculating the correlation degree of each characteristic and a preset category.
S203: and removing the characteristics with the correlation degree smaller than the preset correlation degree from the characteristics.
S204: and sorting the plurality of features with the removed features with the correlation less than the preset correlation according to the sequence from high correlation to low correlation, and dividing the rest sorted features into N feature segments.
S205: judging whether M feature segments with the features larger than a preset threshold exist or not, and removing the feature with the minimum similarity in the feature segments when judging the existence of the feature segments.
For the above steps S204 to S205, specifically, the redundant features of the sample data are removed using a threshold correlation method. The threshold correlation method is to identify redundant features by using the correlation between features, and in the actual software measurement, nonlinear relation exists, so that ISU is still selected to calculate the correlation between a pair of features, the threshold correlation method uses preset beta (i.e. a preset threshold) as a critical value of the correlation, after removing features with the correlation smaller than the preset correlation in a plurality of features, correlation analysis is performed on the remaining features from back to front, and all pairs of features larger than the critical value are removed from the sample set, and the like. The correlation analysis is performed from the back to the front, because the features with the correlation smaller than the preset correlation in the plurality of features are removed, and the discrimination of the features is higher from the back to the front, so that the correlation analysis is performed from the back to the front, and when two features with the correlation larger than the beta value are encountered, the features with the small discrimination can be preferentially removed, so that the features with the larger discrimination are reserved.
S206: the remaining features are taken as identifying features of the sample data.
Steps S201 to S203 and S206 of the present embodiment are similar to steps S101 to S103 of the first embodiment, and are not repeated here.
Compared with the prior art, the embodiment of the application obtains the sample data with the preset dimension by performing the dimension reduction processing on the initial sample data, so that the calculation of the subsequent step is facilitated, the operand of the subsequent step is reduced, and the efficiency of the data processing method is improved; by acquiring a plurality of characteristics of the sample data and calculating the correlation degree of each characteristic and a preset category, the preset category is one category of a plurality of categories of the sample data, and in this way, which characteristics of the plurality of characteristics of the sample data are redundant characteristics can be known according to the similarity; by removing the features with the correlation degree smaller than the preset correlation degree from the plurality of features and taking the remaining features as the identification features of the sample data, the sample data with high identification can be obtained, so that the budget speed of a training model using the sample data is increased, and the prediction efficiency is improved.
A third embodiment of the present application relates to a data processing apparatus based on principal component analysis, as shown in fig. 3, including:
at least one processor 301; the method comprises the steps of,
a memory 302 communicatively coupled to the at least one processor 301; wherein,
the memory 302 stores instructions executable by the at least one processor 301, the instructions being executable by the at least one processor 301 to enable the at least one processor 301 to perform the above-described principal component analysis based data processing method.
Where the memory 302 and the processor 301 are connected by a bus, the bus may comprise any number of interconnected buses and bridges, the buses connecting the various circuits of the one or more processors 301 and the memory 302 together. The bus may also connect various other circuits such as peripherals, voltage regulators, and power management circuits, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or may be a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor 301 is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor 301.
The processor 301 is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And memory 302 may be used to store data used by processor 301 in performing operations.
A fourth embodiment of the present application relates to a computer-readable storage medium storing a computer program. The computer program implements the above-described method embodiments when executed by a processor.
That is, it will be understood by those skilled in the art that all or part of the steps in implementing the methods of the embodiments described above may be implemented by a program stored in a storage medium, where the program includes several instructions for causing a device (which may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps in the methods of the embodiments of the application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples of carrying out the application and that various changes in form and details may be made therein without departing from the spirit and scope of the application.

Claims (8)

1. A data processing method based on principal component analysis, comprising:
performing dimension reduction processing on initial sample data to obtain sample data with preset dimensions, wherein the initial sample data are image sample data, and the sample data are characteristic face images;
acquiring a plurality of characteristics of the sample data, and calculating the correlation degree of each characteristic and a preset category, wherein the preset category is one category of a plurality of categories of the sample data;
removing the features with the correlation degree smaller than the preset correlation degree from the plurality of features, and taking the remaining features as identification features of the sample data;
the step of performing dimension reduction processing on the initial sample data to obtain sample data with preset dimensions comprises the following steps:
performing data feature recognition processing on the image sample data based on a principal component analysis method to acquire each feature face image, wherein each feature face image is used for linearly representing any one of the image sample data;
after removing the feature with the correlation degree smaller than the preset correlation degree, the method further includes:
ranking the remaining features in the order of the correlation from high to low;
dividing the sequenced residual features into N feature segments, wherein each feature segment comprises M features, and N, M is an integer greater than 1;
judging whether there are feature segments with the correlation degree of M features being greater than a preset threshold value, and removing the feature with the minimum correlation degree in the feature segments when judging that there are the feature segments.
2. The method for processing data based on principal component analysis according to claim 1, wherein the performing the dimension reduction processing on the initial sample data specifically comprises:
converting the initial sample data into a data matrix;
calculating a covariance matrix of the data matrix, and carrying out feature decomposition on the covariance matrix to obtain a feature value of the covariance matrix and a feature vector corresponding to the feature value;
and obtaining a projection matrix according to the characteristic values and the characteristic vectors, and reducing the dimension of the initial sample data to the dimension corresponding to the projection matrix.
3. The method for processing data based on principal component analysis according to claim 2, wherein the obtaining a projection matrix according to the eigenvalues and the eigenvectors specifically includes:
arranging the feature vectors into a matrix from top to bottom according to rows, wherein the larger the feature value corresponding to the feature vector is, the more forward the feature vector is positioned in the matrix;
and taking the first k rows to form the projection matrix, wherein k is an integer greater than 1.
4. A data processing method based on principal component analysis according to claim 2 or 3, further comprising, before calculating the covariance matrix of the data matrix:
zero-equalizing each row of the data matrix;
the calculating the covariance matrix of the data matrix specifically comprises the following steps:
and calculating a covariance matrix of the data matrix after zero-mean processing.
5. The principal component analysis-based data processing method according to claim 1, wherein the correlation of features with the preset categories is calculated by the following formula:
Si=[X T ×Y+X` T ×Y+X T ×Y`+X` T ×Y`]+[2×(IG(X|L))-(H(X)+H(L))]+[2×
(IG(Y|L))-(H(Y)+H(L))];
wherein Si is the degree of correlation; x, Y are two different features of the sample data; x 'is the representation of X in different dimensions, Y' is the representation of Y in different dimensions; l is the preset category; [ X ] T ×Y+X` T ×Y+X T ×Y`+X` T ×Y`]An authentication correlation of the representation of X in different dimensions with the representation of Y in different dimensions; [2× (IG (X|L)) - (H (X) +H (L))]+[2×(IG(Y|L))-(H(Y)+H(L))]The correlation between X, Y and the preset categories respectively is represented.
6. The principal component analysis-based data processing method according to claim 1, wherein the correlation of features with the preset categories is calculated by the following formula:
Si=[X T ×Y+X` T ×Y+X T ×Y`+X` T ×Y`]+λ×[2×(IG(X|L))-(H(X)+H(L))]+[2×
(IG(Y|L))-(H(Y)+H(L))];
wherein Si is the degree of correlation; x, Y are two different features of the sample data; x 'is the representation of X in different dimensions, Y' is the representation of Y in different dimensions; l is the preset category; [ X ] T ×Y+X` T ×Y+X T ×Y`+X` T ×Y`]An authentication correlation of the representation of X in different dimensions with the representation of Y in different dimensions; [2× (IG (X|L)) - (H (X) +H (L))]+[2×(IG(Y|L))-(H(Y)+H(L))]Representing the correlation degree between X, Y and the preset categories respectively; lambda is the equilibrium constant.
7. A data processing apparatus based on principal component analysis, comprising: at least one processor; the method comprises the steps of,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the principal component analysis-based data processing method of any one of claims 1 to 6.
8. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the principal component analysis-based data processing method according to any one of claims 1 to 6.
CN202010155934.8A 2020-03-09 2020-03-09 Data processing method, device and storage medium based on principal component analysis Active CN111476100B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010155934.8A CN111476100B (en) 2020-03-09 2020-03-09 Data processing method, device and storage medium based on principal component analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010155934.8A CN111476100B (en) 2020-03-09 2020-03-09 Data processing method, device and storage medium based on principal component analysis

Publications (2)

Publication Number Publication Date
CN111476100A CN111476100A (en) 2020-07-31
CN111476100B true CN111476100B (en) 2023-11-14

Family

ID=71748104

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010155934.8A Active CN111476100B (en) 2020-03-09 2020-03-09 Data processing method, device and storage medium based on principal component analysis

Country Status (1)

Country Link
CN (1) CN111476100B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111914954B (en) * 2020-09-14 2024-08-13 中移(杭州)信息技术有限公司 Data analysis method, device and storage medium
CN112528893A (en) * 2020-12-15 2021-03-19 南京中兴力维软件有限公司 Abnormal state identification method and device and computer readable storage medium
CN113177879A (en) * 2021-04-30 2021-07-27 北京百度网讯科技有限公司 Image processing method, image processing apparatus, electronic device, and storage medium
CN115730592A (en) * 2022-11-30 2023-03-03 贵州电网有限责任公司信息中心 Power grid redundant data elimination method, device, equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101021897A (en) * 2006-12-27 2007-08-22 中山大学 Two-dimensional linear discrimination human face analysis identificating method based on interblock correlation
CN103020640A (en) * 2012-11-28 2013-04-03 金陵科技学院 Facial image dimensionality reduction classification method based on two-dimensional principal component analysis
CN103942572A (en) * 2014-05-07 2014-07-23 中国标准化研究院 Method and device for extracting facial expression features based on bidirectional compressed data space dimension reduction
CN105138972A (en) * 2015-08-11 2015-12-09 北京天诚盛业科技有限公司 Face authentication method and device
CN106845397A (en) * 2017-01-18 2017-06-13 湘潭大学 A kind of confirming face method based on measuring similarity
CN109784668A (en) * 2018-12-21 2019-05-21 国网江苏省电力有限公司南京供电分公司 A kind of sample characteristics dimension-reduction treatment method for electric power monitoring system unusual checking
CN109978023A (en) * 2019-03-11 2019-07-05 南京邮电大学 Feature selection approach and computer storage medium towards higher-dimension big data analysis
CN109981335A (en) * 2019-01-28 2019-07-05 重庆邮电大学 The feature selection approach of combined class uneven traffic classification

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7760917B2 (en) * 2005-05-09 2010-07-20 Like.Com Computer-implemented method for performing similarity searches
US8842891B2 (en) * 2009-06-09 2014-09-23 Arizona Board Of Regents On Behalf Of Arizona State University Ultra-low dimensional representation for face recognition under varying expressions
CN103839041B (en) * 2012-11-27 2017-07-18 腾讯科技(深圳)有限公司 The recognition methods of client features and device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101021897A (en) * 2006-12-27 2007-08-22 中山大学 Two-dimensional linear discrimination human face analysis identificating method based on interblock correlation
CN103020640A (en) * 2012-11-28 2013-04-03 金陵科技学院 Facial image dimensionality reduction classification method based on two-dimensional principal component analysis
CN103942572A (en) * 2014-05-07 2014-07-23 中国标准化研究院 Method and device for extracting facial expression features based on bidirectional compressed data space dimension reduction
CN105138972A (en) * 2015-08-11 2015-12-09 北京天诚盛业科技有限公司 Face authentication method and device
CN106845397A (en) * 2017-01-18 2017-06-13 湘潭大学 A kind of confirming face method based on measuring similarity
CN109784668A (en) * 2018-12-21 2019-05-21 国网江苏省电力有限公司南京供电分公司 A kind of sample characteristics dimension-reduction treatment method for electric power monitoring system unusual checking
CN109981335A (en) * 2019-01-28 2019-07-05 重庆邮电大学 The feature selection approach of combined class uneven traffic classification
CN109978023A (en) * 2019-03-11 2019-07-05 南京邮电大学 Feature selection approach and computer storage medium towards higher-dimension big data analysis

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
A new algorithm of face detection based on differential images and PCA in color image;Yan Xu 等;《2009 2nd IEEE International Conference on Computer Science and Information Technology》;172-176 *
基于FPCA和ReliefF算法的图像特征降维;齐迎春 等;《吉林大学学报(理学版)》(第05期);153-158 *
基于特征选择的数据降维算法研究;余大龙;《中国优秀硕士学位论文全文数据库信息科技辑》(第08期);I138-317 *

Also Published As

Publication number Publication date
CN111476100A (en) 2020-07-31

Similar Documents

Publication Publication Date Title
CN111476100B (en) Data processing method, device and storage medium based on principal component analysis
Rainforth et al. Canonical correlation forests
US11294624B2 (en) System and method for clustering data
Alzate et al. Multiway spectral clustering with out-of-sample extensions through weighted kernel PCA
Landgrebe et al. Efficient multiclass ROC approximation by decomposition via confusion matrix perturbation analysis
US8538173B2 (en) Computer readable medium, apparatus, and method for adding identification information indicating content of a target image using decision trees generated from a learning image
Denton Kernel-density-based clustering of time series subsequences using a continuous random-walk noise model
Shrivastava et al. Learning discriminative dictionaries with partially labeled data
US9842279B2 (en) Data processing method for learning discriminator, and data processing apparatus therefor
US20220179912A1 (en) Search device, search method and learning model search system
CN106599856A (en) Combined face detection, positioning and identification method
CN109034238A (en) A kind of clustering method based on comentropy
US9576222B2 (en) Image retrieval apparatus, image retrieval method, and recording medium
JP2014228995A (en) Image feature learning device, image feature learning method and program
Sivasankar et al. Feature reduction in clinical data classification using augmented genetic algorithm
US20200279148A1 (en) Material structure analysis method and material structure analyzer
Ibrahim et al. On feature selection methods for accurate classification and analysis of emphysema ct images
CN110941542A (en) Sequence integration high-dimensional data anomaly detection system and method based on elastic network
Wu et al. Discriminant Tensor Dictionary Learning with Neighbor Uncorrelation for Image Set Based Classification.
CN111914954B (en) Data analysis method, device and storage medium
Marini et al. Feature Selection for Enhanced Spectral Shape Comparison.
Bharathi et al. The significance of feature selection techniques in machine learning
McInerney et al. On using sift descriptors for image parameter evaluation
Vengatesan et al. FAST Clustering Algorithm for Maximizing the Feature Selection in High Dimensional Data
CN109978066A (en) Quick Spectral Clustering based on multi-Scale Data structure

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant