CN115017125A - Data processing method and device for improving KNN method - Google Patents
Data processing method and device for improving KNN method Download PDFInfo
- Publication number
- CN115017125A CN115017125A CN202210946851.XA CN202210946851A CN115017125A CN 115017125 A CN115017125 A CN 115017125A CN 202210946851 A CN202210946851 A CN 202210946851A CN 115017125 A CN115017125 A CN 115017125A
- Authority
- CN
- China
- Prior art keywords
- data
- data information
- information
- representing
- formula
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/176—Support for shared access to files; File sharing support
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a data processing method and a data processing device for improving a KNN method, which relate to the technical field of data processing and solve the technical problem of data processing, and the technical scheme is that the data processing method and the data processing device for improving the KNN method comprise the following steps: step one, acquiring data information from database information, and performing dimensionality reduction processing on the acquired data information to acquire low-dimensionality data information; step two, carrying out data information processing on the data information after the dimensionality reduction by improving a KNN algorithm model; evaluating the processed data information through an improved error evaluation function; and fourthly, applying and sharing the data information, and performing remote data information processing and data sharing on the acquired data information. The invention greatly improves the data information processing capability through data dimension reduction, data preprocessing, data mining, error analysis and processing.
Description
Technical Field
The present invention relates to the field of data processing, and more particularly, to a data processing method and apparatus for improving a KNN method.
Background
And data processing is a basic link of system engineering and automatic control. Data processing is throughout various fields of social production and social life. The development of data processing technology and the breadth and depth of its application have greatly influenced the progress of human society development. Data (Data) is a representation of facts, concepts or instructions that can be processed by either manual or automated means. After the data is interpreted and given a certain meaning, it becomes information. Data processing (data processing) is the collection, storage, retrieval, processing, transformation, and transmission of data. The basic purpose of data processing is to extract and derive valuable, meaningful data for certain people from large, possibly chaotic, unintelligible amounts of data.
In the prior art, data information is generally processed by adopting a data statistics method, which improves the data processing capability to a certain extent, but when data information is analyzed and calculated, classification and data information processing are difficult to realize, the whole data information processing capability is poor, and the data information processing method is lagged.
Disclosure of Invention
Aiming at the technical defects, the invention discloses a data processing method and a data processing device for improving a KNN method, which greatly improve the data information processing capacity through data dimension reduction, data preprocessing, data mining, error analysis and processing.
In order to realize the technical effects, the invention adopts the following technical scheme:
a data processing method for improving a KNN method, comprising the steps of:
step one, acquiring data information from database information, and performing dimensionality reduction processing on the acquired data information to acquire low-dimensionality data information;
step two, carrying out data information processing on the data information after the dimensionality reduction through an improved KNN algorithm model, wherein the improved KNN algorithm model comprises a data preprocessing step, a data layering step, a data KNN algorithm calculating step and a convolution fault diagnosis step;
evaluating the processed data information through an improved error evaluation function;
and fourthly, applying and sharing the data information, and performing remote data information processing and data sharing on the acquired data information.
As a further technical scheme of the invention, the dimension reduction processing method comprises the following steps:
(S11) dimension reduction is realized by reconstructing matrix data information, and the number of reconstructed matrix data, data dimension and time delay are set;
(S12) solving the distribution probability of different element libraries through an average mutual information method, and analyzing data characteristics through a correlation algorithm model;
(S13) the dimension of the data information is calculated through a false neighbor method, different data classifications are selected by comparing the dimensions of different data information, the sequence between two different dimensions realizes the comparison between different elements in the database information through a feature pair measurement method, and the formula is as follows:
in formula (1), R represents a data dimension, n represents a vector,representing the matrix data information before reconstruction,representing the reconstructed matrix data information;andrepresenting the relationship of false adjacent points among the reconstruction matrix data, r representing data information added after reconstruction, u being the optimal dimensionality of the reconstruction matrix data information, and the difference between the element data dimensionality of the reconstruction matrix data and the data dimensionality after dimensionality reduction being larger than 10 after reconstruction;
and (S14) performing dimension reduction judgment, outputting data information when the dimension reduction data information meets the current requirement, and performing dimension reduction calculation again when the dimension reduction data information does not meet the current requirement.
As a further technical scheme of the invention, the data layering is differential layering, and the differential layering method comprises the following steps:
dividing the data attributes into different attributes according to the number and the types, and sequentially arranging and distributing the attribute data quantity from the top layer to the bottom layer from less to most;
calculating the distance between different data attributes, and assuming that certain data information in the data set isData attribute class classification、、Andthen data informationDistance data attribute categories、、Andis a distance of、;
Carrying out differential calculation on the calculated data information with different data attributes; when in useIn whichIs constant, then data informationIs divided intoAnd (4) class.
As a further technical scheme of the invention, the data KNN algorithm comprises the following steps:
(S21) selecting a big data information test set, and selecting a test big data information vector set according to different data attributes;
(S22) training a big data information test set to construct an n-layer tree form through hierarchical classification; data search of the big data information test set is realized through an optimal search algorithm;
(S23) sequentially calculating the text similarity of each big data information in the big data information test set and the 1 st-nth layer big data information test set training set;
the formula for calculating the Euclidean distance is as follows:
in the formula (2), the first and second groups,a feature vector representing test information in a large test set of data information,representing big data information in test setA sequence of feature vectors of the test information;the aggregate center vector is tested for large data information of layer 1 class j,a class representing big data information; m is the dimension of the feature vector of the big data information test set;testing the kth dimension of the set vector for big data information;representing a large data information test set vector of a jth class of a first layer in a kth dimension;
(S24) selecting the text most similar to the test text from the training text set according to the text similarityA text;
(S25) under test textIn each neighbor, the weight of each class is calculated in turn,representing a formula of weight value, the formula beingWhereinIn order to be able to obtain the data information,is shown asThe feature vectors of the test information in the big data information test set of a class,the coefficient of similarity of the Jacard is expressed,in order to calculate the formula for the degree of similarity,representing a degree of similarity value, whereinIs 1 or 0, ifBelong toThen functionThe value is 1, otherwise 0;
(S26) sorting the calculated weights, and comparing the sorted weights differentially, whenIn whichIf the data set represents the characteristics, the test text belongs to the 1 st class, and only the subclasses of the 1 st class in the second layer need to be compared when the similarity comparison is carried out on the second layer; if it is notThen continue to judge, existWhen is coming into contact withWhen it is, the test text belongs to 1-One of the classes, when comparing the second layer, only the first class in the 1 st class in the second layer needs to be comparedA subclass of the class; if it is notIf so, continuing to judge; whereinRepresenting the difference values of the sorted forward-to-adjacent weights,the set of large data information test set threshold differential values representing settings,indicating the presence ofThe big data information-like test gathers the differential value of the distance value.
As a further technical scheme of the invention, the convolution fault diagnosis method comprises the following steps:
the fault diagnosis architecture is constructed by expanding the causal convolution with a residual block, as shown in equation (3):
in the formula (3), O is an output variable of the output layer of the convolution fault diagnosis model,input variables representing the output layers of the sub-fault diagnosis model,residual mapping representing deep learning, adding a set exit layer after the weight layer, and expanding the causal convolution function f (t) as defined by:
in the formula (4), the first and second groups,is a filter;is a hierarchy of neural networks;representing input time series information;the cavity parameters are cavity interval sizes;representing a hole convolution operator;
the evaluation formula of the fault diagnosis system structure is as follows:
in the formula (5), the first and second groups,the mean value of the fault assessment indexes of the big data information is shown,Twhich represents the duration of the prediction,an evaluation duration period parameter representing a predictive big data message failure architecture,represents the hyper-parameters of various items of the deep learning model,θrepresents the evaluation index of the fault diagnosis system structure,parameters representing the evaluation indexes of the big data information fault diagnosis system structure are subjected to information overlapping by establishing an orthogonalized evaluation matrix, and the iterative process of mutual influence among different information is as follows:
in the formula (6), α represents a mutual overlapping function of the big data information fault evaluation indexes, β represents a mutual influence iterative process between the big data information, and an algorithm program is established for the matrix of the formula (6) according to an iterative formula between the big data information fault evaluation indexes, that is:
in the formula (7), the first and second groups,representing a big data information fault assessment orthogonalization safety matrix, and mu represents an editing parameter of the orthogonalization matrix; then, applying various big data information fault evaluation index data to a data information intelligent prediction platform through a Schmidt formula, and outputting the best evaluation effect obtained by online testing as follows:
in the formula (8), the first and second groups,the evaluation index effect of each item of data information of checking calculation is shown,mrepresenting the number of big data information architecture nodes,the variable value of the number of the nodes of the big data information architecture is judged and evaluated to obtain the effect of the index, and then a weight formula is calculated, wherein the weight formula is as follows:
in the formula (9), the reaction mixture is,and representing the weight of the big data information fault evaluation index.
As a further technical solution of the present invention, the improved error evaluation function is(10)
The formula (10) includesGroup data whereinRepresented as a big data information test sample,represented as big data information failure prediction samples.
A data processing apparatus for improving a KNN method, comprising:
the data acquisition module is used for acquiring data information from the database information and performing dimensionality reduction processing on the acquired data information to acquire low-dimensional data information;
the data processing module is used for processing the data information after the dimensionality reduction through improving the KNN algorithm model;
the data evaluation module is used for evaluating the processed data information through an improved error evaluation function;
the data sharing module is used for applying and sharing data information, and performing remote data information processing and data sharing on the acquired data information;
the data processing module is respectively connected with the data acquisition module, the data evaluation module and the data sharing module.
The invention has the following positive beneficial effects:
the invention obtains the data information from the database information, and performs the dimension reduction processing on the obtained data information to obtain the low-dimensional data information; carrying out data information processing on the data information subjected to dimensionality reduction by improving a KNN algorithm model, wherein the improved KNN algorithm model comprises a data preprocessing step, a data layering step, a data KNN algorithm calculating step and a convolution fault diagnosis step; evaluating the processed data information through an improved error evaluation function; and data information application and sharing are carried out, and remote data information processing and data sharing are carried out on the acquired data information.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without inventive exercise, wherein:
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is a diagram of a first embodiment of a dimension reduction processing model according to the present invention;
FIG. 3 is a diagram of a second embodiment of the dimension reduction processing model according to the present invention;
FIG. 4 is a schematic structural diagram of a differential layer model according to a first embodiment of the present invention;
FIG. 5 is a schematic structural diagram of a differential layer model according to a second embodiment of the present invention;
FIG. 6 is a schematic structural diagram of a differential layer model according to a third embodiment of the present invention;
FIG. 7 is a schematic diagram of a convolution fault diagnosis model according to the present invention;
FIG. 8 is a comparative illustration of the experimental results of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described below with reference to the accompanying drawings, and it should be understood that the embodiments described herein are merely for the purpose of illustrating and explaining the present invention and are not intended to limit the present invention.
Example (1) method
As shown in fig. 1, a data processing method for improving the KNN method includes the following steps:
step one, acquiring data information from database information, and performing dimensionality reduction processing on the acquired data information to acquire low-dimensionality data information;
step two, carrying out data information processing on the data information after dimensionality reduction through an improved KNN algorithm model, wherein the improved KNN algorithm model comprises a data preprocessing step, a data layering step, a data KNN algorithm calculating step and a convolution fault diagnosis step;
evaluating the processed data information through an improved error evaluation function;
and fourthly, applying and sharing the data information, and performing remote data information processing and data sharing on the acquired data information.
The general name of KNN is K Nearest Neighbors, which means that the value of K is definitely of great importance. The principle of KNN is to determine what class x belongs to based on what class it is from the nearest K points when predicting a new value x.
In the above embodiment, the dimension reduction processing method includes the following steps:
(S11) dimension reduction is realized by reconstructing matrix data information, and the number of reconstructed matrix data, data dimension and time delay are set;
(S12) solving the distribution probability of different element libraries through an average mutual information method, and analyzing data characteristics through a correlation algorithm model;
(S13) the dimension of the data information is calculated through a false neighbor method, different data classifications are selected through comparing the dimensions of different data information, the sequence between two different dimensions realizes the comparison between different elements in the database information through a feature pair measurement method, and the formula is as follows:
in formula (1), R represents a data dimension, n represents a vector,representing the matrix data information before the reconstruction is performed,representing the reconstructed matrix data information;andrepresenting the relationship of false adjacent points among the reconstruction matrix data, r representing the data information added after reconstruction, u being the optimal dimensionality of the reconstruction matrix data information, and the difference between the element data dimensionality of the reconstruction matrix data and the data dimensionality after dimensionality reduction being larger than 10 after reconstruction;
and (S14) performing dimension reduction judgment, outputting data information when the dimension reduction data information meets the current requirement, and performing dimension reduction calculation again when the dimension reduction data information does not meet the current requirement.
In a specific embodiment, the dimension reduction process is an operation of converting high-dimensional data into low-dimensional data, and the computing capacity of data information can be improved. In a particular embodiment, one matrix may be reshaped to another new matrix of a different size by borrowing the function reshape through MATLAB, but retaining its original data. The number of rows and columns of the desired reconstructed matrix is represented by giving a matrix represented by a two-dimensional array and two positive integers. The reconstructed matrix needs to fill all elements of the original matrix in the same row traversal order. If reshape operation with the given parameters is feasible and reasonable, outputting a new remolding matrix; otherwise, the original matrix is output.
In a particular embodiment, the average mutual information as a whole represents the amount of information given by one random variable Y about another random variable X in the data processing. Let h (X) represent the uncertainty about the input variable X before the output symbol is received. And H (Y | X) represents the average uncertainty about the input variable X after receiving the output symbol. The difference between the two represents the amount of information obtained by the receiving end, i.e. the average mutual information. It can be seen that some uncertainty is removed by the channel transmission, and certain information is obtained, while the average mutual information represents the amount of information about the input terminal X obtained by averaging each symbol after receiving the output symbols.
In a specific embodiment, the support degree represents the occurrence probability in the population, and the larger the number of the total tickets is, the smaller the minimum support degree is set, so as to ensure that a frequent item set can exist. The less frequent item sets, the minimum support should be adjusted. Firstly, deleting the items which do not meet the minimum support degree to construct a data set, and scanning one side of the data set; then sorting the screened data sets to construct a tree, wherein the root node is NULL; the data set is inserted into the tree.
In the embodiment, on the basis of the false neighborhood concept, a method for simultaneously determining a proper embedding dimension and time delay can be provided, so that the input of the radial basis function neural network can be determined, and then, the radial basis function neural network is used for learning and predicting. The chaotic time sequence is a projection of a track of a high-dimensional phase space chaotic motion on a one-dimensional space, and the track of the chaotic motion is distorted in the projection process. When two points which are not adjacent in the high-dimensional phase space are projected on the one-dimensional space axis, the two points which are not adjacent may be called as two adjacent points, namely false adjacent points, which is the reason why the chaotic time sequence appears irregular. Reconstructing a phase space, namely recovering a track of chaotic motion from a wonton time sequence, wherein the track of the chaotic motion is gradually opened along with the increase of the embedding dimension m, and False adjacent points are gradually kicked out, so that the track of the chaotic motion is recovered.
As shown in FIGS. 2-6, the data attribute categories in FIG. 2Representing data attributes of- Categorizing the data information as a subordinate of a data attribute category, wherein a 11- a 32 And representing various data information in the subordinate classified data information. The data attribute class b in FIG. 3 represents data attributes other than a, of which- Representing a data attribute different from the data information a, b 11- b 32 A plurality of data information in subordinate classification data information which is different from the data attribute of the data information a. In other words, a and b are different types of data information.
In the above embodiment, the data hierarchy is a differential hierarchy, and the differential hierarchy method includes:
dividing the data attributes into different attributes according to the number and the types, and sequentially arranging and distributing the attribute data quantity from the top layer to the bottom layer from less to most;
calculating the distance between different data attributes, and assuming that certain data information in the data set isData attribute class classification、、Andthen data informationDistance data attribute categories、、Andis a distance of、;
Carrying out differential calculation on the calculated data information with different data attributes; when in useIn whichIs constant, then data informationIs divided intoAnd (4) class.
In a specific embodiment, by dividing different data attributes, a user can acquire data information with different attributes from a large amount of data information, and improve data processing capacity of the acquired data information in a distributed computing manner. Through differential calculation, the acquired data information can be correctly classified, so that the division of different module information is realized, and the data processing capacity is improved.
In the above embodiment, the data KNN algorithm includes the steps of:
(S21) selecting a big data information test set, and selecting a test big data information vector set according to different data attributes;
(S22) training a big data information test set to construct an n-layer tree form through hierarchical classification; data search of the big data information test set is realized through an optimal search algorithm;
(S23) sequentially calculating the text similarity of each big data information in the big data information test set and the 1 st-nth layer big data information test set training set;
the formula for calculating the Euclidean distance is as follows:
in the formula (2), the first and second groups,a feature vector representing test information in a large test set of data information,representing a sequence of feature vectors of test information in a big data information test set;the aggregate center vector is tested for large data information of layer 1 class j,a class representing big data information; m is the dimension of the feature vector of the big data information test set;testing the kth dimension of the set vector for big data information;representing a big data information test set vector of a jth class of a first layer in a kth dimension;
(S24) selecting the text most similar to the test text from the training text set according to the text similarityA text;
(S25) under test textIn each neighbor, the weight of each class is calculated in turn,representing a formula of weight value, the formula beingWhereinIn order to be able to obtain the data information,is shown asThe feature vectors of the test information in the big data information test set of a class,the coefficient of similarity of the Jacard is expressed,in order to calculate the formula for the degree of similarity,representing a degree of similarity value, whereinIs 1 or 0, ifBelong toThen functionThe value is 1, otherwise 0;
(S26) sorting the calculated weights, and comparing the sorted weights differentially, whenIn whichIf the data set represents the characteristics, the test text belongs to the 1 st class, and only the subclasses of the 1 st class in the second layer need to be compared when the similarity comparison is carried out on the second layer; if it is notThen continue to judge, existWhen is coming into contact withWhen it is, the test text belongs to1-One of the classes, when comparing the second layer, only the first class in the 1 st class in the second layer needs to be comparedA subclass of the class; if it is notIf so, continuing to judge; whereinRepresenting the difference values of the sorted forward-to-adjacent weights,the set of large data information test set threshold differential values representing settings,indicating the presence ofThe big data information-like test gathers the differential value of the distance value.
KNN (K-Nearest Neighbor) is one of the simplest machine learning algorithms, can be used for classification and regression, and is a supervised learning algorithm. If a sample belongs to a certain class in the K most similar samples in the feature space (i.e., the nearest neighbors in the feature space), then the sample also belongs to this class. That is, the method only determines the category to which the sample to be classified belongs according to the category of the nearest sample or samples in the classification decision. KNN is classified by measuring the distance between different feature values. If most of the k most similar (i.e. nearest neighbor in feature space) samples in feature space belong to a certain class, then the sample also belongs to this class. K is typically an integer no greater than 20. In the KNN algorithm, the selected neighbors are all objects that have been correctly classified. The method only determines the category of the sample to be classified according to the category of the nearest sample or a plurality of samples in the classification decision. The core idea of the kNN algorithm is that if most of k nearest neighbor samples of a sample in the feature space belong to a certain class, the sample also belongs to the class and has the characteristics of the sample on the class. The method only determines the category of the sample to be classified according to the category of the nearest sample or samples in the determination of classification decision. In a particular embodiment, the result of the KNN algorithm depends largely on the choice of K. The KNN algorithm can be used not only for classification but also for regression. The attributes of a sample are obtained by finding the k nearest neighbors to the sample and assigning the average of the attributes of these neighbors to the sample. A more useful approach is to give different weights (weights) to the impact that neighbors of different distances have on the sample, e.g., the weights are inversely proportional to the distance.
In a further embodiment, the distance between the test data and the respective training data is calculated; sorting according to the increasing relation of the distances; selecting K points with the minimum distance; determining the occurrence frequency of the category where the first K points are located; and then returning the category with the highest frequency of occurrence in the former K points as the prediction classification of the test data.
In a further embodiment, a smaller value is selected by the selection of the value of k, and then an appropriate final value is selected by cross-validation. Where k is smaller, even if the prediction is performed using samples in a smaller domain, the training error will be reduced, but the model will be so complex that it is over-fitted. The larger k is, even if prediction is performed using samples in a large area, training errors increase, a model becomes simple, and under-fitting is easily caused. Therefore, in a specific embodiment, a proper k value needs to be selected to improve the data processing capability.
An exemplary code in a data processing method of an improved KNN method is as follows:
load data.txt;
a = data (1: 30, 1: 4); % of the first thirty groups
aa = data (31: 50, 1: 4); % of the last twenty groups of the first class
b = data (51: 80, 1: 4); % of the first thirty groups of the second class
bb = data (81: 100, 1: 4); % of the last twenty groups of the second class
c = data (101: 130, 1: 4); % of the first thirty groups of the third group
cc = data (131: 150, 1: 4); % of the last twenty groups of the third class
train = cat (1, a, b, c); % of the composition training samples (90X 4)
test = cat (1, aa, bb, cc); % composition test specimen (60 ANG 4)
c = 3; % c mean c =3
z1=train(1,:);
z2=train(45,:);
z3= train (90,: r); % initial clustering centers z1, z2, z3
m = 0; t = 0; % number of iteration steps
while m==0
samp1= [ ]; samp2= [ ]; samp3= [ ]; % defines empty sample: the first type is samp1, the second type is samp2, and the third type is samp3
n1=1;n2=1;n3=1;
t=t+1;
for i=1:90
if(pdist([train(i,:);z1])〈pdist([train(i,:);z2]))&&(pdist([train(i,:);z1])<pdis
([ train (i): z3 ]))% distance
% is assigned to samp1 if the distance of the training sample from cluster z1 is less than the distance from z2, z 3.
samp1(n1,:)=train(i,:);
n1=n1+1;
elseif (pdist([train(i,:);z2])〈pdist([train(i,:);z1]))&&(pdist([train(i,:);z2])<pdist([train(i,:);z3]))
% if the distance between the training sample and the cluster z2 is less than the distance between the training sample and z1, z3, then assign a value to samp2
samp2(n2,:)=train(i,:);
n2=n2+1;
else% is assigned to samp3
samp3(n3,:)=train(i,:);
n3=n3+1;
end。
As shown in fig. 7,which represents the information of the input data,it is indicated that the node data information is hidden,representing function data information nodes in the calculation process of the big data information test set,the attributes of the hidden layer node are represented,training data information representing nodes of a data output layer;
in the above embodiment, the convolution fault diagnosis method includes the following steps:
constructing a fault diagnosis architecture by expanding a causal convolution and a residual block, in which Dropout is a regularization technique for removing some random outputs of the convolution sub-fault diagnosis model architecture layer; the number of neurons to discard is given by a DREPOPOUT rate of 0 to 1, which is the probability that the layer output is discarded; the field of view of the convolution fault diagnosis model also depends on the number of layers of the residual block, e.g. kernel size k s =3, spreading factor d =1, 2, 4, number of remaining tile stacksnA receptive field size of =1 would be 3 × 4 × 1= 12. The residual block is shown in equation (3):
in the formula (3), O is an output variable of the output layer of the convolution fault diagnosis model,input variables representing the output layers of the sub-fault diagnosis model,residual mapping representing deep learning, adding a set exit layer after the weight layer, and expanding the causal convolution function f (t) as defined by:
in the formula (4), the first and second groups,is a filter;is a hierarchy of neural networks;representing input time series information;the cavity parameters are cavity interval sizes;representing a hole convolution operator;
the evaluation formula of the fault diagnosis architecture is as follows:
in the formula (5), the first and second groups,fault evaluation finger for representing big data informationThe average value is marked, and the average value,Twhich represents the duration of the prediction,an evaluation duration period parameter representing a predictive big data message failure architecture,represents the hyper-parameters of various items of the deep learning model,θthe evaluation index of the fault diagnosis architecture is shown,parameters representing the evaluation indexes of the big data information fault diagnosis system structure are subjected to information overlapping by establishing an orthogonalized evaluation matrix, and the iterative process of mutual influence among different information is as follows:
in the formula (6), α represents a mutual overlapping function of the big data information fault evaluation indexes, β represents a mutual influence iterative process between the big data information, and an algorithm program is established for the matrix of the formula (6) according to an iterative formula between the big data information fault evaluation indexes, that is:
in the formula (7), the first and second groups,representing a big data information fault assessment orthogonalization safety matrix, and mu represents an editing parameter of the orthogonalization matrix; then, applying various big data information fault evaluation index data to a data information intelligent prediction platform through a Schmidt formula, and outputting the best evaluation effect obtained by online testing as follows:
in the formula (8), the first and second groups,the evaluation index effect of each item of data information of checking calculation is shown,mrepresenting the number of big data information architecture nodes,the variable value of the number of the nodes of the big data information architecture is judged and evaluated to obtain the effect of the index, and then a weight formula is calculated, wherein the weight formula is as follows:
in the formula (9), the reaction mixture,and representing the weight of the big data information fault evaluation index.
The super-parameters of the convolution fault diagnosis model are subjected to iteration processing by establishing an algorithm model, the fault evaluation index of big data information is calculated according to iteration data, and optimization is performed through an orthogonalization matrix, so that the optimal optimization parameter evaluation result is obtained, and the algorithm performance of the convolution fault diagnosis model system is improved.
The invention applies a novel Time Convolution Network (Time Convolution Network, Convolution component fault diagnosis model) deep learning model for scheduling big data information fault intelligent prediction.
The formula (10) includesGroup data whereinRepresented as a big data information test sample,represented as big data information failure prediction samples.
In order to verify the technical effect of the invention, the scheme 1 is assumed to be a decision tree classification method, the scheme 2 is assumed to be a k-means classification method, and the 2 methods are respectively adopted to verify and compare the scheme of the invention.
The corresponding experimental results obtained by continuous training are shown in table 1, and the comparative graph obtained by simulation software is shown in fig. 8.
TABLE 1 error accuracy comparison schematic table of different methods
As can be seen from the above figure, in the data analysis accuracy test, the test result of the method of the present invention is significantly higher than the accuracy of the scheme 1 and the scheme 2, and the data analysis accuracy of the method of the present invention is higher than 80%, and can reach 96% at most, the accuracy fluctuation is small, and the method is relatively stable. The fluctuation range of the scheme 1 and the scheme 2 is large in the data analysis accuracy test, and the accuracy is extremely unstable, so that compared with the method disclosed by the invention, the method has the great defects; therefore, the method has high data analysis accuracy.
Example (2) apparatus
A data processing apparatus for improving a KNN method, comprising:
the data acquisition module is used for acquiring data information from the database information and performing dimensionality reduction processing on the acquired data information to acquire low-dimensional data information;
the data processing module is used for processing the data information after the dimensionality reduction through improving the KNN algorithm model;
the data evaluation module is used for evaluating the processed data information through an improved error evaluation function;
the data sharing module is used for applying and sharing data information, and performing remote data information processing and data sharing on the acquired data information;
the data processing module is respectively connected with the data acquisition module, the data evaluation module and the data sharing module.
Although specific embodiments of the present invention have been described above, it will be understood by those skilled in the art that these specific embodiments are merely illustrative and that various omissions, substitutions and changes in the form of the detail of the methods and systems described above may be made by those skilled in the art without departing from the spirit and scope of the invention. For example, it is within the scope of the present invention to combine the steps of the above-described methods to perform substantially the same function in substantially the same way to achieve substantially the same result. Accordingly, the scope of the invention is to be limited only by the following claims.
Claims (7)
1. A data processing method for improving a KNN method is characterized in that: the method comprises the following steps:
step one, acquiring data information from database information, and performing dimensionality reduction processing on the acquired data information to acquire low-dimensionality data information;
step two, carrying out data information processing on the data information after dimensionality reduction through an improved KNN algorithm model, wherein the improved KNN algorithm model comprises a data preprocessing step, a data layering step, a data KNN algorithm calculating step and a convolution fault diagnosis step;
evaluating the processed data information through an improved error evaluation function;
and fourthly, applying and sharing the data information, and performing remote data information processing and data sharing on the acquired data information.
2. The data processing method for improving a KNN method as claimed in claim 1, wherein: the dimension reduction processing method comprises the following steps:
(S11) dimension reduction is realized by reconstructing matrix data information, and the number of reconstructed matrix data, data dimension and time delay are set;
(S12) solving the distribution probability of different element libraries through an average mutual information method, and analyzing data characteristics through a correlation algorithm model;
(S13) the dimension of the data information is calculated through a false neighbor method, different data classifications are selected by comparing the dimensions of different data information, the sequence between two different dimensions realizes the comparison between different elements in the database information through a feature pair measurement method, and the formula is as follows:
in formula (1), R represents a data dimension, n represents a vector,representing the matrix data information before reconstruction,representing the reconstructed matrix data information;and withRepresenting the relationship of false adjacent points among the reconstruction matrix data, r representing data information added after reconstruction, u being the optimal dimensionality of the reconstruction matrix data information, and the difference between the element data dimensionality of the reconstruction matrix data and the data dimensionality after dimensionality reduction being larger than 10 after reconstruction;
and (S14) performing dimension reduction judgment, outputting data information when the dimension reduction data information meets the current requirement, and performing dimension reduction calculation again when the dimension reduction data information does not meet the current requirement.
3. The data processing method for improving a KNN method as claimed in claim 1, wherein: the data layering is differential layering, and the differential layering method comprises the following steps:
dividing the data attributes into different attributes according to the number and the types, and sequentially arranging and distributing the attribute data quantity from the top layer to the bottom layer from less to most;
calculating the distance between different data attributes, and assuming that certain data information in the data set isData attribute class classification、、Andthen data informationDistance data attribute categories、、Andis a distance of、;
4. The data processing method for improving a KNN method as claimed in claim 1, wherein: the data KNN algorithm comprises the following steps:
(S21) selecting a big data information test set, and selecting a test big data information vector set according to different data attributes;
(S22) training a big data information test set to construct an n-layer tree form through hierarchical classification; data search of the big data information test set is realized through an optimal search algorithm;
(S23) sequentially calculating the text similarity of each big data information in the big data information test set and the 1 st-nth layer big data information test set training set;
the formula for calculating the Euclidean distance is as follows:
in the formula (2), the first and second groups,a feature vector representing test information in the large data information test set,representing a sequence of feature vectors of test information in a big data information test set;the aggregate center vector is tested for large data information of layer 1 class j,a class representing big data information; m is the dimension of the feature vector of the big data information test set;testing the kth dimension of the set vector for big data information;representing a large data information test set vector of a jth class of a first layer in a kth dimension;
(S24) selecting the text most similar to the test text from the training text set according to the text similarityA text;
(S25) in the test textIn each neighbor, the weight of each class is calculated in turn,representing a formula of weight value, the formula beingWhereinIn order to be able to obtain the data information,is shown asThe feature vectors of the test information in the big data information test set of a class,the coefficient of similarity of the Jacard is expressed,in order to calculate the formula for the degree of similarity,representing a degree of similarity value, whereinIs 1 or 0, ifBelong toThen functionThe value is 1, otherwise 0;
(S26) sorting the calculated weights, and comparing the sorted weights differentially, whenIn whichIf the data set represents the characteristics, the test text belongs to the 1 st class, and only the subclasses of the 1 st class in the second layer need to be compared when the similarity comparison is carried out on the second layer; if it is notThen continue to judge, existWhen it comes toWhen it is, the test text belongs to 1-One of the classes, when comparing the second layer, only the first class in the 1 st class in the second layer needs to be comparedA subclass of the class; if it is usedIf so, continuing to judge; whereinRepresenting the sorted differential values to adjacent weights,the set of large data information test set threshold differential values representing settings,indicating the presence ofBig data classThe information test gathers differential values of distance values.
5. The data processing method for improving a KNN method as claimed in claim 1, wherein: the volume integral fault diagnosis method comprises the following steps:
the fault diagnosis architecture is constructed by expanding the causal convolution with a residual block, as shown in equation (3):
in the formula (3), O is an output variable of the output layer of the convolution fault diagnosis model,input variables representing the output layers of the sub-fault diagnosis model,residual mapping representing deep learning, adding a set exit layer after the weight layer, and expanding the causal convolution function f (t) as defined by:
in the formula (4), the first and second groups,is a filter;is a hierarchy of neural networks;representing input time series information;the cavity parameters are cavity interval sizes;representing a hole convolution operator;
the evaluation formula of the fault diagnosis system structure is as follows:
in the formula (5), the first and second groups,the mean value of the fault assessment indexes of the big data information is shown,Twhich represents the duration of the prediction,an evaluation duration period parameter representing a predictive big data message failure architecture,represents the hyper-parameters of various items of the deep learning model,θrepresents the evaluation index of the fault diagnosis system structure,parameters representing the evaluation indexes of the big data information fault diagnosis system structure are subjected to information overlapping by establishing an orthogonalized evaluation matrix, and the iterative process of mutual influence among different information is as follows:
in the formula (6), α represents a mutual overlapping function of the big data information fault evaluation indexes, β represents a mutual influence iterative process between the big data information, and an algorithm program is established for the matrix of the formula (6) according to an iterative formula between the big data information fault evaluation indexes, that is:
in the formula (7), the first and second groups,representing a big data information fault assessment orthogonalization safety matrix, and mu represents an editing parameter of the orthogonalization matrix; then, applying various big data information fault evaluation index data to a data information intelligent prediction platform through a Schmidt formula, and outputting the best evaluation effect obtained by online testing as follows:
in the formula (8), the first and second groups,the evaluation index effect of each item of data information of checking calculation is shown,mrepresenting the number of big data information architecture nodes,the variable value of the number of the nodes of the big data information architecture is judged and evaluated to obtain the effect of the index, and then a weight formula is calculated, wherein the weight formula is as follows:
6. A method as claimed in claim 1The data processing method for improving the KNN method is characterized by comprising the following steps: the improved error evaluation function is(10)
7. A data processing apparatus for improving a KNN method, comprising:
the data acquisition module is used for acquiring data information from the database information and performing dimensionality reduction processing on the acquired data information to acquire low-dimensional data information;
the data processing module is used for processing the data information after the dimensionality reduction through improving the KNN algorithm model;
the data evaluation module is used for evaluating the processed data information through an improved error evaluation function;
the data sharing module is used for applying and sharing data information, and performing remote data information processing and data sharing on the acquired data information;
the data processing module is respectively connected with the data acquisition module, the data evaluation module and the data sharing module.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210946851.XA CN115017125B (en) | 2022-08-09 | 2022-08-09 | Data processing method and device for improving KNN method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210946851.XA CN115017125B (en) | 2022-08-09 | 2022-08-09 | Data processing method and device for improving KNN method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115017125A true CN115017125A (en) | 2022-09-06 |
CN115017125B CN115017125B (en) | 2022-10-21 |
Family
ID=83066268
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210946851.XA Active CN115017125B (en) | 2022-08-09 | 2022-08-09 | Data processing method and device for improving KNN method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115017125B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7720822B1 (en) * | 2005-03-18 | 2010-05-18 | Beyondcore, Inc. | Quality management in a data-processing environment |
CN104408095A (en) * | 2014-11-15 | 2015-03-11 | 北京广利核系统工程有限公司 | Improvement-based KNN (K Nearest Neighbor) text classification method |
US20200210826A1 (en) * | 2018-12-29 | 2020-07-02 | Northeastern University | Intelligent analysis system using magnetic flux leakage data in pipeline inner inspection |
CN112308251A (en) * | 2020-12-31 | 2021-02-02 | 北京蒙帕信创科技有限公司 | Work order assignment method and system based on machine learning |
CN114781555A (en) * | 2022-06-21 | 2022-07-22 | 深圳市鼎合丰科技有限公司 | Electronic component data classification method by improving KNN method |
-
2022
- 2022-08-09 CN CN202210946851.XA patent/CN115017125B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7720822B1 (en) * | 2005-03-18 | 2010-05-18 | Beyondcore, Inc. | Quality management in a data-processing environment |
CN104408095A (en) * | 2014-11-15 | 2015-03-11 | 北京广利核系统工程有限公司 | Improvement-based KNN (K Nearest Neighbor) text classification method |
US20200210826A1 (en) * | 2018-12-29 | 2020-07-02 | Northeastern University | Intelligent analysis system using magnetic flux leakage data in pipeline inner inspection |
CN112308251A (en) * | 2020-12-31 | 2021-02-02 | 北京蒙帕信创科技有限公司 | Work order assignment method and system based on machine learning |
CN114781555A (en) * | 2022-06-21 | 2022-07-22 | 深圳市鼎合丰科技有限公司 | Electronic component data classification method by improving KNN method |
Also Published As
Publication number | Publication date |
---|---|
CN115017125B (en) | 2022-10-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112382352B (en) | Method for quickly evaluating structural characteristics of metal organic framework material based on machine learning | |
Isa et al. | Using the self organizing map for clustering of text documents | |
CN107292350A (en) | The method for detecting abnormality of large-scale data | |
CN112784881A (en) | Network abnormal flow detection method, model and system | |
CN110135167B (en) | Edge computing terminal security level evaluation method for random forest | |
CN107103332A (en) | A kind of Method Using Relevance Vector Machine sorting technique towards large-scale dataset | |
CN112288191A (en) | Ocean buoy service life prediction method based on multi-class machine learning method | |
CN110020712B (en) | Optimized particle swarm BP network prediction method and system based on clustering | |
Labroche | New incremental fuzzy c medoids clustering algorithms | |
CN106934410A (en) | The sorting technique and system of data | |
CN112926640A (en) | Cancer gene classification method and equipment based on two-stage depth feature selection and storage medium | |
Farooq | Genetic algorithm technique in hybrid intelligent systems for pattern recognition | |
CN113516019A (en) | Hyperspectral image unmixing method and device and electronic equipment | |
CN112817954A (en) | Missing value interpolation method based on multi-method ensemble learning | |
CN112508363A (en) | Deep learning-based power information system state analysis method and device | |
CN115545111B (en) | Network intrusion detection method and system based on clustering self-adaptive mixed sampling | |
CN115017125B (en) | Data processing method and device for improving KNN method | |
CN111584010A (en) | Key protein identification method based on capsule neural network and ensemble learning | |
CN111488903A (en) | Decision tree feature selection method based on feature weight | |
CN112488188A (en) | Feature selection method based on deep reinforcement learning | |
CN111104950A (en) | K value prediction method and device in k-NN algorithm based on neural network | |
CN117437976B (en) | Disease risk screening method and system based on gene detection | |
CN113240113B (en) | Method for enhancing network prediction robustness | |
CN113609480B (en) | Multipath learning intrusion detection method based on large-scale network flow | |
CN113177604B (en) | High-dimensional data feature selection method based on improved L1 regularization and clustering |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |