CN113420804A

CN113420804A - Data processing method, device, network equipment and storage medium

Info

Publication number: CN113420804A
Application number: CN202110678862.XA
Authority: CN
Inventors: 郑忠斌; 王朝栋; 彭新
Original assignee: Industrial Internet Innovation Center Shanghai Co ltd
Current assignee: Industrial Internet Innovation Center Shanghai Co ltd
Priority date: 2021-06-18
Filing date: 2021-06-18
Publication date: 2021-09-21
Also published as: WO2022262869A1

Abstract

The embodiment of the invention relates to the technical field of communication, and discloses a data processing method, which comprises the following steps: acquiring a target data set, carrying out rough clustering on the target data set by adopting a shortest bifurcation tree rough clustering algorithm, and forming a plurality of shortest bifurcation trees according to a rough clustering result; pruning and combining the shortest bifurcation trees by adopting a threshold pruning algorithm based on a rough clustering neighborhood information system to obtain the simplified shortest bifurcation trees; and calculating the abnormal degree of the data object in the simplified shortest bifurcation tree by adopting an abnormal value detection algorithm of the balanced fusion data local multi-feature factors, and determining and removing the abnormal data value in the target data set according to the abnormal degree. The embodiment of the invention also discloses a data processing device, network equipment and a storage medium. The data processing method, the data processing device, the network equipment and the storage medium disclosed by the embodiment of the invention can eliminate abnormal data values in original data, and improve the efficiency of data analysis and the accuracy of decision.

Description

Data processing method, device, network equipment and storage medium

Technical Field

The present invention relates to the field of communications technologies, and in particular, to a data processing method, an apparatus, a network device, and a storage medium.

Background

When an enterprise makes a decision, if data is analyzed first, the decision can be made more scientifically and accurately. On one hand, however, due to the development of information technology, enterprises generate more and more data, and if the enterprises analyze the data, the enterprises often need to face a large amount of data when making decisions; on the other hand, most enterprises still rely on experience or traditional data analysis means, when a large amount of data is analyzed by the data analysis means to obtain potential rules or changes of the data, the analysis efficiency is low, and the analysis result is not accurate enough due to differences in subjective aspects, so that the decision accuracy is influenced. Particularly, if an abnormal data value exists in the original data and is not removed during data analysis, the data analysis may have an irreversible deviation, which seriously affects the accuracy of the analysis result and causes a great decision error.

Disclosure of Invention

The embodiment of the invention aims to provide a data processing method, a data processing device, network equipment and a storage medium, which can eliminate abnormal data values in original data and improve the efficiency of data analysis and the accuracy of decision.

In order to solve the above technical problem, an embodiment of the present invention provides a data processing method, including: acquiring a target data set, carrying out rough clustering on the target data set by adopting a shortest bifurcation tree rough clustering algorithm, and forming a plurality of shortest bifurcation trees according to a rough clustering result; pruning and combining the shortest bifurcation trees by adopting a threshold pruning algorithm based on a rough clustering neighborhood information system to obtain the simplified shortest bifurcation trees; and calculating the abnormal degree of the data object in the simplified shortest bifurcation tree by adopting an abnormal value detection algorithm of the balanced fusion data local multi-feature factors, and determining and removing the abnormal data value in the target data set according to the abnormal degree.

An embodiment of the present invention further provides a data processing apparatus, including: the clustering module is used for acquiring a target data set, carrying out rough clustering on the target data set by adopting a shortest bifurcation tree rough clustering algorithm, and forming a plurality of shortest bifurcation trees according to a rough clustering result; the processing module is used for pruning and combining the shortest bifurcation trees by adopting a threshold pruning algorithm based on a rough clustering neighborhood information system to obtain the simplified shortest bifurcation trees; and the determining module is used for calculating the abnormal degree of the data object in the simplified shortest bifurcation tree by adopting an abnormal value detection algorithm of the balanced fusion data local multi-feature factor, and determining and removing the abnormal data value in the target data set according to the abnormal degree.

An embodiment of the present invention further provides a network device, including: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the data processing method.

Embodiments of the present invention also provide a computer-readable storage medium storing a computer program, which when executed by a processor implements the data processing method described above.

Compared with the related technology, the embodiment of the invention adopts the shortest bifurcation tree rough clustering algorithm to carry out rough clustering on a target data set to form a plurality of shortest bifurcation trees, then adopts the threshold pruning algorithm of a rough clustering neighborhood information system to carry out pruning and merging on the shortest bifurcation trees, then calculates the abnormal degree of the data object in the shortest bifurcation trees by using the abnormal value detection algorithm of the balanced fusion data local multi-feature factors, and determines and eliminates the abnormal data value according to the abnormal degree of the data object. Because the data of the target data set is automatically analyzed by adopting the algorithm, the data analysis efficiency can be improved; meanwhile, due to the abnormal value detection algorithm of the balanced fusion data local multi-feature factor, local relative proximity is introduced into the standard local abnormal factor to replace local reachable density of the data object, the calculation ratio of neighborhood dispersion degree and distance is adjusted to a calculation mode suitable for rough clustering, and variation coefficient representation intra-class dispersion degree is introduced, so that the abnormal degree of the data object can be accurately and quantitatively analyzed, abnormal data values in original data (namely a target data set) are determined and removed according to the abnormal degree, and the accuracy of analysis results and decision is improved.

In addition, after the abnormal value detection algorithm of the balanced fusion data local multi-feature factor is adopted to calculate the abnormal degree of the data object in the simplified shortest bifurcation tree, and the abnormal data value in the target data set is determined and removed according to the abnormal degree of the data object, the method further comprises the following steps: and adopting an improved sparse autoencoder to perform dimensionality reduction on the target data set, wherein the improved sparse autoencoder adopts a sparse rule operator to replace KL relative entropy as a sparsity constraint item, and adopts an L2 norm as a regular item. By replacing KL relative entropy with a sparse rule operator as a sparsity constraint term, the coefficient performance of the algorithm can be improved; by adopting the L2 norm as a regular term, the polynomial component weight can be balanced, and the capability of preventing overfitting when the sparse self-encoder processes data is improved; meanwhile, the data dimension reduction is carried out on the data subjected to abnormal value detection by adopting the improved sparse self-encoder, so that the data redundancy can be reduced, and the simplicity and reliability of the data are improved.

In addition, the improved sparse autoencoder is adopted to reduce the dimension of the target data set, and the method comprises the following steps: the following objective loss function is constructed from the improved sparse autoencoder:

wherein λ is₁For sparse penalty term weight, λ₂As a weighted attenuation coefficient, s₂Representing the number of hidden layer neurons, W representing the weight coefficient of the neural network, b representing the bias term of the neural network, J representing the index of the neuron, J (W, b) representing the initial loss function term of the sparse self-encoder, J_spare(W, b) represents an objective loss function of the improved sparse self-encoder; and reducing the dimension of the target data set according to the target loss function.

In addition, the abnormal value detection algorithm of the balanced fusion data local multi-feature factor is adopted to calculate the abnormal degree of the data object in the simplified shortest bifurcation tree, and the abnormal data value in the target data set is determined and removed according to the abnormal degree of the data object, and the abnormal value detection method comprises the following steps: according to T_i-stand＝T_i+|min(T_i) I standardizing data in the reduced shortest bifurcation treeMelting; according to

Calculating the distance between each node in the same shortest bifurcation tree, wherein N_dis(x) Is the calculation result of the distance between each node of the shortest bifurcation tree, x is a specified data object, x_iK represents the number of data objects in the shortest forked tree class for other data objects in the shortest forked tree class, exp (1) represents a constant value with e as a base and an index of 1; calculating the variation coefficient of the data in the shortest bifurcation tree according to the following formulas respectively:

wherein T represents the sum of distances of all nodes in any shortest bifurcation tree cluster class, i represents an index label of the T, and x_qRepresenting the distance of each node in the shortest bifurcation tree, k representing the number of nodes contained in the cluster class, j being the number of the shortest bifurcation tree, N_std(T) standard deviation of class, N_mean(T) is the mean value of the class, N_cv(T) is the coefficient of variation; according to

Calculating local relative proximity of data objects in the class; calculating according to local relative proximity

Taking the MDILAF as the abnormal degree of the data object, and determining and eliminating abnormal data values in the target data set according to the abnormal degree, wherein N is_xIs the shortest spanning tree class for data object x, | n (x) | is the sum of the distances of all the remaining data objects in the class. Due to an abnormal value detection algorithm for the local multi-feature factor of the balanced fusion data, Local Relative Proximity (LRP) is introduced to a standard local abnormal factor (LOF) to replace a data objectLocal Reachable Density (LRD), the calculation ratio of neighborhood discrete degree and distance is adjusted to a calculation mode suitable for rough clustering, and the discrete degree in the variable coefficient representation class is introduced, so that the abnormal degree of the data object can be accurately and quantitatively analyzed, and the judged abnormal data value is removed.

In addition, the method adopts a shortest bifurcation tree rough clustering algorithm to carry out rough clustering on the target data set, and forms a plurality of shortest bifurcation trees according to the rough clustering result, which comprises the following steps: determining a source node in a target data set; searching the nearest node of the source node, and taking the nearest node of the source node as a primary node; starting with an initial generation node as a current father node, circularly executing search of a descendant node set by taking the current father node as a starting point and taking the adaptive node spacing as a neighborhood search radius, if a new node exists in the neighborhood search radius, taking the new node as the current father node, continuing to search the descendant node set by taking the adaptive node spacing as the neighborhood search radius until no new node exists in the neighborhood search radius, finishing the search and storing all nodes and node distances from a source node to a last generation node, and forming a shortest bifurcation tree according to all nodes from the source node to the last generation node, wherein the node distances are the distance sets of the nodes at the same level and the descendant node, and the adaptive node spacing is as follows: dist ═ arg min (last-gene)_i,next-gener_j))，last-gener_iIs a parent of two adjacent generations, next-generator_jAre the offspring in two adjacent generations.

In addition, a threshold pruning algorithm based on a rough clustering neighborhood information system is adopted to prune and combine the shortest bifurcation tree to obtain a simplified shortest bifurcation tree, which comprises the following steps: and combining the branches containing the shared nodes into a shortest branched tree structure according to the attribute of each data object in the shortest branched tree, and cutting off the completely intersected branches in the shortest branched tree to obtain the simplified shortest branched tree. By pruning the data objects and complete intersections in the shortest bifurcation tree of the rough clustering and combining the branches containing the shared nodes, the data structure of the shortest bifurcation tree can be further simplified, and the further processing of subsequent data is facilitated.

In addition, after combining the branches containing the shared nodes into a shortest bifurcation tree structure and cutting off the completely intersected branches in the shortest bifurcation tree, the method further comprises the following steps: calculating the median and average of the sum of the distances of each data object in the shortest bifurcation tree according to the Dist attribute of each data object in the shortest bifurcation tree, and cutting off branches with the fraction less than or equal to a deviation threshold value according to a deviation threshold value formula, wherein the deviation threshold value formula is as follows: DEV is mean + mean-mean, DEV is the deviation threshold, mean is the mean, and mean is the median. And the weak weight branch clusters with the scores lower than the deviation threshold value in the shortest bifurcation tree are cut out by a deviation threshold value formula, so that the data structure of the shortest bifurcation tree can be further simplified, and the further processing of subsequent data is facilitated.

Drawings

One or more embodiments are illustrated by the corresponding figures in the drawings, which are not meant to be limiting.

FIG. 1 is a schematic flow chart of a data processing method according to a first embodiment of the present invention;

FIG. 2 is a schematic diagram of an algorithm process of a shortest bifurcation tree rough algorithm in the data processing method according to the first embodiment of the present invention;

fig. 3 is an exemplary diagram of a search result of a primary node in the data processing method according to the first embodiment of the present invention;

FIG. 4 is a schematic diagram of a process of processing a shortest bifurcation tree by using a threshold pruning algorithm of a rough clustering domain information system in the data processing method according to the first embodiment of the present invention;

FIG. 5 is a schematic flowchart of an abnormal value detection algorithm using a local multi-feature factor of balanced fusion data in the data processing method according to the first embodiment of the present invention;

FIG. 6 is a schematic diagram of a network mechanism of an improved sparse autoencoder of the data processing method provided by the first embodiment of the present invention;

FIG. 7 is a flowchart illustrating a data processing method according to a first embodiment of the present invention;

fig. 8 is a schematic block diagram of a data processing apparatus according to a second embodiment of the present invention;

fig. 9 is a schematic structural diagram of a network device according to a third embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that numerous technical details are set forth in order to provide a better understanding of the present application in various embodiments of the present invention. However, the technical solution claimed in the present application can be implemented without these technical details and various changes and modifications based on the following embodiments.

The first embodiment of the invention relates to a data processing method, wherein a plurality of shortest bifurcation trees are formed by carrying out rough clustering on a target data set by adopting a shortest bifurcation tree rough clustering algorithm, then the shortest bifurcation trees are pruned and merged by adopting a threshold pruning algorithm of a rough clustering neighborhood information system, the abnormal degree of a data object in the shortest bifurcation trees is calculated by utilizing an abnormal value detection algorithm of balanced fusion data local multi-feature factors, and an abnormal data value is determined and removed according to the abnormal degree of the data object. Because the data of the target data set is automatically analyzed by adopting the algorithm, the data analysis efficiency can be improved; meanwhile, due to the abnormal value detection algorithm of the balanced fusion data local multi-feature factor, local relative proximity is introduced into the standard local abnormal factor to replace local reachable density of the data object, the calculation ratio of neighborhood dispersion degree and distance is adjusted to a calculation mode suitable for rough clustering, and variation coefficient representation intra-class dispersion degree is introduced, so that the abnormal degree of the data object can be accurately and quantitatively analyzed, abnormal data values in original data (namely a target data set) are determined and removed according to the abnormal degree, and the accuracy of analysis results and decision is improved.

It should be noted that the execution main body of the data processing method provided by the embodiment of the present invention may be a server, where the server may be implemented by a single server or a server cluster composed of multiple servers, and the following description takes the server as an example.

A specific flow of the data processing method provided by the embodiment of the present invention is shown in fig. 1, and includes the following steps:

s101: and acquiring a target data set, carrying out rough clustering on the target data set by adopting a shortest bifurcation tree rough clustering algorithm, and forming a plurality of shortest bifurcation trees according to a rough clustering result.

The target data set may be real-time data or offline data, for example, offline data of an enterprise, and when the target data set is real-time data, the target data set refers to data at a certain time.

Specifically, S101 may include: determining a source node in a target data set, searching a nearest node of the source node, and taking the nearest node of the source node as a primary node; starting with an initial generation node as a current father node, circularly executing search of a descendant node set by taking the current father node as a starting point and taking the adaptive node spacing as a neighborhood search radius, if a new node exists in the neighborhood search radius, taking the new node as the current father node, continuing to search the descendant node set by taking the adaptive node spacing as the neighborhood search radius until no new node exists in the neighborhood search radius, finishing the search and storing all nodes and node distances from a source node to a last generation node, and forming a shortest bifurcation tree according to all nodes from the source node to the last generation node, wherein the node distances are the distance sets of the nodes at the same level and the descendant node, and the adaptive node spacing is as follows: dist ═ arg min (last-gene)_i,next-gener_j))，last-gener_iIs a parent of two adjacent generations, next-generator_jIs a descendant in two adjacent generations, wherein European _ dist represents a Euclidean distance.

Please refer to fig. 2, which is a schematic diagram of an algorithm process of the shortest bifurcation tree rough clustering algorithm in the data processing method according to the embodiment of the present invention, and a specific process is described as an example below:

1. the method comprises the steps that a server side collects offline data of an enterprise as a target data set, all data objects in the target data set are assumed to be abnormal values and are considered as source nodes, the number of the data objects in the offline data is equal to the assumed number of the source nodes, and meanwhile relevant attributes (loca, value) of the source nodes are stored, wherein loca is the position of the data objects, and value is the data value.

2. When global search is carried out, the global search strategy mainly comprises the steps of calculating the self-adaptive node distance and determining adjacent nodes, and searching the data sets of the next two generations of nodes by taking a source node as a starting point as an example:

2-1, with arbitrary source node x_iAs a starting point, all data objects are traversed to determine the node with the closest distance as the initial generation node of the source node, namely x_i→x_i1And it is necessary to ensure that the source node has only one primary node.

2-2, calculating the distance | x between the source node and the initial generation node_i,x_i1If the distance can be calculated by using a distance calculation formula (e.g. euclidean distance), the initial generation node includes three attributes (loca, value, | x |)_i,x_i1|), calculating adjacent two generations of Dist belonging to the same level, and screening next-generator with the current level at the next level_jNode as center, using Dist as neighborhood search radius to search last-generator of next level_iAnd continuously searching by the thought, other nodes formed except the source node comprise three attributes defined as: x is the number of_j(loca, value, Dist), where the Dist attribute value represents the set of distances between the current node and the descendant node of the same hierarchy.

2-3, taking the initial generation node as the starting point of the next level and taking the distance | x_i,x_i1Using the self-adaptive node spacing as the neighborhood search radius to search the next generation child node set of the initial generation node, wherein the search result is shown in fig. 3, the descendant node set of the initial generation node is not limited by the number of nodes, and the data objects in the neighborhood search radius all belong to the child nodes thereof, but obey the uniqueness principle, wherein the uniqueness principle means that only last-generator can be used between two adjacent generations in the same layer_iSearch to generate next-generator_jThe mapping relationship can be one-to-one or one-to-many, but the data between two generations cannot have intersection, that is: last-generator_i→next-gener_jAnd is

3. And searching layer by layer according to the searching strategy of 2-3, and finally forming the Shortest Forking Tree (SFT) and identifying the SFT as a group of rough clusters which comprise all node data sets and corresponding node distance sets of the source nodes and descendants thereof. Forming a rough cluster-the shortest bifurcation tree includes two types of data: one is a source node and a descendant node set searched by the source node; and the second is a distance set of corresponding nodes between all adjacent two generations forming the shortest bifurcation tree.

Since the outlier in the target dataset has the characteristics of low density and large distance of surrounding data objects in the neighborhood, the dispersion between the local outlier and the adjacent point is large. Assuming that the independent abnormal value is used as a source node, and the distance (namely, the self-adaptive node spacing) between different levels is used as a neighborhood searching radius, adjacent points of the independent abnormal value are gradually searched to form a complete tree structure and are identified as a rough category, so that the aim of dividing data into different clusters can be fulfilled.

S102: and pruning and combining the shortest bifurcation tree by adopting a threshold pruning algorithm based on a rough clustering neighborhood information system to obtain the simplified shortest bifurcation tree.

Please refer to fig. 4, which is a schematic diagram illustrating a process of processing a shortest bifurcation tree by using a threshold pruning algorithm of a rough clustering neighborhood information system in the data processing method according to the embodiment of the present invention.

Specifically, S102 may include: and combining the branches containing the shared nodes into a shortest branched tree structure according to the attribute of each data object in the shortest branched tree, and cutting off the completely intersected branches in the shortest branched tree to obtain the simplified shortest branched tree.

The method comprises the following specific steps:

1. extracting any shortest bifurcation tree formed in S101, and cutting off complete intersection branches from the shortest bifurcation tree, namely, supposing that two different branches T exist_iAnd T_jAnd is|T_i|≥|T_jAnd the pruning conditions are as follows:

at this time, T is cut off_jRetaining only T_i。

2. Shared node branch clustering: assuming the presence of two different branches T₁And T₂And | T₁|≥|T₂If the condition for realizing the cluster of the shared nodes is T₂In which contains T₁When two branches are merged into T₁。

In a specific example, after combining the branches including the shared node into a shortest bifurcated tree structure and cutting off the completely intersected branches in the shortest bifurcated tree to obtain a reduced shortest bifurcated tree, the method further includes: calculating the median and average of the sum of the distances of each data object in the shortest bifurcation tree according to the Dist attribute of each data object in the shortest bifurcation tree, and cutting off branches with the fraction less than or equal to a deviation threshold value according to a deviation threshold value formula, wherein the deviation threshold value formula is as follows: DEV is mean + mean-mean, DEV is a deviation threshold, mean is an average, mean is a median, and pruning branches with a score less than or equal to the deviation threshold according to a deviation threshold formula means: branches with data object Dist values less than or equal to the deviation threshold are pruned.

It should be understood that pruning branches with scores less than or equal to the deviation threshold according to the deviation threshold formula refers to pruning the weakly weighted branch cluster class with scores lower than the deviation threshold in the shortest bifurcation tree according to the deviation threshold formula.

By pruning the data objects and complete intersections in the shortest bifurcation tree of the rough clustering and combining the branches containing the shared nodes, the data structure of the shortest bifurcation tree can be further simplified, and the further processing of subsequent data is facilitated.

S103: and calculating the abnormal degree of the data object in the simplified shortest bifurcation tree by adopting an abnormal value detection algorithm of the balanced fusion data local multi-feature factors, and determining and removing the abnormal data value in the target data set according to the abnormal degree.

At one endIn a specific example, S103 may include: according to T_i-stand＝T_i+|min(T_i) I, standardizing the data in the simplified shortest bifurcation tree; according to

Calculating the distance between each node in the same shortest bifurcation tree, wherein N_dis(x) Is the calculation result of the distance between each node of the shortest bifurcation tree, x is a specified data object, x_iK represents the number of data objects in the shortest branched tree class, exp (1) represents a constant value with e as a base and an index of 1; calculating the variation coefficient of the data in the shortest bifurcation tree according to the following formulas respectively:

Taking the MDILAF as the abnormal degree of the data object, and determining and eliminating abnormal data values in the target data set according to the abnormal degree, wherein N is_xIs the shortest bifurcation tree of data object x, | N (x) | is the distance of all the rest of the data objects in the classAnd (4) summing.

Fig. 5 is a schematic flowchart of an abnormal value detection algorithm for uniformly fusing local multi-feature factors of data according to the data processing method provided by the embodiment of the present invention.

Because the abnormal value detection algorithm of the balanced fusion data local multi-feature factor introduces local relative closeness (LRP) to the standard local abnormal factor (LOF) to replace Local Reachable Density (LRD) of the data object, adjusts the calculation ratio of neighborhood discrete degree and distance into a calculation mode suitable for rough clustering, and introduces variation coefficient to characterize the discrete degree in the class, the abnormal degree of the data object can be accurately and quantitatively analyzed, and the judged abnormal object (namely the abnormal data value) is removed.

In a specific example, after S103, the method further includes: and adopting an improved sparse autoencoder to perform dimensionality reduction on the target data set, wherein the improved sparse autoencoder adopts a sparse rule operator to replace KL relative entropy as a sparsity constraint item, and adopts an L2 norm as a regular item.

In particular, the sparse self-encoder employs increasing neuron liveness at the hidden layer

Represents the degree of activation of the hidden neuron j from the encoded neural network given input X. The average activation of sparse autoencoder hidden neurons j is defined as:

where the index value j represents each neuron position tag.

The loss function of the original sparse self-encoder is generally expressed by mean square error, and KL divergence is increased on the basis of the mean square error as sparsity constraint, and the specific formula is as follows:

where beta is a functional penalty factor,

is a functional penalty term. The update mechanism of the sparse self-encoder is as follows:

wherein

The method for representing the derivative of the output layer z in the neural network, when the improved sparse self-encoder is adopted to perform dimensionality reduction on a target data set, specifically, the method may include: the following objective loss function is constructed from the improved sparse autoencoder:

wherein λ is₁For sparse penalty term weight, λ₂As a weighted attenuation coefficient, s₂Representing the number of hidden layer neurons, W representing the weight coefficient of the neural network, b representing the bias term of the neural network, j representing the index of the neuron, in the range of [1, s₂]J (W, b) represents the initial loss function term of the sparse autoencoder, J_spare(W, b) represents an objective loss function of the improved sparse self-encoder; and reducing the dimension of the target data set according to the target loss function.

Specifically, according to the constructed target loss function, the neural network parameter updating mechanism is changed into:

reference may be made to fig. 6, which is a schematic diagram illustrating a network mechanism of an improved sparse self-encoder of the data processing method according to the embodiment of the present invention.

Fig. 7 is a flowchart illustrating a data processing method according to an embodiment of the present invention.

By replacing KL relative entropy with a sparse rule operator as a sparsity constraint term, the coefficient performance of the algorithm can be improved; by adopting the L2 norm as a regular term, the polynomial component weight can be balanced, and the capability of preventing overfitting when the sparse self-encoder processes data is improved; meanwhile, the data dimension reduction is carried out on the data subjected to abnormal value detection by adopting the improved sparse self-encoder, so that the data redundancy can be reduced, and the simplicity and reliability of the data are improved.

The data processing method provided by the embodiment of the invention adopts the shortest bifurcation tree rough clustering algorithm to carry out rough clustering on a target data set to form a plurality of shortest bifurcation trees, then adopts the threshold pruning algorithm of the rough clustering neighborhood information system to prune and combine the shortest bifurcation trees, then utilizes the abnormal value detection algorithm of the balanced fusion data local multi-feature factors to calculate the abnormal degree of the data objects in the shortest bifurcation trees, and determines and eliminates the abnormal data values according to the abnormal degree of the data objects. Because the data of the target data set is automatically analyzed by adopting the algorithm, the data analysis efficiency can be improved; meanwhile, due to the abnormal value detection algorithm of the balanced fusion data local multi-feature factor, local relative proximity is introduced into the standard local abnormal factor to replace local reachable density of the data object, the calculation ratio of neighborhood dispersion degree and distance is adjusted to a calculation mode suitable for rough clustering, and variation coefficient representation intra-class dispersion degree is introduced, so that the abnormal degree of the data object can be accurately and quantitatively analyzed, abnormal data values in original data (namely a target data set) are determined and removed according to the abnormal degree, and the accuracy of analysis results and decision is improved.

The steps of the above methods are divided for clarity, and the implementation may be combined into one step or split some steps, and the steps are divided into multiple steps, so long as the steps contain the same logical relationship, which is within the protection scope of the present patent; it is within the scope of the patent to add insignificant modifications to the algorithms or processes or to introduce insignificant design changes to the core design without changing the algorithms or processes.

A second embodiment of the present invention relates to a data processing apparatus 200, as shown in fig. 8, including: the clustering module 201, the processing module 202 and the determining module 203, the functions of each module are described in detail as follows:

the clustering module 201 is configured to obtain a target data set, perform rough clustering on the target data set by using a shortest bifurcation tree rough clustering algorithm, and form a plurality of shortest bifurcation trees according to a rough clustering result;

the processing module 202 is configured to prune and combine the shortest bifurcate trees by using a threshold pruning algorithm based on a rough clustering neighborhood information system to obtain a simplified shortest bifurcate tree;

the determining module 203 is configured to calculate an abnormal degree of the data object in the reduced shortest bifurcation tree by using an abnormal value detection algorithm for the balanced fusion data local multi-feature factor, and determine and remove an abnormal data value in the target data set according to the abnormal degree.

Further, the data processing apparatus 200 provided by the embodiment of the present invention further includes a dimension reduction module, where the dimension reduction module is configured to: and adopting an improved sparse autoencoder to perform dimensionality reduction on the target data set, wherein the improved sparse autoencoder adopts a sparse rule operator to replace KL relative entropy as a sparsity constraint item, and adopts an L2 norm as a regular item.

Further, the dimension reduction module is further configured to:

the following objective loss function is constructed from the improved sparse autoencoder:

wherein, said λ₁For sparse penalty term weights, said λ₂As a weighted attenuation coefficient, s₂Representing the number of hidden layer neurons, W represents the neural network weight coefficientB represents the neural network bias term, J represents the index of the neuron, J (W, b) represents the initial loss function term of the sparse self-encoder, J_spare(W, b) represents an objective loss function of the improved sparse self-encoder;

and reducing the dimension of the target data set according to the target loss function.

Further, the determining module 203 is specifically configured to:

according to T_i-stand＝T_i+|min(T_i) Standardizing the data in the simplified shortest bifurcation tree;

according to

Calculating the distance between each node in the same shortest bifurcation tree, wherein N_dis(x) Is the calculation result of the distance between each node of the shortest bifurcation tree, x is a specified data object, x_iK represents the number of data objects in the shortest forked tree class for other data objects in the shortest forked tree class, exp (1) represents a constant value with e as a base and an index of 1;

calculating the variation coefficient of the data in the shortest bifurcation tree according to the following formula respectively:

wherein, the T represents the sum of the distances of all nodes in any shortest bifurcation tree cluster class, i represents the index number of the T, and x_qRepresenting the distance of each node in the shortest bifurcation tree, k representing the number of nodes contained in the cluster class, N_stdOf the class (T)Standard deviation, j denotes the number of shortest bifurcation trees, N_mean(T) is the mean value of the class, N_cv(T) is the coefficient of variation;

according to

Calculating local relative proximity of data objects in the class;

calculating according to local relative proximity

Taking the MDILAF as the abnormal degree of a data object, and determining and eliminating abnormal data values in the target data set according to the abnormal degree, wherein N is_xThe shortest branch tree class of data object x, where | n (x) | is the sum of the distances of all the remaining data objects in the class.

Further, the clustering module 201 is specifically configured to:

determining a source node in the target dataset;

searching the nearest node of the source node, and taking the nearest node of the source node as a primary node;

starting with the initial generation node as a current father node, circularly executing searching of a descendant node set by taking the current father node as a starting point and taking an adaptive node spacing as a neighborhood searching radius, if a new node exists in the neighborhood searching radius, taking the new node as the current father node, continuing to search the descendant node set by taking the adaptive node spacing as the neighborhood searching radius until no new node exists in the neighborhood searching radius, ending the searching and storing all nodes and node distances from the source node to the last generation node, and forming a shortest bifurcation tree according to all nodes from the source node to the last generation node, wherein the node distances are the distance sets of the nodes at the same level and the descendant node, and the adaptive node spacing is as follows: dist ═ arg min (last-gene)_i,next-gener_j) The last-generator described in (1)_iBeing a parent of two adjacent generations, said next-generator_jAre the offspring in two adjacent generations.

Further, the processing module 202 is specifically configured to:

and combining the branches containing the shared nodes into a shortest branched tree structure according to the attribute of each data object in the shortest branched tree, and cutting off the completely intersected branches in the shortest branched tree to obtain the simplified shortest branched tree.

Further, the processing module 202 is further configured to:

according to the Dist attribute of each data object in the shortest bifurcation tree, calculating the median and average number of the sum of the distances of each data object in the shortest bifurcation tree, and cutting off branches with the fraction smaller than or equal to a deviation threshold value according to a deviation threshold value formula, wherein the deviation threshold value formula is as follows: DEV is mean + | mean-mean |, the DEV is the deviation threshold, the mean is the average, and the mean is the median.

It should be understood that this embodiment is a device embodiment corresponding to the first embodiment, and that this embodiment can be implemented in cooperation with the first embodiment. The related technical details mentioned in the first embodiment are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the first embodiment.

It should be noted that each module referred to in this embodiment is a logical module, and in practical applications, one logical unit may be one physical unit, may be a part of one physical unit, and may be implemented by a combination of multiple physical units. In addition, in order to highlight the innovative part of the present invention, elements that are not so closely related to solving the technical problems proposed by the present invention are not introduced in the present embodiment, but this does not indicate that other elements are not present in the present embodiment.

A third embodiment of the present invention relates to a network device, as shown in fig. 9, including at least one processor 301; and a memory 302 communicatively coupled to the at least one processor 301; the memory 302 stores instructions executable by the at least one processor 301, and the instructions are executed by the at least one processor 301, so that the at least one processor 301 can execute the data processing method.

Where the memory 302 and the processor 301 are coupled in a bus, the bus may comprise any number of interconnected buses and bridges, the buses coupling one or more of the various circuits of the processor 301 and the memory 302. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor 301 is transmitted over a wireless medium through an antenna, which further receives the data and transmits the data to the processor 301.

The processor 301 is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And memory 302 may be used to store data used by processor 301 in performing operations.

A fourth embodiment of the present invention relates to a computer-readable storage medium storing a computer program. The computer program realizes the above-described method embodiments when executed by a processor.

That is, those skilled in the art can understand that all or part of the steps in the method for implementing the above embodiments may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, etc.) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific embodiments for practicing the invention, and that various changes in form and details may be made therein without departing from the spirit and scope of the invention in practice.

Claims

1. A data processing method, comprising:

acquiring a target data set, carrying out rough clustering on the target data set by adopting a shortest bifurcation tree rough clustering algorithm, and forming a plurality of shortest bifurcation trees according to a rough clustering result;

pruning and combining the shortest bifurcation tree by adopting a threshold pruning algorithm based on a rough clustering neighborhood information system to obtain a simplified shortest bifurcation tree;

and calculating the abnormal degree of the data object in the simplified shortest bifurcation tree by adopting an abnormal value detection algorithm of the balanced fusion data local multi-feature factor, and determining and removing the abnormal data value in the target data set according to the abnormal degree.

2. The data processing method according to claim 1, wherein after the computing the degree of abnormality of the data object in the reduced shortest bifurcation tree by the abnormal value detection algorithm using the balanced fusion data local multi-feature factor, and determining and eliminating the abnormal data value in the target data set according to the degree of abnormality of the data object, the method further comprises:

and adopting an improved sparse autoencoder to perform dimensionality reduction on the target data set, wherein the improved sparse autoencoder adopts a sparse rule operator to replace KL relative entropy as a sparsity constraint item, and adopts an L2 norm as a regular item.

3. The data processing method of claim 2, wherein the dimensionality reduction of the target data set using the improved sparse auto-encoder comprises:

wherein, said λ₁For sparse penalty term weights, said λ₂As a weighted attenuation coefficient, s₂Representing the number of hidden layer neurons, W representing the weight coefficient of the neural network, b representing the bias term of the neural network, J representing the index of the neuron, J (W, b) representing the initial loss function term of the sparse self-encoder, J_spare(W, b) represents an objective loss function of the improved sparse self-encoder;

4. The data processing method according to claim 1, wherein the calculating the degree of abnormality of the data objects in the reduced shortest bifurcation tree by using an abnormal value detection algorithm for equalizing and fusing local multi-feature factors of data, and determining and removing abnormal data values in the target data set according to the degree of abnormality of the data objects comprises:

according to

wherein, the T represents the sum of the distances of all nodes in any shortest bifurcation tree cluster class, i represents the index number of the T, and x_qRepresenting the distance of each node in the shortest bifurcation tree, k representing the number of nodes contained in the cluster class, N_std(T) is the standard deviation of the class, j represents the number of shortest bifurcation trees, and N is_mean(T) is the average of the classes, said N_cv(T) is the coefficient of variation;

according to

Calculating local relative proximity of data objects in the class;

calculating according to local relative proximity

Taking the MDILAF as the abnormal degree of a data object, and determining and eliminating abnormal data values in the target data set according to the abnormal degree, wherein N is_xIs the shortest branching tree of data object x, and the | n (x) | is the sum of the distances of all the remaining data objects in the class.

5. The data processing method of claim 1, wherein the coarse clustering algorithm using the shortest spanning tree is used to perform coarse clustering on the target data set, and a plurality of shortest spanning trees are formed according to the coarse clustering result, including:

determining a source node in the target dataset;

6. The data processing method according to any one of claims 1 to 5, wherein the pruning and merging the shortest bifurcation tree by using a threshold pruning algorithm based on a rough clustering neighborhood information system to obtain a reduced shortest bifurcation tree comprises:

7. The data processing method of claim 4, wherein after combining the branches including the shared node into a shortest bifurcated tree structure and pruning completely intersected branches in the shortest bifurcated tree, the method further comprises:

8. A data processing apparatus, comprising:

the clustering module is used for acquiring a target data set, carrying out rough clustering on the target data set by adopting a shortest bifurcation tree rough clustering algorithm, and forming a plurality of shortest bifurcation trees according to a rough clustering result;

the processing module is used for pruning and combining the shortest bifurcation tree by adopting a threshold pruning algorithm based on a rough clustering neighborhood information system to obtain a simplified shortest bifurcation tree;

and the determining module is used for calculating the abnormal degree of the data object in the simplified shortest bifurcation tree by adopting an abnormal value detection algorithm of the balanced fusion data local multi-feature factor, and determining and removing the abnormal data value in the target data set according to the abnormal degree.

9. A network device, comprising:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the data processing method of any one of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the data processing method of any one of claims 1 to 7.