WO2022262869A1

WO2022262869A1 - Data processing method and apparatus, network device, and storage medium

Info

Publication number: WO2022262869A1
Application number: PCT/CN2022/099638
Authority: WO
Inventors: 郑忠斌; 王朝栋; 彭新
Original assignee: 工业互联网创新中心（上海）有限公司
Priority date: 2021-06-18
Filing date: 2022-06-17
Publication date: 2022-12-22
Also published as: CN113420804A

Abstract

A data processing method in the technical field of communications, comprising: acquiring a target data set, performing rough clustering on the target data set by using a shortest bifurcation tree rough clustering algorithm, and forming a plurality of shortest bifurcation trees according to a rough clustering result; pruning and merging the shortest bifurcation trees by using a threshold pruning algorithm based on a rough clustering neighborhood information system to obtain a simplified shortest bifurcation tree; and calculating abnormality of a data object in the simplified shortest bifurcation tree by using an abnormal value detection algorithm for local multi-feature factors of balanced fusion data, and determining and removing an abnormal data value in the target data set according to the abnormality. Further provided are a data processing apparatus, a network device, and a storage medium.

Description

Data processing method, device, network equipment and storage medium

cross reference

This application is based on the Chinese patent application with the application number "202110678862.X" and the filing date is June 18, 2021, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is incorporated herein by reference into this application.

technical field

The present application relates to the technical field of communications, and in particular to a data processing method, device, network equipment, and storage medium.

Background technique

When an enterprise makes a decision, if it analyzes the data first, it can make the decision more scientific and accurate. However, on the one hand, due to the development of information technology, enterprises generate more and more data. If enterprises analyze data when making decisions, they often need to face a large amount of data; on the other hand, most enterprises still rely on experience Or traditional data analysis methods, when using these data analysis methods to analyze a large amount of data to obtain its potential laws or changes, the analysis efficiency is low, and the analysis results are not accurate enough due to subjective differences, which affects decision-making accuracy. In particular, if there are abnormal data values in the original data and the abnormal data values are not eliminated during data analysis, it may cause irreversible deviations in data analysis, seriously affect the accuracy of analysis results, and lead to major mistakes in decision-making.

Contents of the invention

Some embodiments of the present application provide a data processing method, device, network device, and storage medium.

In order to solve the above technical problems, the embodiment of the present application provides a data processing method, including: obtaining the target data set, using the shortest bifurcation tree rough clustering algorithm to perform rough clustering on the target data set, and forming Multiple shortest forked trees; the threshold pruning algorithm based on the rough clustering neighborhood information system is used to prune and merge the shortest forked trees to obtain the simplified shortest forked tree; The outlier detection algorithm calculates the abnormality degree of the data object in the simplified shortest bifurcation tree, and determines and eliminates the abnormal data value in the target data set according to the abnormality degree.

The embodiment of the present application also provides a data processing device, including: a clustering module, used to obtain the target data set, use the shortest bifurcation tree rough clustering algorithm to perform rough clustering on the target data set, and according to the rough clustering result A plurality of shortest forked trees are formed; a processing module is used for pruning and merging the shortest forked trees by using a threshold pruning algorithm based on a rough clustering neighborhood information system to obtain a simplified shortest forked tree; a determination module, The outlier detection algorithm is used to calculate the abnormality degree of the data object in the simplified shortest bifurcation tree by using the outlier detection algorithm of the local multi-characteristic factors of the balanced fusion data, and determine and eliminate the abnormal data value in the target data set according to the abnormality degree.

The embodiment of the present application also provides a network device, including: at least one processor; and a memory connected to the at least one processor in communication; wherein, the memory stores instructions that can be executed by the at least one processor, and the instructions are executed by at least one processor. Executed by a processor, so that at least one processor can execute the above data processing method.

Embodiments of the present application also provide a computer-readable storage medium storing a computer program, and implementing the above-mentioned data processing method when the computer program is executed by a processor.

Description of drawings

One or more implementations are exemplified by pictures in the accompanying drawings, and these exemplifications are not intended to limit the implementations.

FIG. 1 is a schematic flow diagram of a data processing method provided in the first embodiment of the present application;

Fig. 2 is a schematic diagram of the algorithm process of the shortest bifurcated tree rough algorithm in the data processing method provided in the first embodiment of the present application;

Fig. 3 is an example diagram of search results of primary nodes in the data processing method provided in the first embodiment of the present application;

4 is a schematic diagram of the process of processing the shortest forked tree by using the threshold pruning algorithm of the rough clustering domain information system in the data processing method provided by the first embodiment of the present application;

Fig. 5 is a schematic flowchart of an outlier detection algorithm using balanced fusion data local multi-characteristic factors in the data processing method provided by the first embodiment of the present application;

Fig. 6 is a schematic diagram of the network mechanism of the improved sparse autoencoder of the data processing method provided in the first embodiment of the present application;

FIG. 7 is an exemplary flowchart of a data processing method provided in the first embodiment of the present application;

FIG. 8 is a schematic diagram of the module structure of the data processing device provided in the second embodiment of the present application;

FIG. 9 is a schematic structural diagram of a network device provided in a third embodiment of the present application.

detailed description

In order to make the purpose, technical solution and advantages of the present application clearer, various embodiments of the present application will be described in detail below in conjunction with the accompanying drawings. However, those of ordinary skill in the art can understand that, in each implementation manner of the present application, many technical details are provided for readers to better understand the present application. However, even without these technical details and various changes and modifications based on the following implementation modes, the technical solution claimed in this application can also be realized.

The first embodiment of the present application relates to a data processing method, which uses the shortest fork tree rough clustering algorithm to perform rough clustering on the target data set to form multiple shortest fork trees, and then uses the rough clustering neighborhood information system The threshold pruning algorithm prunes and merges the shortest bifurcation tree, and then uses the outlier detection algorithm of the local multi-characteristic factors of the balanced fusion data to calculate the anomaly degree of the data object in the shortest bifurcation tree, and determines and eliminates it according to the anomaly degree of the data object Unusual data value. Abnormal data values in the original data can be eliminated to improve the efficiency of data analysis and the accuracy of decision-making. Since the algorithm is used to automatically analyze the data of the target data set, the efficiency of data analysis can be improved; at the same time, due to the outlier detection algorithm of the local multi-characteristic factors of the balanced fusion data, the local relative proximity is introduced to replace the standard local anomalous factors. The local accessibility density of data objects, adjust the ratio of neighborhood dispersion and distance calculation to a calculation method suitable for rough clustering, and introduce the coefficient of variation to represent the degree of dispersion within a class, so the abnormality of data objects can be accurately and quantitatively analyzed, so that according to abnormal It can accurately determine and eliminate abnormal data values in the original data (ie, the target data set), and improve the accuracy of analysis results and decision-making.

It should be noted that the execution subject of the data processing method provided in the embodiments of the present application may be a server, wherein the server may be implemented by a single server or a server cluster composed of multiple servers, and the following uses the server as an example for illustration.

The specific flow of the data processing method provided in the implementation mode of this application is shown in Figure 1, including the following steps:

S101: Obtain the target data set, perform rough clustering on the target data set by using the shortest fork tree rough clustering algorithm, and form multiple shortest fork trees according to the rough clustering results.

Wherein, the target data set may be real-time data or offline data, such as offline data of an enterprise. When the target data set is real-time data, the target data set refers to data at a certain moment.

Specifically, S101 may include: determining the source node in the target data set, searching for the nearest node of the source node, and using the node closest to the source node as the initial node; The node is used as the starting point, and the descendant node set is searched with the adaptive node spacing as the neighborhood search radius. If there is a new node within the neighborhood search radius, the new node is used as the current parent node, and the adaptive node spacing is used as the neighborhood The search radius searches the descendant node set until there is no new node in the neighborhood search radius, ends the search and stores all nodes and node distances from the source node to the last generation node, according to all the nodes between the source node and the last generation node The nodes form the shortest bifurcated tree, where the node distance is the distance set between nodes of the same level and descendant nodes, and the adaptive node distance is: Dist=arg min(Euclidean_dist(last-gener _i ,next-gener _j )), last-gener _i is the parent in the two adjacent generations, and next-gener _j is the child in the two adjacent generations, where Euclidean_dist() represents the Euclidean distance, and i and j represent the node indexes of the parent and child, respectively.

Please refer to FIG. 2, which is a schematic diagram of the algorithm process of the shortest bifurcated tree rough clustering algorithm in the data processing method provided by the embodiment of the present application. The following uses a specific process as an example to illustrate:

1. The server collects the offline data of the enterprise as the target data set, assumes all data objects in the target data set as outliers and identifies them as source nodes, and the number of data objects in the offline data is equal to the assumed number of source nodes. Store the relevant attributes (loca, value) of the source node, where loca is the location of the data object, and value is the data value.

2. Globally search the collection of descendant nodes. When performing a global search, the global search strategy mainly includes calculating the distance between adaptive nodes and determining adjacent nodes. Take the source node as the starting point to search for the next two generations of node data sets as an example:

2-1. Starting from any source node x _i , traverse all data objects to determine the node with the closest distance as the original node of the source node, that is, x _i → x _i1 , and ensure that the source node has only one original node.

2-2. Calculate the distance between the source node and the first-generation node |x _i , x _i1 |, where the distance can be calculated using a distance calculation formula (such as Euclidean distance). At this time, the first-generation node contains three attributes (loca, value, |x _i , x _i1 |), calculate that the adjacent two generations of Dist belong to the same level, the next level of screening is centered on the next-gener _j node of the current level, and the last-gener _i of the next level is searched with Dist as the neighborhood search radius And continue to search with this idea, the other nodes formed except the source node contain three attributes, defined as: x _j = (loca, value, Dist), where the value of the Dist attribute represents the distance set between the current node and the descendant node of the same level , that is, adaptive node spacing.

2-3. Take the first-generation node as the starting point of the next level, and use the distance | _xi , x _i1 | (that is, the distance between the source node and the first-generation node) as the neighborhood search radius to search for the next-generation child node set of the first-generation node. The search The result is shown in Figure 3. At this time, the set of descendant nodes of the first-generation node is not limited by the number of nodes, and the data objects within the neighborhood search radius belong to its child nodes, but the principle of uniqueness must be followed. The principle of uniqueness refers to Between two adjacent generations of the same level, only next-gener _j can be generated by searching last-gener _i . The mapping relationship can be one-to-one or one-to-many, but the data between two generations cannot overlap, that is: last-gener _i → next-gener _j and

3. Search layer by layer according to the search strategy of 2-3 above, and finally form a shortest forked tree (the Shortest Forked Tree, SFT) and identify it as a set of rough clusters, which include all node data sets of the source node and its descendants and The corresponding set of node distances. Form rough clustering—the shortest bifurcated tree includes two types of data: one is the source node and its searched descendant node set; the other is the distance between corresponding nodes between all adjacent generations that form the shortest bifurcated tree gather.

Since the outliers in the target data set have the characteristics of low density of surrounding data objects and large spacing in their neighborhood, the dispersion between local outliers and adjacent points is large. Assuming independent outliers as the source nodes, according to the distance between different levels (that is, the adaptive node spacing) as the neighborhood search radius, gradually search its adjacent points to form a complete tree structure and identify it as a rough category, which can achieve The purpose of dividing the data into different clusters.

S102: Use a threshold pruning algorithm based on the rough clustering neighborhood information system to prune and merge the shortest forked tree to obtain a simplified shortest forked tree.

Please refer to FIG. 4 , which is a schematic diagram of the process of processing the shortest forked tree by using the threshold pruning algorithm of the rough clustering neighborhood information system in the data processing method provided by the embodiment of the present application.

Specifically, S102 may include: according to the attributes of each data object in the shortest forked tree, combine the branches containing shared nodes into a shortest forked tree structure, and cut off the branches that completely intersect in the shortest forked tree to obtain a simplified The shortest bifurcation tree of .

The specific steps are as follows:

1. Extract any shortest bifurcated tree formed in S101, and cut off the complete intersection branch of the shortest bifurcated tree, that is, assuming that there are two different branches T _i and T _j and |T _i |≥|T _j |, satisfying the pruning The branch condition is:

At this time, T _j is cut off and only T _i is reserved.

2. Shared node branch clustering: Assuming that there are two different branches T ₁ and T ₂ and |T ₁ |≥|T ₂ |, the condition for realizing shared node clustering is that T ₂ contains nodes of T ₁ , at this time Merge the two branches into T ₁ .

In a specific example, after combining the branches containing the shared nodes into a shortest bifurcated tree structure, and pruning the completely intersecting branches in the shortest bifurcated tree to obtain the simplified shortest bifurcated tree, it also includes: according to the shortest The Dist attribute of each data object in the bifurcated tree, calculate the median and average of the sum of the distances of each data object in the shortest bifurcated tree, and cut off the branch whose score is less than or equal to the deviation threshold according to the deviation threshold formula, where the deviation threshold The formula is: DEV＝mean+|mean-median|, DEV is the deviation threshold, mean is the average value, and median is the median. According to the deviation threshold formula, cutting off branches whose scores are less than or equal to the deviation threshold means: cutting off the data object whose Dist value is less than or branches equal to the deviation threshold.

It should be understood that the pruning of branches whose scores are less than or equal to the deviation threshold according to the deviation threshold formula refers to the pruning of weak weight branch clusters in the shortest bifurcated tree whose scores are lower than the deviation threshold according to the deviation threshold formula.

Use the deviation threshold formula to prune the weak weight branch clusters whose scores are lower than the deviation threshold in the shortest bifurcation tree, that is, to prune the data objects and complete intersections in the shortest bifurcation tree of rough clustering and merge the branches containing shared nodes, which can be The data structure of the shortest forked tree is further simplified to facilitate further processing of subsequent data.

S103: Calculate the abnormality degree of the data object in the streamlined shortest bifurcation tree by using the outlier detection algorithm of the local multi-characteristic factors of the balanced fusion data, and determine and eliminate the abnormal data value in the target data set according to the abnormality degree.

In a specific example, S103 may include: according to T _o-stand = T _o + |min(T _o )| numerically processing the data in the simplified shortest bifurcation tree, that is, the data standardization process, where T _o Represents the shortest bifurcation tree branch after simplification, T _o-stand represents the T _o branch after numerical processing; according to N _dis (x)=

Calculate the distance between each node in the same shortest bifurcation tree, where N _dis (x) is the calculation result of the distance between each node of the shortest bifurcation tree, x is the specified data object, x _i is the shortest bifurcation For other data objects in the tree class, K represents the number of data objects in the shortest forked tree class, and exp(1) represents a constant value with e as the base and an exponent of 1; respectively calculate the shortest forked tree according to the following formula The coefficient of variation of the data:

Among them, T _i represents the sum of the distances of all nodes in any shortest forked tree cluster class, x _c represents the distance of each node in the shortest forked tree corresponding to T _i , α represents the number of nodes contained in the cluster class, β is the number of the shortest forked tree, N _std (T _i ) is the standard deviation of the shortest forked tree cluster, N _mean (T _i ) is the average value of the class, N _cv (T _i ) is the coefficient of variation; according to

Computes the local relative proximity of data objects in a class; computes based on local relative proximity

Take MDILAF as the abnormal degree of the data object, and determine and eliminate the abnormal data values in the target data set according to the abnormal degree, where LRP( _xi ) is the local relative proximity between the other data in the class except x, N(x) is the shortest bifurcated tree of data object x, and |N(x)| is the sum of the distances of all other data objects in the class.

The above process can refer to FIG. 5 , which is a schematic flow chart of the data processing method provided by the embodiment of the present application using an outlier detection algorithm using balanced fusion data local multi-feature factors.

Due to the outlier detection algorithm of the local multi-feature factors of the balanced fusion data, the local relative proximity (Local Relative Proximity, LRP) is introduced to the standard local outlier factor (Local Outlier Factor, LOF) to replace the local reachability density (Local Reachability) of the data object. Density, LRD), adjust the neighborhood dispersion degree and distance calculation ratio to the calculation method suitable for rough clustering, and introduce the variation coefficient to represent the dispersion degree within the class, so it can accurately and quantitatively analyze the abnormality of data objects and eliminate the abnormal objects judged ( i.e. outlier data values).

In a specific example, after S103, it also includes: using an improved sparse autoencoder to reduce the dimensionality of the target data set, wherein the improved sparse autoencoder uses a sparse rule operator instead of KL relative entropy as a sparsity constraint term, and the L2 norm is used as the regular term.

Specifically, the sparse autoencoder adopts increasing neuron activity in the hidden layer

The sparsity limit of is used to represent the activation of the hidden neuron j-ac of the autoencoder neural network given the input X. The average activation of the hidden neuron j-ac of the sparse autoencoder is defined as:

Among them, the index value j-ac represents the position label of each neuron, H represents the number of neurons in the input layer, and h represents the index of each neuron in the input layer.

The loss function of the original sparse autoencoder is generally represented by the mean square error and on this basis, the KL divergence is added as a sparsity constraint. The specific formula is as follows:

In the formula

is the function penalty factor,

is the function penalty term. The update mechanism of the sparse autoencoder is:

Where f'(z _q ) represents the derivative of the output layer z in the neural network, and q represents the number of neurons in the output layer. When the improved sparse autoencoder is used to reduce the dimension of the target data set, specifically, it can include: according to The modified sparse autoencoder constructs the following objective loss function:

Among them, λ ₁ is the weight of the sparse penalty item, λ ₂ is the weight decay coefficient, S ₂ represents the number of neurons in the hidden layer, W _s represents the weight coefficient of all hidden layer neurons in the neural network, b represents the bias term of the neural network, s Represents the index of the hidden layer neuron, and its range is [1,S ₂ ], J(W,b) represents the initial loss function item of the sparse autoencoder, and J _sparse (W,b) represents the improved sparse autoencoder The target loss function of , y represents the real value, h _{w, h} (x) represents the predicted value of the neural network whose input is x,

Indicates the increment of the hidden layer (second layer) updated after the neural network error backpropagation,

Represents the weight of each neuron in the hidden layer (the second layer and the third layer),

Represents the increment of the hidden layer (third layer) of the neural network, where r represents the index of each neuron in the hidden layer (third layer); performs dimensionality reduction on the target data set according to the target loss function.

Specifically, according to the constructed target loss function, the neural network parameter update mechanism is changed to:

Reference can be made to FIG. 6 , which is a schematic diagram of a network mechanism of an improved sparse autoencoder of the data processing method provided in the embodiment of the present application.

For the above-mentioned specific flow, reference may be made to FIG. 7 , which is an example flow diagram of a data processing method provided in an embodiment of the present application.

By replacing the KL relative entropy with the sparse regular operator as the sparsity constraint item, the performance of the algorithm coefficients can be improved; using the L2 norm as the regular item can balance the weight of the polynomial components and improve the sparse autoencoder to prevent overfitting when processing data. At the same time, using an improved sparse autoencoder to reduce the data dimensionality of the data that has been detected by outliers can reduce data redundancy and improve the simplicity and reliability of data.

The data processing method provided by the embodiment of the present application uses the rough clustering algorithm of the shortest forked tree to perform rough clustering on the target data set to form multiple shortest forked trees, and then uses the threshold pruning algorithm of the rough clustering neighborhood information system Pruning and merging the shortest bifurcation tree, and then using the outlier detection algorithm of balanced fusion data local multi-characteristic factors to calculate the abnormality degree of the data object in the shortest bifurcation tree, determine and eliminate the abnormal data value according to the abnormality degree of the data object. Since the algorithm is used to automatically analyze the data of the target data set, the efficiency of data analysis can be improved; at the same time, due to the outlier detection algorithm of the local multi-characteristic factors of the balanced fusion data, the local relative proximity is introduced to replace the standard local anomalous factors. The local accessibility density of data objects, adjust the ratio of neighborhood dispersion and distance calculation to a calculation method suitable for rough clustering, and introduce the coefficient of variation to represent the degree of dispersion within a class, so the abnormality of data objects can be accurately and quantitatively analyzed, so that according to abnormal It can accurately determine and eliminate abnormal data values in the original data (ie, the target data set), and improve the accuracy of analysis results and decision-making.

The division of steps in the above methods is only for the sake of clarity of description. During implementation, they can be combined into one step or some steps can be split and decomposed into multiple steps. As long as they contain the same logical relationship, they are all within the scope of protection of this patent. ; Adding insignificant modifications or introducing insignificant designs to the algorithm or process, but not changing the core design of the algorithm and process are all within the scope of protection of this patent.

The second embodiment of the present application relates to a data processing device 200, as shown in FIG. 8 , comprising: a clustering module 201, a processing module 202, and a determination module 203. The functions of each module are described in detail as follows:

The clustering module 201 is used to obtain the target data set, perform rough clustering on the target data set by using the shortest fork tree rough clustering algorithm, and form multiple shortest fork trees according to the rough clustering results;

The processing module 202 is configured to use a threshold pruning algorithm based on a rough clustering neighborhood information system to prune and merge the shortest forked tree to obtain a simplified shortest forked tree;

The determination module 203 is used to calculate the anomaly degree of the data object in the simplified shortest bifurcated tree by adopting the outlier detection algorithm of balanced and fused data local multi-characteristic factors, and determine and eliminate the target data set according to the anomalous degree abnormal data value.

Further, the data processing device 200 provided in the embodiment of the present application further includes a dimensionality reduction module, wherein the dimensionality reduction module is used to: use an improved sparse autoencoder to perform dimensionality reduction on the target data set, wherein the improved The sparse autoencoder uses the sparse regular operator instead of the KL relative entropy as the sparsity constraint item, and uses the L2 norm as the regular item.

Further, the dimensionality reduction module is also used for:

Build the following objective loss function from the modified sparse autoencoder:

Wherein, the λ ₁ is the weight of the sparse penalty item, the λ ₂ is the weight attenuation coefficient, S ₂ represents the number of neurons in the hidden layer, W _s represents the weight coefficient of all hidden layer neurons in the neural network, and b represents the bias of the neural network Set item, s represents the index of the hidden layer neuron, and its range is [1,S ₂ ], J(W,b) represents the initial loss function item of the sparse autoencoder, J _sparse (W,b) represents the improved Target loss function for sparse autoencoders;

Dimensionality reduction is performed on the target data set according to the target loss function.

Further, the determination module 203 is specifically used for:

Standardize the data in the simplified shortest bifurcated tree according to T _o-stand =T _o +|min(T _o )|;

according to

Calculate the distance between each node in the same shortest bifurcated tree, wherein, N _dis (x) is the calculation result of the distance between each node of the shortest bifurcated tree, x is the specified data object, x _i is the shortest point For other data objects in the fork tree class, K represents the number of data objects in the shortest fork tree class, and exp(1) represents taking e as the base, and the exponent is a constant value of 1;

The coefficient of variation of the data in the shortest forked tree is calculated according to the following formula:

Among them, T _i represents the sum of the distances of all nodes in any shortest forked tree cluster class, x _c represents the distance of each node in the shortest forked tree corresponding to T _i , α represents the number of nodes contained in the cluster class, β is the number of the shortest forked tree, N _std (T _i ) is the standard deviation of the shortest forked tree cluster, N _mean (T _i ) is the average value of the class, N _cv (T _i ) is the coefficient of variation.

according to

Compute the local relative proximity of data objects in a class;

Calculated from local relative proximity

The MDILAF is used as the abnormality degree of the data object, and the abnormal data values in the target data set are determined and eliminated according to the abnormality degree, wherein, LRP( _xi ) is the local relative proximity between the other data in the class except x degree, N(x) is the shortest bifurcated tree of data object x, |N(x)| is the sum of the distances of all other data objects in the class.

Further, the clustering module 201 is specifically used for:

determining source nodes in said target data set;

Search for the nearest node of the source node, and use the node closest to the source node as the initial generation node;

Starting from the first-generation node as the current parent node, loop execution takes the current parent node as the starting point, uses the adaptive node spacing as the neighborhood search radius to search for the descendant node set, if there is a new node in the neighborhood search radius, then Use the new node as the current parent node, continue to use the adaptive node spacing as the neighborhood search radius to search for the descendant node set, until there is no new node in the neighborhood search radius, end the search and store the All nodes and node distances between the node and the last generation node, form the shortest bifurcated tree according to all the nodes between the source node and the last generation node, wherein the node distance is the distance set between nodes of the same level and descendant nodes , the adaptive node spacing is: Dist=arg min(Eucliddan_dist(last-gener _i , next-gener _j )), where last-gener _i is the parent generation of two adjacent generations, and next-gener _j is the parent generation of two adjacent generations descendants of generations.

According to the attributes of each data object in the shortest forked tree, the branches containing the shared nodes are combined into a shortest forked tree structure, and the branches that are completely intersected in the shortest forked tree are cut off to obtain the simplified shortest branch fork tree.

Further, the processing module 202 is also used for:

According to the Dist attribute of each data object in the shortest bifurcated tree, calculate the median and the average of the sum of the distances of each data object in the shortest bifurcated tree, and cut off scores less than or equal to the deviation threshold according to the deviation threshold formula Branch, wherein, the formula of the deviation threshold is: DEV=mean+|mean-median|, DEV is the deviation threshold, mean is the average value, and median is the median. According to the deviation threshold formula, the branch whose score is less than or equal to the deviation threshold is Refers to: pruning branches whose Dist value of the data object is less than or equal to the deviation threshold.

It is not difficult to find that this embodiment is a device embodiment corresponding to the first embodiment, and this embodiment can be implemented in cooperation with the first embodiment. The relevant technical details mentioned in the first embodiment are still valid in this embodiment, and will not be repeated here in order to reduce repetition. Correspondingly, the relevant technical details mentioned in this implementation manner can also be applied in the first implementation manner.

It is worth mentioning that all the modules involved in this embodiment are logical modules. In practical applications, a logical unit can be a physical unit, or a part of a physical unit, or multiple physical units. Combination of units. In addition, in order to highlight the innovative part of the present application, units that are not closely related to solving the technical problems proposed in the present application are not introduced in this embodiment, but this does not mean that there are no other units in this embodiment.

The third embodiment of the present application relates to a network device. As shown in FIG. 9 , it includes at least one processor 301; and a memory 302 communicatively connected to at least one processor 301; The instructions executed by the processor 301 are executed by at least one processor 301, so that the at least one processor 301 can execute the above data processing method.

Wherein, the memory 302 and the processor 301 are connected by a bus, and the bus may include any number of interconnected buses and bridges, and the bus connects one or more processors 301 and various circuits of the memory 302 together. The bus may also connect together various other circuits such as peripherals, voltage regulators, and power management circuits, all of which are well known in the art and therefore will not be further described herein. The bus interface provides an interface between the bus and the transceivers. A transceiver may be a single element or multiple elements, such as multiple receivers and transmitters, providing means for communicating with various other devices over a transmission medium. The data processed by the processor 301 is transmitted on the wireless medium through the antenna, and further, the antenna also receives the data and transmits the data to the processor 301 .

The processor 301 is responsible for managing the bus and general processing, and may also provide various functions including timing, peripheral interface, voltage regulation, power management and other control functions. And the memory 302 may be used to store data used by the processor 301 when performing operations.

The fourth embodiment of the present application relates to a computer-readable storage medium storing a computer program. When the computer program is executed by the processor, the above-mentioned method embodiments are realized.

That is, those skilled in the art can understand that all or part of the steps in the method of the above-mentioned embodiments can be completed by instructing related hardware through a program, the program is stored in a storage medium, and includes several instructions to make a device (which can It is a single-chip microcomputer, a chip, etc.) or a processor (processor) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. .

Those of ordinary skill in the art can understand that the above-mentioned implementation modes are specific implementation modes for realizing the application, and in practical applications, various changes can be made to it in form and details without departing from the spirit and spirit of the application. scope.

Claims

A data processing method, characterized in that, comprising:

Obtaining the target data set, performing rough clustering on the target data set by using the shortest forked tree rough clustering algorithm, and forming multiple shortest forked trees according to the rough clustering results;

Pruning and merging the shortest forked tree by using a threshold pruning algorithm based on a rough clustering neighborhood information system to obtain a simplified shortest forked tree;

Using an outlier detection algorithm that balances and fuses local multi-characteristic factors of the data to calculate the anomaly degree of the data object in the simplified shortest bifurcation tree, and determine and eliminate abnormal data values in the target data set according to the anomaly degree.
The data processing method according to claim 1, wherein the outlier detection algorithm using local multi-characteristic factors of balanced fusion data is used to calculate the anomaly degree of the data object in the simplified shortest bifurcation tree, and according to After the abnormality degree of the data object is determined and the abnormal data values in the target data set are eliminated, it also includes:

An improved sparse autoencoder is used to reduce the dimensionality of the target data set, wherein the improved sparse autoencoder uses a sparse rule operator instead of KL relative entropy as a sparsity constraint item, and uses an L2 norm as a regular item .
The data processing method according to claim 2, wherein said adopting an improved sparse autoencoder to reduce the dimensionality of said target data set comprises:

Build the following objective loss function from the modified sparse autoencoder:

Among them, λ 1 is the weight of the sparse penalty item, λ 2 is the weight decay coefficient, S 2 represents the number of neurons in the hidden layer, W s represents the weight coefficient of all hidden layer neurons in the neural network, b represents the bias term of the neural network, s Represents the index of the hidden layer neuron, and its range is [1,S 2 ], J(W,b) represents the initial loss function item of the sparse autoencoder, and J spare (W,b) represents the improved sparse autoencoder The target loss function;

Dimensionality reduction is performed on the target data set according to the target loss function.
The data processing method according to any one of claims 1 to 3, characterized in that the outlier detection algorithm using the local multi-characteristic factors of the balanced fusion data calculates the data object in the simplified shortest bifurcation tree degree of abnormality, and determine and eliminate abnormal data values in the target data set according to the degree of abnormality of the data object, including:

Standardize the data in the simplified shortest bifurcated tree according to T o-stand =T o +|min(T o )|;

according to
Calculate the distance between each node in the same shortest bifurcated tree, wherein, N dis (x) is the calculation result of the distance between each node of the shortest bifurcated tree, x is the specified data object, x i is the shortest point For other data objects in the fork tree class, K represents the number of data objects in the shortest fork tree class, and exp(1) represents taking e as the base, and the exponent is a constant value of 1;

The coefficient of variation of the data in the shortest forked tree is calculated according to the following formula:

Among them, T i represents the sum of the distances of all nodes in any shortest forked tree cluster class, x c represents the distance of each node in the shortest forked tree corresponding to T i , α represents the number of nodes contained in the cluster class, β is the number of the shortest forked tree, N std (T i ) is the standard deviation of the shortest forked tree cluster class, N mean (T i ) is the average value of the class, N cv (T i ) is the coefficient of variation;

according to
Compute the local relative proximity of data objects in a class;

Calculated from local relative proximity
The MDILAF is used as the abnormality degree of the data object, and the abnormal data values in the target data set are determined and eliminated according to the abnormality degree, wherein, LRP( xi ) is the local relative proximity between the other data in the class except x degree, N(x) is the shortest bifurcated tree of data object x, |N(x)| is the sum of the distances of all other data objects in the class.
The data processing method according to any one of claims 1 to 4, characterized in that, the rough clustering algorithm of the shortest bifurcation tree is used to perform rough clustering on the target data set, and multiple clusters are formed according to the rough clustering results. A shortest bifurcation tree, including:

determining source nodes in said target data set;

Search for the nearest node of the source node, and use the node closest to the source node as the initial generation node;

Starting from the first-generation node as the current parent node, loop execution takes the current parent node as the starting point, uses the adaptive node spacing as the neighborhood search radius to search for the descendant node set, if there is a new node in the neighborhood search radius, then Use the new node as the current parent node, continue to use the adaptive node spacing as the neighborhood search radius to search for the descendant node set, until there is no new node in the neighborhood search radius, end the search and store the All nodes and node distances between the node and the last generation node, form the shortest bifurcated tree according to all the nodes between the source node and the last generation node, wherein the node distance is the distance set between nodes of the same level and descendant nodes , the adaptive node spacing is: Dist=arg min(Euclidean_dist(last-gener i , next-gener j )), where last-gener i is the parent generation of two adjacent generations, and next-gener j is the parent generation of two adjacent generations descendants of generations.
The data processing method according to any one of claims 1-5, wherein the threshold pruning algorithm based on the rough clustering neighborhood information system is used to prune and merge the shortest bifurcation tree to obtain The streamlined shortest forked tree includes:

According to the attributes of each data object in the shortest forked tree, the branches containing the shared nodes are combined into a shortest forked tree structure, and the branches that are completely intersected in the shortest forked tree are cut off to obtain the simplified shortest branch fork tree.
The data processing method according to claim 6, characterized in that, after said combining the branches containing the shared nodes into a shortest bifurcated tree structure, and pruning the completely intersecting branches in the shortest bifurcated tree, further comprising :

According to the Dist attribute of each data object in the shortest bifurcated tree, calculate the median and the average of the sum of the distances of each data object in the shortest bifurcated tree, and cut off scores less than or equal to the deviation threshold according to the deviation threshold formula Branch, wherein the deviation threshold formula is: DEV=mean+|mean-median|, DEV is the deviation threshold, mean is the average value, and median is the median.
A data processing device, characterized in that it comprises:

A clustering module, configured to obtain a target data set, perform rough clustering on the target data set by using the shortest fork tree rough clustering algorithm, and form multiple shortest fork trees according to the rough clustering results;

A processing module, configured to use a threshold pruning algorithm based on a rough clustering neighborhood information system to prune and merge the shortest forked tree to obtain a simplified shortest forked tree;

The determination module is used to calculate the abnormality degree of the data object in the simplified shortest bifurcated tree by adopting the outlier detection algorithm of the local multi-characteristic factors of the balanced fusion data, and determine and eliminate the object in the target data set according to the abnormality degree Unusual data value.
A network device, characterized in that it includes:

at least one processor; and,

a memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, the instructions are executed by the at least one processor, so that the at least one processor can perform the operation described in any one of claims 1 to 7 The data processing method described above.
A computer-readable storage medium storing a computer program, wherein the computer program implements the data processing method according to any one of claims 1 to 7 when executed by a processor.