WO2022262869A1 - Data processing method and apparatus, network device, and storage medium - Google Patents

Data processing method and apparatus, network device, and storage medium Download PDF

Info

Publication number
WO2022262869A1
WO2022262869A1 PCT/CN2022/099638 CN2022099638W WO2022262869A1 WO 2022262869 A1 WO2022262869 A1 WO 2022262869A1 CN 2022099638 W CN2022099638 W CN 2022099638W WO 2022262869 A1 WO2022262869 A1 WO 2022262869A1
Authority
WO
WIPO (PCT)
Prior art keywords
shortest
tree
data
node
forked
Prior art date
Application number
PCT/CN2022/099638
Other languages
French (fr)
Chinese (zh)
Inventor
郑忠斌
王朝栋
彭新
Original Assignee
工业互联网创新中心(上海)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 工业互联网创新中心(上海)有限公司 filed Critical 工业互联网创新中心(上海)有限公司
Publication of WO2022262869A1 publication Critical patent/WO2022262869A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2323Non-hierarchical techniques based on graph theory, e.g. minimum spanning trees [MST] or graph cuts
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the technical field of communications, and in particular to a data processing method, device, network equipment, and storage medium.
  • Some embodiments of the present application provide a data processing method, device, network device, and storage medium.
  • the embodiment of the present application provides a data processing method, including: obtaining the target data set, using the shortest bifurcation tree rough clustering algorithm to perform rough clustering on the target data set, and forming Multiple shortest forked trees; the threshold pruning algorithm based on the rough clustering neighborhood information system is used to prune and merge the shortest forked trees to obtain the simplified shortest forked tree; The outlier detection algorithm calculates the abnormality degree of the data object in the simplified shortest bifurcation tree, and determines and eliminates the abnormal data value in the target data set according to the abnormality degree.
  • the embodiment of the present application also provides a data processing device, including: a clustering module, used to obtain the target data set, use the shortest bifurcation tree rough clustering algorithm to perform rough clustering on the target data set, and according to the rough clustering result A plurality of shortest forked trees are formed; a processing module is used for pruning and merging the shortest forked trees by using a threshold pruning algorithm based on a rough clustering neighborhood information system to obtain a simplified shortest forked tree; a determination module, The outlier detection algorithm is used to calculate the abnormality degree of the data object in the simplified shortest bifurcation tree by using the outlier detection algorithm of the local multi-characteristic factors of the balanced fusion data, and determine and eliminate the abnormal data value in the target data set according to the abnormality degree.
  • a clustering module used to obtain the target data set, use the shortest bifurcation tree rough clustering algorithm to perform rough clustering on the target data set, and according to the rough clustering result A plurality of shortest forked trees are formed;
  • the embodiment of the present application also provides a network device, including: at least one processor; and a memory connected to the at least one processor in communication; wherein, the memory stores instructions that can be executed by the at least one processor, and the instructions are executed by at least one processor. Executed by a processor, so that at least one processor can execute the above data processing method.
  • Embodiments of the present application also provide a computer-readable storage medium storing a computer program, and implementing the above-mentioned data processing method when the computer program is executed by a processor.
  • FIG. 1 is a schematic flow diagram of a data processing method provided in the first embodiment of the present application
  • Fig. 2 is a schematic diagram of the algorithm process of the shortest bifurcated tree rough algorithm in the data processing method provided in the first embodiment of the present application;
  • Fig. 3 is an example diagram of search results of primary nodes in the data processing method provided in the first embodiment of the present application
  • FIG. 4 is a schematic diagram of the process of processing the shortest forked tree by using the threshold pruning algorithm of the rough clustering domain information system in the data processing method provided by the first embodiment of the present application;
  • Fig. 5 is a schematic flowchart of an outlier detection algorithm using balanced fusion data local multi-characteristic factors in the data processing method provided by the first embodiment of the present application;
  • Fig. 6 is a schematic diagram of the network mechanism of the improved sparse autoencoder of the data processing method provided in the first embodiment of the present application;
  • FIG. 7 is an exemplary flowchart of a data processing method provided in the first embodiment of the present application.
  • FIG. 8 is a schematic diagram of the module structure of the data processing device provided in the second embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a network device provided in a third embodiment of the present application.
  • the first embodiment of the present application relates to a data processing method, which uses the shortest fork tree rough clustering algorithm to perform rough clustering on the target data set to form multiple shortest fork trees, and then uses the rough clustering neighborhood information system
  • the threshold pruning algorithm prunes and merges the shortest bifurcation tree, and then uses the outlier detection algorithm of the local multi-characteristic factors of the balanced fusion data to calculate the anomaly degree of the data object in the shortest bifurcation tree, and determines and eliminates it according to the anomaly degree of the data object Unusual data value.
  • Abnormal data values in the original data can be eliminated to improve the efficiency of data analysis and the accuracy of decision-making.
  • the algorithm Since the algorithm is used to automatically analyze the data of the target data set, the efficiency of data analysis can be improved; at the same time, due to the outlier detection algorithm of the local multi-characteristic factors of the balanced fusion data, the local relative proximity is introduced to replace the standard local anomalous factors.
  • the local accessibility density of data objects adjust the ratio of neighborhood dispersion and distance calculation to a calculation method suitable for rough clustering, and introduce the coefficient of variation to represent the degree of dispersion within a class, so the abnormality of data objects can be accurately and quantitatively analyzed, so that according to abnormal It can accurately determine and eliminate abnormal data values in the original data (ie, the target data set), and improve the accuracy of analysis results and decision-making.
  • the execution subject of the data processing method provided in the embodiments of the present application may be a server, wherein the server may be implemented by a single server or a server cluster composed of multiple servers, and the following uses the server as an example for illustration.
  • S101 Obtain the target data set, perform rough clustering on the target data set by using the shortest fork tree rough clustering algorithm, and form multiple shortest fork trees according to the rough clustering results.
  • the target data set may be real-time data or offline data, such as offline data of an enterprise.
  • the target data set refers to data at a certain moment.
  • S101 may include: determining the source node in the target data set, searching for the nearest node of the source node, and using the node closest to the source node as the initial node; The node is used as the starting point, and the descendant node set is searched with the adaptive node spacing as the neighborhood search radius.
  • the new node is used as the current parent node, and the adaptive node spacing is used as the neighborhood
  • the search radius searches the descendant node set until there is no new node in the neighborhood search radius, ends the search and stores all nodes and node distances from the source node to the last generation node, according to all the nodes between the source node and the last generation node
  • FIG. 2 is a schematic diagram of the algorithm process of the shortest bifurcated tree rough clustering algorithm in the data processing method provided by the embodiment of the present application.
  • the following uses a specific process as an example to illustrate:
  • the server collects the offline data of the enterprise as the target data set, assumes all data objects in the target data set as outliers and identifies them as source nodes, and the number of data objects in the offline data is equal to the assumed number of source nodes. Store the relevant attributes (loca, value) of the source node, where loca is the location of the data object, and value is the data value.
  • the global search strategy mainly includes calculating the distance between adaptive nodes and determining adjacent nodes. Take the source node as the starting point to search for the next two generations of node data sets as an example:
  • the first-generation node contains three attributes (loca, value,
  • the first-generation node Take the first-generation node as the starting point of the next level, and use the distance
  • the search The result is shown in Figure 3.
  • the set of descendant nodes of the first-generation node is not limited by the number of nodes, and the data objects within the neighborhood search radius belong to its child nodes, but the principle of uniqueness must be followed.
  • the principle of uniqueness refers to Between two adjacent generations of the same level, only next-gener j can be generated by searching last-gener i .
  • the mapping relationship can be one-to-one or one-to-many, but the data between two generations cannot overlap, that is: last-gener i ⁇ next-gener j and
  • the shortest bifurcated tree includes two types of data: one is the source node and its searched descendant node set; the other is the distance between corresponding nodes between all adjacent generations that form the shortest bifurcated tree gather.
  • the outliers in the target data set have the characteristics of low density of surrounding data objects and large spacing in their neighborhood, the dispersion between local outliers and adjacent points is large.
  • the distance between different levels that is, the adaptive node spacing
  • S102 Use a threshold pruning algorithm based on the rough clustering neighborhood information system to prune and merge the shortest forked tree to obtain a simplified shortest forked tree.
  • FIG. 4 is a schematic diagram of the process of processing the shortest forked tree by using the threshold pruning algorithm of the rough clustering neighborhood information system in the data processing method provided by the embodiment of the present application.
  • S102 may include: according to the attributes of each data object in the shortest forked tree, combine the branches containing shared nodes into a shortest forked tree structure, and cut off the branches that completely intersect in the shortest forked tree to obtain a simplified The shortest bifurcation tree of .
  • cutting off branches whose scores are less than or equal to the deviation threshold means: cutting off the data object whose Dist value is less than or branches equal to the deviation threshold.
  • the pruning of branches whose scores are less than or equal to the deviation threshold according to the deviation threshold formula refers to the pruning of weak weight branch clusters in the shortest bifurcated tree whose scores are lower than the deviation threshold according to the deviation threshold formula.
  • S103 Calculate the abnormality degree of the data object in the streamlined shortest bifurcation tree by using the outlier detection algorithm of the local multi-characteristic factors of the balanced fusion data, and determine and eliminate the abnormal data value in the target data set according to the abnormality degree.
  • numerically processing the data in the simplified shortest bifurcation tree, that is, the data standardization process, where T o Represents the shortest bifurcation tree branch after simplification, T o-stand represents the T o branch after numerical processing; according to N dis (x) Calculate the distance between each node in the same shortest bifurcation tree, where N dis (x) is the calculation result of the distance between each node of the shortest bifurcation tree, x is the specified data object, x i is the shortest bifurcation For other data objects in the tree class, K represents the number of data objects in the shortest forked tree class, and exp(1) represents a constant value with e as the base and an exponent of 1; respectively calculate the shortest forked tree according to the following formula The coefficient of variation of the data:
  • T i represents the sum of the distances of all nodes in any shortest forked tree cluster class
  • x c represents the distance of each node in the shortest forked tree corresponding to T i
  • represents the number of nodes contained in the cluster class
  • is the number of the shortest forked tree
  • N std (T i ) is the standard deviation of the shortest forked tree cluster
  • N mean (T i ) is the average value of the class
  • N cv (T i ) is the coefficient of variation
  • FIG. 5 is a schematic flow chart of the data processing method provided by the embodiment of the present application using an outlier detection algorithm using balanced fusion data local multi-feature factors.
  • the local relative proximity (Local Relative Proximity, LRP) is introduced to the standard local outlier factor (Local Outlier Factor, LOF) to replace the local reachability density (Local Reachability) of the data object.
  • LRP Local Relative Proximity
  • LOF Local Outlier Factor
  • Density, LRD adjust the neighborhood dispersion degree and distance calculation ratio to the calculation method suitable for rough clustering, and introduce the variation coefficient to represent the dispersion degree within the class, so it can accurately and quantitatively analyze the abnormality of data objects and eliminate the abnormal objects judged ( i.e. outlier data values).
  • after S103 it also includes: using an improved sparse autoencoder to reduce the dimensionality of the target data set, wherein the improved sparse autoencoder uses a sparse rule operator instead of KL relative entropy as a sparsity constraint term, and the L2 norm is used as the regular term.
  • the sparse autoencoder adopts increasing neuron activity in the hidden layer
  • the sparsity limit of is used to represent the activation of the hidden neuron j-ac of the autoencoder neural network given the input X.
  • the average activation of the hidden neuron j-ac of the sparse autoencoder is defined as:
  • the index value j-ac represents the position label of each neuron
  • H represents the number of neurons in the input layer
  • h represents the index of each neuron in the input layer.
  • the loss function of the original sparse autoencoder is generally represented by the mean square error and on this basis, the KL divergence is added as a sparsity constraint.
  • the specific formula is as follows:
  • f'(z q ) represents the derivative of the output layer z in the neural network
  • q represents the number of neurons in the output layer.
  • the modified sparse autoencoder constructs the following objective loss function: Among them, ⁇ 1 is the weight of the sparse penalty item, ⁇ 2 is the weight decay coefficient, S 2 represents the number of neurons in the hidden layer, W s represents the weight coefficient of all hidden layer neurons in the neural network, b represents the bias term of the neural network, s Represents the index of the hidden layer neuron, and its range is [1,S 2 ], J(W,b) represents the initial loss function item of the sparse autoencoder, and J sparse (W,b) represents the improved sparse autoencoder
  • the target loss function of , y represents the real value
  • h w, h (x) represents the predicted value of the neural network whose input is x
  • FIG. 6 is a schematic diagram of a network mechanism of an improved sparse autoencoder of the data processing method provided in the embodiment of the present application.
  • FIG. 7 is an example flow diagram of a data processing method provided in an embodiment of the present application.
  • the performance of the algorithm coefficients can be improved; using the L2 norm as the regular item can balance the weight of the polynomial components and improve the sparse autoencoder to prevent overfitting when processing data.
  • using an improved sparse autoencoder to reduce the data dimensionality of the data that has been detected by outliers can reduce data redundancy and improve the simplicity and reliability of data.
  • the data processing method provided by the embodiment of the present application uses the rough clustering algorithm of the shortest forked tree to perform rough clustering on the target data set to form multiple shortest forked trees, and then uses the threshold pruning algorithm of the rough clustering neighborhood information system Pruning and merging the shortest bifurcation tree, and then using the outlier detection algorithm of balanced fusion data local multi-characteristic factors to calculate the abnormality degree of the data object in the shortest bifurcation tree, determine and eliminate the abnormal data value according to the abnormality degree of the data object.
  • the algorithm Since the algorithm is used to automatically analyze the data of the target data set, the efficiency of data analysis can be improved; at the same time, due to the outlier detection algorithm of the local multi-characteristic factors of the balanced fusion data, the local relative proximity is introduced to replace the standard local anomalous factors.
  • the local accessibility density of data objects adjust the ratio of neighborhood dispersion and distance calculation to a calculation method suitable for rough clustering, and introduce the coefficient of variation to represent the degree of dispersion within a class, so the abnormality of data objects can be accurately and quantitatively analyzed, so that according to abnormal It can accurately determine and eliminate abnormal data values in the original data (ie, the target data set), and improve the accuracy of analysis results and decision-making.
  • the second embodiment of the present application relates to a data processing device 200, as shown in FIG. 8 , comprising: a clustering module 201, a processing module 202, and a determination module 203.
  • a clustering module 201 a processing module 202
  • a determination module 203 a processing module 203.
  • the clustering module 201 is used to obtain the target data set, perform rough clustering on the target data set by using the shortest fork tree rough clustering algorithm, and form multiple shortest fork trees according to the rough clustering results;
  • the processing module 202 is configured to use a threshold pruning algorithm based on a rough clustering neighborhood information system to prune and merge the shortest forked tree to obtain a simplified shortest forked tree;
  • the determination module 203 is used to calculate the anomaly degree of the data object in the simplified shortest bifurcated tree by adopting the outlier detection algorithm of balanced and fused data local multi-characteristic factors, and determine and eliminate the target data set according to the anomalous degree abnormal data value.
  • the data processing device 200 provided in the embodiment of the present application further includes a dimensionality reduction module, wherein the dimensionality reduction module is used to: use an improved sparse autoencoder to perform dimensionality reduction on the target data set, wherein the improved The sparse autoencoder uses the sparse regular operator instead of the KL relative entropy as the sparsity constraint item, and uses the L2 norm as the regular item.
  • the dimensionality reduction module is used to: use an improved sparse autoencoder to perform dimensionality reduction on the target data set, wherein the improved The sparse autoencoder uses the sparse regular operator instead of the KL relative entropy as the sparsity constraint item, and uses the L2 norm as the regular item.
  • the dimensionality reduction module is also used for:
  • the ⁇ 1 is the weight of the sparse penalty item
  • the ⁇ 2 is the weight attenuation coefficient
  • S 2 represents the number of neurons in the hidden layer
  • W s represents the weight coefficient of all hidden layer neurons in the neural network
  • b represents the bias of the neural network Set item
  • s represents the index of the hidden layer neuron, and its range is [1,S 2 ]
  • J(W,b) represents the initial loss function item of the sparse autoencoder
  • J sparse (W,b) represents the improved Target loss function for sparse autoencoders;
  • determination module 203 is specifically used for:
  • N dis (x) is the calculation result of the distance between each node of the shortest bifurcated tree
  • x is the specified data object
  • x i is the shortest point
  • K represents the number of data objects in the shortest fork tree class
  • exp(1) represents taking e as the base
  • the exponent is a constant value of 1
  • T i represents the sum of the distances of all nodes in any shortest forked tree cluster class
  • x c represents the distance of each node in the shortest forked tree corresponding to T i
  • represents the number of nodes contained in the cluster class
  • is the number of the shortest forked tree
  • N std (T i ) is the standard deviation of the shortest forked tree cluster
  • N mean (T i ) is the average value of the class
  • N cv (T i ) is the coefficient of variation.
  • the MDILAF is used as the abnormality degree of the data object, and the abnormal data values in the target data set are determined and eliminated according to the abnormality degree, wherein, LRP( xi ) is the local relative proximity between the other data in the class except x degree, N(x) is the shortest bifurcated tree of data object x,
  • clustering module 201 is specifically used for:
  • the branches containing the shared nodes are combined into a shortest forked tree structure, and the branches that are completely intersected in the shortest forked tree are cut off to obtain the simplified shortest branch fork tree.
  • processing module 202 is also used for:
  • the branch whose score is less than or equal to the deviation threshold is Refers to: pruning branches whose Dist value of the data object is less than or equal to the deviation threshold.
  • this embodiment is a device embodiment corresponding to the first embodiment, and this embodiment can be implemented in cooperation with the first embodiment.
  • the relevant technical details mentioned in the first embodiment are still valid in this embodiment, and will not be repeated here in order to reduce repetition.
  • the relevant technical details mentioned in this implementation manner can also be applied in the first implementation manner.
  • modules involved in this embodiment are logical modules.
  • a logical unit can be a physical unit, or a part of a physical unit, or multiple physical units. Combination of units.
  • units that are not closely related to solving the technical problems proposed in the present application are not introduced in this embodiment, but this does not mean that there are no other units in this embodiment.
  • the third embodiment of the present application relates to a network device. As shown in FIG. 9 , it includes at least one processor 301; and a memory 302 communicatively connected to at least one processor 301; The instructions executed by the processor 301 are executed by at least one processor 301, so that the at least one processor 301 can execute the above data processing method.
  • the memory 302 and the processor 301 are connected by a bus, and the bus may include any number of interconnected buses and bridges, and the bus connects one or more processors 301 and various circuits of the memory 302 together.
  • the bus may also connect together various other circuits such as peripherals, voltage regulators, and power management circuits, all of which are well known in the art and therefore will not be further described herein.
  • the bus interface provides an interface between the bus and the transceivers.
  • a transceiver may be a single element or multiple elements, such as multiple receivers and transmitters, providing means for communicating with various other devices over a transmission medium.
  • the data processed by the processor 301 is transmitted on the wireless medium through the antenna, and further, the antenna also receives the data and transmits the data to the processor 301 .
  • the processor 301 is responsible for managing the bus and general processing, and may also provide various functions including timing, peripheral interface, voltage regulation, power management and other control functions. And the memory 302 may be used to store data used by the processor 301 when performing operations.
  • the fourth embodiment of the present application relates to a computer-readable storage medium storing a computer program.
  • the computer program is executed by the processor, the above-mentioned method embodiments are realized.
  • the program is stored in a storage medium, and includes several instructions to make a device (which can It is a single-chip microcomputer, a chip, etc.) or a processor (processor) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Discrete Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A data processing method in the technical field of communications, comprising: acquiring a target data set, performing rough clustering on the target data set by using a shortest bifurcation tree rough clustering algorithm, and forming a plurality of shortest bifurcation trees according to a rough clustering result; pruning and merging the shortest bifurcation trees by using a threshold pruning algorithm based on a rough clustering neighborhood information system to obtain a simplified shortest bifurcation tree; and calculating abnormality of a data object in the simplified shortest bifurcation tree by using an abnormal value detection algorithm for local multi-feature factors of balanced fusion data, and determining and removing an abnormal data value in the target data set according to the abnormality. Further provided are a data processing apparatus, a network device, and a storage medium.

Description

数据处理方法、装置、网络设备及存储介质Data processing method, device, network equipment and storage medium
交叉引用cross reference
本申请基于申请号为“202110678862.X”、申请日为2021年06月18日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此以引入方式并入本申请。This application is based on the Chinese patent application with the application number "202110678862.X" and the filing date is June 18, 2021, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is incorporated herein by reference into this application.
技术领域technical field
本申请涉及通信技术领域,特别涉及一种数据处理方法、装置、网络设备及存储介质。The present application relates to the technical field of communications, and in particular to a data processing method, device, network equipment, and storage medium.
背景技术Background technique
企业在进行决策时,如果先对数据进行分析,则可以使决策更加科学和准确。然而,一方面,由于信息技术的发展,企业产生的数据越来越多,企业在进行决策时如果对数据进行分析往往需要面对大量的数据;另一方面,目前大多数企业还是依赖以经验或传统的数据分析手段,运用这些数据分析手段对大量的数据进行分析来获取其潜在规律或变化时,分析的效率较低,还会因为存在主观方面的差异而使得分析结果不够准确,影响决策的准确性。特别地,若原始数据中存在异常数据值,在进行数据分析时未剔除异常数据值,则可能会使数据分析出现不可逆的偏差,严重影响分析结果的准确性,导致决策出现重大失误。When an enterprise makes a decision, if it analyzes the data first, it can make the decision more scientific and accurate. However, on the one hand, due to the development of information technology, enterprises generate more and more data. If enterprises analyze data when making decisions, they often need to face a large amount of data; on the other hand, most enterprises still rely on experience Or traditional data analysis methods, when using these data analysis methods to analyze a large amount of data to obtain its potential laws or changes, the analysis efficiency is low, and the analysis results are not accurate enough due to subjective differences, which affects decision-making accuracy. In particular, if there are abnormal data values in the original data and the abnormal data values are not eliminated during data analysis, it may cause irreversible deviations in data analysis, seriously affect the accuracy of analysis results, and lead to major mistakes in decision-making.
发明内容Contents of the invention
本申请部分实施例提供一种数据处理方法、装置、网络设备及存储介质。Some embodiments of the present application provide a data processing method, device, network device, and storage medium.
为解决上述技术问题,本申请的实施方式提供了一种数据处理方法,包括: 获取目标数据集,采用最短分叉树粗糙聚类算法对目标数据集进行粗糙聚类,根据粗糙聚类结果形成多个最短分叉树;采用基于粗糙聚类邻域信息系统的阈值剪枝算法对最短分叉树进行剪枝与合并,得到精简后的最短分叉树;采用均衡融合数据局部多特征因子的异常值检测算法计算精简后的最短分叉树中数据对象的异常度,并根据异常度确定并剔除目标数据集中的异常数据值。In order to solve the above technical problems, the embodiment of the present application provides a data processing method, including: obtaining the target data set, using the shortest bifurcation tree rough clustering algorithm to perform rough clustering on the target data set, and forming Multiple shortest forked trees; the threshold pruning algorithm based on the rough clustering neighborhood information system is used to prune and merge the shortest forked trees to obtain the simplified shortest forked tree; The outlier detection algorithm calculates the abnormality degree of the data object in the simplified shortest bifurcation tree, and determines and eliminates the abnormal data value in the target data set according to the abnormality degree.
本申请的实施方式还提供了一种数据处理装置,包括:聚类模块,用于获取目标数据集,采用最短分叉树粗糙聚类算法对目标数据集进行粗糙聚类,根据粗糙聚类结果形成多个最短分叉树;处理模块,用于采用基于粗糙聚类邻域信息系统的阈值剪枝算法对最短分叉树进行剪枝与合并,得到精简后的最短分叉树;确定模块,用于采用均衡融合数据局部多特征因子的异常值检测算法计算精简后的最短分叉树中数据对象的异常度,并根据异常度确定并剔除目标数据集中的异常数据值。The embodiment of the present application also provides a data processing device, including: a clustering module, used to obtain the target data set, use the shortest bifurcation tree rough clustering algorithm to perform rough clustering on the target data set, and according to the rough clustering result A plurality of shortest forked trees are formed; a processing module is used for pruning and merging the shortest forked trees by using a threshold pruning algorithm based on a rough clustering neighborhood information system to obtain a simplified shortest forked tree; a determination module, The outlier detection algorithm is used to calculate the abnormality degree of the data object in the simplified shortest bifurcation tree by using the outlier detection algorithm of the local multi-characteristic factors of the balanced fusion data, and determine and eliminate the abnormal data value in the target data set according to the abnormality degree.
本申请的实施方式还提供了一种网络设备,包括:至少一个处理器;以及,与至少一个处理器通信连接的存储器;其中,存储器存储有可被至少一个处理器执行的指令,指令被至少一个处理器执行,以使至少一个处理器能够执行上述的数据处理方法。The embodiment of the present application also provides a network device, including: at least one processor; and a memory connected to the at least one processor in communication; wherein, the memory stores instructions that can be executed by the at least one processor, and the instructions are executed by at least one processor. Executed by a processor, so that at least one processor can execute the above data processing method.
本申请的实施方式还提供了一种计算机可读存储介质,存储有计算机程序,计算机程序被处理器执行时实现上述的数据处理方法。Embodiments of the present application also provide a computer-readable storage medium storing a computer program, and implementing the above-mentioned data processing method when the computer program is executed by a processor.
附图说明Description of drawings
一个或多个实施方式通过与之对应的附图中的图片进行示例性说明,这些示例性说明并不构成对实施方式的限定。One or more implementations are exemplified by pictures in the accompanying drawings, and these exemplifications are not intended to limit the implementations.
图1是本申请第一实施方式提供的数据处理方法的流程示意图;FIG. 1 is a schematic flow diagram of a data processing method provided in the first embodiment of the present application;
图2是本申请第一实施方式提供的数据处理方法中最短分叉树粗糙算法的算法过程示意图;Fig. 2 is a schematic diagram of the algorithm process of the shortest bifurcated tree rough algorithm in the data processing method provided in the first embodiment of the present application;
图3是本申请第一实施方式提供的数据处理方法中初代节点的搜索结果的示例图;Fig. 3 is an example diagram of search results of primary nodes in the data processing method provided in the first embodiment of the present application;
图4是本申请第一实施方式提供的数据处理方法采用粗糙聚类领域信息系统的阈值剪枝算法对最短分叉树进行处理的过程示意图;4 is a schematic diagram of the process of processing the shortest forked tree by using the threshold pruning algorithm of the rough clustering domain information system in the data processing method provided by the first embodiment of the present application;
图5是本申请第一实施方式提供的数据处理方法采用均衡融合数据局部多特征因子的异常值检测算法的流程示意图;Fig. 5 is a schematic flowchart of an outlier detection algorithm using balanced fusion data local multi-characteristic factors in the data processing method provided by the first embodiment of the present application;
图6是本申请第一实施方式提供的数据处理方法的改进稀疏自编码器的网络机制示意图;Fig. 6 is a schematic diagram of the network mechanism of the improved sparse autoencoder of the data processing method provided in the first embodiment of the present application;
图7是本申请第一实施方式提供的数据处理方法的流程示例图;FIG. 7 is an exemplary flowchart of a data processing method provided in the first embodiment of the present application;
图8是本申请第二实施方式提供的数据处理装置的模块结构示意图;FIG. 8 is a schematic diagram of the module structure of the data processing device provided in the second embodiment of the present application;
图9是本申请第三实施方式提供的网络设备的结构示意图。FIG. 9 is a schematic structural diagram of a network device provided in a third embodiment of the present application.
具体实施方式detailed description
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请的各实施方式进行详细的阐述。然而,本领域的普通技术人员可以理解,在本申请各实施方式中,为了使读者更好地理解本申请而提出了许多技术细节。但是,即使没有这些技术细节和基于以下各实施方式的种种变化和修改,也可以实现本申请所要求保护的技术方案。In order to make the purpose, technical solution and advantages of the present application clearer, various embodiments of the present application will be described in detail below in conjunction with the accompanying drawings. However, those of ordinary skill in the art can understand that, in each implementation manner of the present application, many technical details are provided for readers to better understand the present application. However, even without these technical details and various changes and modifications based on the following implementation modes, the technical solution claimed in this application can also be realized.
本申请的第一实施方式涉及一种数据处理方法,通过采用最短分叉树粗糙聚类算法对目标数据集进行粗糙聚类形成多个最短分叉树,然后采用粗糙聚类邻域信息系统的阈值剪枝算法对最短分叉树进行剪枝与合并,再利用均衡融合数据局部多特征因子的异常值检测算法计算最短分叉树中数据对象的异常度,根据数据对象的异常度确定并剔除异常数据值。可以剔除原始数据中的异常数据值,提高数据分析的效率和决策的准确性。由于是采用算法对目标数据集的数据进行自动分析,因此可以提高数据分析的效率;同时,由于均衡融合数据局部多特征因子的异常值检测算法,对标准局部异常因子引入局部相对接近度来替换数据对象的局部可及密度,将邻域离散程度与距离计算比率调整为适合粗糙聚类的计算方式,引入变异系数表征类内离散程度,因此可以准确定量分析数据对象的异常度,从而根据异常度确定并剔除原始数据(即目标数据集)中的异常数据值,提高分析结果以及决策的准确性。The first embodiment of the present application relates to a data processing method, which uses the shortest fork tree rough clustering algorithm to perform rough clustering on the target data set to form multiple shortest fork trees, and then uses the rough clustering neighborhood information system The threshold pruning algorithm prunes and merges the shortest bifurcation tree, and then uses the outlier detection algorithm of the local multi-characteristic factors of the balanced fusion data to calculate the anomaly degree of the data object in the shortest bifurcation tree, and determines and eliminates it according to the anomaly degree of the data object Unusual data value. Abnormal data values in the original data can be eliminated to improve the efficiency of data analysis and the accuracy of decision-making. Since the algorithm is used to automatically analyze the data of the target data set, the efficiency of data analysis can be improved; at the same time, due to the outlier detection algorithm of the local multi-characteristic factors of the balanced fusion data, the local relative proximity is introduced to replace the standard local anomalous factors. The local accessibility density of data objects, adjust the ratio of neighborhood dispersion and distance calculation to a calculation method suitable for rough clustering, and introduce the coefficient of variation to represent the degree of dispersion within a class, so the abnormality of data objects can be accurately and quantitatively analyzed, so that according to abnormal It can accurately determine and eliminate abnormal data values in the original data (ie, the target data set), and improve the accuracy of analysis results and decision-making.
应当说明的是,本申请实施方式提供的数据处理方法的执行主体可以为服务端,其中,服务端可由单独的服务器或多个服务器组成的服务器集群来实现, 以下以服务端为例进行说明。It should be noted that the execution subject of the data processing method provided in the embodiments of the present application may be a server, wherein the server may be implemented by a single server or a server cluster composed of multiple servers, and the following uses the server as an example for illustration.
本申请实施方式提供的数据处理方法的具体流程如图1所示,包括以下步骤:The specific flow of the data processing method provided in the implementation mode of this application is shown in Figure 1, including the following steps:
S101:获取目标数据集,采用最短分叉树粗糙聚类算法对目标数据集进行粗糙聚类,根据粗糙聚类结果形成多个最短分叉树。S101: Obtain the target data set, perform rough clustering on the target data set by using the shortest fork tree rough clustering algorithm, and form multiple shortest fork trees according to the rough clustering results.
其中,目标数据集可以为实时数据或离线数据,例如是企业的离线数据,当目标数据集为实时数据时,目标数据集是指某一时刻的数据。Wherein, the target data set may be real-time data or offline data, such as offline data of an enterprise. When the target data set is real-time data, the target data set refers to data at a certain moment.
具体地,S101可以包括:确定目标数据集中的源节点,搜索源节点的最近节点,将离源节点最近的节点作为初代节点;从将初代节点作为当前的父节点开始,循环执行以当前的父节点为起点,以自适应节点间距为邻域搜索半径搜索后代节点集,若邻域搜索半径内存在新的节点,则将新的节点作为当前的父节点,继续以自适应节点间距为邻域搜索半径搜索后代节点集,直到邻域搜索半径内不存在新的节点,结束搜索并存储从源节点到最后一代节点之间的所有节点和节点距离,根据源节点至最后一代节点之间的所有节点形成最短分叉树,其中,节点距离为同层次节点与后代节点的间距集合,自适应节点间距为:Dist=arg min(Euclidean_dist(last-gener i,next-gener j)),last-gener i为相邻两代中的父代,next-gener j为相邻两代中的子代,其中,Euclidean_dist()表示欧式距离,i和j分别表示父代与子代的节点索引。 Specifically, S101 may include: determining the source node in the target data set, searching for the nearest node of the source node, and using the node closest to the source node as the initial node; The node is used as the starting point, and the descendant node set is searched with the adaptive node spacing as the neighborhood search radius. If there is a new node within the neighborhood search radius, the new node is used as the current parent node, and the adaptive node spacing is used as the neighborhood The search radius searches the descendant node set until there is no new node in the neighborhood search radius, ends the search and stores all nodes and node distances from the source node to the last generation node, according to all the nodes between the source node and the last generation node The nodes form the shortest bifurcated tree, where the node distance is the distance set between nodes of the same level and descendant nodes, and the adaptive node distance is: Dist=arg min(Euclidean_dist(last-gener i ,next-gener j )), last-gener i is the parent in the two adjacent generations, and next-gener j is the child in the two adjacent generations, where Euclidean_dist() represents the Euclidean distance, and i and j represent the node indexes of the parent and child, respectively.
请参考图2,其为本申请实施方式提供的数据处理方法中最短分叉树粗糙聚类算法的算法过程示意图,以下以一个具体的过程为例进行说明:Please refer to FIG. 2, which is a schematic diagram of the algorithm process of the shortest bifurcated tree rough clustering algorithm in the data processing method provided by the embodiment of the present application. The following uses a specific process as an example to illustrate:
1、服务端采集企业的离线数据作为目标数据集,将目标数据集中所有数据对象假定为异常值并将其认定为源节点,离线数据中数据对象的个数等于假定的源节点个数,同时存储源节点的相关属性(loca,value),其中,loca为数据对象位置,value为数据值。1. The server collects the offline data of the enterprise as the target data set, assumes all data objects in the target data set as outliers and identifies them as source nodes, and the number of data objects in the offline data is equal to the assumed number of source nodes. Store the relevant attributes (loca, value) of the source node, where loca is the location of the data object, and value is the data value.
2、全局搜索后代节点集合,在进行全局搜索时,全局搜索策略主要包括计算自适应节点间距和确定相邻节点,以源节点为起点搜索其后两代节点数据集为例:2. Globally search the collection of descendant nodes. When performing a global search, the global search strategy mainly includes calculating the distance between adaptive nodes and determining adjacent nodes. Take the source node as the starting point to search for the next two generations of node data sets as an example:
2-1、以任意源节点x i为起点,遍历所有数据对象确定其距离最近的节点作为源节点的初代节点,即x i→x i1,且需保证源节点只有一个初代节点。 2-1. Starting from any source node x i , traverse all data objects to determine the node with the closest distance as the original node of the source node, that is, x i → x i1 , and ensure that the source node has only one original node.
2-2、计算源节点与初代节点间距|x i,x i1|,其中间距可以采用距离计算公式(如欧式距离)进行计算,此时的初代节点包含三种属性(loca,value,|x i,x i1|),计算Dist的临近两代归属于同一层次,下一层次筛选以当前层次的next-gener j节点为中心,以Dist作为邻域搜索半径搜索下一层次的last-gener i并以此思路不断搜寻,除源节点外形成的其他节点包含三种属性,定义为:x j=(loca,value,Dist),其中Dist属性值表示同层次的当前节点与后代节点的间距集合,即自适应节点间距。 2-2. Calculate the distance between the source node and the first-generation node |x i , x i1 |, where the distance can be calculated using a distance calculation formula (such as Euclidean distance). At this time, the first-generation node contains three attributes (loca, value, |x i , x i1 |), calculate that the adjacent two generations of Dist belong to the same level, the next level of screening is centered on the next-gener j node of the current level, and the last-gener i of the next level is searched with Dist as the neighborhood search radius And continue to search with this idea, the other nodes formed except the source node contain three attributes, defined as: x j = (loca, value, Dist), where the value of the Dist attribute represents the distance set between the current node and the descendant node of the same level , that is, adaptive node spacing.
2-3、以初代节点作为下一层次的起点,以间距|x i,x i1|(即源节点与初代节点间距)作为邻域搜索半径,搜寻初代节点的下一代子节点集,其搜索结果如图3所示,此时初代节点的后代节点集合不受节点数目的限制,邻域搜索半径内的数据对象都属于其子节点,但要遵守唯一性原则,其中,唯一性原则是指同层次相邻两代之间只能由last-gener i搜索产生next-gener j,其映射关系可以是一对一或一对多,但两代之间数据不能有交集,即:last-gener i→next-gener j
Figure PCTCN2022099638-appb-000001
2-3. Take the first-generation node as the starting point of the next level, and use the distance | xi , x i1 | (that is, the distance between the source node and the first-generation node) as the neighborhood search radius to search for the next-generation child node set of the first-generation node. The search The result is shown in Figure 3. At this time, the set of descendant nodes of the first-generation node is not limited by the number of nodes, and the data objects within the neighborhood search radius belong to its child nodes, but the principle of uniqueness must be followed. The principle of uniqueness refers to Between two adjacent generations of the same level, only next-gener j can be generated by searching last-gener i . The mapping relationship can be one-to-one or one-to-many, but the data between two generations cannot overlap, that is: last-gener i → next-gener j and
Figure PCTCN2022099638-appb-000001
3、按照上述2-3的搜索策略逐层搜索,最终形成一个最短分叉树(the Shortest Forked Tree,SFT)并认定为一组粗糙聚类,其包括源节点及其后代所有节点数据集和相对应的节点距离集。形成粗糙聚类—最短分叉树中包括两种类型数据:其一是源节点及其搜索到的后代节点集合;其二为形成最短分叉树的所有相邻两代之间对应节点的间距集合。3. Search layer by layer according to the search strategy of 2-3 above, and finally form a shortest forked tree (the Shortest Forked Tree, SFT) and identify it as a set of rough clusters, which include all node data sets of the source node and its descendants and The corresponding set of node distances. Form rough clustering—the shortest bifurcated tree includes two types of data: one is the source node and its searched descendant node set; the other is the distance between corresponding nodes between all adjacent generations that form the shortest bifurcated tree gather.
由于目标数据集中的异常值在其邻域内具有周围数据对象密度低,间距大的特性,因此局部异常值与相邻点之间的离散度较大。假定独立异常值作为源节点,根据不同层次间的距离(即自适应节点间距)作为邻域搜索半径,逐步搜索其相邻点形成一个完整的树结构并将其认定为一个粗糙类别,可以实现将数据划分为不同聚类的目的。Since the outliers in the target data set have the characteristics of low density of surrounding data objects and large spacing in their neighborhood, the dispersion between local outliers and adjacent points is large. Assuming independent outliers as the source nodes, according to the distance between different levels (that is, the adaptive node spacing) as the neighborhood search radius, gradually search its adjacent points to form a complete tree structure and identify it as a rough category, which can achieve The purpose of dividing the data into different clusters.
S102:采用基于粗糙聚类邻域信息系统的阈值剪枝算法对最短分叉树进行剪枝与合并,得到精简后的最短分叉树。S102: Use a threshold pruning algorithm based on the rough clustering neighborhood information system to prune and merge the shortest forked tree to obtain a simplified shortest forked tree.
请参考图4,其为本申请实施方式提供的数据处理方法采用粗糙聚类邻域信息系统的阈值剪枝算法对最短分叉树进行处理的过程示意图。Please refer to FIG. 4 , which is a schematic diagram of the process of processing the shortest forked tree by using the threshold pruning algorithm of the rough clustering neighborhood information system in the data processing method provided by the embodiment of the present application.
具体地,S102可以包括:根据最短分叉树中每个数据对象的属性,将包含 共享节点的分支组合为一个最短分叉树结构,并剪除最短分叉树中完全交集的分支,得到精简后的最短分叉树。Specifically, S102 may include: according to the attributes of each data object in the shortest forked tree, combine the branches containing shared nodes into a shortest forked tree structure, and cut off the branches that completely intersect in the shortest forked tree to obtain a simplified The shortest bifurcation tree of .
其具体的步骤如下:The specific steps are as follows:
1、提取S101中形成的任意最短分叉树,对该最短分叉树剪除完全交集分支,即假定存在两个不同分支T i和T j并且|T i|≥|T j|,满足的剪枝条件为:
Figure PCTCN2022099638-appb-000002
此时剪除T j只保留T i
1. Extract any shortest bifurcated tree formed in S101, and cut off the complete intersection branch of the shortest bifurcated tree, that is, assuming that there are two different branches T i and T j and |T i |≥|T j |, satisfying the pruning The branch condition is:
Figure PCTCN2022099638-appb-000002
At this time, T j is cut off and only T i is reserved.
2、共享节点分支聚类:假定存在两个不同分支T 1和T 2并且|T 1|≥|T 2|,能够实现共享节点聚类的条件是T 2中含有T 1的节点,此时将两个分支合并为T 12. Shared node branch clustering: Assuming that there are two different branches T 1 and T 2 and |T 1 |≥|T 2 |, the condition for realizing shared node clustering is that T 2 contains nodes of T 1 , at this time Merge the two branches into T 1 .
在一个具体的例子中,在将包含共享节点的分支组合为一个最短分叉树结构,并剪除最短分叉树中完全交集的分支,得到精简后的最短分叉树之后,还包括:根据最短分叉树中每个数据对象的Dist属性,计算最短分叉树中各个数据对象距离之和的中位数和平均数,根据偏差阈值公式剪除分数小于或等于偏差阈值的分支,其中,偏差阈值公式为:DEV=mean+|mean-median|,DEV为偏差阈值,mean为平均值,median为中位数,根据偏差阈值公式剪除分数小于或等于偏差阈值的分支是指:剪除数据对象Dist值小于或等于偏差阈值的分支。In a specific example, after combining the branches containing the shared nodes into a shortest bifurcated tree structure, and pruning the completely intersecting branches in the shortest bifurcated tree to obtain the simplified shortest bifurcated tree, it also includes: according to the shortest The Dist attribute of each data object in the bifurcated tree, calculate the median and average of the sum of the distances of each data object in the shortest bifurcated tree, and cut off the branch whose score is less than or equal to the deviation threshold according to the deviation threshold formula, where the deviation threshold The formula is: DEV=mean+|mean-median|, DEV is the deviation threshold, mean is the average value, and median is the median. According to the deviation threshold formula, cutting off branches whose scores are less than or equal to the deviation threshold means: cutting off the data object whose Dist value is less than or branches equal to the deviation threshold.
应当理解的是,根据偏差阈值公式剪除分数小于或等于偏差阈值的分支,是指根据偏差阈值公式剪除最短分叉树中分数低于偏差阈值的弱权重分支簇类。It should be understood that the pruning of branches whose scores are less than or equal to the deviation threshold according to the deviation threshold formula refers to the pruning of weak weight branch clusters in the shortest bifurcated tree whose scores are lower than the deviation threshold according to the deviation threshold formula.
通过偏差阈值公式剪除最短分叉树中分数低于偏差阈值的弱权重分支簇类,即对粗糙聚类的最短分叉树中数据对象和完全交集进行剪枝并且合并包含共享节点的分支,可以进一步精简最短分叉树的数据结构,方便后续数据的进一步处理。Use the deviation threshold formula to prune the weak weight branch clusters whose scores are lower than the deviation threshold in the shortest bifurcation tree, that is, to prune the data objects and complete intersections in the shortest bifurcation tree of rough clustering and merge the branches containing shared nodes, which can be The data structure of the shortest forked tree is further simplified to facilitate further processing of subsequent data.
S103:采用均衡融合数据局部多特征因子的异常值检测算法计算精简后的最短分叉树中数据对象的异常度,并根据异常度确定并剔除目标数据集中的异常数据值。S103: Calculate the abnormality degree of the data object in the streamlined shortest bifurcation tree by using the outlier detection algorithm of the local multi-characteristic factors of the balanced fusion data, and determine and eliminate the abnormal data value in the target data set according to the abnormality degree.
在一个具体的例子中,S103可以包括:根据T o-stand=T o+|min(T o)|对精简后的最短分叉树中的数据进行数值处理,即数据标准化过程,其中T o表示精简后的最短分叉树分支,T o-stand表示数值处理后的T o分支;根据N dis(x)=
Figure PCTCN2022099638-appb-000003
计算同一最短分叉树中各节点之间的距离,其中,N dis(x)为最短分叉树各节点之间的距离的计算结果,x为指定的数据对象,x i为该最短分叉树类中的其它数据对象,K表示该最短分叉树类中数据对象的个数,exp(1)表示以e为底,指数为1的常值;分别根据以下公式计算最短分叉树中数据的变异系数:
In a specific example, S103 may include: according to T o-stand = T o + |min(T o )| numerically processing the data in the simplified shortest bifurcation tree, that is, the data standardization process, where T o Represents the shortest bifurcation tree branch after simplification, T o-stand represents the T o branch after numerical processing; according to N dis (x)=
Figure PCTCN2022099638-appb-000003
Calculate the distance between each node in the same shortest bifurcation tree, where N dis (x) is the calculation result of the distance between each node of the shortest bifurcation tree, x is the specified data object, x i is the shortest bifurcation For other data objects in the tree class, K represents the number of data objects in the shortest forked tree class, and exp(1) represents a constant value with e as the base and an exponent of 1; respectively calculate the shortest forked tree according to the following formula The coefficient of variation of the data:
Figure PCTCN2022099638-appb-000004
Figure PCTCN2022099638-appb-000004
Figure PCTCN2022099638-appb-000005
Figure PCTCN2022099638-appb-000005
Figure PCTCN2022099638-appb-000006
其中,T i表示任意的最短分叉树簇类中所有节点的距离之和,x c表示T i对应的最短分叉树中的各节点距离,α表示该簇类中包含的节点个数,β为最短分叉树的个数,N std(T i)为最短分叉树簇类的标准差,N mean(T i)是类的平均值,N cv(T i)为变异系数;根据
Figure PCTCN2022099638-appb-000007
计算类中数据对象的局部相对接近度;根据局部相对接近度计算
Figure PCTCN2022099638-appb-000008
将MDILAF作为数据对象的异常度,并根据异常度确定并剔除目标数据集中的异常数据值,其中,LRP(x i)是类中除x以外其余数据间的局部相对接近度,N(x)为数据对象x的最短分叉树,|N(x)|为类中其余所有数据对象的距离之和。
Figure PCTCN2022099638-appb-000006
Among them, T i represents the sum of the distances of all nodes in any shortest forked tree cluster class, x c represents the distance of each node in the shortest forked tree corresponding to T i , α represents the number of nodes contained in the cluster class, β is the number of the shortest forked tree, N std (T i ) is the standard deviation of the shortest forked tree cluster, N mean (T i ) is the average value of the class, N cv (T i ) is the coefficient of variation; according to
Figure PCTCN2022099638-appb-000007
Computes the local relative proximity of data objects in a class; computes based on local relative proximity
Figure PCTCN2022099638-appb-000008
Take MDILAF as the abnormal degree of the data object, and determine and eliminate the abnormal data values in the target data set according to the abnormal degree, where LRP( xi ) is the local relative proximity between the other data in the class except x, N(x) is the shortest bifurcated tree of data object x, and |N(x)| is the sum of the distances of all other data objects in the class.
上述流程可以参考图5,其为本申请实施方式提供的数据处理方法采用均衡融合数据局部多特征因子的异常值检测算法的流程示意图。The above process can refer to FIG. 5 , which is a schematic flow chart of the data processing method provided by the embodiment of the present application using an outlier detection algorithm using balanced fusion data local multi-feature factors.
由于均衡融合数据局部多特征因子的异常值检测算法,对标准局部异常因 子(Local Outlier Factor,LOF)引入局部相对接近度(Local Relative Proximity,LRP)来替换数据对象的局部可及密度(Local Reachability Density,LRD),将邻域离散程度与距离计算比率调整为适合粗糙聚类的计算方式,引入变异系数表征类内离散程度,因此可以准确定量分析数据对象的异常度并剔除判定的异常对象(即异常数据值)。Due to the outlier detection algorithm of the local multi-feature factors of the balanced fusion data, the local relative proximity (Local Relative Proximity, LRP) is introduced to the standard local outlier factor (Local Outlier Factor, LOF) to replace the local reachability density (Local Reachability) of the data object. Density, LRD), adjust the neighborhood dispersion degree and distance calculation ratio to the calculation method suitable for rough clustering, and introduce the variation coefficient to represent the dispersion degree within the class, so it can accurately and quantitatively analyze the abnormality of data objects and eliminate the abnormal objects judged ( i.e. outlier data values).
在一个具体的例子中,在S103之后,还包括:采用改进的稀疏自编码器对目标数据集进行降维,其中,改进的稀疏自编码器采用稀疏规则算子替代KL相对熵作为稀疏性约束项,并采用L2范数作为正则项。In a specific example, after S103, it also includes: using an improved sparse autoencoder to reduce the dimensionality of the target data set, wherein the improved sparse autoencoder uses a sparse rule operator instead of KL relative entropy as a sparsity constraint term, and the L2 norm is used as the regular term.
具体地,稀疏自编码器采用在隐藏层增加神经元活跃度
Figure PCTCN2022099638-appb-000009
的稀疏限制来表示在给定输入X情况下自编码神经网络隐藏神经元j-ac的激活度。稀疏自编码器隐藏神经元j-ac的平均激活度定义为:
Specifically, the sparse autoencoder adopts increasing neuron activity in the hidden layer
Figure PCTCN2022099638-appb-000009
The sparsity limit of is used to represent the activation of the hidden neuron j-ac of the autoencoder neural network given the input X. The average activation of the hidden neuron j-ac of the sparse autoencoder is defined as:
Figure PCTCN2022099638-appb-000010
Figure PCTCN2022099638-appb-000010
其中,索引值j-ac代表各神经元位置标签,H代表输入层神经元个数,h代表输入层各个神经元的索引。Among them, the index value j-ac represents the position label of each neuron, H represents the number of neurons in the input layer, and h represents the index of each neuron in the input layer.
原始稀疏自编码器的损失函数一般采用均方误差表示并在此基础上增加KL散度作为稀疏性约束,具体公式如下:The loss function of the original sparse autoencoder is generally represented by the mean square error and on this basis, the KL divergence is added as a sparsity constraint. The specific formula is as follows:
Figure PCTCN2022099638-appb-000011
Figure PCTCN2022099638-appb-000011
Figure PCTCN2022099638-appb-000012
Figure PCTCN2022099638-appb-000012
Figure PCTCN2022099638-appb-000013
Figure PCTCN2022099638-appb-000013
式中
Figure PCTCN2022099638-appb-000014
是函数惩罚因子,
Figure PCTCN2022099638-appb-000015
是函数惩罚项。稀疏自编码器的更新机制为:
In the formula
Figure PCTCN2022099638-appb-000014
is the function penalty factor,
Figure PCTCN2022099638-appb-000015
is the function penalty term. The update mechanism of the sparse autoencoder is:
Figure PCTCN2022099638-appb-000016
Figure PCTCN2022099638-appb-000016
其中f′(z q)表示神经网络中输出层z的导数,q表示输出层神经元个数,在采用改进的稀疏自编码器对目标数据集进行降维时,具体地,可以包括:根据改进的稀疏自编码器构建以下目标损失函数:
Figure PCTCN2022099638-appb-000017
Figure PCTCN2022099638-appb-000018
其中,λ 1为稀疏惩罚项权重,λ 2为权重衰减系数,S 2代表隐藏层神经元的数量,W s表示神经网络所有隐藏层神经元的权重系数,b代表神经网络偏置项,s代表隐藏层神经元的索引,其范围是[1,S 2],J(W,b)表示稀疏自编码器初始的损失函数项,J sparse(W,b)表示改进后的稀疏自编码器的目标损失函数,y表示真实值,h w,h(x)表示输入为x的神经网络的预测值,
Figure PCTCN2022099638-appb-000019
表示经过神经网络误差反向传递后更新的隐藏层(第二层)增量,
Figure PCTCN2022099638-appb-000020
表示隐藏层间(第二层与第三层)各神经元的权重,
Figure PCTCN2022099638-appb-000021
表示神经网络隐藏层(第三层)的增量,其中r表示隐藏层(第三层)中各个神经元的索引;根据目标损失函数对目标数据集进行降维。
Where f'(z q ) represents the derivative of the output layer z in the neural network, and q represents the number of neurons in the output layer. When the improved sparse autoencoder is used to reduce the dimension of the target data set, specifically, it can include: according to The modified sparse autoencoder constructs the following objective loss function:
Figure PCTCN2022099638-appb-000017
Figure PCTCN2022099638-appb-000018
Among them, λ 1 is the weight of the sparse penalty item, λ 2 is the weight decay coefficient, S 2 represents the number of neurons in the hidden layer, W s represents the weight coefficient of all hidden layer neurons in the neural network, b represents the bias term of the neural network, s Represents the index of the hidden layer neuron, and its range is [1,S 2 ], J(W,b) represents the initial loss function item of the sparse autoencoder, and J sparse (W,b) represents the improved sparse autoencoder The target loss function of , y represents the real value, h w, h (x) represents the predicted value of the neural network whose input is x,
Figure PCTCN2022099638-appb-000019
Indicates the increment of the hidden layer (second layer) updated after the neural network error backpropagation,
Figure PCTCN2022099638-appb-000020
Represents the weight of each neuron in the hidden layer (the second layer and the third layer),
Figure PCTCN2022099638-appb-000021
Represents the increment of the hidden layer (third layer) of the neural network, where r represents the index of each neuron in the hidden layer (third layer); performs dimensionality reduction on the target data set according to the target loss function.
具体地,根据构建的目标损失函数,神经网络参数更新机制更改为:
Figure PCTCN2022099638-appb-000022
可以参考图6,其为本申请实施方式提供的数据处理方法的改进稀疏自编码器的网络机制示意图。
Specifically, according to the constructed target loss function, the neural network parameter update mechanism is changed to:
Figure PCTCN2022099638-appb-000022
Reference can be made to FIG. 6 , which is a schematic diagram of a network mechanism of an improved sparse autoencoder of the data processing method provided in the embodiment of the present application.
上述的具体流程可以参考图7,其为本申请实施方式提供的数据处理方法的流程示例图。For the above-mentioned specific flow, reference may be made to FIG. 7 , which is an example flow diagram of a data processing method provided in an embodiment of the present application.
通过将稀疏规则算子替代KL相对熵作为稀疏性约束项,可以提高算法系数性能;采用L2范数作为正则项,则可以均衡多项式分量权重,提高稀疏自编码器处理数据时防止过拟合的能力;同时,采用改进的稀疏自编码器对经过异常值检测的数据进行数据降维,可以减少数据冗余量,提升数据的简洁可靠性。By replacing the KL relative entropy with the sparse regular operator as the sparsity constraint item, the performance of the algorithm coefficients can be improved; using the L2 norm as the regular item can balance the weight of the polynomial components and improve the sparse autoencoder to prevent overfitting when processing data. At the same time, using an improved sparse autoencoder to reduce the data dimensionality of the data that has been detected by outliers can reduce data redundancy and improve the simplicity and reliability of data.
本申请实施方式提供的数据处理方法,通过采用最短分叉树粗糙聚类算法 对目标数据集进行粗糙聚类形成多个最短分叉树,然后采用粗糙聚类邻域信息系统的阈值剪枝算法对最短分叉树进行剪枝与合并,再利用均衡融合数据局部多特征因子的异常值检测算法计算最短分叉树中数据对象的异常度,根据数据对象的异常度确定并剔除异常数据值。由于是采用算法对目标数据集的数据进行自动分析,因此可以提高数据分析的效率;同时,由于均衡融合数据局部多特征因子的异常值检测算法,对标准局部异常因子引入局部相对接近度来替换数据对象的局部可及密度,将邻域离散程度与距离计算比率调整为适合粗糙聚类的计算方式,引入变异系数表征类内离散程度,因此可以准确定量分析数据对象的异常度,从而根据异常度确定并剔除原始数据(即目标数据集)中的异常数据值,提高分析结果以及决策的准确性。The data processing method provided by the embodiment of the present application uses the rough clustering algorithm of the shortest forked tree to perform rough clustering on the target data set to form multiple shortest forked trees, and then uses the threshold pruning algorithm of the rough clustering neighborhood information system Pruning and merging the shortest bifurcation tree, and then using the outlier detection algorithm of balanced fusion data local multi-characteristic factors to calculate the abnormality degree of the data object in the shortest bifurcation tree, determine and eliminate the abnormal data value according to the abnormality degree of the data object. Since the algorithm is used to automatically analyze the data of the target data set, the efficiency of data analysis can be improved; at the same time, due to the outlier detection algorithm of the local multi-characteristic factors of the balanced fusion data, the local relative proximity is introduced to replace the standard local anomalous factors. The local accessibility density of data objects, adjust the ratio of neighborhood dispersion and distance calculation to a calculation method suitable for rough clustering, and introduce the coefficient of variation to represent the degree of dispersion within a class, so the abnormality of data objects can be accurately and quantitatively analyzed, so that according to abnormal It can accurately determine and eliminate abnormal data values in the original data (ie, the target data set), and improve the accuracy of analysis results and decision-making.
上面各种方法的步骤划分,只是为了描述清楚,实现时可以合并为一个步骤或者对某些步骤进行拆分,分解为多个步骤,只要包含相同的逻辑关系,都在本专利的保护范围内;对算法中或者流程中添加无关紧要的修改或者引入无关紧要的设计,但不改变其算法和流程的核心设计都在该专利的保护范围内。The division of steps in the above methods is only for the sake of clarity of description. During implementation, they can be combined into one step or some steps can be split and decomposed into multiple steps. As long as they contain the same logical relationship, they are all within the scope of protection of this patent. ; Adding insignificant modifications or introducing insignificant designs to the algorithm or process, but not changing the core design of the algorithm and process are all within the scope of protection of this patent.
本申请第二实施方式涉及一种数据处理装置200,如图8所示,包含:聚类模块201、处理模块202和确定模块203,各模块功能详细说明如下:The second embodiment of the present application relates to a data processing device 200, as shown in FIG. 8 , comprising: a clustering module 201, a processing module 202, and a determination module 203. The functions of each module are described in detail as follows:
聚类模块201,用于获取目标数据集,采用最短分叉树粗糙聚类算法对所述目标数据集进行粗糙聚类,根据粗糙聚类结果形成多个最短分叉树;The clustering module 201 is used to obtain the target data set, perform rough clustering on the target data set by using the shortest fork tree rough clustering algorithm, and form multiple shortest fork trees according to the rough clustering results;
处理模块202,用于采用基于粗糙聚类邻域信息系统的阈值剪枝算法对所述最短分叉树进行剪枝与合并,得到精简后的最短分叉树;The processing module 202 is configured to use a threshold pruning algorithm based on a rough clustering neighborhood information system to prune and merge the shortest forked tree to obtain a simplified shortest forked tree;
确定模块203,用于采用均衡融合数据局部多特征因子的异常值检测算法计算所述精简后的最短分叉树中数据对象的异常度,并根据所述异常度确定并剔除所述目标数据集中的异常数据值。The determination module 203 is used to calculate the anomaly degree of the data object in the simplified shortest bifurcated tree by adopting the outlier detection algorithm of balanced and fused data local multi-characteristic factors, and determine and eliminate the target data set according to the anomalous degree abnormal data value.
进一步地,本申请实施方式提供的数据处理装置200还包括降维模块,其中,降维模块用于:采用改进的稀疏自编码器对所述目标数据集进行降维,其中,所述改进的稀疏自编码器采用稀疏规则算子替代KL相对熵作为稀疏性约束项,并采用L2范数作为正则项。Further, the data processing device 200 provided in the embodiment of the present application further includes a dimensionality reduction module, wherein the dimensionality reduction module is used to: use an improved sparse autoencoder to perform dimensionality reduction on the target data set, wherein the improved The sparse autoencoder uses the sparse regular operator instead of the KL relative entropy as the sparsity constraint item, and uses the L2 norm as the regular item.
进一步地,降维模块还用于:Further, the dimensionality reduction module is also used for:
根据改进的稀疏自编码器构建以下目标损失函数:Build the following objective loss function from the modified sparse autoencoder:
Figure PCTCN2022099638-appb-000023
其中,所述λ 1为稀疏惩罚项权重,所述λ 2为权重衰减系数,S 2代表隐藏层神经元的数量,W s表示神经网络所有隐藏层神经元的权重系数,b代表神经网络偏置项,s代表隐藏层神经元的索引,其范围是[1,S 2],J(W,b)表示稀疏自编码器初始的损失函数项,J sparse(W,b)表示改进后的稀疏自编码器的目标损失函数;
Figure PCTCN2022099638-appb-000023
Wherein, the λ 1 is the weight of the sparse penalty item, the λ 2 is the weight attenuation coefficient, S 2 represents the number of neurons in the hidden layer, W s represents the weight coefficient of all hidden layer neurons in the neural network, and b represents the bias of the neural network Set item, s represents the index of the hidden layer neuron, and its range is [1,S 2 ], J(W,b) represents the initial loss function item of the sparse autoencoder, J sparse (W,b) represents the improved Target loss function for sparse autoencoders;
根据所述目标损失函数对所述目标数据集进行降维。Dimensionality reduction is performed on the target data set according to the target loss function.
进一步地,确定模块203具体用于:Further, the determination module 203 is specifically used for:
根据T o-stand=T o+|min(T o)|对所述精简后的最短分叉树中的数据进行标准化; Standardize the data in the simplified shortest bifurcated tree according to T o-stand =T o +|min(T o )|;
根据
Figure PCTCN2022099638-appb-000024
计算同一最短分叉树中各节点之间的距离,其中,N dis(x)为最短分叉树各节点之间的距离的计算结果,x为指定的数据对象,x i为所述最短分叉树类中的其它数据对象,K表示所述最短分叉树类中数据对象的个数,exp(1)表示以e为底,指数为1的常值;
according to
Figure PCTCN2022099638-appb-000024
Calculate the distance between each node in the same shortest bifurcated tree, wherein, N dis (x) is the calculation result of the distance between each node of the shortest bifurcated tree, x is the specified data object, x i is the shortest point For other data objects in the fork tree class, K represents the number of data objects in the shortest fork tree class, and exp(1) represents taking e as the base, and the exponent is a constant value of 1;
分别根据以下公式计算所述最短分叉树中数据的变异系数:The coefficient of variation of the data in the shortest forked tree is calculated according to the following formula:
Figure PCTCN2022099638-appb-000025
Figure PCTCN2022099638-appb-000025
Figure PCTCN2022099638-appb-000026
Figure PCTCN2022099638-appb-000026
其中,T i表示任意的最短分叉树簇类中所有节点的距离之和,x c表示T i对应的最短分叉树中的各节点距离,α表示该簇类中包含的节点个数,β为最短分叉树的个数,N std(T i)为最短分叉树簇类的标准差,N mean(T i)是类的平均值,N cv(T i)为变异系数。 Among them, T i represents the sum of the distances of all nodes in any shortest forked tree cluster class, x c represents the distance of each node in the shortest forked tree corresponding to T i , α represents the number of nodes contained in the cluster class, β is the number of the shortest forked tree, N std (T i ) is the standard deviation of the shortest forked tree cluster, N mean (T i ) is the average value of the class, N cv (T i ) is the coefficient of variation.
根据
Figure PCTCN2022099638-appb-000027
计算类中数据对象的局部相对接近 度;
according to
Figure PCTCN2022099638-appb-000027
Compute the local relative proximity of data objects in a class;
根据局部相对接近度计算
Figure PCTCN2022099638-appb-000028
将所述MDILAF作为数据对象的异常度,并根据所述异常度确定并剔除所述目标数据集中的异常数据值,其中,LRP(x i)是类中除x以外其余数据间的局部相对接近度,N(x)为数据对象x的最短分叉树,|N(x)|为类中其余所有数据对象的距离之和。
Calculated from local relative proximity
Figure PCTCN2022099638-appb-000028
The MDILAF is used as the abnormality degree of the data object, and the abnormal data values in the target data set are determined and eliminated according to the abnormality degree, wherein, LRP( xi ) is the local relative proximity between the other data in the class except x degree, N(x) is the shortest bifurcated tree of data object x, |N(x)| is the sum of the distances of all other data objects in the class.
进一步地,聚类模块201具体用于:Further, the clustering module 201 is specifically used for:
确定所述目标数据集中的源节点;determining source nodes in said target data set;
搜索所述源节点的最近节点,将离所述源节点最近的节点作为初代节点;Search for the nearest node of the source node, and use the node closest to the source node as the initial generation node;
从将所述初代节点作为当前的父节点开始,循环执行以当前的父节点为起点,以自适应节点间距为邻域搜索半径搜索后代节点集,若邻域搜索半径内存在新的节点,则将所述新的节点作为当前的父节点,继续以所述自适应节点间距为邻域搜索半径搜索后代节点集,直到邻域搜索半径内不存在新的节点,结束搜索并存储从所述源节点到最后一代节点之间的所有节点和节点距离,根据所述源节点至最后一代节点之间的所有节点形成最短分叉树,其中,所述节点距离为同层次节点与后代节点的间距集合,所述自适应节点间距为:Dist=arg min(Eucliddan_dist(last-gener i,next-gener j)),last-gener i为相邻两代中的父代,next-gener j为相邻两代中的子代。 Starting from the first-generation node as the current parent node, loop execution takes the current parent node as the starting point, uses the adaptive node spacing as the neighborhood search radius to search for the descendant node set, if there is a new node in the neighborhood search radius, then Use the new node as the current parent node, continue to use the adaptive node spacing as the neighborhood search radius to search for the descendant node set, until there is no new node in the neighborhood search radius, end the search and store the All nodes and node distances between the node and the last generation node, form the shortest bifurcated tree according to all the nodes between the source node and the last generation node, wherein the node distance is the distance set between nodes of the same level and descendant nodes , the adaptive node spacing is: Dist=arg min(Eucliddan_dist(last-gener i , next-gener j )), where last-gener i is the parent generation of two adjacent generations, and next-gener j is the parent generation of two adjacent generations descendants of generations.
根据所述最短分叉树中每个数据对象的属性,将包含共享节点的分支组合为一个最短分叉树结构,并剪除所述最短分叉树中完全交集的分支,得到精简后的最短分叉树。According to the attributes of each data object in the shortest forked tree, the branches containing the shared nodes are combined into a shortest forked tree structure, and the branches that are completely intersected in the shortest forked tree are cut off to obtain the simplified shortest branch fork tree.
进一步地,处理模块202还用于:Further, the processing module 202 is also used for:
根据所述最短分叉树中每个数据对象的Dist属性,计算所述最短分叉树中各个数据对象距离之和的中位数和平均数,根据偏差阈值公式剪除分数小于或等于偏差阈值的分支,其中,所述偏差阈值公式为:DEV=mean+|mean-median|,DEV为偏差阈值,mean为平均值,median为中位数,根据偏差阈值公式剪除分数小于或等于偏差阈值的分支是指:剪除数据对象Dist值小于或等于偏差阈值的分支。According to the Dist attribute of each data object in the shortest bifurcated tree, calculate the median and the average of the sum of the distances of each data object in the shortest bifurcated tree, and cut off scores less than or equal to the deviation threshold according to the deviation threshold formula Branch, wherein, the formula of the deviation threshold is: DEV=mean+|mean-median|, DEV is the deviation threshold, mean is the average value, and median is the median. According to the deviation threshold formula, the branch whose score is less than or equal to the deviation threshold is Refers to: pruning branches whose Dist value of the data object is less than or equal to the deviation threshold.
不难发现,本实施方式为与第一实施方式相对应的装置实施方式,本实施方式可与第一实施方式互相配合实施。第一实施方式中提到的相关技术细节在本实施方式中依然有效,为了减少重复,这里不再赘述。相应地,本实施方式中提到的相关技术细节也可应用在第一实施方式中。It is not difficult to find that this embodiment is a device embodiment corresponding to the first embodiment, and this embodiment can be implemented in cooperation with the first embodiment. The relevant technical details mentioned in the first embodiment are still valid in this embodiment, and will not be repeated here in order to reduce repetition. Correspondingly, the relevant technical details mentioned in this implementation manner can also be applied in the first implementation manner.
值得一提的是,本实施方式中所涉及到的各模块均为逻辑模块,在实际应用中,一个逻辑单元可以是一个物理单元,也可以是一个物理单元的一部分,还可以以多个物理单元的组合实现。此外,为了突出本申请的创新部分,本实施方式中并没有将与解决本申请所提出的技术问题关系不太密切的单元引入,但这并不表明本实施方式中不存在其它的单元。It is worth mentioning that all the modules involved in this embodiment are logical modules. In practical applications, a logical unit can be a physical unit, or a part of a physical unit, or multiple physical units. Combination of units. In addition, in order to highlight the innovative part of the present application, units that are not closely related to solving the technical problems proposed in the present application are not introduced in this embodiment, but this does not mean that there are no other units in this embodiment.
本申请第三实施方式涉及一种网络设备,如图9所示,包括至少一个处理器301;以及,与至少一个处理器301通信连接的存储器302;其中,存储器302存储有可被至少一个处理器301执行的指令,指令被至少一个处理器301执行,以使至少一个处理器301能够执行上述的数据处理方法。The third embodiment of the present application relates to a network device. As shown in FIG. 9 , it includes at least one processor 301; and a memory 302 communicatively connected to at least one processor 301; The instructions executed by the processor 301 are executed by at least one processor 301, so that the at least one processor 301 can execute the above data processing method.
其中,存储器302和处理器301采用总线方式连接,总线可以包括任意数量的互联的总线和桥,总线将一个或多个处理器301和存储器302的各种电路连接在一起。总线还可以将诸如外围设备、稳压器和功率管理电路等之类的各种其他电路连接在一起,这些都是本领域所公知的,因此,本文不再对其进行进一步描述。总线接口在总线和收发机之间提供接口。收发机可以是一个元件,也可以是多个元件,比如多个接收器和发送器,提供用于在传输介质上与各种其他装置通信的单元。经处理器301处理的数据通过天线在无线介质上进行传输,进一步,天线还接收数据并将数据传送给处理器301。Wherein, the memory 302 and the processor 301 are connected by a bus, and the bus may include any number of interconnected buses and bridges, and the bus connects one or more processors 301 and various circuits of the memory 302 together. The bus may also connect together various other circuits such as peripherals, voltage regulators, and power management circuits, all of which are well known in the art and therefore will not be further described herein. The bus interface provides an interface between the bus and the transceivers. A transceiver may be a single element or multiple elements, such as multiple receivers and transmitters, providing means for communicating with various other devices over a transmission medium. The data processed by the processor 301 is transmitted on the wireless medium through the antenna, and further, the antenna also receives the data and transmits the data to the processor 301 .
处理器301负责管理总线和通常的处理,还可以提供各种功能,包括定时,外围接口,电压调节、电源管理以及其他控制功能。而存储器302可以被用于存储处理器301在执行操作时所使用的数据。The processor 301 is responsible for managing the bus and general processing, and may also provide various functions including timing, peripheral interface, voltage regulation, power management and other control functions. And the memory 302 may be used to store data used by the processor 301 when performing operations.
本申请第四实施方式涉及一种计算机可读存储介质,存储有计算机程序。计算机程序被处理器执行时实现上述方法实施方式。The fourth embodiment of the present application relates to a computer-readable storage medium storing a computer program. When the computer program is executed by the processor, the above-mentioned method embodiments are realized.
即,本领域技术人员可以理解实现上述实施方式方法中的全部或部分步骤 是可以通过程序来指令相关的硬件来完成,该程序存储在一个存储介质中,包括若干指令用以使得一个设备(可以是单片机,芯片等)或处理器(processor)执行本申请各个实施方式所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。That is, those skilled in the art can understand that all or part of the steps in the method of the above-mentioned embodiments can be completed by instructing related hardware through a program, the program is stored in a storage medium, and includes several instructions to make a device (which can It is a single-chip microcomputer, a chip, etc.) or a processor (processor) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. .
本领域的普通技术人员可以理解,上述各实施方式是实现本申请的具体实施方式,而在实际应用中,可以在形式上和细节上对其作各种改变,而不偏离本申请的精神和范围。Those of ordinary skill in the art can understand that the above-mentioned implementation modes are specific implementation modes for realizing the application, and in practical applications, various changes can be made to it in form and details without departing from the spirit and spirit of the application. scope.

Claims (10)

  1. 一种数据处理方法,其特征在于,包括:A data processing method, characterized in that, comprising:
    获取目标数据集,采用最短分叉树粗糙聚类算法对所述目标数据集进行粗糙聚类,根据粗糙聚类结果形成多个最短分叉树;Obtaining the target data set, performing rough clustering on the target data set by using the shortest forked tree rough clustering algorithm, and forming multiple shortest forked trees according to the rough clustering results;
    采用基于粗糙聚类邻域信息系统的阈值剪枝算法对所述最短分叉树进行剪枝与合并,得到精简后的最短分叉树;Pruning and merging the shortest forked tree by using a threshold pruning algorithm based on a rough clustering neighborhood information system to obtain a simplified shortest forked tree;
    采用均衡融合数据局部多特征因子的异常值检测算法计算所述精简后的最短分叉树中数据对象的异常度,并根据所述异常度确定并剔除所述目标数据集中的异常数据值。Using an outlier detection algorithm that balances and fuses local multi-characteristic factors of the data to calculate the anomaly degree of the data object in the simplified shortest bifurcation tree, and determine and eliminate abnormal data values in the target data set according to the anomaly degree.
  2. 根据权利要求1所述的数据处理方法,其特征在于,在所述采用均衡融合数据局部多特征因子的异常值检测算法计算所述精简后的最短分叉树中数据对象的异常度,并根据所述数据对象的异常度确定并剔除所述目标数据集中的异常数据值之后,还包括:The data processing method according to claim 1, wherein the outlier detection algorithm using local multi-characteristic factors of balanced fusion data is used to calculate the anomaly degree of the data object in the simplified shortest bifurcation tree, and according to After the abnormality degree of the data object is determined and the abnormal data values in the target data set are eliminated, it also includes:
    采用改进的稀疏自编码器对所述目标数据集进行降维,其中,所述改进的稀疏自编码器采用稀疏规则算子替代KL相对熵作为稀疏性约束项,并采用L2范数作为正则项。An improved sparse autoencoder is used to reduce the dimensionality of the target data set, wherein the improved sparse autoencoder uses a sparse rule operator instead of KL relative entropy as a sparsity constraint item, and uses an L2 norm as a regular item .
  3. 根据权利要求2所述的数据处理方法,其特征在于,所述采用改进的稀疏自编码器对所述目标数据集进行降维,包括:The data processing method according to claim 2, wherein said adopting an improved sparse autoencoder to reduce the dimensionality of said target data set comprises:
    根据改进的稀疏自编码器构建以下目标损失函数:Build the following objective loss function from the modified sparse autoencoder:
    Figure PCTCN2022099638-appb-100001
    其中,λ 1为稀疏惩罚项权重,λ 2为权重衰减系数,S 2代表隐藏层神经元的数量,W s表示神经网络所有隐藏层神经元的权重系数,b代表神经网络偏置项,s代表隐藏层神经元的索引,其范围是[1,S 2],J(W,b)表示稀疏自编码器初始的损失函数项,J spare(W,b)表示改进后的稀疏自编码器的目标损失函数;
    Figure PCTCN2022099638-appb-100001
    Among them, λ 1 is the weight of the sparse penalty item, λ 2 is the weight decay coefficient, S 2 represents the number of neurons in the hidden layer, W s represents the weight coefficient of all hidden layer neurons in the neural network, b represents the bias term of the neural network, s Represents the index of the hidden layer neuron, and its range is [1,S 2 ], J(W,b) represents the initial loss function item of the sparse autoencoder, and J spare (W,b) represents the improved sparse autoencoder The target loss function;
    根据所述目标损失函数对所述目标数据集进行降维。Dimensionality reduction is performed on the target data set according to the target loss function.
  4. 根据权利要求1至3中任一项所述的数据处理方法,其特征在于,所述采用均衡融合数据局部多特征因子的异常值检测算法计算所述精简后的最短分叉树中数据对象的异常度,并根据所述数据对象的异常度确定并剔除所述目标 数据集中的异常数据值,包括:The data processing method according to any one of claims 1 to 3, characterized in that the outlier detection algorithm using the local multi-characteristic factors of the balanced fusion data calculates the data object in the simplified shortest bifurcation tree degree of abnormality, and determine and eliminate abnormal data values in the target data set according to the degree of abnormality of the data object, including:
    根据T o-stand=T o+|min(T o)|对所述精简后的最短分叉树中的数据进行标准化; Standardize the data in the simplified shortest bifurcated tree according to T o-stand =T o +|min(T o )|;
    根据
    Figure PCTCN2022099638-appb-100002
    计算同一最短分叉树中各节点之间的距离,其中,N dis(x)为最短分叉树各节点之间的距离的计算结果,x为指定的数据对象,x i为所述最短分叉树类中的其它数据对象,K表示所述最短分叉树类中数据对象的个数,exp(1)表示以e为底,指数为1的常值;
    according to
    Figure PCTCN2022099638-appb-100002
    Calculate the distance between each node in the same shortest bifurcated tree, wherein, N dis (x) is the calculation result of the distance between each node of the shortest bifurcated tree, x is the specified data object, x i is the shortest point For other data objects in the fork tree class, K represents the number of data objects in the shortest fork tree class, and exp(1) represents taking e as the base, and the exponent is a constant value of 1;
    分别根据以下公式计算所述最短分叉树中数据的变异系数:The coefficient of variation of the data in the shortest forked tree is calculated according to the following formula:
    Figure PCTCN2022099638-appb-100003
    Figure PCTCN2022099638-appb-100003
    Figure PCTCN2022099638-appb-100004
    Figure PCTCN2022099638-appb-100004
    Figure PCTCN2022099638-appb-100005
    Figure PCTCN2022099638-appb-100005
    其中,T i表示任意的最短分叉树簇类中所有节点的距离之和,x c表示T i对应的最短分叉树中的各节点距离,α表示该簇类中包含的节点个数,β为最短分叉树的个数,N std(T i)为最短分叉树簇类的标准差,N mean(T i)是类的平均值,N cv(T i)为变异系数; Among them, T i represents the sum of the distances of all nodes in any shortest forked tree cluster class, x c represents the distance of each node in the shortest forked tree corresponding to T i , α represents the number of nodes contained in the cluster class, β is the number of the shortest forked tree, N std (T i ) is the standard deviation of the shortest forked tree cluster class, N mean (T i ) is the average value of the class, N cv (T i ) is the coefficient of variation;
    根据
    Figure PCTCN2022099638-appb-100006
    计算类中数据对象的局部相对接近度;
    according to
    Figure PCTCN2022099638-appb-100006
    Compute the local relative proximity of data objects in a class;
    根据局部相对接近度计算
    Figure PCTCN2022099638-appb-100007
    将所述MDILAF作为数据对象的异常度,并根据所述异常度确定并剔除所述目标数据集中的异常数据值,其中,LRP(x i)是类中除x以外其余数据间的局部相对接近度,N(x)为数据对象x的最短分叉树,|N(x)|为类中其余所有数据对象的距离之和。
    Calculated from local relative proximity
    Figure PCTCN2022099638-appb-100007
    The MDILAF is used as the abnormality degree of the data object, and the abnormal data values in the target data set are determined and eliminated according to the abnormality degree, wherein, LRP( xi ) is the local relative proximity between the other data in the class except x degree, N(x) is the shortest bifurcated tree of data object x, |N(x)| is the sum of the distances of all other data objects in the class.
  5. 根据权利要求1至4中任一项所述的数据处理方法,其特征在于,所述采用最短分叉树粗糙聚类算法对所述目标数据集进行粗糙聚类,根据粗糙聚类 结果形成多个最短分叉树,包括:The data processing method according to any one of claims 1 to 4, characterized in that, the rough clustering algorithm of the shortest bifurcation tree is used to perform rough clustering on the target data set, and multiple clusters are formed according to the rough clustering results. A shortest bifurcation tree, including:
    确定所述目标数据集中的源节点;determining source nodes in said target data set;
    搜索所述源节点的最近节点,将离所述源节点最近的节点作为初代节点;Search for the nearest node of the source node, and use the node closest to the source node as the initial generation node;
    从将所述初代节点作为当前的父节点开始,循环执行以当前的父节点为起点,以自适应节点间距为邻域搜索半径搜索后代节点集,若邻域搜索半径内存在新的节点,则将所述新的节点作为当前的父节点,继续以所述自适应节点间距为邻域搜索半径搜索后代节点集,直到邻域搜索半径内不存在新的节点,结束搜索并存储从所述源节点到最后一代节点之间的所有节点和节点距离,根据所述源节点至最后一代节点之间的所有节点形成最短分叉树,其中,所述节点距离为同层次节点与后代节点的间距集合,所述自适应节点间距为:Dist=arg min(Euclidean_dist(last-gener i,next-gener j)),last-gener i为相邻两代中的父代,next-gener j为相邻两代中的子代。 Starting from the first-generation node as the current parent node, loop execution takes the current parent node as the starting point, uses the adaptive node spacing as the neighborhood search radius to search for the descendant node set, if there is a new node in the neighborhood search radius, then Use the new node as the current parent node, continue to use the adaptive node spacing as the neighborhood search radius to search for the descendant node set, until there is no new node in the neighborhood search radius, end the search and store the All nodes and node distances between the node and the last generation node, form the shortest bifurcated tree according to all the nodes between the source node and the last generation node, wherein the node distance is the distance set between nodes of the same level and descendant nodes , the adaptive node spacing is: Dist=arg min(Euclidean_dist(last-gener i , next-gener j )), where last-gener i is the parent generation of two adjacent generations, and next-gener j is the parent generation of two adjacent generations descendants of generations.
  6. 根据权利要求1-5任一项所述的数据处理方法,其特征在于,所述采用基于粗糙聚类邻域信息系统的阈值剪枝算法对所述最短分叉树进行剪枝与合并,得到精简后的最短分叉树,包括:The data processing method according to any one of claims 1-5, wherein the threshold pruning algorithm based on the rough clustering neighborhood information system is used to prune and merge the shortest bifurcation tree to obtain The streamlined shortest forked tree includes:
    根据所述最短分叉树中每个数据对象的属性,将包含共享节点的分支组合为一个最短分叉树结构,并剪除所述最短分叉树中完全交集的分支,得到精简后的最短分叉树。According to the attributes of each data object in the shortest forked tree, the branches containing the shared nodes are combined into a shortest forked tree structure, and the branches that are completely intersected in the shortest forked tree are cut off to obtain the simplified shortest branch fork tree.
  7. 根据权利要求6所述的数据处理方法,其特征在于,在所述将包含共享节点的分支组合为一个最短分叉树结构,并剪除所述最短分叉树中完全交集的分支之后,还包括:The data processing method according to claim 6, characterized in that, after said combining the branches containing the shared nodes into a shortest bifurcated tree structure, and pruning the completely intersecting branches in the shortest bifurcated tree, further comprising :
    根据所述最短分叉树中每个数据对象的Dist属性,计算所述最短分叉树中各个数据对象距离之和的中位数和平均数,根据偏差阈值公式剪除分数小于或等于偏差阈值的分支,其中,所述偏差阈值公式为:DEV=mean+|mean-median|,DEV为偏差阈值,mean为平均值,median为中位数。According to the Dist attribute of each data object in the shortest bifurcated tree, calculate the median and the average of the sum of the distances of each data object in the shortest bifurcated tree, and cut off scores less than or equal to the deviation threshold according to the deviation threshold formula Branch, wherein the deviation threshold formula is: DEV=mean+|mean-median|, DEV is the deviation threshold, mean is the average value, and median is the median.
  8. 一种数据处理装置,其特征在于,包括:A data processing device, characterized in that it comprises:
    聚类模块,用于获取目标数据集,采用最短分叉树粗糙聚类算法对所述目标数据集进行粗糙聚类,根据粗糙聚类结果形成多个最短分叉树;A clustering module, configured to obtain a target data set, perform rough clustering on the target data set by using the shortest fork tree rough clustering algorithm, and form multiple shortest fork trees according to the rough clustering results;
    处理模块,用于采用基于粗糙聚类邻域信息系统的阈值剪枝算法对所述最短分叉树进行剪枝与合并,得到精简后的最短分叉树;A processing module, configured to use a threshold pruning algorithm based on a rough clustering neighborhood information system to prune and merge the shortest forked tree to obtain a simplified shortest forked tree;
    确定模块,用于采用均衡融合数据局部多特征因子的异常值检测算法计算所述精简后的最短分叉树中数据对象的异常度,并根据所述异常度确定并剔除所述目标数据集中的异常数据值。The determination module is used to calculate the abnormality degree of the data object in the simplified shortest bifurcated tree by adopting the outlier detection algorithm of the local multi-characteristic factors of the balanced fusion data, and determine and eliminate the object in the target data set according to the abnormality degree Unusual data value.
  9. 一种网络设备,其特征在于,包括:A network device, characterized in that it includes:
    至少一个处理器;以及,at least one processor; and,
    与所述至少一个处理器通信连接的存储器;其中,a memory communicatively coupled to the at least one processor; wherein,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如权利要求1至7中任一项所述的数据处理方法。The memory stores instructions executable by the at least one processor, the instructions are executed by the at least one processor, so that the at least one processor can perform the operation described in any one of claims 1 to 7 The data processing method described above.
  10. 一种计算机可读存储介质,存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至7中任一项所述的数据处理方法。A computer-readable storage medium storing a computer program, wherein the computer program implements the data processing method according to any one of claims 1 to 7 when executed by a processor.
PCT/CN2022/099638 2021-06-18 2022-06-17 Data processing method and apparatus, network device, and storage medium WO2022262869A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110678862.XA CN113420804A (en) 2021-06-18 2021-06-18 Data processing method, device, network equipment and storage medium
CN202110678862.X 2021-06-18

Publications (1)

Publication Number Publication Date
WO2022262869A1 true WO2022262869A1 (en) 2022-12-22

Family

ID=77789079

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/099638 WO2022262869A1 (en) 2021-06-18 2022-06-17 Data processing method and apparatus, network device, and storage medium

Country Status (2)

Country Link
CN (1) CN113420804A (en)
WO (1) WO2022262869A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117272216A (en) * 2023-11-22 2023-12-22 中国建材检验认证集团湖南有限公司 Data analysis method for automatic flow monitoring station and manual water gauge observation station
CN117370331A (en) * 2023-12-08 2024-01-09 河北建投水务投资有限公司 Method and device for cleaning total water consumption data of cell, terminal equipment and storage medium

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113420804A (en) * 2021-06-18 2021-09-21 工业互联网创新中心(上海)有限公司 Data processing method, device, network equipment and storage medium
CN114742178B (en) * 2022-06-10 2022-11-08 航天亮丽电气有限责任公司 Method for non-invasive pressure plate state monitoring through MEMS six-axis sensor
CN115202661B (en) * 2022-09-15 2022-11-29 深圳大学 Hybrid generation method with hierarchical structure layout and related equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444247A (en) * 2020-06-17 2020-07-24 北京必示科技有限公司 KPI (Key performance indicator) -based root cause positioning method and device and storage medium
CN111985837A (en) * 2020-08-31 2020-11-24 平安医疗健康管理股份有限公司 Risk analysis method, device and equipment based on hierarchical clustering and storage medium
CN112800148A (en) * 2021-02-04 2021-05-14 国网福建省电力有限公司 Scattered pollutant enterprise research and judgment method based on clustering feature tree and outlier quantization
CN113420804A (en) * 2021-06-18 2021-09-21 工业互联网创新中心(上海)有限公司 Data processing method, device, network equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6871201B2 (en) * 2001-07-31 2005-03-22 International Business Machines Corporation Method for building space-splitting decision tree

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444247A (en) * 2020-06-17 2020-07-24 北京必示科技有限公司 KPI (Key performance indicator) -based root cause positioning method and device and storage medium
CN111985837A (en) * 2020-08-31 2020-11-24 平安医疗健康管理股份有限公司 Risk analysis method, device and equipment based on hierarchical clustering and storage medium
CN112800148A (en) * 2021-02-04 2021-05-14 国网福建省电力有限公司 Scattered pollutant enterprise research and judgment method based on clustering feature tree and outlier quantization
CN113420804A (en) * 2021-06-18 2021-09-21 工业互联网创新中心(上海)有限公司 Data processing method, device, network equipment and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117272216A (en) * 2023-11-22 2023-12-22 中国建材检验认证集团湖南有限公司 Data analysis method for automatic flow monitoring station and manual water gauge observation station
CN117272216B (en) * 2023-11-22 2024-02-09 中国建材检验认证集团湖南有限公司 Data analysis method for automatic flow monitoring station and manual water gauge observation station
CN117370331A (en) * 2023-12-08 2024-01-09 河北建投水务投资有限公司 Method and device for cleaning total water consumption data of cell, terminal equipment and storage medium
CN117370331B (en) * 2023-12-08 2024-02-20 河北建投水务投资有限公司 Method and device for cleaning total water consumption data of cell, terminal equipment and storage medium

Also Published As

Publication number Publication date
CN113420804A (en) 2021-09-21

Similar Documents

Publication Publication Date Title
WO2022262869A1 (en) Data processing method and apparatus, network device, and storage medium
US11210144B2 (en) Systems and methods for hyperparameter tuning
US11449529B2 (en) Path generation and selection tool for database objects
US9684874B2 (en) Parallel decision or regression tree growing
WO2018205881A1 (en) Estimating the number of samples satisfying a query
CN108804473B (en) Data query method, device and database system
WO2018107128A9 (en) Systems and methods for automating data science machine learning analytical workflows
US20050278139A1 (en) Automatic match tuning
AU2017246552A1 (en) Self-service classification system
US20030208284A1 (en) Modular architecture for optimizing a configuration of a computer system
CN113110866B (en) Evaluation method and device for database change script
US10417580B2 (en) Iterative refinement of pathways correlated with outcomes
CN112328798A (en) Text classification method and device
Liu et al. Correlated aggregation operators for simplified neutrosophic set and their application in multi-attribute group decision making
US10884865B2 (en) Identifying redundant nodes in a knowledge graph data structure
CN111125199B (en) Database access method and device and electronic equipment
Yan et al. A clustering algorithm for multi-modal heterogeneous big data with abnormal data
US11113348B2 (en) Device, system, and method for determining content relevance through ranked indexes
US11741101B2 (en) Estimating execution time for batch queries
CN106789163A (en) A kind of network equipment power information monitoring method, device and system
US20220147515A1 (en) Systems, methods, and program products for providing investment expertise using a financial ontology framework
Karegar et al. Data-mining by probability-based patterns
CN115996169A (en) Network fault analysis method and device, electronic equipment and storage medium
US20220188308A1 (en) Selecting access flow path in complex queries
CN112365014A (en) GA-BP-CBR-based industrial equipment fault diagnosis system and method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22824341

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE