CN116521799A - Abnormal data detection method and device, electronic equipment and storage medium - Google Patents

Abnormal data detection method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN116521799A
CN116521799A CN202210059916.9A CN202210059916A CN116521799A CN 116521799 A CN116521799 A CN 116521799A CN 202210059916 A CN202210059916 A CN 202210059916A CN 116521799 A CN116521799 A CN 116521799A
Authority
CN
China
Prior art keywords
data
target
detected
clustering
current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210059916.9A
Other languages
Chinese (zh)
Inventor
姜乐
张增
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Wodong Tianjun Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN202210059916.9A priority Critical patent/CN116521799A/en
Publication of CN116521799A publication Critical patent/CN116521799A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees

Abstract

The present disclosure provides an abnormal data detection method and apparatus, an electronic device, a computer-readable storage medium, and a computer program product. The abnormal data detection method comprises the following steps: clustering the business data to be detected of the target dimension in the data set by using a preset clustering algorithm, and determining a current target partition point of a current layer data node in the binary tree; dividing the service data to be detected in the current layer data node based on the current target dividing point to obtain the service data to be detected corresponding to the next layer data node respectively; iteratively executing to determine a next-layer target partitioning point of a next-layer data node by using a preset clustering algorithm, partitioning the service data to be detected in the next-layer data node based on the next-layer target partitioning point until a preset partitioning termination condition is met, and obtaining a constructed binary tree; and determining an abnormal value of each piece of service data to be detected according to the position of each piece of service data to be detected in the binary tree.

Description

Abnormal data detection method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of big data technology, and more particularly, to an abnormal data detection method and apparatus, an electronic device, a computer readable storage medium, and a computer program product.
Background
In the big data age, as the volume of business data is getting larger and larger, the application of data warehouse is getting wider and wider. And after a large amount of service data are generated, data processing is carried out in a multi-bin system, a series of data warehouse models (tables) are output, and finally, the processed data warehouse models are synchronized into data product application, so that the data quality of the models provides guarantee for downstream data users to carry out data analysis, index evaluation and the like, and therefore, the reliability of the model data is very important.
In the process of implementing the disclosed concept, the inventor finds that at least the following problems exist in the related art: at present, the data warehouse model data has an unsatisfactory abnormality detection effect, and partial abnormal data cannot be identified in some cases.
Disclosure of Invention
In view of this, the present disclosure provides an abnormal data detection method and apparatus, an electronic device, a computer-readable storage medium, and a computer program product.
One aspect of the present disclosure provides an abnormal data detection method, including:
clustering the business data to be detected of the target dimension in the data set by using a preset clustering algorithm, and determining a current target partition point of a current layer data node in the binary tree;
Dividing the service data to be detected in the current layer data node based on the current target dividing point to obtain the service data to be detected corresponding to the next layer data node respectively;
iteratively executing to determine a next-layer target partitioning point of a next-layer data node by using a preset clustering algorithm, partitioning the service data to be detected in the next-layer data node based on the next-layer target partitioning point until a preset partitioning termination condition is met, and obtaining a constructed binary tree; and
and determining an abnormal value of each piece of service data to be detected according to the position of each piece of service data to be detected in the binary tree.
According to an embodiment of the present disclosure, clustering service data to be detected of a target dimension in a data set by using a preset clustering algorithm, and determining a current target partition point of a current layer data node in a binary tree includes:
clustering the business data to be detected of the target dimension in the dataset into a first target classification dataset and a second target classification dataset by using a preset clustering algorithm, wherein the first target classification dataset corresponds to a first target clustering center, and the second target classification dataset corresponds to a second target clustering center;
determining a data point, of which the distance between the first target classification data set and the second target clustering center meets a preset distance threshold, as a target division point; or alternatively
And determining the data points, of which the distances between the second target classification data set and the first target clustering center meet a preset distance threshold, as target division points.
According to an embodiment of the present disclosure, clustering service data to be detected of a target dimension in a dataset into a first target classification dataset and a second target classification dataset using a preset clustering algorithm includes:
and under the condition that the to-be-detected business data of the target dimension in the dataset is clustered for the current time by using a preset clustering algorithm, performing iteration: determining a next first clustering center according to the current first clustering center and determining a next second clustering center according to the current second clustering center until a preset termination condition is reached, so as to obtain a first target clustering center and a second target clustering center which are finally determined;
and clustering the business data to be detected of the target dimension in the data set into a first target classification data set and a second target classification data set according to the first target clustering center and the second target clustering center.
According to an embodiment of the present disclosure, clustering service data to be detected of a target dimension in a dataset into a first target classification dataset and a second target classification dataset according to a first target clustering center and a second target clustering center includes:
Calculating the distance between the business data to be detected of each target dimension and the first clustering center to obtain a plurality of first distances;
calculating the distance between the service data to be detected of each target dimension and the second aggregation center to obtain a plurality of second distances;
and clustering the business data to be detected of the target dimension in the data set into a first target classification data set and a second target classification data set according to the first distances and the second distances.
According to an embodiment of the present disclosure, wherein determining the next first cluster center according to the current next first cluster center and determining the next second cluster center according to the current next second cluster center comprises:
according to the current secondary first clustering center and the current secondary second clustering center, clustering the business data to be detected of the target dimension into a current secondary first target classification data set and a current secondary second target classification data set;
calculating the average value of a plurality of data in the first target classification data set at the current time to obtain a first clustering center at the next time; and calculating the average value of the data in the second target classification data set at the current time to obtain a second clustering center at the next time.
According to an embodiment of the present disclosure, wherein:
The data set comprises a plurality of binary trees, wherein the binary trees comprise a plurality of binary trees, and the service data to be detected in each data set are different;
in the case that the binary tree comprises a plurality of pieces, the position of each piece of service data to be detected in the binary tree is: and an average value of positions of each piece of service data to be detected in a plurality of binary trees.
According to an embodiment of the present disclosure, further comprising:
acquiring a plurality of pieces of original service data to be detected from a service data table to be processed;
carrying out standardization processing on each piece of original service data to be detected to obtain a plurality of pieces of standardized service data to be detected;
and selecting part or all of the data from the standardized service data to be detected as a data set.
Another aspect of the present disclosure provides an abnormal data detection apparatus, including a first determination module, a segmentation module, an iteration module, and a second determination module.
The first determining module is used for clustering the business data to be detected of the target dimension in the data set by using a preset clustering algorithm, and determining the current target partitioning point of the current layer data node in the binary tree.
The segmentation module is used for segmenting the service data to be detected in the current layer data node based on the current target segmentation point so as to obtain the service data to be detected corresponding to the next layer data node respectively.
And the iteration module is used for iteratively executing the next-layer target division point of the next-layer data node determined by using a preset clustering algorithm, dividing the service data to be detected in the next-layer data node based on the next-layer target division point until a preset division termination condition is met, and obtaining a built binary tree.
And the second determining module is used for determining the abnormal value of each piece of service data to be detected according to the position of each piece of service data to be detected in the binary tree.
According to an embodiment of the disclosure, the first determining module includes a clustering unit, a first determining unit, and a second determining unit.
The clustering unit is used for clustering the business data to be detected of the target dimension in the data set into a first target classification data set and a second target classification data set by using a preset clustering algorithm, wherein the first target classification data set corresponds to a first target clustering center, and the second target classification data set corresponds to a second target clustering center.
And the first determining unit is used for determining the data points, of which the distances between the first target classification data set and the second target clustering center meet the preset distance threshold, as target division points.
And the second determining unit is used for determining the data points, of which the distances between the second target classification data set and the first target clustering center meet the preset distance threshold, as target segmentation points.
According to an embodiment of the present disclosure, the clustering unit comprises an iteration subunit and a clustering subunit.
The iteration subunit is configured to perform iteration under the condition that the to-be-detected service data of the target dimension in the dataset is clustered for the current time by using a preset clustering algorithm: and determining a next first clustering center according to the current first clustering center, and determining a next second clustering center according to the current second clustering center until a preset termination condition is reached, so as to obtain a finally determined first target clustering center and a finally determined second target clustering center.
And the clustering subunit is used for clustering the business data to be detected of the target dimension in the data set into a first target classification data set and a second target classification data set according to the first target clustering center and the second target clustering center.
According to an embodiment of the present disclosure, in the clustering subunit, clustering, according to the first target clustering center and the second target clustering center, service data to be detected of a target dimension in the dataset into a first target classification dataset and a second target classification dataset includes:
calculating the distance between the business data to be detected of each target dimension and the first clustering center to obtain a plurality of first distances;
Calculating the distance between the service data to be detected of each target dimension and the second aggregation center to obtain a plurality of second distances;
and clustering the business data to be detected of the target dimension in the data set into a first target classification data set and a second target classification data set according to the first distances and the second distances.
According to an embodiment of the present disclosure, in the iterative subunit, determining the next first cluster center according to the current second cluster center and determining the next second cluster center according to the current second cluster center includes:
according to the current secondary first clustering center and the current secondary second clustering center, clustering the business data to be detected of the target dimension into a current secondary first target classification data set and a current secondary second target classification data set;
calculating the average value of a plurality of data in the first target classification data set at the current time to obtain a first clustering center at the next time; and calculating the average value of the data in the second target classification data set at the current time to obtain a second clustering center at the next time.
According to an embodiment of the present disclosure, wherein:
the data set comprises a plurality of binary trees comprising a plurality of binary trees, wherein the traffic data to be detected in each data set is different.
In the case that the binary tree comprises a plurality of pieces, the position of each piece of service data to be detected in the binary tree is: and an average value of positions of each piece of service data to be detected in a plurality of binary trees.
According to an embodiment of the disclosure, the system further comprises an acquisition module, a processing module and a selection module.
The acquisition module is used for acquiring a plurality of pieces of original service data to be detected from the service data table to be processed; the processing module is used for carrying out standardization processing on each piece of original service data to be detected so as to obtain a plurality of pieces of standardized service data to be detected; and the selection module is used for selecting part or all of the data from the standardized service data to be detected as the data set.
Another aspect of the present disclosure provides an electronic device, comprising: one or more processors, and memory; wherein the memory is for storing one or more programs; wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the abnormal data detection method as described above.
Another aspect of the present disclosure provides a computer-readable storage medium storing computer-executable instructions that, when executed, are configured to implement the abnormal data detection method as described above.
Another aspect of the present disclosure provides a computer program product comprising computer executable instructions which, when executed, are for implementing the abnormal data detection method as described above.
The embodiment of the disclosure provides an abnormal data detection method based on an isolated forest algorithm, which mainly aims at the situations that the abnormal detection effect on data of a data warehouse model in the related technology is not ideal and partial abnormal data cannot be identified. The method comprises the steps of determining a current target partition point of a current layer data node in a binary tree, partitioning to-be-detected business data in the current layer data node based on the current target partition point, finally completing tree construction, and then determining an abnormal value of each piece of to-be-detected business data according to the position of each piece of to-be-detected business data in the binary tree. The method can be used for detecting abnormal data which are difficult to visually identify in the model, and is not only used for checking some abnormal data which can be visually judged in the model (such as null value, 0 value, negative value and the like), so that the model development and the users can be helped to find problems in time, and the alarm problems can be manually interfered and treated, and the accuracy of the output of the model data and the stability of all-link service data can be further ensured. .
According to an embodiment of the disclosure, the method is an abnormal data detection method integrating a clustering algorithm and an isolated forest model. In the construction process of the binary tree, the position of the target point on the binary tree is determined by selecting the cutting point, and the accuracy of detection is influenced by randomly selecting the cutting point, so that the embodiment of the disclosure optimizes the selection mode of the cutting point of the conventional isolated forest algorithm, clusters the business data to be detected of the target dimension in the data set by using a preset clustering algorithm, and determines the current target cutting point of the current layer of data nodes in the binary tree.
Drawings
The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of embodiments thereof with reference to the accompanying drawings in which:
FIG. 1 schematically illustrates an exemplary system architecture to which the anomaly data detection methods and apparatus of the present disclosure may be applied;
FIG. 2 schematically illustrates a flow chart of an abnormal data detection method according to an embodiment of the present disclosure;
FIG. 3 schematically illustrates a flow chart of an abnormal data detection method according to another embodiment of the present disclosure;
FIG. 4 schematically illustrates a block diagram of an abnormal data detection apparatus according to an embodiment of the present disclosure; and
fig. 5 schematically illustrates a block diagram of an electronic device for implementing an abnormal data detection method according to an embodiment of the present disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.
Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.). Where a formulation similar to at least one of "A, B or C, etc." is used, in general such a formulation should be interpreted in accordance with the ordinary understanding of one skilled in the art (e.g. "a system with at least one of A, B or C" would include but not be limited to systems with a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).
In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related personal information of the user all conform to the regulations of related laws and regulations, necessary security measures are taken, and the public order harmony is not violated.
In the big data age, as the volume of business data is more and more huge, the variety of data warehouse models is more and more, including a buffer model, a general model, an aggregation model, a dimension model, an application model and the like. And after a large amount of business data are generated, data processing is carried out in a multi-bin system, a series of data warehouse models (tables) are output, finally, the processed data warehouse models are synchronized into data product application, the data models in the multi-bin provide support for business departments to provide data information, unified enterprise data views and the like, the data quality of the models provides guarantee for downstream model users to carry out data analysis, index evaluation and the like, and once the model data are in error, the business effect evaluation and important decision can be influenced, and asset loss is caused. Therefore, it is very important to ensure the reliability and accuracy of the data quality of each model of the data warehouse.
In order to ensure the accuracy of data in a data warehouse model, a big data scheduling platform monitors all data processing scheduling, at present, data quality check is generally configured on scheduling tasks, after execution of corresponding tasks is completed, model result data is queried and checked, and a common data quality check rule (check sql statement) is as follows:
1. And (5) checking repeated data of the model:
SELECT COUNT FROM filter condition GROUP BY primary key field HAVING COUNT > 1) a;
note that: if the query result is greater than 0, the model has repeated data, and if the query result is equal to 0, the model has no repeated data.
2. Checking the data quantity of the model:
SELECT COUNT (x) FROM TABLE WHERE filter conditions.
3. Null value checking:
SELECT COUNT (x) FROM TABLE WHERE field a IS NULL.
As described above, most of the existing data quality check rules are check of checksum data amount of repeated data of the model, and some intuitively-determinable abnormal data checks, such as null value, 0 value, negative value check, etc., while other abnormal values in the model, such as amount data in order model, volume weight data in commodity model, bin quotation data in stock model, etc., written in abnormal data which cannot be intuitively identified due to production errors or irregular business operations, etc., are difficult to detect and monitor through simple data quality check rules, thus affecting the accuracy of analyzing data by downstream model users, that is, the abnormal detection effect of the method on data of the data warehouse model is not ideal, and some abnormal data cannot be identified.
In view of this, the invention provides an integrated clustering algorithm and an isolated forest model abnormal data detection method, which are used for detecting abnormal data which are difficult to visually identify in a model so as to realize the detection of the model abnormal data.
Before describing embodiments of the present disclosure in detail, the following description is given to a system structure and an application scenario related to the method provided by the embodiments of the present disclosure.
FIG. 1 schematically illustrates an exemplary system architecture 100 to which the anomaly data detection methods and apparatus of the present disclosure may be applied. It should be noted that fig. 1 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios.
As shown in fig. 1, a system architecture 100 according to this embodiment may include a terminal device 101, a server 102, and a data repository 103. Communication between terminal device 101, server 102, and data store 103 may be via a network, which may include various connection types, such as wired and/or wireless communication links, and the like.
A user may interact with the server 102 over a network using the terminal device 101 to receive or send messages, etc. Various communication client applications may be installed on the terminal device 101, such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients and/or social platform software, to name a few.
The terminal device 101 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.
The server 102 may be a server providing various services, such as a background management server providing support for websites browsed by users using terminal devices. (the server may be a server providing various services, including but not limited to a first service, a second service, a third service, a fourth service, etc., and the first service, the second service, the third service, the fourth service may be, for example, a service providing support for a website browsed by the user using the terminal device.) the background management server may analyze and process data such as received user requests, etc., and feed back processing results (e.g., web pages, information, data, etc., acquired or generated according to the user requests) to the terminal device.
The data warehouse 103 is used for processing data after a large amount of business data is generated, outputting a series of data warehouse models (tables), and finally synchronizing the processed data warehouse models into the data product application.
According to an embodiment of the present disclosure, the server 102 may perform detection of abnormal data on various types of model data tables (e.g., commodity tables, warehouse tables, order tables, user tables, etc.) output from the data warehouse 103 by performing the abnormal data detection method of the embodiment of the present disclosure.
It should be noted that, the method for detecting abnormal data provided by the embodiment of the present disclosure may be generally performed by the server 102. Accordingly, the abnormal data detection apparatus provided in the embodiments of the present disclosure may be generally disposed in the server 102. The abnormal data detection method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 102 and is capable of communicating with the terminal device 101 and/or the server 102. Accordingly, the abnormal data detection apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 102 and is capable of communicating with the terminal device 101 and/or the server 102. Alternatively, the abnormal data detection method provided by the embodiment of the present disclosure may be performed by the terminal apparatus 101, or may be performed by another terminal apparatus other than the terminal apparatus 101. Accordingly, the abnormal data detecting apparatus provided by the embodiment of the present disclosure may also be provided in the terminal device 101, or in another terminal device different from the terminal device 101.
According to the embodiment of the present disclosure, a user may interact with the server 102 through a network using the terminal device 101, send a request for obtaining an abnormal data detection result to the server, the server 102 obtains a model data table of a corresponding type (such as a commodity table, a warehouse table, an order table, a user table, etc.) from the data warehouse 103 in response to the request initiated by the user, and by executing the abnormal data detection method of the embodiment of the present disclosure, the various types of model data tables (perform detection of abnormal data, obtain the abnormal data detection result) output by the data warehouse 103, and return the detection result to the terminal device 101.
It should be understood that the number of terminal devices, servers in fig. 1 is merely illustrative. There may be any number of terminal devices and servers, as desired for implementation.
Fig. 2 schematically illustrates a flowchart of an abnormal data detection method according to an embodiment of the present disclosure.
As shown in fig. 2, the method includes operations S201 to S204.
In operation S201, clustering the service data to be detected of the target dimension in the data set by using a preset clustering algorithm, and determining a current target partition point of a current layer data node in the binary tree;
in operation S202, the to-be-detected service data in the current layer data node is segmented based on the current target segmentation point, so as to obtain to-be-detected service data respectively corresponding to the next layer data node;
in operation S203, iteratively executing to determine a next-layer target partitioning point of a next-layer data node by using a preset clustering algorithm, and partitioning the service data to be detected in the next-layer data node based on the next-layer target partitioning point until a preset partitioning termination condition is satisfied, so as to obtain a constructed binary tree; and
in operation S204, an outlier of each piece of service data to be detected is determined according to the position of each piece of service data to be detected in the binary tree.
According to the embodiment of the disclosure, the abnormal data detection method is used for obtaining an abnormal data detection result by executing abnormal data detection on a corresponding type of model data table (such as a commodity table, a warehouse table, an order table, a user table and the like) in a data warehouse, and monitoring and early warning are carried out on the abnormal data detection result through a data quality product, so that model development and user discovery and treatment of abnormal problems are facilitated in time. .
According to the embodiment of the disclosure, each piece of service data to be detected is multidimensional information data, and the information dimensions contained in different pieces of service data to be detected are the same. For example, each piece of data to be detected may be order data, each piece of order data includes a plurality of preset information dimensions, and the plurality of preset information dimensions are used for describing detailed data of the order, for example, the plurality of preset information dimensions included in each piece of order data may be: sales order number, commodity SKU number, pre-offer amount, sales amount, and the like. For another example, each piece of data to be detected may be warehouse data, each piece of warehouse data includes a plurality of preset information dimensions, and the plurality of preset information dimensions are used for describing detailed data of the piece of warehouse data, for example, the plurality of preset information dimensions included in each piece of warehouse data may be: commodity SKU numbers, warehouse numbers, distribution center numbers, inventory amounts, bin quotes, and the like.
According to an embodiment of the disclosure, the method is an abnormal data detection method integrating a clustering algorithm and an isolated forest model.
The basic idea of detecting model abnormal data based on an Isolated Forest (Isolated Forest) algorithm is as follows: the outlier is defined by the isolated forest algorithm as "outliers that are easily isolated", and can be understood as points that are sparse in density and far from a high-density population. The main idea of the isolated forest algorithm is to divide the dataset by recursion randomly for a given multi-dimensional dataset until all data points are isolated. The segmentation strategy can be expressed as binary tree construction, and each time of segmentation, the segmented two parts of data are placed in the left subtree and the right subtree of the current node until the node data are not segmented or the maximum depth of the tree is reached. For normal points in the data set, the normal points are generally concentrated into clusters and have a larger density, so that the normal points can be isolated only by cutting for many times, but the abnormal points in the data set can be isolated easily due to rarity and low density, so that the abnormal points in the data set are generally divided into leaf nodes faster, i.e. have shorter paths.
The specific implementation process of data detection by using an isolated forest is divided into two stages:
Stage one: constructing an isolated forest formed by a preset number of isolated binary trees;
stage two: an anomaly score for each piece of data in the dataset is calculated.
According to the embodiment of the disclosure, in the operation process of the isolated forest algorithm, data of a current layer data node in the binary tree is cut every time: firstly, determining a target division point, then dividing the data of the current data node by utilizing a hyperplane where the target division point is located, thereby determining the left subtree and the right subtree of the binary tree, and completing the construction of the tree.
According to an embodiment of the present disclosure, the current target division point is a division point determined according to traffic data to be detected in a preselected target dimension, wherein the current target division point is generated between a maximum value and a minimum value of a specified field (information dimension) in the current node data. For example, for each piece of order data, a dimension in which a price dimension cuts a current data node may be selected, and the to-be-detected order data in the current data node is, for example, 5 pieces of order data, and prices in the 5 pieces of order data are respectively: 10, 12, 9, 10, 15; the data cut point may ultimately be determined to be a price of 10-yuan (with a value between 15-yuan maximum and 9-yuan minimum).
In the related art, the determination of the target division point is: and randomly selecting one data value in the data range as a cutting point, thereby determining the left subtree and the right subtree of the binary tree, and completing the construction of the tree.
According to the embodiment of the disclosure, in the construction process of the binary tree, the selection of the cutting point determines the position of the target point on the binary tree, and the random selection of the cutting point influences the detection accuracy, so that the embodiment of the disclosure optimizes the selection mode of the cutting point of the traditional isolated forest algorithm, and in operation S201, the to-be-detected business data of the target dimension in the data set is clustered by using a preset clustering algorithm, so as to determine the current target cutting point of the current layer data node in the binary tree. The preset clustering algorithm can select a k-means class-two clustering algorithm, and select data points which are easier to isolate abnormal points as cutting points for constructing a binary tree through the k-means class-two clustering algorithm. According to embodiments of the present disclosure, the preset clustering algorithm may also employ other clustering algorithms besides the k-means class-two clustering algorithm, such as the DBSCAN clustering algorithm, and the like.
According to an embodiment of the present disclosure, the operations S201, S202, and S203 are the operations of constructing a binary tree by using the isolated forest construction method combined with the clustering algorithm. After determining the current target division point of the current layer data node in the binary tree in operation S201, in operation S202 and operation S203, the service data to be detected in the current layer data node is divided sequentially based on the current target division point, so as to determine the left subtree and the right subtree of each data node in the current layer of the binary tree in sequence, and finally complete the construction of the tree.
According to an embodiment of the present disclosure, the preset partitioning termination condition is reached, which may be that the data of the current data node is not partitioned (only one data remains) or that the maximum depth of the preset tree has been reached, and the recursive process ends.
According to an embodiment of the present disclosure, in operation S204 described above, an outlier of each piece of service data to be detected is determined according to a position of each piece of service data to be detected in the binary tree. The anomaly score of each business data to be detected in the data set is related to the depth of the business data in the isolated tree, and the shallower the depth, the higher the anomaly score, the deeper the depth and the lower the anomaly score.
The embodiment of the disclosure provides an abnormal data detection method based on an isolated forest algorithm, which mainly aims at the situations that the abnormal detection effect on data of a data warehouse model in the related technology is not ideal and partial abnormal data cannot be identified. After determining the current target division point of the current layer data node in the binary tree in operation S201, in operation S202 and operation S203, the to-be-detected service data in the current layer data node is divided sequentially based on the current target division point, so as to sequentially determine the left subtree and the right subtree of each data node in the current layer of the binary tree, and finally complete the construction of the tree, and then in operation S204, the abnormal value of each to-be-detected service data is determined according to the position of each to-be-detected service data in the binary tree. The method can be used for detecting abnormal data which are difficult to visually identify in the model, and is not only used for checking some abnormal data which can be visually judged in the model (such as null value, 0 value, negative value and the like), so that the model development and users can be helped to find problems in time, send alarm notification or automatically block task service, and perform manual intervention treatment on the alarm problems, and further, the accuracy of model data output and the stability of all-link service data are ensured.
According to an embodiment of the disclosure, the method is an abnormal data detection method integrating a clustering algorithm and an isolated forest model. In the construction process of the binary tree, the position of the target point on the binary tree is determined by selecting the cutting point, and the accuracy of detection is influenced by randomly selecting the cutting point, so that the embodiment of the disclosure optimizes the selection mode of the cutting point of the traditional isolated forest algorithm, in operation S201, the to-be-detected business data of the target dimension in the data set is clustered by using the preset clustering algorithm, the current target cutting point of the current layer of data nodes in the binary tree is determined, by the method, the accuracy of abnormal data detection can be further improved on the basis of the original data model, and meanwhile, the speed of model calculation can be accelerated by selecting the data point which is easier to isolate the abnormal point as the cutting point for constructing the binary tree, the data detection efficiency is improved, and the timeliness of the full-link business data is ensured.
According to an embodiment of the present disclosure, wherein: the data set comprises a plurality of binary trees, wherein the binary trees comprise a plurality of binary trees, and the service data to be detected in each data set are different; in the case that the binary tree comprises a plurality of pieces, the position of each piece of service data to be detected in the binary tree is: and an average value of positions of each piece of service data to be detected in a plurality of binary trees.
According to embodiments of the present disclosure, the anomaly score for each data point in the dataset is related to its depth in each orphan tree, the shallower the depth, the higher the anomaly score, the deeper the depth, and the lower the anomaly score. Because the construction of the isolated tree has randomness, the reliability of the anomaly score can be improved by constructing a plurality of binary trees and calculating the average value of the path length of each data point in the plurality of isolated trees to judge the depth of each data point in the isolated tree.
According to an embodiment of the present disclosure, further comprising:
acquiring a plurality of pieces of original service data to be detected from a service data table to be processed;
carrying out standardization processing on each piece of original service data to be detected to obtain a plurality of pieces of standardized service data to be detected;
and selecting part or all of the data from the standardized service data to be detected as a data set.
According to the embodiment of the disclosure, the influence of the unit or the magnitude of the model field can be eliminated by carrying out standardized processing on the service data to be detected, so that the model field is used as the basic data for subsequently establishing the data set, and the accuracy of data calculation can be improved.
According to an embodiment of the present disclosure, clustering service data to be detected of a target dimension in a data set by using a preset clustering algorithm, and determining a current target partition point of a current layer data node in a binary tree includes:
Clustering the business data to be detected of the target dimension in the dataset into a first target classification dataset and a second target classification dataset by using a preset clustering algorithm, wherein the first target classification dataset corresponds to a first target clustering center, and the second target classification dataset corresponds to a second target clustering center;
the data points, of which the distance between the first target classification data set and the second target clustering center meets a preset distance threshold, can be the data points closest to the second target clustering center, and are determined to be target partition points; or the data points in the second target classification data set, the distance between the data points and the first target clustering center of which meets the preset distance threshold value, can be the data points closest to the first target clustering center, and are determined to be target partition points.
According to the embodiment of the disclosure, the preset clustering algorithm may select a k-means class-two clustering algorithm, for example, and other clustering algorithms besides the k-means class-two clustering algorithm may also be adopted, such as a DBSCAN clustering algorithm, for example.
According to an embodiment of the disclosure, the operation of clustering the to-be-detected business data of the target dimension in the dataset into the first target classification dataset and the second target classification dataset by using a preset clustering algorithm may be that distances between a plurality of to-be-detected business data of the target dimension and two target clustering centers are calculated according to the determined first target clustering center and the determined second target clustering center, and the to-be-detected business data are divided into classification datasets corresponding to the target clustering centers with smaller distances.
According to an embodiment of the present disclosure, the above-described operations may be, for example: for each piece of order data, selecting a target dimension in which a price dimension cuts a current data node, wherein the number of pieces of order data to be detected in the current data node is 5, and the prices in the 5 pieces of order data are respectively: and clustering the three data into two classified data sets, namely 9-element, 10-element and 10-element, wherein the first classified data set comprises the three data of 9-element, 10-element and 10-element, and the second classified data set comprises the two data of 12-element and 15-element.
According to the embodiment of the disclosure, data points, of which the distance between the first target classification data set and the second target clustering center meets a preset distance threshold value, are determined to be target division points; or determining the data points, of which the distances between the second target classification data set and the first target clustering center meet a preset distance threshold, as target segmentation points. For example, for the above example, it may be that the first classification data set (including three data of 9, 10, and 10) is closest to the cluster center 13.5 of the second classification data set (including two data of 12 and 15): 10 yuan, as the goal dividing point; it is also possible to locate the closest point of the second classification dataset to the cluster center 9.7 of the first classification dataset: 12-element as the target division point.
According to the embodiment of the disclosure, after the data set is divided into two classified data sets by the clustering algorithm, the target division points are determined again on the basis of the classified data sets, and the selection of the visible target division points is established on the basis of data clustering, so that the data can be classified and cut more accurately by taking the visible target division points as the division points for constructing the binary tree, and the detection precision of abnormal data is improved. Further, the point, closest to the clustering center of the other classified data set, in one classified data set is used as a target cutting point, the target cutting point is a data point which is easier to isolate an abnormal point, by the method, the detection precision of the abnormal data can be further improved on the basis of an original data model, meanwhile, the data point which is easier to isolate the abnormal point is selected to be used as a cutting point for constructing a binary tree, the calculation speed of the model can be increased, the data detection efficiency is improved, and the timeliness of the full-link service data is further guaranteed.
According to an embodiment of the present disclosure, clustering service data to be detected of a target dimension in a dataset into a first target classification dataset and a second target classification dataset using a preset clustering algorithm includes:
(1) Firstly, under the condition that the to-be-detected business data of the target dimension in the dataset is clustered for the current time by utilizing a preset clustering algorithm, iteration is carried out: and determining a next first clustering center according to the current first clustering center and determining a next second clustering center according to the current second clustering center until a preset termination condition is met, so as to obtain a finally determined first target clustering center and a finally determined second target clustering center. According to the embodiment of the disclosure, the next clustering center is determined according to the current classification data set, and the average value of all data in the current classification data set can be used as the next clustering center.
(2) And then, clustering the business data to be detected of the target dimension in the data set into a first target classification data set and a second target classification data set according to the first target clustering center and the second target clustering center. According to an embodiment of the disclosure, the operation of clustering the to-be-detected business data of the target dimension in the dataset into the first target classification dataset and the second target classification dataset by using a preset clustering algorithm may be that distances between a plurality of to-be-detected business data of the target dimension and two target clustering centers are calculated according to the determined first target clustering center and the determined second target clustering center, and the to-be-detected business data are divided into classification datasets corresponding to the target clustering centers with smaller distances.
According to the embodiment of the disclosure, in the method, the operation of determining the next clustering center according to the current clustering center is performed iteratively, so that the target clustering center determined by the method is subjected to repeated iterative computation, the determined clustering center has higher availability, and the accuracy of the target cutting point determined by the target clustering center is higher, so that the target cutting point is used as the cutting point for constructing a binary tree, the data can be classified and cut more accurately, and the accuracy of abnormal data detection is improved.
According to an embodiment of the present disclosure, in the foregoing operation, determining the next first cluster center according to the current first cluster center and determining the next second cluster center according to the current second cluster center specifically includes:
according to the current secondary first clustering center and the current secondary second clustering center, clustering the business data to be detected of the target dimension into a current secondary first target classification data set and a current secondary second target classification data set;
calculating the average value of a plurality of data in the first target classification data set at the current time to obtain a first clustering center at the next time; and calculating the average value of the data in the second target classification data set at the current time to obtain a second clustering center at the next time.
According to the embodiment of the disclosure, by using the method, the average value of all data in the data set is classified as the next clustering center according to the current time, so that the determined clustering center has higher availability, and further, the accuracy of the target cutting point determined by the target clustering center is higher, and the target cutting point is used as the cutting point for constructing a binary tree, so that the data can be classified and cut more accurately, and the accuracy of abnormal data detection is improved.
According to an embodiment of the present disclosure, in the foregoing operation, clustering, according to a first target clustering center and a second target clustering center, service data to be detected of a target dimension in a dataset into a first target classification dataset and a second target classification dataset specifically includes:
calculating the distance between the business data to be detected of each target dimension and the first clustering center to obtain a plurality of first distances;
calculating the distance between the service data to be detected of each target dimension and the second aggregation center to obtain a plurality of second distances;
and clustering the business data to be detected of the target dimension in the data set into a first target classification data set and a second target classification data set according to the first distances and the second distances.
According to the embodiment of the disclosure, according to the method, according to the determined first target clustering center and the determined second target clustering center, the distances between the plurality of business data to be detected of the target dimension and the two target clustering centers are calculated, the business data are divided into the classified data sets corresponding to the target clustering centers with smaller distances, the data in the determined classified data sets have higher aggregation, the accuracy of the target cutting points determined by the classified data sets is higher, the data can be accurately classified and cut by taking the data as the cutting points for constructing the binary tree, and the accuracy of abnormal data detection is improved.
Fig. 3 schematically illustrates a flowchart of an abnormal data detection method according to another embodiment of the present disclosure, and the method of the embodiment of the present disclosure is further exemplarily described below with reference to fig. 3.
1. First, data initialization is performed, and a data model table to be processed (such as a commodity table, a warehouse table, an order table, a user table and the like) is obtained from a data warehouse:
X={x 1 ,x 2 ,…,x n }
in the data table, the data quantity is n, the number of fields (dimensions) is d, and the variable k=1 and q=1 of iterative calculation are initialized; each piece of business data to be detected in the table is multidimensional (d) information data. For example, each piece of data to be detected may be order data, each piece of order data includes a plurality of preset information dimensions, and the plurality of preset information dimensions are used for describing detailed data of the order, for example, the plurality of preset information dimensions included in each piece of order data may be: sales order number, commodity SKU number, pre-offer amount, sales amount, and the like.
2. Data preprocessing is performed, and standardized processing is performed on the data, wherein the standardized processing is shown in the following formula (one) to eliminate model field unit or magnitude influence:
in the above-mentioned method, the step of,for the maximum value in the j-th field of the model, +.>Is the minimum value in the j-th field of the model.
3. Constructing an isolated forest consisting of t isolated binary trees.
Setting the number of binary trees as t, initializing the root node of each binary tree as N k (k=1, 2, …, t), the first level, r, node in the binary tree being denoted N lr
(1) Initializing l=0, r=1, and satisfying N 01 =N k
(2) Randomly selecting h pieces of data from the model data table as a sub-data set, and putting the sub-data set into a root node N of a tree k k
(3) Selecting a field (dimension, such as commodity price dimension), and using k-means class-two clustering algorithm at the current node N lr Generating a current target partition point p in the data, wherein the current target partition point p is generated at a current node N lr Between the maximum and minimum values of the specified fields in the data.
Wherein, the k-means class-two clustering algorithm is utilized in the current node N lr The specific method for generating the current target partition point p in the data is as follows:
(3.1) selecting two samples (for example, two pieces of data with commodity price of 100 yuan, 200 yuan) in the current node data as an initial clustering center a=a 1 ,a 2
(3.2) for each sample x in node data z (x z ∈N lr ) Calculating the distance from the cluster center to the two cluster centers, and dividing the distance into classes corresponding to the cluster centers with smaller distances;
(3.3) for two categories 1,2, recalculate its cluster center as:
wherein, in the above formula (II), c i I (i=1, 2) is the number of data in the two categories, respectively.
As described above, according to the embodiments of the present disclosure, the next cluster center is determined according to the current cluster center, which is based on calculating the mean value of all data in the current classified data set, i.e., the centroid of the classified data set, as the next cluster center. .
(3.4) making q=q+1, repeating the steps (2.2) and (2.3) until the termination condition is satisfied, that is, the iteration number reaches a maximum value m (m is a preset parameter);
(3.5) after classification is completed, selecting a clustering center a of the distance class 2 in the class 1 2 The nearest data point is taken as the current target split point (e.g., the current target split point is ultimately determined to be data with a price of 140 elements).
(4) Forming a hyperplane based on the cut point, and connecting the current node N lr Is divided into 2 subspaces, and data smaller than the cut point p in the designated field (dimension) is placed in the current node N lr Form a new child node N (l+1)(2r-1) Data with the cutting point p being greater than or equal to the current node N lr Form a new child node N (l+1)(2r) The method comprises the steps of carrying out a first treatment on the surface of the (e.g., placing data with a price less than 140 elements in the plurality of pieces of order data to be detected as a left subtree node, and placing data with a price greater than 140 elements in the plurality of pieces of order data to be detected as a right subtree node);
(5) Let l=l+1, r= (1, 2, …, 2) l ) In the new set of child nodes N lr (r=1,2,…,2 l ) Respectively recursion steps (3), (4), continue to construct new child nodes when the data is not partitionable or has reached the maximum depth log of the tree 2 h, the recursion process ends.
(6) k=k+1, repeating steps (1), (2), (3), (4) and (5) until t binary trees are built, i.e. k=t, and the isolated forest is built.
4. After the construction of the isolated forest is completed, calculating the abnormal score of each business data to be detected in the data set:
in accordance with an embodiment of the present disclosure,according to each service data x to be detected i (i=1, 2, …, n) the average value E (h (x) i ) To determine its depth in the orphan tree.
Each data x i Is defined as:
in the above formula, when c (n) is a given data amount n, the average value of the path length is used for normalizing the data x i Path length h (x) i ) The calculation formula is shown as the following formula (IV):
in the above equation, H (n-1) is a harmonic number, and this value can be estimated as: ln (n-1) +euler constant.
According to an embodiment of the present disclosure, it is obtainable by the above formula (iii):
when E (h (x) i ))→0,s(x i N) →1, data x i The higher the probability of being judged as abnormal data;
when E (h (x) i ))→n-1,s(x i N) →0, data x i Can be judged as normal data;
when E (h (x) i ))→c(n),s(x i N) →0.5, data x i It was judged that there was no obvious abnormality.
Through the above flow, the detection of the abnormal data in the model table is completed, and then, the corresponding data quality check rule can be configured through the data quality product, so that the monitoring and early warning of the abnormal data of the model can be realized, the abnormal data in the model can be conveniently found and processed in time, the accuracy of the output of the model data is further ensured, and higher-quality data support service is provided for a downstream model user.
Fig. 4 schematically illustrates a block diagram 400 of an abnormal data detection apparatus according to an embodiment of the present disclosure. The abnormal data detecting apparatus 400 may be used to implement the method shown with reference to fig. 2.
As shown in fig. 4, the abnormal data detecting apparatus 400 includes a first determining module 401, a dividing module 402, an iterating module 403, and a second determining module 404.
The first determining module 401 is configured to cluster to-be-detected service data of a target dimension in a data set by using a preset clustering algorithm, and determine a current target partition point of a current layer data node in a binary tree.
The segmentation module 402 is configured to segment the service data to be detected in the current layer of data nodes based on the current target segmentation point, so as to obtain the service data to be detected corresponding to the next layer of data nodes respectively.
And the iteration module 403 is configured to iteratively execute determining a next-layer target partitioning point of a next-layer data node by using a preset clustering algorithm, and partition the service data to be detected in the next-layer data node based on the next-layer target partitioning point until a preset partition termination condition is met, so as to obtain a constructed binary tree.
A second determining module 404, configured to determine an outlier of each piece of service data to be detected according to a position of each piece of service data to be detected in the binary tree.
The embodiment of the disclosure provides an abnormal data detection device based on an isolated forest algorithm, which mainly aims at the situations that the abnormal detection effect on data of a data warehouse model in the related technology is not ideal and partial abnormal data cannot be identified. After determining the current target division point of the current layer data node in the binary tree through the first determining module 401, dividing the to-be-detected service data in the current layer data node sequentially based on the current target division point through the dividing module 402 and the iteration module 403, thereby sequentially determining the left subtree and the right subtree of each data node in the current layer of the binary tree respectively, finally completing the construction of the tree, and then determining the abnormal value of each to-be-detected service data according to the position of each to-be-detected service data in the binary tree through the second determining module 404. The method can be used for detecting abnormal data which are difficult to visually identify in the model, and is not only used for checking some abnormal data which can be visually judged in the model (such as null value, 0 value, negative value and the like), so that the model development and the users can be helped to find problems in time, and the alarm problems can be manually interfered and treated, and the accuracy of the output of the model data and the stability of all-link service data can be further ensured. .
According to the embodiment of the disclosure, the position of the target point on the binary tree is determined by selecting the cutting point in the construction process of the binary tree, and the accuracy of detection is affected by randomly selecting the cutting point, so that the embodiment of the disclosure optimizes the selection mode of the cutting point of the traditional isolated forest algorithm, clusters the to-be-detected service data of the target dimension in the data set by using the preset clustering algorithm through the first determining module 401, determines the current target cutting point of the current layer data node in the binary tree, and by using the device, the accuracy of abnormal data detection can be further improved on the basis of the original data model, and meanwhile, the speed of model calculation can be accelerated by selecting the data point which is easier to isolate the abnormal point as the cutting point for constructing the binary tree, the data detection efficiency is improved, and the timeliness of the full-link service data is ensured.
According to an embodiment of the disclosure, the first determining module includes a clustering unit, a first determining unit, and a second determining unit.
The clustering unit is used for clustering the business data to be detected of the target dimension in the data set into a first target classification data set and a second target classification data set by using a preset clustering algorithm, wherein the first target classification data set corresponds to a first target clustering center, and the second target classification data set corresponds to a second target clustering center.
And the first determining unit is used for determining the data points, of which the distances between the first target classification data set and the second target clustering center meet the preset distance threshold, as target division points.
And the second determining unit is used for determining the data points, of which the distances between the second target classification data set and the first target clustering center meet the preset distance threshold, as target segmentation points.
According to an embodiment of the present disclosure, the clustering unit comprises an iteration subunit and a clustering subunit.
The iteration subunit is configured to perform iteration under the condition that the to-be-detected service data of the target dimension in the dataset is clustered for the current time by using a preset clustering algorithm: and determining a next first clustering center according to the current first clustering center, and determining a next second clustering center according to the current second clustering center until a preset termination condition is reached, so as to obtain a finally determined first target clustering center and a finally determined second target clustering center.
And the clustering subunit is used for clustering the business data to be detected of the target dimension in the data set into a first target classification data set and a second target classification data set according to the first target clustering center and the second target clustering center.
According to an embodiment of the present disclosure, in the clustering subunit, clustering, according to the first target clustering center and the second target clustering center, service data to be detected of a target dimension in the dataset into a first target classification dataset and a second target classification dataset includes:
calculating the distance between the business data to be detected of each target dimension and the first clustering center to obtain a plurality of first distances;
calculating the distance between the service data to be detected of each target dimension and the second aggregation center to obtain a plurality of second distances;
and clustering the business data to be detected of the target dimension in the data set into a first target classification data set and a second target classification data set according to the first distances and the second distances.
According to an embodiment of the present disclosure, in the iterative subunit, determining the next first cluster center according to the current second cluster center and determining the next second cluster center according to the current second cluster center includes:
according to the current secondary first clustering center and the current secondary second clustering center, clustering the business data to be detected of the target dimension into a current secondary first target classification data set and a current secondary second target classification data set;
Calculating the average value of a plurality of data in the first target classification data set at the current time to obtain a first clustering center at the next time; and calculating the average value of the data in the second target classification data set at the current time to obtain a second clustering center at the next time.
According to an embodiment of the present disclosure, wherein:
the data set comprises a plurality of binary trees comprising a plurality of binary trees, wherein the traffic data to be detected in each data set is different.
In the case that the binary tree comprises a plurality of pieces, the position of each piece of service data to be detected in the binary tree is: and an average value of positions of each piece of service data to be detected in a plurality of binary trees.
According to an embodiment of the disclosure, the system further comprises an acquisition module, a processing module and a selection module.
The acquisition module is used for acquiring a plurality of pieces of original service data to be detected from the service data table to be processed; the processing module is used for carrying out standardization processing on each piece of original service data to be detected so as to obtain a plurality of pieces of standardized service data to be detected; and the selection module is used for selecting part or all of the data from the standardized service data to be detected as the data set.
Any number of modules, sub-modules, units, sub-units, or at least some of the functionality of any number of the sub-units according to embodiments of the present disclosure may be implemented in one module. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be implemented as split into multiple modules. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system-on-chip, a system-on-substrate, a system-on-package, an Application Specific Integrated Circuit (ASIC), or in any other reasonable manner of hardware or firmware that integrates or encapsulates the circuit, or in any one of or a suitable combination of three of software, hardware, and firmware. Alternatively, one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be at least partially implemented as computer program modules, which when executed, may perform the corresponding functions.
For example, any of the first determining module 401, the dividing module 402, the iterating module 403, and the second determining module 404 may be combined in one module/unit/sub-unit, or any of them may be split into a plurality of modules/units/sub-units. Alternatively, at least some of the functionality of one or more of these modules/units/sub-units may be combined with at least some of the functionality of other modules/units/sub-units and implemented in one module/unit/sub-unit. According to embodiments of the present disclosure, at least one of the first determination module 401, the segmentation module 402, the iteration module 403, and the second determination module 404 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable way of integrating or packaging the circuitry, or in any one of or a suitable combination of three of software, hardware, and firmware. Alternatively, at least one of the first determination module 401, the segmentation module 402, the iteration module 403, and the second determination module 404 may be at least partially implemented as computer program modules, which, when executed, may perform the respective functions.
Fig. 5 schematically illustrates a block diagram of an electronic device for implementing an abnormal data detection method according to an embodiment of the present disclosure. The electronic device shown in fig. 5 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.
As shown in fig. 5, an electronic device 500 according to an embodiment of the present disclosure includes a processor 501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. The processor 501 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. The processor 501 may also include on-board memory for caching purposes. The processor 501 may comprise a single processing unit or a plurality of processing units for performing different actions of the method flows according to embodiments of the disclosure.
In the RAM 503, various programs and data required for the operation of the electronic apparatus 500 are stored. The processor 501, ROM502, and RAM 503 are connected to each other by a bus 504. The processor 501 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM502 and/or the RAM 503. Note that the program may be stored in one or more memories other than the ROM502 and the RAM 503. The processor 501 may also perform various operations of the method flow according to embodiments of the present disclosure by executing programs stored in the one or more memories.
According to an embodiment of the present disclosure, the electronic device 500 may also include an input/output (I/O) interface 505, the input/output (I/O) interface 505 also being connected to the bus 504. The system 500 may also include one or more of the following components connected to the I/O interface 505: an input section 506 including a keyboard, a mouse, and the like; an output portion 507 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker, and the like; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The drive 510 is also connected to the I/O interface 505 as needed. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as needed so that a computer program read therefrom is mounted into the storage section 508 as needed.
According to embodiments of the present disclosure, the method flow according to embodiments of the present disclosure may be implemented as a computer software program. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 509, and/or installed from the removable media 511. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 501. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.
The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present disclosure.
According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium. Examples may include, but are not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
For example, according to embodiments of the present disclosure, the computer-readable storage medium may include ROM 502 and/or RAM 503 and/or one or more memories other than ROM 502 and RAM 503 described above.
Embodiments of the present disclosure also include a computer program product comprising a computer program comprising program code for performing the methods provided by the embodiments of the present disclosure, the program code for causing an electronic device to implement the method of anomaly data detection provided by the embodiments of the present disclosure when the computer program product is run on the electronic device.
The above-described functions defined in the system/apparatus of the embodiments of the present disclosure are performed when the computer program is executed by the processor 501. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.
In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed, and downloaded and installed in the form of a signal on a network medium, and/or installed from a removable medium 511 via the communication portion 509. The computer program may include program code that may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
According to embodiments of the present disclosure, program code for performing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be combined in various combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.
The embodiments of the present disclosure are described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims (11)

1. An abnormal data detection method, comprising:
clustering the business data to be detected of the target dimension in the data set by using a preset clustering algorithm, and determining a current target partition point of a current layer data node in the binary tree;
dividing the service data to be detected in the current layer data node based on the current target dividing point to obtain the service data to be detected corresponding to the next layer data node respectively;
iteratively executing to determine a next-layer target partitioning point of the next-layer data node by using the preset clustering algorithm, partitioning the service data to be detected in the next-layer data node based on the next-layer target partitioning point until a preset partitioning termination condition is met, and obtaining a constructed binary tree; and
And determining an abnormal value of each piece of service data to be detected according to the position of each piece of service data to be detected in the binary tree.
2. The method of claim 1, wherein the clustering the service data to be detected of the target dimension in the data set by using a preset clustering algorithm, and determining the current target partition point of the current layer data node in the binary tree comprises:
clustering the business data to be detected of the target dimension in the dataset into a first target classification dataset and a second target classification dataset by using the preset clustering algorithm, wherein the first target classification dataset corresponds to a first target clustering center, and the second target classification dataset corresponds to a second target clustering center;
determining a data point, of which the distance between the first target classification data set and the second target clustering center meets a preset distance threshold, as the target partition point; or alternatively
And determining the data points, of which the distances between the second target classification data set and the first target clustering center meet a preset distance threshold, as the target segmentation points.
3. The method of claim 2, wherein the clustering the traffic data to be detected of the target dimension in the dataset into a first target classification dataset and a second target classification dataset using the preset clustering algorithm comprises:
And under the condition that the current clustering of the business data to be detected of the target dimension in the dataset is completed by utilizing the preset clustering algorithm, performing iteration: determining a next first clustering center according to the current first clustering center and determining a next second clustering center according to the current second clustering center until a preset termination condition is reached, so as to obtain the finally determined first target clustering center and second target clustering center;
and clustering the business data to be detected of the target dimension in the dataset into the first target classification dataset and the second target classification dataset according to the first target clustering center and the second target clustering center.
4. The method of claim 3, wherein the clustering the business data to be detected for the target dimension in the dataset into the first target classification dataset and the second target classification dataset according to the first target clustering center and the second target clustering center comprises:
calculating the distance between the business data to be detected of each target dimension and the first clustering center to obtain a plurality of first distances;
calculating the distance between the service data to be detected of each target dimension and the second aggregation center to obtain a plurality of second distances;
And clustering the business data to be detected of the target dimension in the dataset into the first target classification dataset and the second target classification dataset according to the first distances and the second distances.
5. The method of claim 3, wherein the determining a next first cluster center from a current next first cluster center and determining a next second cluster center from a current next second cluster center comprises:
according to the current secondary first clustering center and the current secondary second clustering center, clustering the business data to be detected of the target dimension into a current secondary first target classification data set and a current secondary second target classification data set;
calculating the average value of a plurality of data in the first target classification data set of the current time to obtain the first clustering center of the next time;
and calculating the average value of a plurality of data in the second target classification data set of the current time to obtain the second clustering center of the next time.
6. The method according to claim 1, wherein:
the data set comprises a plurality of binary trees, wherein the binary trees comprise a plurality of binary trees, and service data to be detected in each data set are different;
In the case that the binary tree includes a plurality of binary trees, the position of each piece of service data to be detected in the binary tree is: and an average value of positions of each piece of business data to be detected in a plurality of binary trees.
7. The method of claim 1, further comprising:
acquiring a plurality of pieces of original service data to be detected from a service data table to be processed;
carrying out standardization processing on each piece of original service data to be detected to obtain a plurality of pieces of standardized service data to be detected;
and selecting part or all of the data from the plurality of standardized service data to be detected as the data set.
8. An abnormal data detection apparatus comprising:
the first determining module is used for clustering the business data to be detected of the target dimension in the data set by using a preset clustering algorithm, and determining the current target partitioning point of the current layer data node in the binary tree;
the segmentation module is used for segmenting the service data to be detected in the current layer data node based on the current target segmentation point so as to obtain the service data to be detected corresponding to the next layer data node respectively;
the iteration module is used for iteratively executing the next-layer target partitioning point of the next-layer data node determined by the preset clustering algorithm, partitioning the service data to be detected in the next-layer data node based on the next-layer target partitioning point until a preset partitioning termination condition is met, and obtaining a built binary tree; and
And the second determining module is used for determining the abnormal value of each piece of service data to be detected according to the position of each piece of service data to be detected in the binary tree.
9. An electronic device, comprising:
one or more processors;
a memory for storing one or more programs,
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1 to 7.
10. A computer readable storage medium having stored thereon executable instructions which when executed by a processor cause the processor to implement the method of any of claims 1 to 7.
11. A computer program product comprising computer executable instructions for implementing the method of any one of claims 1 to 7 when executed.
CN202210059916.9A 2022-01-19 2022-01-19 Abnormal data detection method and device, electronic equipment and storage medium Pending CN116521799A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210059916.9A CN116521799A (en) 2022-01-19 2022-01-19 Abnormal data detection method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210059916.9A CN116521799A (en) 2022-01-19 2022-01-19 Abnormal data detection method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116521799A true CN116521799A (en) 2023-08-01

Family

ID=87405136

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210059916.9A Pending CN116521799A (en) 2022-01-19 2022-01-19 Abnormal data detection method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116521799A (en)

Similar Documents

Publication Publication Date Title
US11238065B1 (en) Systems and methods for generating and implementing knowledge graphs for knowledge representation and analysis
US11734233B2 (en) Method for classifying an unmanaged dataset
US11562002B2 (en) Enabling advanced analytics with large data sets
US10504120B2 (en) Determining a temporary transaction limit
US9852212B2 (en) Dynamic clustering for streaming data
US10019442B2 (en) Method and system for peer detection
WO2021012783A1 (en) Insurance policy underwriting model training method employing big data, and underwriting risk assessment method
CN111612041B (en) Abnormal user identification method and device, storage medium and electronic equipment
CN110046889B (en) Method and device for detecting abnormal behavior body and server
US20180253653A1 (en) Rich entities for knowledge bases
CN110135978B (en) User financial risk assessment method and device, electronic equipment and readable medium
US10552484B2 (en) Guided data exploration
CN111126442A (en) Method for generating key attribute of article, method and device for classifying article
CN110751354B (en) Abnormal user detection method and device
US20210294850A1 (en) Monitoring information processing systems utilizing co-clustering of strings in different sets of data records
CN116155628A (en) Network security detection method, training device, electronic equipment and medium
CN113869904B (en) Suspicious data identification method, device, electronic equipment, medium and computer program
CN116521799A (en) Abnormal data detection method and device, electronic equipment and storage medium
US20210365443A1 (en) Similarity-based value-to-column classification
CN113051293A (en) Resource query method and device based on tree structure and electronic equipment
CN114491253A (en) Observation information processing method, device, electronic device and storage medium
CN113722593A (en) Event data processing method and device, electronic equipment and medium
CN112579673A (en) Multi-source data processing method and device
CN111915391A (en) Commodity data processing method and device and electronic equipment
CN114144776A (en) Real-time geographic intelligent aggregation engine

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination