CN114781688A - Method, device, equipment and storage medium for identifying abnormal data of business expansion project - Google Patents
Method, device, equipment and storage medium for identifying abnormal data of business expansion project Download PDFInfo
- Publication number
- CN114781688A CN114781688A CN202210276723.9A CN202210276723A CN114781688A CN 114781688 A CN114781688 A CN 114781688A CN 202210276723 A CN202210276723 A CN 202210276723A CN 114781688 A CN114781688 A CN 114781688A
- Authority
- CN
- China
- Prior art keywords
- data
- expansion project
- business expansion
- isolated
- business
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 58
- 230000002159 abnormal effect Effects 0.000 title claims abstract description 49
- 230000009467 reduction Effects 0.000 claims abstract description 52
- 238000001514 detection method Methods 0.000 claims abstract description 44
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 23
- 230000008569 process Effects 0.000 claims abstract description 18
- 230000006798 recombination Effects 0.000 claims abstract description 14
- 238000005215 recombination Methods 0.000 claims abstract description 14
- 238000012549 training Methods 0.000 claims description 16
- 230000002547 anomalous effect Effects 0.000 claims description 8
- 238000004590 computer program Methods 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000011478 gradient descent method Methods 0.000 claims description 6
- 238000005457 optimization Methods 0.000 claims description 6
- 238000005070 sampling Methods 0.000 claims description 4
- 238000010276 construction Methods 0.000 claims description 2
- 230000008521 reorganization Effects 0.000 claims description 2
- 230000008901 benefit Effects 0.000 abstract description 6
- 230000000694 effects Effects 0.000 abstract description 5
- 238000011160 research Methods 0.000 abstract description 3
- 230000005856 abnormality Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000012790 confirmation Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000011900 installation process Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/04—Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/06—Energy or water supply
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Theoretical Computer Science (AREA)
- Economics (AREA)
- Physics & Mathematics (AREA)
- Human Resources & Organizations (AREA)
- Strategic Management (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Marketing (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Public Health (AREA)
- Water Supply & Treatment (AREA)
- General Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Primary Health Care (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Development Economics (AREA)
- Game Theory and Decision Science (AREA)
- Entrepreneurship & Innovation (AREA)
- Operations Research (AREA)
- Quality & Reliability (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to a method for identifying abnormal data of a business expansion project, which comprises the following steps: acquiring business expansion project data, wherein the business expansion project data comprises project process node number, node working duration and project cost; performing feature recombination and dimensionality reduction on the business expansion project data by using a t-distribution random neighborhood embedding algorithm t-SNE; constructing an isolated forest model; and inputting the operation expansion project data subjected to the dimension reduction into the isolated forest model to perform abnormal detection on the operation expansion project data to obtain a detection result. The method mainly researches abnormal detection under unsupervised learning, reduces the dimensionality of high-dimensionality data through a t-SNE algorithm, recombines the high-dimensionality data, and then detects abnormal data by using isolated forests. The isolated forest is an unsupervised rapid anomaly detection method based on combination, has linear time complexity and high accuracy, can achieve good anomaly detection effect only by a small amount of samples, and has the advantage of rapid convergence.
Description
Technical Field
The invention relates to the technical field of business expansion project data abnormity, in particular to a business expansion project clustering method, device, equipment and storage medium based on SOM-E.
Background
With the rapid development of the power grid industry in China, how to improve the quality of electric energy and provide high-quality power utilization service is always a concern in the field of power grids.
In the expansion and installation process, the problem that the abnormal condition of a small amount of data often occurs in the business expansion matching project data source due to various reasons often occurs, and the key of the success or failure of the work duration prediction of each node of the business expansion matching project cluster and the business expansion process lies in the abnormal data identification of the business expansion matching project data. However, most of the common anomaly identification methods are used for identifying the imbalance of data, and high-dimensional data is more difficult to identify because the high-dimensional data is more sparse in spatial distribution relative to low-dimensional data, so that the high-dimensional data is ignored and is also a factor influencing outlier identification. And further makes the abnormal data identification inaccurate and sensitive.
Disclosure of Invention
Based on the method, the device, the equipment and the storage medium, the abnormal data of the business expansion project are identified. The method mainly researches abnormal detection under unsupervised learning, reduces and recombines high-dimensional data through a t-SNE algorithm, and then detects abnormal data by using isolated forests. The isolated forest is an unsupervised rapid anomaly detection method based on combination, has linear time complexity and high accuracy, can achieve good anomaly detection effect only by a small amount of samples, and has the advantage of rapid convergence.
According to a first aspect of some embodiments herein, there is provided a method of predicting short term load, comprising the steps of:
acquiring business expansion project data, wherein the business expansion project data comprises project process node number, node working duration and project cost;
performing feature recombination and dimensionality reduction on the business expansion project data by adopting a t-distribution random neighborhood embedding algorithm t-SNE;
constructing an isolated forest model;
and inputting the operation expansion project data subjected to the dimension reduction into the isolated forest model to perform abnormal detection on the operation expansion project data to obtain a detection result.
Further, a t-distribution random neighborhood embedding algorithm t-SNE is adopted to carry out feature recombination and dimensionality reduction on the business expansion project data, and the method comprises the following steps:
defining the business expansion project data as n d-dimensional sample sets X ═ { X1, X2, …, xn }, assuming any two points XiAnd xjObey in xiCentered, variance σiIs Gaussian distribution PiSame xiObey in xjCentered, Gaussian distribution P of variance δ jj. Thus xiAnd xjSimilar conditional probability Pj/iComprises the following steps:
wherein, Pj/iFor data point x in the business expansion project dataiAnd xjHigh-dimensional distribution probability therebetween; sigmaiIs defined as a data point xjFinding σ by dichotomy of the confusion concept for a centered Gaussian varianceiOptimum value of (A), degree of confusion (P)i)PerpCan be expressed as:
defining a low-dimensional sample set Y ═ Y1, Y2, …, yn }, a high-dimensional sample set X ═ X1, X2, …,xn } data point xiAnd xjCorresponding points y in low dimensional spaceiAnd yjJoint probability ofThe definition is as follows:
measuring P using KL divergencej/iAnd q isi/jSimilarity between C:
the KL distance is minimized by using a gradient descent method, and the formula is as follows:
to speed up the optimization process and avoid local optimal solutions, a larger momentum is used in the gradient to obtain a low-dimensional embedded Yt:
Wherein, YtFor values of t iterations, η is the learning efficiency, and α (t) is the momentum of t iterations.
Further, the step of constructing the isolated forest model comprises:
initializing an isolated forest;
training an isolated tree in the isolated forest model.
Further, the initializing the orphan forest includes:
according to the post-dimensionality-reduction business expansion project data, setting an isolated forest consisting of a plurality of isolated trees, and setting a proper secondary sampling sample size psi;
wherein the height limit/of each tree is determined by the following formula:
l=ceiling(log2Ψ), wherein ceiling represents rounding.
Further, training the isolated trees in the isolated forest model comprises:
randomly selecting an isolated tree, obtaining the tree height e of the isolated tree, if e > l, and returning to the previous sub leaf node;
on the contrary, a sample q is randomly selected from the operation expansion project data after the dimension reduction, and a splitting value p is randomly selected between the maximum value and the minimum value of q;
after p is selected, a sample contained in the current cotyledon node is split into two parts, namely a left new cotyledon and a right new cotyledon, according to the condition that q is more than or equal to p and q is less than p;
updating the tree height e +1 of the isolated tree;
and training each isolated tree in the steps until each isolated tree is trained.
Further, inputting the operation expansion project data after the dimension reduction into the isolated forest model to perform anomaly detection on the operation expansion project data to obtain a detection result, wherein the detection result comprises:
performing anomaly detection on each sample in the business expansion project data after the dimension reduction;
determining the abnormality of each sample by using an abnormality score s, which is determined by the following formula:
c(n)=2H(n-1)-(2(n-1)/n)
H(i)≈ln(i)+0.5772156649
wherein x is a sample in the expansion project data after the dimension reduction, n is all samples in the expansion project data after the dimension reduction, h (x) is the number of edges passed by the sample x from the root node to the final sub leaf node, and E (h (x)) is the expectation of the sample x in h (x) of all trees of the isolated forest;
the closer a sample x is to the root node, the greater its outlier score.
According to a second aspect of some embodiments of the present application there is provided apparatus for identification of anomalous data for a business extension project, comprising:
the business expansion project acquisition module is used for acquiring business expansion project data, and the business expansion project data comprises project process node number, node working time and project cost;
the dimensionality reduction module is used for performing characteristic recombination and dimensionality reduction on the business expansion project data by adopting a t-distribution random neighborhood embedding algorithm t-SNE;
the construction module is used for constructing an isolated forest model;
and the detection module is used for inputting the operation expansion project data after the dimension reduction into the isolated forest model to carry out abnormity detection on the operation expansion project data to obtain a detection result.
Further, a dimension reduction module includes:
a high-dimensional probability calculation unit, configured to define the business expansion project data as n d-dimensional sample sets X ═ { X1, X2, …, xn }, assuming that any two points X are any twoiAnd xjObey in xiCentered, variance σiIs Gaussian distribution PiSame xiObey in xjAs a center, Gaussian distribution P of variance δ jj. Thus xiAnd xjSimilar conditional probability Pj/iComprises the following steps:
wherein, Pj/iFor data point x in the business expansion project dataiAnd xjHigh-dimensional distribution probability therebetween; sigmaiIs represented by data point xjFinding σ by dichotomy of the confusion concept for a centered Gaussian varianceiBest value of (c), confusion (P)i)PerpCan be expressed as:
a low-dimensional probability calculation unit for defining a low-dimensional sample set Y ═ { Y1, Y2, …, yn }, which is a low-dimensional embedding of a high-dimensional sample set X ═ { X1, X2, …, xn }, a data point XiAnd xjCorresponding points y in low dimensional spaceiAnd yjJoint probability ofIs defined as follows
A dimension reduction determination unit for measuring P using KL divergencej/iAnd q isijSimilarity between C:
the KL distance is minimized by using a gradient descent method, and the formula is as follows:
to speed up the optimization process and avoid local optimal solutions, a larger momentum is used in the gradient to obtain a low-dimensional embedded Yt:
Wherein Y istFor values of t iterations, η is the learning efficiency, and α (t) is the momentum of t iterations.
According to a third aspect of some embodiments herein there is provided an apparatus comprising:
at least one memory and at least one processor;
the memory to store one or more programs;
when executed by the at least one processor, the one or more programs cause the at least one processor to implement the steps of the method for identifying anomalous data in a business expansion project according to any one of the first aspects.
According to a fourth aspect of some embodiments of the present application, there is provided a computer readable storage medium storing a computer program which, when executed by a processor, performs the steps of the method according to any one of the first aspect.
According to the method, the original high-dimensional business expansion project data are subjected to characteristic recombination and data dimensionality reduction by using a t-distribution field embedding algorithm, abnormal data in the business expansion project data after recombination and dimensionality reduction are discovered by using an isolated forest algorithm, and finally the abnormal data in the business expansion project data after recombination and dimensionality reduction are eliminated. The isolated forest algorithm aims to discover the characteristics of the abnormal points rather than generalizing the characteristics of the normal points, and the abnormal points only occupy a small part of the whole sample set, and the attributes are greatly different from other normal points, so that the abnormal points are easier to be isolated than the normal points. The isolated forest is an unsupervised rapid anomaly detection method based on combination, has linear time complexity and high accuracy, can achieve a good anomaly detection effect only by a small amount of samples, has the advantage of rapid convergence, and has important significance for eliminating anomalous data in business expansion project data.
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Drawings
FIG. 1 is a flowchart illustrating steps of a method for identifying anomalous data in a business extension project according to an embodiment of the present application;
FIG. 2 is a schematic structural diagram of an apparatus for identifying abnormal data of a business expansion project according to an embodiment of the present application;
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
It should be understood that the embodiments described are only a few embodiments of the present application, and not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the embodiments in the present application.
The terminology used in the embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the embodiments of the present application. As used in the examples of this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the application, as detailed in the appended claims. In the description of the present application, it is to be understood that the terms "first," "second," "third," and the like are used solely to distinguish one from another and are not necessarily used to describe a particular order or sequence, nor are they to be construed as indicating or implying relative importance. The specific meaning of the above terms in the present application can be understood by those of ordinary skill in the art as the case may be.
Further, in the description of the present application, "a plurality" means two or more unless otherwise specified. "and/or" describes an associative relationship with a human body, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the context of the associated body is an "or" relationship.
The method aims to solve the problem that in actual engineering related to the background art, normal data and abnormal data are distributed in the same data set, and sufficient information is not available for distinguishing.
The present application provides a method for identifying abnormal data of a business expansion project, please refer to fig. 1, which includes the following steps:
step S1: and acquiring business expansion project data, wherein the business expansion project data comprises project flow node number, node working duration and project cost.
The business expansion project comprises a plurality of project process nodes, and the working time and the cost of different process nodes are different.
Step S2: and performing feature recombination and dimensionality reduction on the business expansion project data by adopting a t-distribution random neighborhood embedding algorithm t-SNE.
the t-distribution random neighborhood embedding algorithm is an algorithm for reducing the dimension of high-dimensional data, and aims to accurately represent a data point set in a high-dimensional space in a low-dimensional space, wherein the low-dimensional space generally refers to a two-dimensional space. The algorithm is non-linear and can adapt to the underlying data, supporting tuning parameters-confusion, which is used to balance local and global concerns of the data in short.
Because the business expansion project data not only relates to the categories of projects and the number of nodes, but also relates to the problems of time and cost, the business expansion project data has the characteristics of high complexity and huge number. Therefore, the dimension reduction is carried out on the business expansion project data through the algorithm, and the method is beneficial to the subsequent identification of abnormal data.
Step S3: and constructing an isolated forest model.
The isolated forest model is an unsupervised anomaly detection method suitable for Continuous data (Continuous numerical data), i.e. marked samples are not needed for training, but the features need to be Continuous. For how to find which points are easily isolated, iForest uses a very efficient set of strategies. In an isolated forest, the data set is recursively randomly partitioned until all sample points are isolated. Under this strategy of random segmentation, outliers typically have shorter paths.
Step S4: and inputting the operation expansion project data subjected to the dimension reduction into the isolated forest model to perform abnormal detection on the operation expansion project data to obtain a detection result.
The method mainly researches abnormal detection under unsupervised learning, reduces the dimensionality of high-dimensionality data through a t-SNE algorithm, recombines the high-dimensionality data, and then detects abnormal data by using isolated forests. The isolated forest is an unsupervised rapid anomaly detection method based on combination, has linear time complexity and high accuracy, can achieve good anomaly detection effect only by a small amount of samples, and has the advantage of rapid convergence.
In a specific example, a t-distribution random neighborhood embedding algorithm t-SNE is adopted to perform feature reorganization and dimensionality reduction on the business expansion project data, and the method comprises the following steps:
the business expansion project data is defined as n d-dimensional sample sets X { X1, X2, …, xn }, wherein each data point in the sample sets is used for indicating any one data of the number of project process nodes, the working time of the nodes and the project cost in the business expansion project data.
Assume any two data points xiAnd xjObey in xiCentered, variance σiGaussian distribution P ofiSame xiObey in xjCentered, Gaussian distribution P of variance δ jj. Thus xiAnd xjSimilar conditional probability Pj/iComprises the following steps:
wherein, Pj/iIs a data point x in the business expansion project dataiAnd xjHigh-dimensional distribution probability therebetween; sigmaiIs represented by data point xjFinding σ by dichotomy of the confusion concept for the central Gaussian varianceiOptimum value of (A), degree of confusion (P)i)PerpCan be expressed as:
the low-dimensional sample set Y ═ Y1, Y2, …, yn } is defined, and the data point X ═ X1, X2, …, xn } is embedded in the low-dimensional sample set X ═ X1, X2, …, xn }iAnd xjCorresponding point y in low dimensional spaceiAnd yjJoint probability of (2)The definition is as follows:
measuring P using KL divergencej/iAnd q isi/jSimilarity between C:
the KL distance is minimized by using a gradient descent method, and the formula is as follows:
to speed up the optimization process, avoiding localityOptimal solution, using larger momentum in the gradient, resulting in a low-dimensional embedded Yt:
Wherein Y istFor values of t iterations, η is the learning efficiency, and α (t) is the momentum of t iterations.
In a specific embodiment, the step of constructing the isolated forest model comprises:
and initializing the isolated forest.
Specifically, initializing a forest includes:
according to the post-dimensionality-reduction business expansion project data, setting an isolated forest consisting of a plurality of isolated trees, and setting a proper secondary sampling sample size psi;
wherein the height limit l of each isolated tree is determined by the following formula:
l=ceiling(log2Ψ), wherein ceiling represents rounding.
Training the isolated trees in the isolated forest model.
Specifically, training the isolated trees in the isolated forest model includes:
randomly selecting an isolated tree, obtaining the tree height e of the isolated tree, if e > l, and returning to the previous sub leaf node.
Otherwise, randomly selecting a sample q from the operation expansion project data after dimensionality reduction, and randomly selecting a splitting value p between the maximum value and the minimum value of q.
After p is selected, the samples contained in the current cotyledon node are split into two parts, namely a left new cotyledon and a right new cotyledon, according to the condition that q is more than or equal to p and q is less than p.
And updating the tree height e +1 of the isolated tree.
And training each isolated tree in the steps until each isolated tree is trained.
In a preferred embodiment, the step of inputting the service expansion project data after the dimension reduction into the isolated forest model to perform anomaly detection on the service expansion project data to obtain a detection result includes:
carrying out anomaly detection on each sample in the service expansion project data after dimension reduction;
determining the abnormality of each sample by using an abnormality score s, which is determined by the following formula:
c(n)=2H(n-1)-(2(n-1)/n)
H(i)≈ln(i)+0.5772156649
wherein x is a sample in the expansion project data after the dimension reduction, n is all samples in the expansion project data after the dimension reduction, h (x) is the number of edges passed by the sample x from the root node to the final sub leaf node, and E (h (x)) is the expectation of the sample x in h (x) of all trees of the isolated forest;
the closer a sample x is to the root node, the greater its outlier score.
Corresponding to the above method for identifying the abnormal data of the business expansion project, as shown in fig. 2, the present application further provides an apparatus 200 for identifying the abnormal data of the business expansion project, including:
the business expansion project acquisition module 210 is configured to acquire business expansion project data, where the business expansion project data includes a number of project process nodes, a node working duration, and a project cost.
And the dimension reduction module 220 is used for performing feature recombination and dimension reduction on the business expansion project data by adopting a t-distribution random neighborhood embedding algorithm t-SNE.
And a building module 230 for building the isolated forest model.
And the detection module 240 is configured to input the service expansion project data after the dimension reduction into the isolated forest model to perform anomaly detection on the service expansion project data, so as to obtain a detection result.
In an alternative example, the dimension reduction module 220 includes:
a high-dimensional probability calculation unit for expanding the businessThe project data is defined as n d-dimensional sample sets X ═ { X1, X2, …, xn }, assuming that any two points X are any twoiAnd xjObey in xiCentered, variance σiIs Gaussian distribution PiSame as xiObey in xjCentered, Gaussian distribution P of variance δ jj. Thus xiAnd xjSimilar conditional probability Pj/iComprises the following steps:
wherein, Pj/iFor data point x in the business expansion project dataiAnd xjHigh-dimensional distribution probability therebetween; sigmaiIs represented by data point xjFinding σ by dichotomy of the confusion concept for a centered Gaussian varianceiBest value of (c), confusion (P)i)PerpCan be expressed as:
a low-dimensional probability calculation unit for defining a low-dimensional sample set Y ═ { Y1, Y2, …, yn }, which is a low-dimensional embedding of a high-dimensional sample set X ═ { X1, X2, …, xn }, a data point XiAnd xjCorresponding points y in low dimensional spaceiAnd yjJoint probability of (2)The definition is as follows:
a dimension reduction determination unit for measuring P using KL divergencej/iAnd q isijIn betweenSimilarity C:
the KL distance is minimized by using a gradient descent method, and the formula is as follows:
to speed up the optimization process and avoid local optimal solutions, a larger momentum is used in the gradient to obtain a low-dimensional embedded Yt:
Wherein, YtFor values of t iterations, η is the learning efficiency, and α (t) is the momentum of t iterations.
In an alternative example, the building block 230 includes:
and the initialization unit is used for initializing the isolated forest.
And the training unit is used for training the isolated trees in the isolated forest model.
In an alternative example, the initialization unit includes:
the isolated forest establishment element is used for setting an isolated forest consisting of a plurality of isolated trees according to the operation expansion project data after the dimension reduction, and setting a proper secondary sampling sample size psi;
a tree height determining element, the height limit for each tree, l, being determined by the formula: ceiling (log)2Ψ), wherein ceiling represents rounding.
In an alternative example, the training unit comprises:
and randomly selecting the isolated tree element for randomly selecting an isolated tree to obtain the current tree height e of the isolated tree, such as e > l, and returning to the previous sub leaf node.
And a split value determining element for randomly selecting a sample q from the dimension-reduced data of the expansion project and randomly selecting a split value p between the maximum value and the minimum value of q.
And a new leaf determining element used for splitting the sample contained in the current cotyledon node into two parts, namely a left new cotyledon and a right new cotyledon according to q ≧ p and q < p after p is selected.
And an update element for updating the tree height e + 1.
And the training confirmation element is used for training the steps for each isolated tree until the training of each isolated tree is completed.
In an alternative example, the detection module 240 includes:
an anomaly detection element, configured to perform anomaly detection on each sample in the service expansion project data after the dimension reduction;
an abnormal situation determining element for determining an abnormal situation of each of the samples by using an abnormal score s, which is determined by the following formula:
c(n)=2H(n-1)-(2(n-1)/n)
H(i)≈ln(i)+0.5772156649
wherein x is a sample in the expansion project data after the dimension reduction, n is all samples in the expansion project data after the dimension reduction, h (x) is the number of edges passed by the sample x from the root node to the final sub leaf node, and E (h (x)) is the expectation of the sample x in h (x) of all trees of the isolated forest;
and an abnormal score determining element for determining whether the sample x is closer to the root node or not.
Corresponding to the method for identifying abnormal data of the business expansion project, the application also provides equipment which comprises at least one memory and at least one processor;
the memory to store one or more programs;
when the one or more programs are executed by the at least one processor, the at least one processor is enabled to implement the steps of any one of the identification methods for the abnormal data of the business expansion projects.
The implementation process of the functions and actions of each component in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again. For the apparatus embodiment, since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment for relevant points. The above-described device embodiments are merely illustrative, wherein the components described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the disclosed solution. One of ordinary skill in the art can understand and implement it without inventive effort.
Corresponding to the method for identifying abnormal data of the business expansion project, the application also provides a computer readable storage medium, wherein a computer program is stored in the computer readable storage medium, and when the computer program is executed by a processor, the computer program realizes the steps of any one of the methods.
The present disclosure may take the form of a computer program product embodied on one or more storage media including, but not limited to, disk storage, CD-ROM, optical storage, and the like, having program code embodied therein. Computer-usable storage media include permanent and non-permanent, removable and non-removable media, and may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of the storage medium of the computer include, but are not limited to: phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technologies, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape storage or other magnetic storage devices, or any other non-transmission medium, may be used to store information that may be accessed by a computing device.
According to the method, the original high-dimensional business expansion project data are subjected to characteristic recombination and data dimensionality reduction by using a t-distribution field embedding algorithm, abnormal data in the business expansion project data after recombination and dimensionality reduction are discovered by using an isolated forest algorithm, and finally the abnormal data in the business expansion project data after recombination and dimensionality reduction are eliminated. The isolated forest algorithm aims to discover the characteristics of the abnormal points rather than generalizing the characteristics of the normal points, and the abnormal points only occupy a small part of the whole sample set, and the attributes are greatly different from other normal points, so that the abnormal points are easier to be isolated than the normal points. The isolated forest is an unsupervised rapid anomaly detection method based on combination, has linear time complexity and high accuracy, can achieve a good anomaly detection effect only by a small amount of samples, has the advantage of rapid convergence, and has important significance for eliminating anomalous data in business expansion project data.
It is to be understood that the embodiments of the present application are not limited to the precise arrangements which have been described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the embodiments of the present application is limited only by the following claims. The above-mentioned embodiments only express a few embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for those skilled in the art, without departing from the concept of the embodiments of the present application, several variations and modifications can be made, which all fall within the scope of the embodiments of the present application.
Claims (10)
1. A method for identifying abnormal data of a business expansion project is characterized by comprising the following steps:
acquiring business expansion project data, wherein the business expansion project data comprises project process node number, node working duration and project cost;
performing characteristic recombination and dimensionality reduction on the business expansion project data by adopting a t-distribution random neighborhood embedding algorithm t-SNE;
constructing an isolated forest model;
and inputting the operation expansion project data subjected to the dimension reduction into the isolated forest model to perform abnormal detection on the operation expansion project data to obtain a detection result.
2. The method for identifying the abnormal data of the business expansion project according to claim 1, wherein a t-distribution random neighborhood embedding algorithm t-SNE is adopted to perform feature reorganization and dimensionality reduction on the business expansion project data, and the method comprises the following steps:
defining the business expansion project data as n d-dimensional sample sets X ═ { X1, X2, …, xn }, assuming any two points XiAnd xjObey in xiCentered, variance σiIs Gaussian distribution PiSame as xiObey in xjAs a center, Gaussian distribution P of variance δ jj. Thus xiAnd xjSimilar conditional probability Pj/iComprises the following steps:
wherein, Pj/iIs a data point x in the business expansion project dataiAnd xjHigh-dimensional distribution probability therebetween; sigmaiIs defined as a data point xjFinding σ by dichotomy of the confusion concept for a centered Gaussian varianceiOptimum value of (A), degree of confusion (P)i)PerpCan be expressed as:
defining a low-dimensional sample set Y ═ { Y1, Y2, …, yn }, which is a low-dimensional embedding of a high-dimensional sample set X ═ { X1, X2, …, xn }, a data point X ═ biAnd xjCorresponding points y in low dimensional spaceiAnd yjJoint probability of (2)The definition is as follows:
the KL distance is minimized by using a gradient descent method, and the formula is as follows:
to speed up the optimization process and avoid local optimal solutions, a larger momentum is used in the gradient to obtain a low-dimensional embedded Yt:
Wherein, YtThe value of t iterations, η the learning efficiency, and α (t) the momentum of t iterations.
3. The method for identifying abnormal data of business expansion projects according to claim 1, wherein the step of constructing the isolated forest model comprises the following steps:
initializing an isolated forest;
training an isolated tree in the isolated forest model.
4. The method for identifying abnormal data of business expansion projects according to claim 3, wherein the initializing the isolated forest comprises:
according to the post-dimensionality-reduction business expansion project data, setting an isolated forest consisting of a plurality of isolated trees, and setting a proper secondary sampling sample size psi;
wherein the height limit/of each tree is determined by the following formula:
l=ceiling(log2Ψ), wherein ceiling represents rounding.
5. The method for identifying the abnormal data of the business extension project according to claim 3, wherein training the isolated trees in the isolated forest model comprises:
randomly selecting an isolated tree, obtaining the tree height e of the isolated tree, if e > l, and returning to the previous sub leaf node;
on the contrary, a sample q is randomly selected from the operation expansion project data after the dimension reduction, and a splitting value p is randomly selected between the maximum value and the minimum value of q;
after p is selected, a sample contained in the current cotyledon node is split into two parts, namely a left new cotyledon and a right new cotyledon, according to q is more than or equal to p and q < p;
updating the tree height e-e +1 of the orphan tree;
and training each isolated tree in the steps until each isolated tree is trained.
6. The method for identifying abnormal data of business expansion projects according to claim 3, wherein the business expansion project data after dimension reduction is input to the isolated forest model to perform abnormal detection on the business expansion project data to obtain a detection result, and the method comprises the following steps:
carrying out anomaly detection on each sample in the service expansion project data after dimension reduction;
determining the abnormal condition of each sample by using an abnormal score s, which is determined by the following formula:
c(n)=2H(n-1)-(2(n-1)/n)
H(i)≈ln(i)+0.5772156649
wherein x is a sample in the expansion project data after the dimension reduction, n is all samples in the expansion project data after the dimension reduction, h (x) is the number of edges passed by the sample x from the root node to the final sub leaf node, and E (h (x)) is the expectation of the sample x in h (x) of all trees of the isolated forest;
the closer a sample x is to the root node, the larger its outlier score.
7. An apparatus for identification of anomalous data for a business extension project, comprising:
the business expansion project acquisition module is used for acquiring business expansion project data, and the business expansion project data comprises project process node numbers, node working duration and project cost;
the dimensionality reduction module is used for performing feature recombination and dimensionality reduction on the business expansion project data by adopting a t-distribution random neighborhood embedding algorithm t-SNE;
the construction module is used for constructing an isolated forest model;
and the detection module is used for inputting the operation expansion project data subjected to the dimension reduction into the isolated forest model to carry out abnormity detection on the operation expansion project data to obtain a detection result.
8. The apparatus for identifying anomalous data in a business expansion project of claim 7, wherein the dimension reduction module comprises:
a high-dimensional probability calculation unit for calculating the business expansion project dataDefined as n d-dimensional sample sets X ═ X1, X2, …, xn }, assuming any two points XiAnd xjObey in xiAs a center, variance σiIs Gaussian distribution PiSame as xiObey in xjAs a center, Gaussian distribution P of variance δ jj. Thus xiAnd xjSimilar conditional probability Pj/iComprises the following steps:
wherein, Pj/iIs a data point x in the business expansion project dataiAnd xjHigh-dimensional distribution probability therebetween; sigmaiIs defined as a data point xjFinding σ by dichotomy of the confusion concept for a centered Gaussian varianceiBest value of (c), confusion (P)i)PerpCan be expressed as:
a low-dimensional probability calculation unit for defining a low-dimensional sample set Y ═ { Y1, Y2, …, yn }, which is a low-dimensional embedding of a high-dimensional sample set X ═ { X1, X2, …, xn }, a data point XiAnd xjCorresponding points y in low dimensional spaceiAnd yjJoint probability of (2)Is defined as follows
A dimension reduction determination unit forMeasuring P with KL divergencej/iAnd q isi/jSimilarity between C:
the KL distance is minimized by using a gradient descent method, and the formula is as follows:
to speed up the optimization process and avoid local optimal solutions, a larger momentum is used in the gradient to obtain a low-dimensional embedded Yt:
Wherein Y istFor values of t iterations, η is the learning efficiency, and α (t) is the momentum of t iterations.
9. An apparatus, comprising:
at least one memory and at least one processor;
the memory to store one or more programs;
when executed by the at least one processor, the one or more programs cause the at least one processor to implement the steps of the method for identifying anomalous data in a business enhancement project according to any one of claims 1 to 6.
10. A computer-readable storage medium storing a computer program, the computer program characterized in that:
the computer program when executed by a processor implements the steps of the method of any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210276723.9A CN114781688A (en) | 2022-03-21 | 2022-03-21 | Method, device, equipment and storage medium for identifying abnormal data of business expansion project |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210276723.9A CN114781688A (en) | 2022-03-21 | 2022-03-21 | Method, device, equipment and storage medium for identifying abnormal data of business expansion project |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114781688A true CN114781688A (en) | 2022-07-22 |
Family
ID=82425451
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210276723.9A Pending CN114781688A (en) | 2022-03-21 | 2022-03-21 | Method, device, equipment and storage medium for identifying abnormal data of business expansion project |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114781688A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115348097A (en) * | 2022-08-18 | 2022-11-15 | 北京天融信网络安全技术有限公司 | Method and device for acquiring abnormal assets, electronic equipment and storage medium |
CN117077067A (en) * | 2023-10-18 | 2023-11-17 | 北京亚康万玮信息技术股份有限公司 | Information system automatic deployment planning method based on intelligent matching |
CN117313555A (en) * | 2023-11-28 | 2023-12-29 | 南京信息工程大学 | Distributed storage-based adaptive OATS-AJSA improved GRU humidity prediction method |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109213786A (en) * | 2018-09-17 | 2019-01-15 | 广州供电局有限公司 | Low pressure industry expands data processing method and system |
CN110825724A (en) * | 2019-10-11 | 2020-02-21 | 深圳供电局有限公司 | User charging data exception interception system and method in electricity price system |
-
2022
- 2022-03-21 CN CN202210276723.9A patent/CN114781688A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109213786A (en) * | 2018-09-17 | 2019-01-15 | 广州供电局有限公司 | Low pressure industry expands data processing method and system |
CN110825724A (en) * | 2019-10-11 | 2020-02-21 | 深圳供电局有限公司 | User charging data exception interception system and method in electricity price system |
Non-Patent Citations (2)
Title |
---|
操江能: "基于优化孤立森林的船舶柴油机故障监测", 《船舶工程》, vol. 43, no. 11, 25 November 2021 (2021-11-25), pages 125 - 132 * |
李倩: "基于模糊孤立森林算法的多维数据异常检测方法", 《计算机与数字工程》, vol. 48, no. 04, 20 April 2020 (2020-04-20), pages 862 - 866 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115348097A (en) * | 2022-08-18 | 2022-11-15 | 北京天融信网络安全技术有限公司 | Method and device for acquiring abnormal assets, electronic equipment and storage medium |
CN117077067A (en) * | 2023-10-18 | 2023-11-17 | 北京亚康万玮信息技术股份有限公司 | Information system automatic deployment planning method based on intelligent matching |
CN117077067B (en) * | 2023-10-18 | 2023-12-22 | 北京亚康万玮信息技术股份有限公司 | Information system automatic deployment planning method based on intelligent matching |
CN117313555A (en) * | 2023-11-28 | 2023-12-29 | 南京信息工程大学 | Distributed storage-based adaptive OATS-AJSA improved GRU humidity prediction method |
CN117313555B (en) * | 2023-11-28 | 2024-03-08 | 南京信息工程大学 | GRU humidity prediction method based on self-adaptive OATS-AJSA |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114781688A (en) | Method, device, equipment and storage medium for identifying abnormal data of business expansion project | |
CN110458187B (en) | Malicious code family clustering method and system | |
CN109960808B (en) | Text recognition method, device and equipment and computer readable storage medium | |
CN111461164B (en) | Sample data set capacity expansion method and model training method | |
CN110866030A (en) | Database abnormal access detection method based on unsupervised learning | |
CN111340054A (en) | Data labeling method and device and data processing equipment | |
CN104573130A (en) | Entity resolution method based on group calculation and entity resolution device based on group calculation | |
CN115293919A (en) | Graph neural network prediction method and system oriented to social network distribution generalization | |
CN117540234A (en) | Abnormal electricity price node and abnormal electricity price area identification system based on electricity price data distribution density characteristics | |
CN114139636B (en) | Abnormal operation processing method and device | |
CN109583712B (en) | Data index analysis method and device and storage medium | |
CN108830302B (en) | Image classification method, training method, classification prediction method and related device | |
CN113704565A (en) | Learning type space-time index method, device and medium based on global interval error | |
CN111863135B (en) | False positive structure variation filtering method, storage medium and computing device | |
CN113741364A (en) | Multi-mode chemical process fault detection method based on improved t-SNE | |
CN111582313A (en) | Sample data generation method and device and electronic equipment | |
CN116795995A (en) | Knowledge graph construction method, knowledge graph construction device, computer equipment and storage medium | |
CN116010831A (en) | Combined clustering scene reduction method and system based on potential decision result | |
CN109657795B (en) | Hard disk failure prediction method based on attribute selection | |
CN112738724A (en) | Method, device, equipment and medium for accurately identifying regional target crowd | |
CN112884028A (en) | System resource adjusting method, device and equipment | |
CN112445939A (en) | Social network group discovery system, method and storage medium | |
CN111984812A (en) | Feature extraction model generation method, image retrieval method, device and equipment | |
CN112800069B (en) | Graph data analysis method and device and computer readable storage medium | |
CN116541252B (en) | Computer room fault log data processing method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |