CN114781688A - Method, device, equipment and storage medium for identifying abnormal data of business expansion project - Google Patents

Method, device, equipment and storage medium for identifying abnormal data of business expansion project Download PDF

Info

Publication number
CN114781688A
CN114781688A CN202210276723.9A CN202210276723A CN114781688A CN 114781688 A CN114781688 A CN 114781688A CN 202210276723 A CN202210276723 A CN 202210276723A CN 114781688 A CN114781688 A CN 114781688A
Authority
CN
China
Prior art keywords
data
expansion project
business expansion
isolated
business
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210276723.9A
Other languages
Chinese (zh)
Inventor
许斌斌
林镜星
周鑫
林其雄
谢志炜
陈刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Original Assignee
Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd filed Critical Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority to CN202210276723.9A priority Critical patent/CN114781688A/en
Publication of CN114781688A publication Critical patent/CN114781688A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Economics (AREA)
  • Physics & Mathematics (AREA)
  • Human Resources & Organizations (AREA)
  • Strategic Management (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Marketing (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Water Supply & Treatment (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Primary Health Care (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Development Economics (AREA)
  • Game Theory and Decision Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a method for identifying abnormal data of a business expansion project, which comprises the following steps: acquiring business expansion project data, wherein the business expansion project data comprises project process node number, node working duration and project cost; performing feature recombination and dimensionality reduction on the business expansion project data by using a t-distribution random neighborhood embedding algorithm t-SNE; constructing an isolated forest model; and inputting the operation expansion project data subjected to the dimension reduction into the isolated forest model to perform abnormal detection on the operation expansion project data to obtain a detection result. The method mainly researches abnormal detection under unsupervised learning, reduces the dimensionality of high-dimensionality data through a t-SNE algorithm, recombines the high-dimensionality data, and then detects abnormal data by using isolated forests. The isolated forest is an unsupervised rapid anomaly detection method based on combination, has linear time complexity and high accuracy, can achieve good anomaly detection effect only by a small amount of samples, and has the advantage of rapid convergence.

Description

Method, device, equipment and storage medium for identifying abnormal data of business expansion project
Technical Field
The invention relates to the technical field of business expansion project data abnormity, in particular to a business expansion project clustering method, device, equipment and storage medium based on SOM-E.
Background
With the rapid development of the power grid industry in China, how to improve the quality of electric energy and provide high-quality power utilization service is always a concern in the field of power grids.
In the expansion and installation process, the problem that the abnormal condition of a small amount of data often occurs in the business expansion matching project data source due to various reasons often occurs, and the key of the success or failure of the work duration prediction of each node of the business expansion matching project cluster and the business expansion process lies in the abnormal data identification of the business expansion matching project data. However, most of the common anomaly identification methods are used for identifying the imbalance of data, and high-dimensional data is more difficult to identify because the high-dimensional data is more sparse in spatial distribution relative to low-dimensional data, so that the high-dimensional data is ignored and is also a factor influencing outlier identification. And further makes the abnormal data identification inaccurate and sensitive.
Disclosure of Invention
Based on the method, the device, the equipment and the storage medium, the abnormal data of the business expansion project are identified. The method mainly researches abnormal detection under unsupervised learning, reduces and recombines high-dimensional data through a t-SNE algorithm, and then detects abnormal data by using isolated forests. The isolated forest is an unsupervised rapid anomaly detection method based on combination, has linear time complexity and high accuracy, can achieve good anomaly detection effect only by a small amount of samples, and has the advantage of rapid convergence.
According to a first aspect of some embodiments herein, there is provided a method of predicting short term load, comprising the steps of:
acquiring business expansion project data, wherein the business expansion project data comprises project process node number, node working duration and project cost;
performing feature recombination and dimensionality reduction on the business expansion project data by adopting a t-distribution random neighborhood embedding algorithm t-SNE;
constructing an isolated forest model;
and inputting the operation expansion project data subjected to the dimension reduction into the isolated forest model to perform abnormal detection on the operation expansion project data to obtain a detection result.
Further, a t-distribution random neighborhood embedding algorithm t-SNE is adopted to carry out feature recombination and dimensionality reduction on the business expansion project data, and the method comprises the following steps:
defining the business expansion project data as n d-dimensional sample sets X ═ { X1, X2, …, xn }, assuming any two points XiAnd xjObey in xiCentered, variance σiIs Gaussian distribution PiSame xiObey in xjCentered, Gaussian distribution P of variance δ jj. Thus xiAnd xjSimilar conditional probability Pj/iComprises the following steps:
Figure BDA0003556316470000021
wherein, Pj/iFor data point x in the business expansion project dataiAnd xjHigh-dimensional distribution probability therebetween; sigmaiIs defined as a data point xjFinding σ by dichotomy of the confusion concept for a centered Gaussian varianceiOptimum value of (A), degree of confusion (P)i)PerpCan be expressed as:
Figure BDA0003556316470000022
Figure BDA0003556316470000023
defining a low-dimensional sample set Y ═ Y1, Y2, …, yn }, a high-dimensional sample set X ═ X1, X2, …,xn } data point xiAnd xjCorresponding points y in low dimensional spaceiAnd yjJoint probability of
Figure BDA0003556316470000024
The definition is as follows:
Figure BDA0003556316470000025
measuring P using KL divergencej/iAnd q isi/jSimilarity between C:
Figure BDA0003556316470000026
the KL distance is minimized by using a gradient descent method, and the formula is as follows:
Figure BDA0003556316470000027
to speed up the optimization process and avoid local optimal solutions, a larger momentum is used in the gradient to obtain a low-dimensional embedded Yt
Figure BDA0003556316470000028
Wherein, YtFor values of t iterations, η is the learning efficiency, and α (t) is the momentum of t iterations.
Further, the step of constructing the isolated forest model comprises:
initializing an isolated forest;
training an isolated tree in the isolated forest model.
Further, the initializing the orphan forest includes:
according to the post-dimensionality-reduction business expansion project data, setting an isolated forest consisting of a plurality of isolated trees, and setting a proper secondary sampling sample size psi;
wherein the height limit/of each tree is determined by the following formula:
l=ceiling(log2Ψ), wherein ceiling represents rounding.
Further, training the isolated trees in the isolated forest model comprises:
randomly selecting an isolated tree, obtaining the tree height e of the isolated tree, if e > l, and returning to the previous sub leaf node;
on the contrary, a sample q is randomly selected from the operation expansion project data after the dimension reduction, and a splitting value p is randomly selected between the maximum value and the minimum value of q;
after p is selected, a sample contained in the current cotyledon node is split into two parts, namely a left new cotyledon and a right new cotyledon, according to the condition that q is more than or equal to p and q is less than p;
updating the tree height e +1 of the isolated tree;
and training each isolated tree in the steps until each isolated tree is trained.
Further, inputting the operation expansion project data after the dimension reduction into the isolated forest model to perform anomaly detection on the operation expansion project data to obtain a detection result, wherein the detection result comprises:
performing anomaly detection on each sample in the business expansion project data after the dimension reduction;
determining the abnormality of each sample by using an abnormality score s, which is determined by the following formula:
Figure BDA0003556316470000031
c(n)=2H(n-1)-(2(n-1)/n)
H(i)≈ln(i)+0.5772156649
wherein x is a sample in the expansion project data after the dimension reduction, n is all samples in the expansion project data after the dimension reduction, h (x) is the number of edges passed by the sample x from the root node to the final sub leaf node, and E (h (x)) is the expectation of the sample x in h (x) of all trees of the isolated forest;
the closer a sample x is to the root node, the greater its outlier score.
According to a second aspect of some embodiments of the present application there is provided apparatus for identification of anomalous data for a business extension project, comprising:
the business expansion project acquisition module is used for acquiring business expansion project data, and the business expansion project data comprises project process node number, node working time and project cost;
the dimensionality reduction module is used for performing characteristic recombination and dimensionality reduction on the business expansion project data by adopting a t-distribution random neighborhood embedding algorithm t-SNE;
the construction module is used for constructing an isolated forest model;
and the detection module is used for inputting the operation expansion project data after the dimension reduction into the isolated forest model to carry out abnormity detection on the operation expansion project data to obtain a detection result.
Further, a dimension reduction module includes:
a high-dimensional probability calculation unit, configured to define the business expansion project data as n d-dimensional sample sets X ═ { X1, X2, …, xn }, assuming that any two points X are any twoiAnd xjObey in xiCentered, variance σiIs Gaussian distribution PiSame xiObey in xjAs a center, Gaussian distribution P of variance δ jj. Thus xiAnd xjSimilar conditional probability Pj/iComprises the following steps:
Figure BDA0003556316470000041
wherein, Pj/iFor data point x in the business expansion project dataiAnd xjHigh-dimensional distribution probability therebetween; sigmaiIs represented by data point xjFinding σ by dichotomy of the confusion concept for a centered Gaussian varianceiBest value of (c), confusion (P)i)PerpCan be expressed as:
Figure BDA0003556316470000042
Figure BDA0003556316470000043
a low-dimensional probability calculation unit for defining a low-dimensional sample set Y ═ { Y1, Y2, …, yn }, which is a low-dimensional embedding of a high-dimensional sample set X ═ { X1, X2, …, xn }, a data point XiAnd xjCorresponding points y in low dimensional spaceiAnd yjJoint probability of
Figure BDA0003556316470000044
Is defined as follows
Figure BDA0003556316470000045
A dimension reduction determination unit for measuring P using KL divergencej/iAnd q isijSimilarity between C:
Figure BDA0003556316470000046
the KL distance is minimized by using a gradient descent method, and the formula is as follows:
Figure BDA0003556316470000047
to speed up the optimization process and avoid local optimal solutions, a larger momentum is used in the gradient to obtain a low-dimensional embedded Yt
Figure BDA0003556316470000048
Wherein Y istFor values of t iterations, η is the learning efficiency, and α (t) is the momentum of t iterations.
According to a third aspect of some embodiments herein there is provided an apparatus comprising:
at least one memory and at least one processor;
the memory to store one or more programs;
when executed by the at least one processor, the one or more programs cause the at least one processor to implement the steps of the method for identifying anomalous data in a business expansion project according to any one of the first aspects.
According to a fourth aspect of some embodiments of the present application, there is provided a computer readable storage medium storing a computer program which, when executed by a processor, performs the steps of the method according to any one of the first aspect.
According to the method, the original high-dimensional business expansion project data are subjected to characteristic recombination and data dimensionality reduction by using a t-distribution field embedding algorithm, abnormal data in the business expansion project data after recombination and dimensionality reduction are discovered by using an isolated forest algorithm, and finally the abnormal data in the business expansion project data after recombination and dimensionality reduction are eliminated. The isolated forest algorithm aims to discover the characteristics of the abnormal points rather than generalizing the characteristics of the normal points, and the abnormal points only occupy a small part of the whole sample set, and the attributes are greatly different from other normal points, so that the abnormal points are easier to be isolated than the normal points. The isolated forest is an unsupervised rapid anomaly detection method based on combination, has linear time complexity and high accuracy, can achieve a good anomaly detection effect only by a small amount of samples, has the advantage of rapid convergence, and has important significance for eliminating anomalous data in business expansion project data.
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Drawings
FIG. 1 is a flowchart illustrating steps of a method for identifying anomalous data in a business extension project according to an embodiment of the present application;
FIG. 2 is a schematic structural diagram of an apparatus for identifying abnormal data of a business expansion project according to an embodiment of the present application;
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
It should be understood that the embodiments described are only a few embodiments of the present application, and not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the embodiments in the present application.
The terminology used in the embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the embodiments of the present application. As used in the examples of this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the application, as detailed in the appended claims. In the description of the present application, it is to be understood that the terms "first," "second," "third," and the like are used solely to distinguish one from another and are not necessarily used to describe a particular order or sequence, nor are they to be construed as indicating or implying relative importance. The specific meaning of the above terms in the present application can be understood by those of ordinary skill in the art as the case may be.
Further, in the description of the present application, "a plurality" means two or more unless otherwise specified. "and/or" describes an associative relationship with a human body, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the context of the associated body is an "or" relationship.
The method aims to solve the problem that in actual engineering related to the background art, normal data and abnormal data are distributed in the same data set, and sufficient information is not available for distinguishing.
The present application provides a method for identifying abnormal data of a business expansion project, please refer to fig. 1, which includes the following steps:
step S1: and acquiring business expansion project data, wherein the business expansion project data comprises project flow node number, node working duration and project cost.
The business expansion project comprises a plurality of project process nodes, and the working time and the cost of different process nodes are different.
Step S2: and performing feature recombination and dimensionality reduction on the business expansion project data by adopting a t-distribution random neighborhood embedding algorithm t-SNE.
the t-distribution random neighborhood embedding algorithm is an algorithm for reducing the dimension of high-dimensional data, and aims to accurately represent a data point set in a high-dimensional space in a low-dimensional space, wherein the low-dimensional space generally refers to a two-dimensional space. The algorithm is non-linear and can adapt to the underlying data, supporting tuning parameters-confusion, which is used to balance local and global concerns of the data in short.
Because the business expansion project data not only relates to the categories of projects and the number of nodes, but also relates to the problems of time and cost, the business expansion project data has the characteristics of high complexity and huge number. Therefore, the dimension reduction is carried out on the business expansion project data through the algorithm, and the method is beneficial to the subsequent identification of abnormal data.
Step S3: and constructing an isolated forest model.
The isolated forest model is an unsupervised anomaly detection method suitable for Continuous data (Continuous numerical data), i.e. marked samples are not needed for training, but the features need to be Continuous. For how to find which points are easily isolated, iForest uses a very efficient set of strategies. In an isolated forest, the data set is recursively randomly partitioned until all sample points are isolated. Under this strategy of random segmentation, outliers typically have shorter paths.
Step S4: and inputting the operation expansion project data subjected to the dimension reduction into the isolated forest model to perform abnormal detection on the operation expansion project data to obtain a detection result.
The method mainly researches abnormal detection under unsupervised learning, reduces the dimensionality of high-dimensionality data through a t-SNE algorithm, recombines the high-dimensionality data, and then detects abnormal data by using isolated forests. The isolated forest is an unsupervised rapid anomaly detection method based on combination, has linear time complexity and high accuracy, can achieve good anomaly detection effect only by a small amount of samples, and has the advantage of rapid convergence.
In a specific example, a t-distribution random neighborhood embedding algorithm t-SNE is adopted to perform feature reorganization and dimensionality reduction on the business expansion project data, and the method comprises the following steps:
the business expansion project data is defined as n d-dimensional sample sets X { X1, X2, …, xn }, wherein each data point in the sample sets is used for indicating any one data of the number of project process nodes, the working time of the nodes and the project cost in the business expansion project data.
Assume any two data points xiAnd xjObey in xiCentered, variance σiGaussian distribution P ofiSame xiObey in xjCentered, Gaussian distribution P of variance δ jj. Thus xiAnd xjSimilar conditional probability Pj/iComprises the following steps:
Figure BDA0003556316470000071
wherein, Pj/iIs a data point x in the business expansion project dataiAnd xjHigh-dimensional distribution probability therebetween; sigmaiIs represented by data point xjFinding σ by dichotomy of the confusion concept for the central Gaussian varianceiOptimum value of (A), degree of confusion (P)i)PerpCan be expressed as:
Figure BDA0003556316470000072
Figure BDA0003556316470000073
the low-dimensional sample set Y ═ Y1, Y2, …, yn } is defined, and the data point X ═ X1, X2, …, xn } is embedded in the low-dimensional sample set X ═ X1, X2, …, xn }iAnd xjCorresponding point y in low dimensional spaceiAnd yjJoint probability of (2)
Figure BDA0003556316470000074
The definition is as follows:
Figure BDA0003556316470000075
measuring P using KL divergencej/iAnd q isi/jSimilarity between C:
Figure BDA0003556316470000081
the KL distance is minimized by using a gradient descent method, and the formula is as follows:
Figure BDA0003556316470000082
to speed up the optimization process, avoiding localityOptimal solution, using larger momentum in the gradient, resulting in a low-dimensional embedded Yt
Figure BDA0003556316470000083
Wherein Y istFor values of t iterations, η is the learning efficiency, and α (t) is the momentum of t iterations.
In a specific embodiment, the step of constructing the isolated forest model comprises:
and initializing the isolated forest.
Specifically, initializing a forest includes:
according to the post-dimensionality-reduction business expansion project data, setting an isolated forest consisting of a plurality of isolated trees, and setting a proper secondary sampling sample size psi;
wherein the height limit l of each isolated tree is determined by the following formula:
l=ceiling(log2Ψ), wherein ceiling represents rounding.
Training the isolated trees in the isolated forest model.
Specifically, training the isolated trees in the isolated forest model includes:
randomly selecting an isolated tree, obtaining the tree height e of the isolated tree, if e > l, and returning to the previous sub leaf node.
Otherwise, randomly selecting a sample q from the operation expansion project data after dimensionality reduction, and randomly selecting a splitting value p between the maximum value and the minimum value of q.
After p is selected, the samples contained in the current cotyledon node are split into two parts, namely a left new cotyledon and a right new cotyledon, according to the condition that q is more than or equal to p and q is less than p.
And updating the tree height e +1 of the isolated tree.
And training each isolated tree in the steps until each isolated tree is trained.
In a preferred embodiment, the step of inputting the service expansion project data after the dimension reduction into the isolated forest model to perform anomaly detection on the service expansion project data to obtain a detection result includes:
carrying out anomaly detection on each sample in the service expansion project data after dimension reduction;
determining the abnormality of each sample by using an abnormality score s, which is determined by the following formula:
Figure BDA0003556316470000091
c(n)=2H(n-1)-(2(n-1)/n)
H(i)≈ln(i)+0.5772156649
wherein x is a sample in the expansion project data after the dimension reduction, n is all samples in the expansion project data after the dimension reduction, h (x) is the number of edges passed by the sample x from the root node to the final sub leaf node, and E (h (x)) is the expectation of the sample x in h (x) of all trees of the isolated forest;
the closer a sample x is to the root node, the greater its outlier score.
Corresponding to the above method for identifying the abnormal data of the business expansion project, as shown in fig. 2, the present application further provides an apparatus 200 for identifying the abnormal data of the business expansion project, including:
the business expansion project acquisition module 210 is configured to acquire business expansion project data, where the business expansion project data includes a number of project process nodes, a node working duration, and a project cost.
And the dimension reduction module 220 is used for performing feature recombination and dimension reduction on the business expansion project data by adopting a t-distribution random neighborhood embedding algorithm t-SNE.
And a building module 230 for building the isolated forest model.
And the detection module 240 is configured to input the service expansion project data after the dimension reduction into the isolated forest model to perform anomaly detection on the service expansion project data, so as to obtain a detection result.
In an alternative example, the dimension reduction module 220 includes:
a high-dimensional probability calculation unit for expanding the businessThe project data is defined as n d-dimensional sample sets X ═ { X1, X2, …, xn }, assuming that any two points X are any twoiAnd xjObey in xiCentered, variance σiIs Gaussian distribution PiSame as xiObey in xjCentered, Gaussian distribution P of variance δ jj. Thus xiAnd xjSimilar conditional probability Pj/iComprises the following steps:
Figure BDA0003556316470000092
wherein, Pj/iFor data point x in the business expansion project dataiAnd xjHigh-dimensional distribution probability therebetween; sigmaiIs represented by data point xjFinding σ by dichotomy of the confusion concept for a centered Gaussian varianceiBest value of (c), confusion (P)i)PerpCan be expressed as:
Figure BDA0003556316470000093
Figure BDA0003556316470000094
a low-dimensional probability calculation unit for defining a low-dimensional sample set Y ═ { Y1, Y2, …, yn }, which is a low-dimensional embedding of a high-dimensional sample set X ═ { X1, X2, …, xn }, a data point XiAnd xjCorresponding points y in low dimensional spaceiAnd yjJoint probability of (2)
Figure BDA0003556316470000101
The definition is as follows:
Figure BDA0003556316470000102
a dimension reduction determination unit for measuring P using KL divergencej/iAnd q isijIn betweenSimilarity C:
Figure BDA0003556316470000103
the KL distance is minimized by using a gradient descent method, and the formula is as follows:
Figure BDA0003556316470000104
to speed up the optimization process and avoid local optimal solutions, a larger momentum is used in the gradient to obtain a low-dimensional embedded Yt
Figure BDA0003556316470000105
Wherein, YtFor values of t iterations, η is the learning efficiency, and α (t) is the momentum of t iterations.
In an alternative example, the building block 230 includes:
and the initialization unit is used for initializing the isolated forest.
And the training unit is used for training the isolated trees in the isolated forest model.
In an alternative example, the initialization unit includes:
the isolated forest establishment element is used for setting an isolated forest consisting of a plurality of isolated trees according to the operation expansion project data after the dimension reduction, and setting a proper secondary sampling sample size psi;
a tree height determining element, the height limit for each tree, l, being determined by the formula: ceiling (log)2Ψ), wherein ceiling represents rounding.
In an alternative example, the training unit comprises:
and randomly selecting the isolated tree element for randomly selecting an isolated tree to obtain the current tree height e of the isolated tree, such as e > l, and returning to the previous sub leaf node.
And a split value determining element for randomly selecting a sample q from the dimension-reduced data of the expansion project and randomly selecting a split value p between the maximum value and the minimum value of q.
And a new leaf determining element used for splitting the sample contained in the current cotyledon node into two parts, namely a left new cotyledon and a right new cotyledon according to q ≧ p and q < p after p is selected.
And an update element for updating the tree height e + 1.
And the training confirmation element is used for training the steps for each isolated tree until the training of each isolated tree is completed.
In an alternative example, the detection module 240 includes:
an anomaly detection element, configured to perform anomaly detection on each sample in the service expansion project data after the dimension reduction;
an abnormal situation determining element for determining an abnormal situation of each of the samples by using an abnormal score s, which is determined by the following formula:
Figure BDA0003556316470000111
c(n)=2H(n-1)-(2(n-1)/n)
H(i)≈ln(i)+0.5772156649
wherein x is a sample in the expansion project data after the dimension reduction, n is all samples in the expansion project data after the dimension reduction, h (x) is the number of edges passed by the sample x from the root node to the final sub leaf node, and E (h (x)) is the expectation of the sample x in h (x) of all trees of the isolated forest;
and an abnormal score determining element for determining whether the sample x is closer to the root node or not.
Corresponding to the method for identifying abnormal data of the business expansion project, the application also provides equipment which comprises at least one memory and at least one processor;
the memory to store one or more programs;
when the one or more programs are executed by the at least one processor, the at least one processor is enabled to implement the steps of any one of the identification methods for the abnormal data of the business expansion projects.
The implementation process of the functions and actions of each component in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again. For the apparatus embodiment, since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment for relevant points. The above-described device embodiments are merely illustrative, wherein the components described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the disclosed solution. One of ordinary skill in the art can understand and implement it without inventive effort.
Corresponding to the method for identifying abnormal data of the business expansion project, the application also provides a computer readable storage medium, wherein a computer program is stored in the computer readable storage medium, and when the computer program is executed by a processor, the computer program realizes the steps of any one of the methods.
The present disclosure may take the form of a computer program product embodied on one or more storage media including, but not limited to, disk storage, CD-ROM, optical storage, and the like, having program code embodied therein. Computer-usable storage media include permanent and non-permanent, removable and non-removable media, and may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of the storage medium of the computer include, but are not limited to: phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technologies, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape storage or other magnetic storage devices, or any other non-transmission medium, may be used to store information that may be accessed by a computing device.
According to the method, the original high-dimensional business expansion project data are subjected to characteristic recombination and data dimensionality reduction by using a t-distribution field embedding algorithm, abnormal data in the business expansion project data after recombination and dimensionality reduction are discovered by using an isolated forest algorithm, and finally the abnormal data in the business expansion project data after recombination and dimensionality reduction are eliminated. The isolated forest algorithm aims to discover the characteristics of the abnormal points rather than generalizing the characteristics of the normal points, and the abnormal points only occupy a small part of the whole sample set, and the attributes are greatly different from other normal points, so that the abnormal points are easier to be isolated than the normal points. The isolated forest is an unsupervised rapid anomaly detection method based on combination, has linear time complexity and high accuracy, can achieve a good anomaly detection effect only by a small amount of samples, has the advantage of rapid convergence, and has important significance for eliminating anomalous data in business expansion project data.
It is to be understood that the embodiments of the present application are not limited to the precise arrangements which have been described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the embodiments of the present application is limited only by the following claims. The above-mentioned embodiments only express a few embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for those skilled in the art, without departing from the concept of the embodiments of the present application, several variations and modifications can be made, which all fall within the scope of the embodiments of the present application.

Claims (10)

1. A method for identifying abnormal data of a business expansion project is characterized by comprising the following steps:
acquiring business expansion project data, wherein the business expansion project data comprises project process node number, node working duration and project cost;
performing characteristic recombination and dimensionality reduction on the business expansion project data by adopting a t-distribution random neighborhood embedding algorithm t-SNE;
constructing an isolated forest model;
and inputting the operation expansion project data subjected to the dimension reduction into the isolated forest model to perform abnormal detection on the operation expansion project data to obtain a detection result.
2. The method for identifying the abnormal data of the business expansion project according to claim 1, wherein a t-distribution random neighborhood embedding algorithm t-SNE is adopted to perform feature reorganization and dimensionality reduction on the business expansion project data, and the method comprises the following steps:
defining the business expansion project data as n d-dimensional sample sets X ═ { X1, X2, …, xn }, assuming any two points XiAnd xjObey in xiCentered, variance σiIs Gaussian distribution PiSame as xiObey in xjAs a center, Gaussian distribution P of variance δ jj. Thus xiAnd xjSimilar conditional probability Pj/iComprises the following steps:
Figure FDA0003556316460000011
wherein, Pj/iIs a data point x in the business expansion project dataiAnd xjHigh-dimensional distribution probability therebetween; sigmaiIs defined as a data point xjFinding σ by dichotomy of the confusion concept for a centered Gaussian varianceiOptimum value of (A), degree of confusion (P)i)PerpCan be expressed as:
Figure FDA0003556316460000012
Figure FDA0003556316460000013
defining a low-dimensional sample set Y ═ { Y1, Y2, …, yn }, which is a low-dimensional embedding of a high-dimensional sample set X ═ { X1, X2, …, xn }, a data point X ═ biAnd xjCorresponding points y in low dimensional spaceiAnd yjJoint probability of (2)
Figure FDA0003556316460000014
The definition is as follows:
Figure FDA0003556316460000015
measuring P using KL divergencej/iAnd
Figure FDA0003556316460000016
similarity between C:
Figure FDA0003556316460000017
the KL distance is minimized by using a gradient descent method, and the formula is as follows:
Figure FDA0003556316460000021
to speed up the optimization process and avoid local optimal solutions, a larger momentum is used in the gradient to obtain a low-dimensional embedded Yt
Figure FDA0003556316460000022
Wherein, YtThe value of t iterations, η the learning efficiency, and α (t) the momentum of t iterations.
3. The method for identifying abnormal data of business expansion projects according to claim 1, wherein the step of constructing the isolated forest model comprises the following steps:
initializing an isolated forest;
training an isolated tree in the isolated forest model.
4. The method for identifying abnormal data of business expansion projects according to claim 3, wherein the initializing the isolated forest comprises:
according to the post-dimensionality-reduction business expansion project data, setting an isolated forest consisting of a plurality of isolated trees, and setting a proper secondary sampling sample size psi;
wherein the height limit/of each tree is determined by the following formula:
l=ceiling(log2Ψ), wherein ceiling represents rounding.
5. The method for identifying the abnormal data of the business extension project according to claim 3, wherein training the isolated trees in the isolated forest model comprises:
randomly selecting an isolated tree, obtaining the tree height e of the isolated tree, if e > l, and returning to the previous sub leaf node;
on the contrary, a sample q is randomly selected from the operation expansion project data after the dimension reduction, and a splitting value p is randomly selected between the maximum value and the minimum value of q;
after p is selected, a sample contained in the current cotyledon node is split into two parts, namely a left new cotyledon and a right new cotyledon, according to q is more than or equal to p and q < p;
updating the tree height e-e +1 of the orphan tree;
and training each isolated tree in the steps until each isolated tree is trained.
6. The method for identifying abnormal data of business expansion projects according to claim 3, wherein the business expansion project data after dimension reduction is input to the isolated forest model to perform abnormal detection on the business expansion project data to obtain a detection result, and the method comprises the following steps:
carrying out anomaly detection on each sample in the service expansion project data after dimension reduction;
determining the abnormal condition of each sample by using an abnormal score s, which is determined by the following formula:
Figure FDA0003556316460000031
c(n)=2H(n-1)-(2(n-1)/n)
H(i)≈ln(i)+0.5772156649
wherein x is a sample in the expansion project data after the dimension reduction, n is all samples in the expansion project data after the dimension reduction, h (x) is the number of edges passed by the sample x from the root node to the final sub leaf node, and E (h (x)) is the expectation of the sample x in h (x) of all trees of the isolated forest;
the closer a sample x is to the root node, the larger its outlier score.
7. An apparatus for identification of anomalous data for a business extension project, comprising:
the business expansion project acquisition module is used for acquiring business expansion project data, and the business expansion project data comprises project process node numbers, node working duration and project cost;
the dimensionality reduction module is used for performing feature recombination and dimensionality reduction on the business expansion project data by adopting a t-distribution random neighborhood embedding algorithm t-SNE;
the construction module is used for constructing an isolated forest model;
and the detection module is used for inputting the operation expansion project data subjected to the dimension reduction into the isolated forest model to carry out abnormity detection on the operation expansion project data to obtain a detection result.
8. The apparatus for identifying anomalous data in a business expansion project of claim 7, wherein the dimension reduction module comprises:
a high-dimensional probability calculation unit for calculating the business expansion project dataDefined as n d-dimensional sample sets X ═ X1, X2, …, xn }, assuming any two points XiAnd xjObey in xiAs a center, variance σiIs Gaussian distribution PiSame as xiObey in xjAs a center, Gaussian distribution P of variance δ jj. Thus xiAnd xjSimilar conditional probability Pj/iComprises the following steps:
Figure FDA0003556316460000032
wherein, Pj/iIs a data point x in the business expansion project dataiAnd xjHigh-dimensional distribution probability therebetween; sigmaiIs defined as a data point xjFinding σ by dichotomy of the confusion concept for a centered Gaussian varianceiBest value of (c), confusion (P)i)PerpCan be expressed as:
Figure FDA0003556316460000033
Figure FDA0003556316460000034
a low-dimensional probability calculation unit for defining a low-dimensional sample set Y ═ { Y1, Y2, …, yn }, which is a low-dimensional embedding of a high-dimensional sample set X ═ { X1, X2, …, xn }, a data point XiAnd xjCorresponding points y in low dimensional spaceiAnd yjJoint probability of (2)
Figure FDA0003556316460000041
Is defined as follows
Figure FDA0003556316460000042
A dimension reduction determination unit forMeasuring P with KL divergencej/iAnd q isi/jSimilarity between C:
Figure FDA0003556316460000043
the KL distance is minimized by using a gradient descent method, and the formula is as follows:
Figure FDA0003556316460000044
to speed up the optimization process and avoid local optimal solutions, a larger momentum is used in the gradient to obtain a low-dimensional embedded Yt
Figure FDA0003556316460000045
Wherein Y istFor values of t iterations, η is the learning efficiency, and α (t) is the momentum of t iterations.
9. An apparatus, comprising:
at least one memory and at least one processor;
the memory to store one or more programs;
when executed by the at least one processor, the one or more programs cause the at least one processor to implement the steps of the method for identifying anomalous data in a business enhancement project according to any one of claims 1 to 6.
10. A computer-readable storage medium storing a computer program, the computer program characterized in that:
the computer program when executed by a processor implements the steps of the method of any one of claims 1 to 6.
CN202210276723.9A 2022-03-21 2022-03-21 Method, device, equipment and storage medium for identifying abnormal data of business expansion project Pending CN114781688A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210276723.9A CN114781688A (en) 2022-03-21 2022-03-21 Method, device, equipment and storage medium for identifying abnormal data of business expansion project

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210276723.9A CN114781688A (en) 2022-03-21 2022-03-21 Method, device, equipment and storage medium for identifying abnormal data of business expansion project

Publications (1)

Publication Number Publication Date
CN114781688A true CN114781688A (en) 2022-07-22

Family

ID=82425451

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210276723.9A Pending CN114781688A (en) 2022-03-21 2022-03-21 Method, device, equipment and storage medium for identifying abnormal data of business expansion project

Country Status (1)

Country Link
CN (1) CN114781688A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115348097A (en) * 2022-08-18 2022-11-15 北京天融信网络安全技术有限公司 Method and device for acquiring abnormal assets, electronic equipment and storage medium
CN117077067A (en) * 2023-10-18 2023-11-17 北京亚康万玮信息技术股份有限公司 Information system automatic deployment planning method based on intelligent matching
CN117313555A (en) * 2023-11-28 2023-12-29 南京信息工程大学 Distributed storage-based adaptive OATS-AJSA improved GRU humidity prediction method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109213786A (en) * 2018-09-17 2019-01-15 广州供电局有限公司 Low pressure industry expands data processing method and system
CN110825724A (en) * 2019-10-11 2020-02-21 深圳供电局有限公司 User charging data exception interception system and method in electricity price system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109213786A (en) * 2018-09-17 2019-01-15 广州供电局有限公司 Low pressure industry expands data processing method and system
CN110825724A (en) * 2019-10-11 2020-02-21 深圳供电局有限公司 User charging data exception interception system and method in electricity price system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
操江能: "基于优化孤立森林的船舶柴油机故障监测", 《船舶工程》, vol. 43, no. 11, 25 November 2021 (2021-11-25), pages 125 - 132 *
李倩: "基于模糊孤立森林算法的多维数据异常检测方法", 《计算机与数字工程》, vol. 48, no. 04, 20 April 2020 (2020-04-20), pages 862 - 866 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115348097A (en) * 2022-08-18 2022-11-15 北京天融信网络安全技术有限公司 Method and device for acquiring abnormal assets, electronic equipment and storage medium
CN117077067A (en) * 2023-10-18 2023-11-17 北京亚康万玮信息技术股份有限公司 Information system automatic deployment planning method based on intelligent matching
CN117077067B (en) * 2023-10-18 2023-12-22 北京亚康万玮信息技术股份有限公司 Information system automatic deployment planning method based on intelligent matching
CN117313555A (en) * 2023-11-28 2023-12-29 南京信息工程大学 Distributed storage-based adaptive OATS-AJSA improved GRU humidity prediction method
CN117313555B (en) * 2023-11-28 2024-03-08 南京信息工程大学 GRU humidity prediction method based on self-adaptive OATS-AJSA

Similar Documents

Publication Publication Date Title
CN114781688A (en) Method, device, equipment and storage medium for identifying abnormal data of business expansion project
CN110458187B (en) Malicious code family clustering method and system
CN109960808B (en) Text recognition method, device and equipment and computer readable storage medium
CN111461164B (en) Sample data set capacity expansion method and model training method
CN110866030A (en) Database abnormal access detection method based on unsupervised learning
CN111340054A (en) Data labeling method and device and data processing equipment
CN104573130A (en) Entity resolution method based on group calculation and entity resolution device based on group calculation
CN115293919A (en) Graph neural network prediction method and system oriented to social network distribution generalization
CN117540234A (en) Abnormal electricity price node and abnormal electricity price area identification system based on electricity price data distribution density characteristics
CN114139636B (en) Abnormal operation processing method and device
CN109583712B (en) Data index analysis method and device and storage medium
CN108830302B (en) Image classification method, training method, classification prediction method and related device
CN113704565A (en) Learning type space-time index method, device and medium based on global interval error
CN111863135B (en) False positive structure variation filtering method, storage medium and computing device
CN113741364A (en) Multi-mode chemical process fault detection method based on improved t-SNE
CN111582313A (en) Sample data generation method and device and electronic equipment
CN116795995A (en) Knowledge graph construction method, knowledge graph construction device, computer equipment and storage medium
CN116010831A (en) Combined clustering scene reduction method and system based on potential decision result
CN109657795B (en) Hard disk failure prediction method based on attribute selection
CN112738724A (en) Method, device, equipment and medium for accurately identifying regional target crowd
CN112884028A (en) System resource adjusting method, device and equipment
CN112445939A (en) Social network group discovery system, method and storage medium
CN111984812A (en) Feature extraction model generation method, image retrieval method, device and equipment
CN112800069B (en) Graph data analysis method and device and computer readable storage medium
CN116541252B (en) Computer room fault log data processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination