CN114781688A

CN114781688A - Method, device, equipment and storage medium for identifying abnormal data of business expansion project

Info

Publication number: CN114781688A
Application number: CN202210276723.9A
Authority: CN
Inventors: 许斌斌; 林镜星; 周鑫; 林其雄; 谢志炜; 陈刚
Original assignee: Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Current assignee: Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority date: 2022-03-21
Filing date: 2022-03-21
Publication date: 2022-07-22

Abstract

The invention relates to a method for identifying abnormal data of a business expansion project, which comprises the following steps: acquiring business expansion project data, wherein the business expansion project data comprises project process node number, node working duration and project cost; performing feature recombination and dimensionality reduction on the business expansion project data by using a t-distribution random neighborhood embedding algorithm t-SNE; constructing an isolated forest model; and inputting the operation expansion project data subjected to the dimension reduction into the isolated forest model to perform abnormal detection on the operation expansion project data to obtain a detection result. The method mainly researches abnormal detection under unsupervised learning, reduces the dimensionality of high-dimensionality data through a t-SNE algorithm, recombines the high-dimensionality data, and then detects abnormal data by using isolated forests. The isolated forest is an unsupervised rapid anomaly detection method based on combination, has linear time complexity and high accuracy, can achieve good anomaly detection effect only by a small amount of samples, and has the advantage of rapid convergence.

Description

Method, device, equipment and storage medium for identifying abnormal data of business expansion project

Technical Field

The invention relates to the technical field of business expansion project data abnormity, in particular to a business expansion project clustering method, device, equipment and storage medium based on SOM-E.

Background

With the rapid development of the power grid industry in China, how to improve the quality of electric energy and provide high-quality power utilization service is always a concern in the field of power grids.

In the expansion and installation process, the problem that the abnormal condition of a small amount of data often occurs in the business expansion matching project data source due to various reasons often occurs, and the key of the success or failure of the work duration prediction of each node of the business expansion matching project cluster and the business expansion process lies in the abnormal data identification of the business expansion matching project data. However, most of the common anomaly identification methods are used for identifying the imbalance of data, and high-dimensional data is more difficult to identify because the high-dimensional data is more sparse in spatial distribution relative to low-dimensional data, so that the high-dimensional data is ignored and is also a factor influencing outlier identification. And further makes the abnormal data identification inaccurate and sensitive.

Disclosure of Invention

Based on the method, the device, the equipment and the storage medium, the abnormal data of the business expansion project are identified. The method mainly researches abnormal detection under unsupervised learning, reduces and recombines high-dimensional data through a t-SNE algorithm, and then detects abnormal data by using isolated forests. The isolated forest is an unsupervised rapid anomaly detection method based on combination, has linear time complexity and high accuracy, can achieve good anomaly detection effect only by a small amount of samples, and has the advantage of rapid convergence.

According to a first aspect of some embodiments herein, there is provided a method of predicting short term load, comprising the steps of:

acquiring business expansion project data, wherein the business expansion project data comprises project process node number, node working duration and project cost;

performing feature recombination and dimensionality reduction on the business expansion project data by adopting a t-distribution random neighborhood embedding algorithm t-SNE;

constructing an isolated forest model;

and inputting the operation expansion project data subjected to the dimension reduction into the isolated forest model to perform abnormal detection on the operation expansion project data to obtain a detection result.

Further, a t-distribution random neighborhood embedding algorithm t-SNE is adopted to carry out feature recombination and dimensionality reduction on the business expansion project data, and the method comprises the following steps:

defining the business expansion project data as n d-dimensional sample sets X ═ { X1, X2, …, xn }, assuming any two points X_iAnd x_jObey in x_iCentered, variance σ_iIs Gaussian distribution P_iSame x_iObey in x_jCentered, Gaussian distribution P of variance δ j_j. Thus x_iAnd x_jSimilar conditional probability P_j/iComprises the following steps:

wherein, P_j/iFor data point x in the business expansion project data_iAnd x_jHigh-dimensional distribution probability therebetween; sigma_iIs defined as a data point x_jFinding σ by dichotomy of the confusion concept for a centered Gaussian variance_iOptimum value of (A), degree of confusion (P)_i)_PerpCan be expressed as:

defining a low-dimensional sample set Y ═ Y1, Y2, …, yn }, a high-dimensional sample set X ═ X1, X2, …,xn } data point x_iAnd x_jCorresponding points y in low dimensional space_iAnd y_jJoint probability of

The definition is as follows:

measuring P using KL divergence_j/iAnd q is_i/jSimilarity between C:

the KL distance is minimized by using a gradient descent method, and the formula is as follows:

to speed up the optimization process and avoid local optimal solutions, a larger momentum is used in the gradient to obtain a low-dimensional embedded Y_t：

Wherein, Y_tFor values of t iterations, η is the learning efficiency, and α (t) is the momentum of t iterations.

Further, the step of constructing the isolated forest model comprises:

initializing an isolated forest;

training an isolated tree in the isolated forest model.

Further, the initializing the orphan forest includes:

according to the post-dimensionality-reduction business expansion project data, setting an isolated forest consisting of a plurality of isolated trees, and setting a proper secondary sampling sample size psi;

wherein the height limit/of each tree is determined by the following formula:

l＝ceiling(log₂Ψ), wherein ceiling represents rounding.

Further, training the isolated trees in the isolated forest model comprises:

randomly selecting an isolated tree, obtaining the tree height e of the isolated tree, if e > l, and returning to the previous sub leaf node;

on the contrary, a sample q is randomly selected from the operation expansion project data after the dimension reduction, and a splitting value p is randomly selected between the maximum value and the minimum value of q;

after p is selected, a sample contained in the current cotyledon node is split into two parts, namely a left new cotyledon and a right new cotyledon, according to the condition that q is more than or equal to p and q is less than p;

updating the tree height e +1 of the isolated tree;

and training each isolated tree in the steps until each isolated tree is trained.

Further, inputting the operation expansion project data after the dimension reduction into the isolated forest model to perform anomaly detection on the operation expansion project data to obtain a detection result, wherein the detection result comprises:

performing anomaly detection on each sample in the business expansion project data after the dimension reduction;

determining the abnormality of each sample by using an abnormality score s, which is determined by the following formula:

c(n)＝2H(n-1)-(2(n-1)/n)

H(i)≈ln(i)+0.5772156649

wherein x is a sample in the expansion project data after the dimension reduction, n is all samples in the expansion project data after the dimension reduction, h (x) is the number of edges passed by the sample x from the root node to the final sub leaf node, and E (h (x)) is the expectation of the sample x in h (x) of all trees of the isolated forest;

the closer a sample x is to the root node, the greater its outlier score.

According to a second aspect of some embodiments of the present application there is provided apparatus for identification of anomalous data for a business extension project, comprising:

the business expansion project acquisition module is used for acquiring business expansion project data, and the business expansion project data comprises project process node number, node working time and project cost;

the dimensionality reduction module is used for performing characteristic recombination and dimensionality reduction on the business expansion project data by adopting a t-distribution random neighborhood embedding algorithm t-SNE;

the construction module is used for constructing an isolated forest model;

and the detection module is used for inputting the operation expansion project data after the dimension reduction into the isolated forest model to carry out abnormity detection on the operation expansion project data to obtain a detection result.

Further, a dimension reduction module includes:

a high-dimensional probability calculation unit, configured to define the business expansion project data as n d-dimensional sample sets X ═ { X1, X2, …, xn }, assuming that any two points X are any two_iAnd x_jObey in x_iCentered, variance σ_iIs Gaussian distribution P_iSame x_iObey in x_jAs a center, Gaussian distribution P of variance δ j_j. Thus x_iAnd x_jSimilar conditional probability P_j/iComprises the following steps:

wherein, P_j/iFor data point x in the business expansion project data_iAnd x_jHigh-dimensional distribution probability therebetween; sigma_iIs represented by data point x_jFinding σ by dichotomy of the confusion concept for a centered Gaussian variance_iBest value of (c), confusion (P)_i)_PerpCan be expressed as:

a low-dimensional probability calculation unit for defining a low-dimensional sample set Y ═ { Y1, Y2, …, yn }, which is a low-dimensional embedding of a high-dimensional sample set X ═ { X1, X2, …, xn }, a data point X_iAnd x_jCorresponding points y in low dimensional space_iAnd y_jJoint probability of

Is defined as follows

A dimension reduction determination unit for measuring P using KL divergence_j/iAnd q is_ijSimilarity between C:

Wherein Y is_tFor values of t iterations, η is the learning efficiency, and α (t) is the momentum of t iterations.

According to a third aspect of some embodiments herein there is provided an apparatus comprising:

at least one memory and at least one processor;

the memory to store one or more programs;

when executed by the at least one processor, the one or more programs cause the at least one processor to implement the steps of the method for identifying anomalous data in a business expansion project according to any one of the first aspects.

According to a fourth aspect of some embodiments of the present application, there is provided a computer readable storage medium storing a computer program which, when executed by a processor, performs the steps of the method according to any one of the first aspect.

According to the method, the original high-dimensional business expansion project data are subjected to characteristic recombination and data dimensionality reduction by using a t-distribution field embedding algorithm, abnormal data in the business expansion project data after recombination and dimensionality reduction are discovered by using an isolated forest algorithm, and finally the abnormal data in the business expansion project data after recombination and dimensionality reduction are eliminated. The isolated forest algorithm aims to discover the characteristics of the abnormal points rather than generalizing the characteristics of the normal points, and the abnormal points only occupy a small part of the whole sample set, and the attributes are greatly different from other normal points, so that the abnormal points are easier to be isolated than the normal points. The isolated forest is an unsupervised rapid anomaly detection method based on combination, has linear time complexity and high accuracy, can achieve a good anomaly detection effect only by a small amount of samples, has the advantage of rapid convergence, and has important significance for eliminating anomalous data in business expansion project data.

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Drawings

FIG. 1 is a flowchart illustrating steps of a method for identifying anomalous data in a business extension project according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of an apparatus for identifying abnormal data of a business expansion project according to an embodiment of the present application;

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

It should be understood that the embodiments described are only a few embodiments of the present application, and not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the embodiments in the present application.

The terminology used in the embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the embodiments of the present application. As used in the examples of this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the application, as detailed in the appended claims. In the description of the present application, it is to be understood that the terms "first," "second," "third," and the like are used solely to distinguish one from another and are not necessarily used to describe a particular order or sequence, nor are they to be construed as indicating or implying relative importance. The specific meaning of the above terms in the present application can be understood by those of ordinary skill in the art as the case may be.

Further, in the description of the present application, "a plurality" means two or more unless otherwise specified. "and/or" describes an associative relationship with a human body, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the context of the associated body is an "or" relationship.

The method aims to solve the problem that in actual engineering related to the background art, normal data and abnormal data are distributed in the same data set, and sufficient information is not available for distinguishing.

The present application provides a method for identifying abnormal data of a business expansion project, please refer to fig. 1, which includes the following steps:

step S1: and acquiring business expansion project data, wherein the business expansion project data comprises project flow node number, node working duration and project cost.

The business expansion project comprises a plurality of project process nodes, and the working time and the cost of different process nodes are different.

Step S2: and performing feature recombination and dimensionality reduction on the business expansion project data by adopting a t-distribution random neighborhood embedding algorithm t-SNE.

the t-distribution random neighborhood embedding algorithm is an algorithm for reducing the dimension of high-dimensional data, and aims to accurately represent a data point set in a high-dimensional space in a low-dimensional space, wherein the low-dimensional space generally refers to a two-dimensional space. The algorithm is non-linear and can adapt to the underlying data, supporting tuning parameters-confusion, which is used to balance local and global concerns of the data in short.

Because the business expansion project data not only relates to the categories of projects and the number of nodes, but also relates to the problems of time and cost, the business expansion project data has the characteristics of high complexity and huge number. Therefore, the dimension reduction is carried out on the business expansion project data through the algorithm, and the method is beneficial to the subsequent identification of abnormal data.

Step S3: and constructing an isolated forest model.

The isolated forest model is an unsupervised anomaly detection method suitable for Continuous data (Continuous numerical data), i.e. marked samples are not needed for training, but the features need to be Continuous. For how to find which points are easily isolated, iForest uses a very efficient set of strategies. In an isolated forest, the data set is recursively randomly partitioned until all sample points are isolated. Under this strategy of random segmentation, outliers typically have shorter paths.

Step S4: and inputting the operation expansion project data subjected to the dimension reduction into the isolated forest model to perform abnormal detection on the operation expansion project data to obtain a detection result.

The method mainly researches abnormal detection under unsupervised learning, reduces the dimensionality of high-dimensionality data through a t-SNE algorithm, recombines the high-dimensionality data, and then detects abnormal data by using isolated forests. The isolated forest is an unsupervised rapid anomaly detection method based on combination, has linear time complexity and high accuracy, can achieve good anomaly detection effect only by a small amount of samples, and has the advantage of rapid convergence.

In a specific example, a t-distribution random neighborhood embedding algorithm t-SNE is adopted to perform feature reorganization and dimensionality reduction on the business expansion project data, and the method comprises the following steps:

the business expansion project data is defined as n d-dimensional sample sets X { X1, X2, …, xn }, wherein each data point in the sample sets is used for indicating any one data of the number of project process nodes, the working time of the nodes and the project cost in the business expansion project data.

Assume any two data points x_iAnd x_jObey in x_iCentered, variance σ_iGaussian distribution P of_iSame x_iObey in x_jCentered, Gaussian distribution P of variance δ j_j. Thus x_iAnd x_jSimilar conditional probability P_j/iComprises the following steps:

wherein, P_j/iIs a data point x in the business expansion project data_iAnd x_jHigh-dimensional distribution probability therebetween; sigma_iIs represented by data point x_jFinding σ by dichotomy of the confusion concept for the central Gaussian variance_iOptimum value of (A), degree of confusion (P)_i)_PerpCan be expressed as:

the low-dimensional sample set Y ═ Y1, Y2, …, yn } is defined, and the data point X ═ X1, X2, …, xn } is embedded in the low-dimensional sample set X ═ X1, X2, …, xn }_iAnd x_jCorresponding point y in low dimensional space_iAnd y_jJoint probability of (2)

The definition is as follows:

measuring P using KL divergence_j/iAnd q is_i/jSimilarity between C:

to speed up the optimization process, avoiding localityOptimal solution, using larger momentum in the gradient, resulting in a low-dimensional embedded Y_t：

In a specific embodiment, the step of constructing the isolated forest model comprises:

and initializing the isolated forest.

Specifically, initializing a forest includes:

wherein the height limit l of each isolated tree is determined by the following formula:

l＝ceiling(log₂Ψ), wherein ceiling represents rounding.

Training the isolated trees in the isolated forest model.

Specifically, training the isolated trees in the isolated forest model includes:

randomly selecting an isolated tree, obtaining the tree height e of the isolated tree, if e > l, and returning to the previous sub leaf node.

Otherwise, randomly selecting a sample q from the operation expansion project data after dimensionality reduction, and randomly selecting a splitting value p between the maximum value and the minimum value of q.

After p is selected, the samples contained in the current cotyledon node are split into two parts, namely a left new cotyledon and a right new cotyledon, according to the condition that q is more than or equal to p and q is less than p.

And updating the tree height e +1 of the isolated tree.

In a preferred embodiment, the step of inputting the service expansion project data after the dimension reduction into the isolated forest model to perform anomaly detection on the service expansion project data to obtain a detection result includes:

carrying out anomaly detection on each sample in the service expansion project data after dimension reduction;

c(n)＝2H(n-1)-(2(n-1)/n)

H(i)≈ln(i)+0.5772156649

the closer a sample x is to the root node, the greater its outlier score.

Corresponding to the above method for identifying the abnormal data of the business expansion project, as shown in fig. 2, the present application further provides an apparatus 200 for identifying the abnormal data of the business expansion project, including:

the business expansion project acquisition module 210 is configured to acquire business expansion project data, where the business expansion project data includes a number of project process nodes, a node working duration, and a project cost.

And the dimension reduction module 220 is used for performing feature recombination and dimension reduction on the business expansion project data by adopting a t-distribution random neighborhood embedding algorithm t-SNE.

And a building module 230 for building the isolated forest model.

And the detection module 240 is configured to input the service expansion project data after the dimension reduction into the isolated forest model to perform anomaly detection on the service expansion project data, so as to obtain a detection result.

In an alternative example, the dimension reduction module 220 includes:

a high-dimensional probability calculation unit for expanding the businessThe project data is defined as n d-dimensional sample sets X ═ { X1, X2, …, xn }, assuming that any two points X are any two_iAnd x_jObey in x_iCentered, variance σ_iIs Gaussian distribution P_iSame as x_iObey in x_jCentered, Gaussian distribution P of variance δ j_j. Thus x_iAnd x_jSimilar conditional probability P_j/iComprises the following steps:

a low-dimensional probability calculation unit for defining a low-dimensional sample set Y ═ { Y1, Y2, …, yn }, which is a low-dimensional embedding of a high-dimensional sample set X ═ { X1, X2, …, xn }, a data point X_iAnd x_jCorresponding points y in low dimensional space_iAnd y_jJoint probability of (2)

The definition is as follows:

a dimension reduction determination unit for measuring P using KL divergence_j/iAnd q is_ijIn betweenSimilarity C:

In an alternative example, the building block 230 includes:

and the initialization unit is used for initializing the isolated forest.

And the training unit is used for training the isolated trees in the isolated forest model.

In an alternative example, the initialization unit includes:

the isolated forest establishment element is used for setting an isolated forest consisting of a plurality of isolated trees according to the operation expansion project data after the dimension reduction, and setting a proper secondary sampling sample size psi;

a tree height determining element, the height limit for each tree, l, being determined by the formula: ceiling (log)₂Ψ), wherein ceiling represents rounding.

In an alternative example, the training unit comprises:

and randomly selecting the isolated tree element for randomly selecting an isolated tree to obtain the current tree height e of the isolated tree, such as e > l, and returning to the previous sub leaf node.

And a split value determining element for randomly selecting a sample q from the dimension-reduced data of the expansion project and randomly selecting a split value p between the maximum value and the minimum value of q.

And a new leaf determining element used for splitting the sample contained in the current cotyledon node into two parts, namely a left new cotyledon and a right new cotyledon according to q ≧ p and q < p after p is selected.

And an update element for updating the tree height e + 1.

And the training confirmation element is used for training the steps for each isolated tree until the training of each isolated tree is completed.

In an alternative example, the detection module 240 includes:

an anomaly detection element, configured to perform anomaly detection on each sample in the service expansion project data after the dimension reduction;

an abnormal situation determining element for determining an abnormal situation of each of the samples by using an abnormal score s, which is determined by the following formula:

c(n)＝2H(n-1)-(2(n-1)/n)

H(i)≈ln(i)+0.5772156649

and an abnormal score determining element for determining whether the sample x is closer to the root node or not.

Corresponding to the method for identifying abnormal data of the business expansion project, the application also provides equipment which comprises at least one memory and at least one processor;

the memory to store one or more programs;

when the one or more programs are executed by the at least one processor, the at least one processor is enabled to implement the steps of any one of the identification methods for the abnormal data of the business expansion projects.

The implementation process of the functions and actions of each component in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again. For the apparatus embodiment, since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment for relevant points. The above-described device embodiments are merely illustrative, wherein the components described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the disclosed solution. One of ordinary skill in the art can understand and implement it without inventive effort.

Corresponding to the method for identifying abnormal data of the business expansion project, the application also provides a computer readable storage medium, wherein a computer program is stored in the computer readable storage medium, and when the computer program is executed by a processor, the computer program realizes the steps of any one of the methods.

The present disclosure may take the form of a computer program product embodied on one or more storage media including, but not limited to, disk storage, CD-ROM, optical storage, and the like, having program code embodied therein. Computer-usable storage media include permanent and non-permanent, removable and non-removable media, and may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of the storage medium of the computer include, but are not limited to: phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technologies, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape storage or other magnetic storage devices, or any other non-transmission medium, may be used to store information that may be accessed by a computing device.

It is to be understood that the embodiments of the present application are not limited to the precise arrangements which have been described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the embodiments of the present application is limited only by the following claims. The above-mentioned embodiments only express a few embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for those skilled in the art, without departing from the concept of the embodiments of the present application, several variations and modifications can be made, which all fall within the scope of the embodiments of the present application.

Claims

1. A method for identifying abnormal data of a business expansion project is characterized by comprising the following steps:

performing characteristic recombination and dimensionality reduction on the business expansion project data by adopting a t-distribution random neighborhood embedding algorithm t-SNE;

constructing an isolated forest model;

2. The method for identifying the abnormal data of the business expansion project according to claim 1, wherein a t-distribution random neighborhood embedding algorithm t-SNE is adopted to perform feature reorganization and dimensionality reduction on the business expansion project data, and the method comprises the following steps:

defining the business expansion project data as n d-dimensional sample sets X ═ { X1, X2, …, xn }, assuming any two points X_iAnd x_jObey in x_iCentered, variance σ_iIs Gaussian distribution P_iSame as x_iObey in x_jAs a center, Gaussian distribution P of variance δ j_j. Thus x_iAnd x_jSimilar conditional probability P_j/iComprises the following steps:

wherein, P_j/iIs a data point x in the business expansion project data_iAnd x_jHigh-dimensional distribution probability therebetween; sigma_iIs defined as a data point x_jFinding σ by dichotomy of the confusion concept for a centered Gaussian variance_iOptimum value of (A), degree of confusion (P)_i)_PerpCan be expressed as:

defining a low-dimensional sample set Y ═ { Y1, Y2, …, yn }, which is a low-dimensional embedding of a high-dimensional sample set X ═ { X1, X2, …, xn }, a data point X ═ b_iAnd x_jCorresponding points y in low dimensional space_iAnd y_jJoint probability of (2)

The definition is as follows:

measuring P using KL divergence_j/iAnd

similarity between C:

Wherein, Y_tThe value of t iterations, η the learning efficiency, and α (t) the momentum of t iterations.

3. The method for identifying abnormal data of business expansion projects according to claim 1, wherein the step of constructing the isolated forest model comprises the following steps:

initializing an isolated forest;

training an isolated tree in the isolated forest model.

4. The method for identifying abnormal data of business expansion projects according to claim 3, wherein the initializing the isolated forest comprises:

wherein the height limit/of each tree is determined by the following formula:

l＝ceiling(log₂Ψ), wherein ceiling represents rounding.

5. The method for identifying the abnormal data of the business extension project according to claim 3, wherein training the isolated trees in the isolated forest model comprises:

after p is selected, a sample contained in the current cotyledon node is split into two parts, namely a left new cotyledon and a right new cotyledon, according to q is more than or equal to p and q < p;

updating the tree height e-e +1 of the orphan tree;

6. The method for identifying abnormal data of business expansion projects according to claim 3, wherein the business expansion project data after dimension reduction is input to the isolated forest model to perform abnormal detection on the business expansion project data to obtain a detection result, and the method comprises the following steps:

determining the abnormal condition of each sample by using an abnormal score s, which is determined by the following formula:

c(n)＝2H(n-1)-(2(n-1)/n)

H(i)≈ln(i)+0.5772156649

the closer a sample x is to the root node, the larger its outlier score.

7. An apparatus for identification of anomalous data for a business extension project, comprising:

the business expansion project acquisition module is used for acquiring business expansion project data, and the business expansion project data comprises project process node numbers, node working duration and project cost;

the dimensionality reduction module is used for performing feature recombination and dimensionality reduction on the business expansion project data by adopting a t-distribution random neighborhood embedding algorithm t-SNE;

the construction module is used for constructing an isolated forest model;

and the detection module is used for inputting the operation expansion project data subjected to the dimension reduction into the isolated forest model to carry out abnormity detection on the operation expansion project data to obtain a detection result.

8. The apparatus for identifying anomalous data in a business expansion project of claim 7, wherein the dimension reduction module comprises:

a high-dimensional probability calculation unit for calculating the business expansion project dataDefined as n d-dimensional sample sets X ═ X1, X2, …, xn }, assuming any two points X_iAnd x_jObey in x_iAs a center, variance σ_iIs Gaussian distribution P_iSame as x_iObey in x_jAs a center, Gaussian distribution P of variance δ j_j. Thus x_iAnd x_jSimilar conditional probability P_j/iComprises the following steps:

wherein, P_j/iIs a data point x in the business expansion project data_iAnd x_jHigh-dimensional distribution probability therebetween; sigma_iIs defined as a data point x_jFinding σ by dichotomy of the confusion concept for a centered Gaussian variance_iBest value of (c), confusion (P)_i)_PerpCan be expressed as:

Is defined as follows

A dimension reduction determination unit forMeasuring P with KL divergence_j/iAnd q is_i/jSimilarity between C:

9. An apparatus, comprising:

at least one memory and at least one processor;

the memory to store one or more programs;

when executed by the at least one processor, the one or more programs cause the at least one processor to implement the steps of the method for identifying anomalous data in a business enhancement project according to any one of claims 1 to 6.

10. A computer-readable storage medium storing a computer program, the computer program characterized in that:

the computer program when executed by a processor implements the steps of the method of any one of claims 1 to 6.