CN116128544A

CN116128544A - Active auditing method and system for electric power marketing abnormal business data

Info

Publication number: CN116128544A
Application number: CN202211642952.4A
Authority: CN
Inventors: 于瑞强; 喻魏贤; 闫谷丰; 李锐; 赵轩臣; 李晓宇; 刘效强; 邵江东; 李慧霖; 李万勇; 杨玉传
Original assignee: YANTAI HAIYI SOFTWARE CO Ltd
Current assignee: YANTAI HAIYI SOFTWARE CO Ltd
Priority date: 2022-12-20
Filing date: 2022-12-20
Publication date: 2023-05-16

Abstract

The invention discloses an active auditing method and system for electric power marketing abnormal business data, wherein the method comprises the following steps: calculating information gain rate among various classification attributes in given data to be audited, and analyzing classification attribute business association; splicing the attributes with obvious correlation based on the correlation analysis result among the classified attributes to form new mixed data; performing frequency characteristic transformation on the classification attribute based on the new mixed data to generate attribute characteristic data capable of being directly input into model training; constructing and training an enhanced isolated forest model based on the enhanced isolated forest algorithm based on the attribute characteristic data; calculating abnormal scores of the electric power marketing business data based on the trained enhanced isolated forest model; based on the abnormal scores of the electric power marketing business data, the abnormal groups are adaptively judged, and an adaptive abnormal result is obtained; and combining the self-adaptive abnormal result with the electric power marketing business data according to the output self-adaptive abnormal result, and outputting a final abnormal judgment result.

Description

Active auditing method and system for electric power marketing abnormal business data

Technical Field

The invention belongs to the technical field of power systems, and particularly relates to an active auditing method and system for abnormal business data of power marketing.

Background

In the electric power marketing business, the marketing auditing work is an important way for perfecting internal control constraint, finding marketing errors and strengthening the risk management of the marketing business. Along with the advancement of informatization construction in the electric power field, on-line normalized audit of business data such as changing electricity business, business expansion and installation, electricity price and electricity charge management is very necessary according to an automatic means. Informative construction of electric power marketing has entered the fast traffic lane and is currently being digitally transformed at a critical time.

In the conventional anomaly checking method, related anomalies are often checked by a business expert 'manually defining rules', and the anomalies are the anomalies when the related rules are not matched. The prior technical proposal is generally summarized as follows:

first, the method of manually defining rules: in the existing electric power marketing auditing service, a rule-based method is mainly adopted to audit electric power business data, for example: abnormal collection of the meter reading core: "abnormal electricity consumption type-the electricity consumption type and electricity price code in the customer file information are not accordant as abnormal"; capacity anomalies for user profile classes: "the power supply capacity is 0 or empty in the client profile information is regarded as abnormal"; the anomaly is examined by a rule such as the one defined by the expert, and if the anomaly is not matched with the rule, the anomaly is obtained.

Drawbacks of manual rule of thumb: as described above, the following drawbacks are mainly present in conducting inspection by manually defining rules: the definition of the rules is manually finished by an expert, so that the workload is huge, and a great amount of labor and time cost are consumed; the manpower is exhaustive, business specialists face huge data volume and complex abnormal types, and complete rules cannot be formulated to be applicable to all abnormal conditions, so that inspection surfaces are limited; rule-based auditing methods can only detect anomalies within rules, such as user profile class capacity anomalies: "the power supply capacity is 0 or empty in the client profile information is regarded as abnormal", based on the rule, it can only be checked that the power supply capacity is 0 or empty is abnormal, however, obvious extremum type anomalies, such as 99999, cannot be detected, so that the detection rate of the anomaly data during inspection is low.

The second category, supervised category machine learning method: in view of the fact that the objective of electric power marketing inspection is to inspect question data, many decision models such as decision trees, random forests, logistic regression and other supervised learning methods are introduced into the inspection field to identify the question data. Because the checked business range relates to different fields of business expansion, check-out, customer service business and the like, and different requirements are met in different fields, different models are needed to be built.

Drawbacks of the supervised approach: for supervised methods such as logistic regression, decision trees, random forests, etc., the main drawbacks are: a corresponding model is required to be built aiming at each target, so that the applicability of the model is poor, and the model cannot adapt to the requirements of various business categories faced by continuous expansion of inspection business; when the supervised method is used for training a model, training data are required to be provided with labels, business data generated by an electric marketing business are unlabeled, the data volume is huge, and if manual marking is carried out, the workload is huge.

Third class, unsupervised class machine learning method: the unsupervised learning method is also applied to the inspection field, and mainly adopts methods such as anomaly detection and the like. Such as cluster analysis, LOF (Local outlier factor, local anomaly factor), isolated forest algorithms, etc.

Drawbacks of the unsupervised approach: in the existing unsupervised anomaly detection method, most of the method is only suitable for single-type data, but cannot be used for mixed data such as electric power marketing business data; and the methods such as LOF and kNN (K-Nearest Neighbor) have large calculated amount and cannot meet the high-performance processing requirement of a large amount of data in the auditing service; in the research of the abnormality detection method, the existing methods such as isolated forest iForest (Isolation Forest), POD (Pattern-Based Outlier Detection) and the like can only give an abnormality score, and cannot specifically identify whether each piece of data is abnormal. And the classical iForest algorithm can only process numerical data and cannot meet the abundant business data processing requirements.

In the electric power marketing business, the marketing auditing work is an important way for perfecting internal control constraint, discovering marketing errors and strengthening the risk management of the marketing business. Business data of electric power marketing, customer file data of business expansion and installation management, meter reading and charge data of electric power price and electric charge management. These data are structured data stored in a relational database. In the process of warehousing related data, the input of error data can be caused by the misoperation of service personnel, or the inexperience of related service policies, or the data acquisition and transmission faults. Along with the advancement of informatization construction in the electric power field, by utilizing informatization and automation means, the method is very necessary for carrying out on-line normalized audit on business data such as changing electricity business, business expansion and reporting, electricity price and electricity charge management and the like, finding out data with problems and improving compliance of marketing business.

In summary, in addition to rule-driven auditing, current power marketing auditing services are urgently required to actively identify anomalies for which rules have not been explicitly defined.

Disclosure of Invention

In order to overcome the problems in the prior art, the invention provides an active auditing method and system for abnormal business data of electric power marketing.

The technical scheme for solving the technical problems is as follows:

an active auditing method for electric power marketing abnormal business data comprises the following steps:

step 1, calculating the correlation among classification attributes by using the information gain rate as an evaluation index for given electric power marketing business data to be audited;

step 2, based on the correlation analysis result among the classification attributes in the electric power marketing business data, splicing the classification attributes with obvious correlation to generate new classification attributes, deleting the original classification attributes and forming new mixed data;

step 3, based on the new mixed data, carrying out frequency characteristic transformation on the classification attribute, converting the classification data into numerical data, and generating attribute characteristic data which can be directly input into model training;

step 4, constructing and training an enhanced isolated forest model based on the enhanced isolated forest algorithm based on the attribute characteristic data;

step 5, calculating abnormal scores of the electric power marketing business data based on the trained enhanced isolated forest model;

step 6, constructing a unitary beta mixed model (Unary Beta Mixed Model) based on the abnormal scores of the electric power marketing business data to judge the self-adaptive abnormal group;

And 7, identifying the optimal unitary beta mixed model through ICL_BIC criterion (Integrated Completed Likelihood-Bayesian Information Criterion), and taking the output abnormal result as the final output abnormal result.

Further, the step 1 calculates the correlation between the classification attributes by using the information gain rate as the evaluation index for the given electric power marketing business data to be audited, and specifically includes:

step 1-1, calculating information gain rates among various classification attributes in electric power marketing business data to be audited, wherein the information gain rates are calculated according to the following formula:

wherein ,

and />

Respectively representing two different classification attributes, gainRatio (&) and Gain (&) respectively representing information Gain rate and information Gain between calculated classification attributes, and +.>

Representing the calculation attribute +.>

For->

Is a gain ratio of information of (a); h (& gt) and H (& lt & gt) respectively represent calculated information entropy and conditional entropy, such as + & lt/EN & gt>

It means that the calculation is at attribute +.>

Property->

Is an information entropy of (a); v _m Representing classification attribute->

The mth attribute value, p (v) _m ) Expressed in attribute +.>

In which v is calculated to be equal to _m The ratio of the number of attribute values to the length of the whole attribute;

step 1-2, converting the information gain rate among the classification attributes into correlation among the classification attributes, wherein the formula is as follows:

wherein ,

the value range of (2) is [0,1 ]]Representing the classification attribute->

And->

Correlation between->

The larger the value, the more relevant.

Further, the step 4 builds and trains an enhanced isolated forest model based on the enhanced isolated forest algorithm based on the attribute characteristic data, and the modeling process is as follows:

step 4-1, setting the number/proportion of samples and the number of isolated trees required for constructing a model;

step 4-2, sampling: randomly extracting a set number or proportion of subsamples from the electric power marketing business data; setting a root node for each partition tree, and taking the root node as a current node;

step 4-3, attribute segmentation; on the current node, randomly selecting a plurality of attributes as target attributes for segmentation, and respectively constructing a left subtree and a right subtree of the segmentation node according to a segmentation result;

the segmentation strategy is specifically as follows:

wherein ,

representing a vector consisting of a number of target attributes; />

Representing intercept vectors, derived by taking values in a uniform distribution between maximum and minimum values for each target attribute; />

The normal vector is represented, which can be understood simply as the slope of the dividing plane, by randomly generating a value composition for each target attribute from the standard normal distribution N (0, 1);

Then by calculation

Obtaining a segmentation hyperplane, segmenting a sample set of a current node based on the segmentation hyperplane, and if the segmentation strategy of the target attribute value of the current node is smaller than 0, namely the current node is positioned below the segmentation hyperplane, the sample corresponding to the target attribute falls into a left subtree, and the rest samples fall into a right subtree; />

Step 4-4, constructing an isolated tree; repeating the step 4-3 in the child node until only one data in the child node can not be cut continuously or the child node reaches the set maximum depth of the tree, stopping the step, and completing the construction of the isolated tree;

step 4-5, constructing an enhanced isolated forest; and (3) according to the sampling number/proportion set in the step (4-1) and the number of trees to be constructed, the steps (4-2), the step (4-3) and the step (4-4) are cycled to finish the construction of all the isolated trees and form an enhanced isolated forest.

Further, in the step 5, based on the trained enhanced isolated forest model, an anomaly score of the electric power marketing business data is calculated, which specifically includes the following sub-steps:

step 5-1, inputting all data to be audited into an enhanced isolated forest model, and obtaining T segmentation results when each data point passes through the enhanced isolated forest model formed by T trees;

Step 5-2, for each data point, calculating the path length h (x) of the point in each isolated tree, namely the number of edges through which the sample point passes from the root node to the leaf node of the tree;

step 5-3, calculating the average path length of each data point in the enhanced isolated forest (namely all trees), wherein the calculation formula is as follows:

wherein E (h (x)) represents the average path length, T represents the number of trees in the enhanced isolated forest, h _t (x) Representing the path length of the data point on the t-th tree;

step 5-4, carrying out normalization processing on the average path length of all the data points, and calculating to obtain an abnormal score of each data point, wherein the calculation formula is as follows:

wherein n represents the data amount of the input sample when the tree is constructed, c (n) is the global average path length, and is used for normalization processing, and the formula is as follows:

c(n)＝2H(n-1)-2(n-1)/n

where H (k) =ln (k) +epsilon, epsilon= 0.5772156649, epsilon being the euler constant.

Further, in the step 6, based on the anomaly score of the electric power marketing business data, a monobeta hybrid model is constructed to perform self-adaptive anomaly group discrimination, which specifically comprises:

step 6-1, clustering analysis is carried out on the abnormal scoring result by using a k-means method, and abnormal scoring of the electric power marketing business data to be audited is divided into k components;

And 6-2, constructing a unitary beta mixed model, and outputting the unitary beta mixed model and abnormal results.

Further, the unitary beta-blending model includes:

unitary beta mix distribution:

wherein ,B_m (V _i Alpha, beta) is the monobeta distribution of the mth component, V _i Representing an anomaly score vector, α= { α ₁ ，…，α _M} and β＝{β₁ ，…，β _M -parameters representing the unitary beta distribution of the M components; lambda = { lambda ₁ ，…，λ _M The mixing coefficient between M components is represented, and

the probability density function of the monobeta distribution is:

wherein Γ (·) represents the gamma function;

using maximum likelihood estimation to estimate parameters of the monobasic beta mixed model, the log likelihood function is:

wherein log (·) represents log, V represents anomaly score vector, and parameter set p= { λ ₁ ，…，λ _m ，α ₁ ，…，α _m ，β ₁ ，…，β _m -representing a set of parameters to be estimated; η= { η ₁ ，…，η _N Associated with N anomaly score vectors V for indicating anomaly scores V _i Of (A), e.g. eta _im Representing the ith anomaly score V _i Whether it belongs to the mth component, if eta _im =1, then indicates that it belongs to, η _im And =0 does not belong.

Further, in the step 7, an ICL-BIC criterion is adopted to search an optimal model, and the calculation formula is as follows:

wherein ,

representing an estimate of the parameter set P +.>

Represents an estimate of the home vector η, Q _M Representing the number of model parameters having M components, N representing the amount of data.

The invention also discloses an active auditing system of the electric power marketing abnormal business data, which comprises the following steps: the system comprises a data source management module, a data reading module, a feature processing module, a model training module, an abnormal output module and an abnormal output management module;

the data source management module is used for managing a database of data required by the power marketing business data abnormal auditing system, access information and a database table of the marketing business data required to be audited;

the data reading module is used for acquiring required data from the data source management module, performing preprocessing operations including missing value processing, deduplication, format conversion and the like, and uniformly storing the processed data as table data required by the system;

the feature processing module is used for acquiring table data from the data reading module, carrying out certain transformation and integration on the table data, calculating information gain rate on classification attribute features in the electric power marketing business data features, evaluating correlation among the classification attributes and completing field splicing, and then carrying out feature transformation on the classification data based on the classification attribute frequency to convert the classification data into numerical value data so as to generate attribute feature data capable of being directly input into model training;

The model training module is used for randomly extracting a certain amount or proportion of data from the data output by the characteristic processing module to train the enhanced isolated forest algorithm model;

the abnormal output module is used for outputting normal values and abnormal values in the electric power marketing business data;

the abnormal output management module is used for managing final result display and output work by combining the abnormal result output by the abnormal result identification module with the electric power marketing business data output by the data source management module.

Further, the abnormal output module comprises three sub-modules: the system comprises an algorithm prediction sub-module, a self-adaptive abnormal output sub-module and an abnormal result identification sub-module;

an algorithm prediction sub-module: the method is used for acquiring all data output from the feature processing module based on the trained model of the model training module, inputting the model for prediction, and outputting abnormal scores of the electric power marketing business data;

an adaptive abnormal population discrimination sub-module: based on the abnormal scoring vector of the electric power marketing business data output by the algorithm prediction module, initializing k-means for each k value in a specified range and constructing a monobasic beta mixed model to self-adaptively determine normal values and abnormal values in the data;

And the abnormal result identification sub-module is used for identifying the optimal mixed model through the ICL_BIC criterion based on the plurality of unitary beta mixed models and the abnormal results output by the adaptive abnormal population judgment module and taking the abnormal result output by the optimal mixed model as the finally output abnormal result.

Compared with the prior art, the invention has the following technical effects:

(1) Aiming at the problems that in the checking process of electric power marketing business data, an expert needs to consume a great deal of effort to formulate rules, summarized rules are not covered fully, and abnormal data cannot be identified in a self-adaptive mode because a threshold value is required to be defined manually in abnormal evaluation, the method for checking structured data unsupervised marketing data is provided, wherein the structured data unsupervised marketing data is driven by an experience rule and different data types are supported simultaneously, the business experience of a business expert is not required to formulate a priori rules for relying on, and the machine learning method is utilized to conduct abnormal initiative heuristic screening;

(2) Aiming at the problem that the classification data in the electric power marketing business data have an interdependence relationship, and the existing unsupervised anomaly detection method only supports independent attributes and ignores the association between businesses, the invention provides a correlation analysis and attribute processing method based on the information gain rate, which can better meet the requirements of electric power marketing inspection businesses;

(3) Aiming at the problem that the classical isolated forest algorithm cannot process the classified data, the invention provides an improved method, which is used for carrying out frequency characteristic transformation on the classified attribute and converting the classified attribute into numerical data so that the algorithm can be suitable for the mixed data;

(4) Aiming at the problem that the conventional unsupervised anomaly detection method only provides anomaly degree evaluation and cannot clearly identify an anomaly individual, the invention provides an anomaly result identification strategy based on k-means and unitary beta distribution, and anomaly data can be directly identified.

Drawings

FIG. 1 is a flow chart of an active auditing method for power marketing abnormal business data according to the present invention;

FIG. 2 is a block diagram of an active auditing system for power marketing exception business data according to the present invention;

FIG. 3 is a three-dimensional segmented hyperplane schematic of the present invention.

Detailed Description

The principles and features of the present invention are described below with reference to the drawings, the examples are illustrated for the purpose of illustrating the invention and are not to be construed as limiting the scope of the invention.

The invention aims to solve the problem that the business risk exists due to the fact that the business data cannot be actively found out of the problem data defined by the rules because the manually set rules are too dependent to check the business data in the current electric power marketing checking business.

In view of the above problems, the present invention provides an active auditing method for abnormal business data of electric power marketing, the method comprising the following steps:

and acquiring electric power marketing business data to be audited, performing preprocessing operations including missing value processing, deduplication, format conversion and the like, and uniformly storing the processed data as required table data.

Because of specific business reasons, a dependency relationship exists among a plurality of data values of different types in the electric power marketing business data, and the relationship is the business experience in the past. If each attribute is regarded as independent, the effect of anomaly detection is greatly reduced, and the attribute is inconsistent with business expectation.

The invention firstly solves the problem of analyzing the potential association relation between the classified attribute services, namely, calculating the correlation between the classified attributes by using the information gain ratio as an evaluation index, and specifically comprises the following steps:

step 1-1, calculating information gain rates among various classification attributes in given electric power marketing business data to be audited;

let x= { X ₁ ,…,x _N A is a set of N mixed attribute data points, where a _c and A_n Respectively representing sets of classification attributes and continuation attributes.

For each classification attribute, the information gain rate between the classification attribute and other classification attributes is calculated, and the correlation between classification attributes is measured based on the information gain rate.

The information gain ratio has the following calculation formula:

wherein ,

and />

Representing the calculation attribute +.>

For->

It means that the calculation is at attribute +.>

Property->

Is an information entropy of (a); v _m Representing classification attribute->

The mth attribute value, p (v) _m ) Expressed in attribute +.>

In which v is calculated to be equal to _m The ratio of the number of attribute values to the overall attribute length.

Step 1-2, converting the information gain rate among the classification attributes into correlation among the classification attributes;

due to asymmetry of information gain ratio, i.e.

And->

(i+.j) is inconsistent and therefore cannot be directly used to measure the correlation between classification attributes, requiring further transformations as shown in the following equation:

wherein ,

the value range of (2) is [0,1 ] ]Representing the classification attribute->

And->

Correlation between->

The greater the valueThe more relevant is indicated.

And 2, based on correlation analysis results among all classification attributes in the electric power marketing business data, splicing the classification attributes with obvious correlation to generate new classification attributes, and deleting the original classification attributes to form new mixed data.

Because there may be a correlation between classification attributes, particularly in the power marketing business data, there may be a significant correlation between classification attributes, such as a strong correspondence between classification attribute "YHLBDM" (user class code) and "JLFSDM" (metering mode code), "DJDM" (electricity price code) and "YDLBDM" (electricity use class code), etc., so that in order to be able to retrieve anomalies existing in the multidimensional mixture, the present invention splices classification attributes with significant correlation based on the correlation between classification attributes calculated in step 1 to generate new classification attributes, deletes the original classification attributes, and merges with continuous attributes. Typically the correlation threshold may be set to 0.5, with correlations greater than 0.5 indicating significant correlations, otherwise uncorrelated/less correlated.

And 3, carrying out frequency characteristic transformation on the classification attribute based on the new mixed data, converting the classification data into numerical data, and generating attribute characteristic data capable of being directly input into model training.

Unlike classical isolated forest algorithms, it is only applicable to numerical data. In order to enable the enhanced isolated forest algorithm to be applicable to numerical data, classified data and mixed data with the classified data and the numerical data, the frequency characteristic transformation is carried out on the classified data, and the classified data is converted into the numerical data.

And 4, constructing and training an enhanced isolated forest model based on the enhanced isolated forest algorithm based on the attribute characteristic data.

The invention calculates anomaly scores for the blended data based on an enhanced isolated forest algorithm. Because the classical isolated forest algorithm can only be applied to numerical data, the method expands the algorithm to support classification attributes.

Constructing and training an enhanced isolated forest model based on an enhanced isolated forest algorithm, wherein the modeling process is as follows:

step 4-1, setting the number (or proportion) of samples and the number of isolated trees required for constructing a model;

step 4-2, sampling: randomly extracting a set number or proportion of subsamples from the electric power marketing business data; for each partition tree, a root node is set, and the root node is used as a current node.

Step 4-3, attribute segmentation; on the current node, a plurality of attributes are randomly selected as target attributes for segmentation, and a left subtree and a right subtree of the segmentation node are respectively constructed according to segmentation results.

The segmentation strategy is specifically as follows:

wherein ,

representing a vector consisting of a number of target attributes; />

then by calculation

A split hyperplane is obtained (taking the example of selecting 3 target attributes, which split hyperplane is shown in fig. 3). Dividing a sample set of the current node based on the division hyperplane, if the division strategy of the target attribute value of the current node is smaller than 0I.e. under the segmentation hyperplane, the samples corresponding to the target attributes fall into the left subtree, and the rest fall into the right subtree;

step 4-4, constructing an isolated tree; and (3) repeatedly executing the step 4-3 in the child nodes until only one data (incapable of being cut) in the child nodes or the child nodes reach the set maximum depth of the tree, stopping the step, and completing the construction of the isolated tree.

Step 4-5, constructing an enhanced isolated forest; and (3) according to the sampling number (or proportion) set in the step (4-1) and the number of trees to be constructed, the steps (4-2), the steps (4-3) and the steps (4-4) are cycled to finish the construction of all the isolated trees and form the enhanced isolated forest.

And 5, calculating abnormal scores of the electric power marketing business data based on the trained enhanced isolated forest model. The step 5 specifically comprises the following substeps:

and 5-1, inputting data to be audited into the enhanced isolated forest model. For each data point, T segmentation results are obtained when an enhanced isolated forest model consisting of T trees is passed.

Step 5-2 for each data point, the path length h (x) of that point in each enhanced orphan tree is calculated, i.e., the number of edges that the sample point passes from the root node to the leaf node of the tree.

c(n)＝2H(n-1)-2(n-1)/n

in order to obtain anomalies adaptively, a monobasic beta mixed model is used to automatically distinguish normal values from abnormal values in a data space, and the main steps are as follows:

step 6-1, clustering analysis is carried out on abnormal scoring results by using a k-means method;

carrying out Min-Max normalization processing on the anomaly scores obtained in the step 4, so that the anomaly scores are real numbers between [0,1 ]; and then carrying out cluster analysis on the abnormal scoring result by using a k-means method, and dividing the abnormal scoring of the data to be audited into k components. For each division result, the following unary beta modeling operation is carried out.

Step 6-2, constructing a unitary beta mixed model to adaptively determine normal values and abnormal values in the data;

in order to be able to adaptively distinguish normal values from abnormal values in data, a monobasic beta mixed model is constructed, and abnormal vectors are described by mixing monobasic beta distribution to give a flexible model.

Unitary beta mix distribution:

the probability density function of the monobeta distribution is:

/>

wherein Γ (·) represents the gamma function.

wherein log (·) represents log, V represents anomaly score vector, and parameter set p= { λ ₁ ，…，λ _m 。α ₁ ，…，α _m ，β ₁ ，…，β _m -representing a set of parameters to be estimated; η= { η ₁ ，…，η _N Associated with N anomaly score vectors V for indicating anomaly scores V _i Of (A), e.g. eta _im Representing the ith anomaly score V _i Whether it belongs to the mth component, if eta _im =1, then indicates that it belongs to, η _im And =0 does not belong.

The parameter set p is estimated using an EM algorithm that iterates between an expected step (Expectation) and a Maximization step (Maximization) to produce a sequence estimate

I denotes the current iteration step, which stops until the log-likelihood function value log (L (V, η|p)) converges to the threshold value no longer changes.

Through step 6, a monobasic beta mixture model is built separately for each k value within the specified range. In order to identify the optimal model, an ICL-BIC rule is adopted to search the optimal model, and the calculation formula is as follows:

wherein ,

representing an estimate of the parameter set P +.>

Represents an estimate of the home vector η, Q _M Represents the number of model parameters (i.e., the length of parameter set P) having M components, N representing the amount of data.

The smaller the ICL-BIC value is calculated, the better the description model is, and the spatial distribution of the data can be described most effectively.

In the selected optimal model, all data are divided into M components according to unitary beta distribution, and based on the characteristic of enhancing abnormal scores of isolated forest algorithms, the abnormal scores of normal data points are extremely similar (or equal), so that the data can be divided into the same component. Because the larger the abnormal scoring result score of the invention shows that the more abnormal is, the smaller is and the more normal is, the components with higher average abnormal scoring are combined to obtain an abnormal result outlier. Outliers identification takes a value of {0,1}, where 0 represents normal and 1 represents abnormal.

Method verification

In order to verify the feasibility and effectiveness of the active auditing method for abnormal data, experiments based on electric power marketing business data are carried out, and the most common business expansion archival data in marketing auditing are taken as an example:

The data of the electric power marketing business with the sample size of 100w is extracted from the database, and the specific information is shown in the following table 1. The classification variables are 11, namely { "YHLBDM", "JFFDM", "YDLDM", "DJDM", "BSJFBZ", "XSJSYSBDM", "XSJFBZ", "JBDFJSBSM", "BSFTFSDM", "LTDFJSFS" }, and the continuous variables are 5, namely { "JFRL", "YDRL", "YGXSJSZ", "WGXSJSZ", "JBFFTZ" }.

Firstly, classifying attributes in electric power business data are analyzed, information gain rates among the classifying attributes are calculated according to the step 1, and attribute splicing is completed. By calculating the information gain rate, the remarkable correlation exists in { "YHLBDM", "JLFSDM" }, { "DJDM", "YDLBDM" }, { "BSJFBZ", "BSFTFSDM" }, wherein the remarkable correlation exists in { "DJDM", "YDLBDM" } and the rule example given above that the electricity use type (YDLBDM) in the customer profile information is consistent with the macroscopic service rule that the electricity price code (DJDM) does not accord with as abnormal "exist in the classification attribute, so that the attribute with the remarkable correlation is spliced and encoded to form a new classification attribute, and the original attribute is deleted. The method and the device can ensure that the multidimensional abnormal mixed data in the data are searched in the later step through the relevant field splicing, and can also reduce the dimension and improve the operation efficiency of the algorithm. The power business data set after field concatenation is shown in table 2 below.

Table 3 shows the output form of the abnormal result of the auditing method according to the present invention, and as shown in the table, the abnormal score and the adaptive abnormal result outlier are output on the basis of the electric power business data. The anomaly score is a result obtained by calculating an anomaly coefficient based on an enhanced isolated forest algorithm and normalizing by Min-Max; outlers is the result of identifying an abnormality (1 is abnormal, 0 is normal) present in the power marketing business data output based on the monobeta-beta hybrid model.

1 to 3 data anomalies in table 3 belong to the presence of extremum anomalies in the continuous attribute, such as the first data continuous attribute "JFRL", "YDRL", "JBDFFTZ" are all 80000, and their average is only 2.20, 40.66 and 2.69, so this data is anomaly data; while 4 to 7 data anomalies belong to multi-dimensional sparse anomalies in classification attributes, for example, when 'YDLDM' is '400', 'DJDM' is '4002071', and only a few data are in the whole data set after the two attributes are associated, so that the data anomalies can be detected as anomalies; the 8 th data exception belongs to multidimensional mixed exception, namely, the condition that both classification attribute and continuous attribute in the data are abnormal. The actively identified anomalies cannot be covered by the existing auditing rules, but are truly anomalies after being reviewed by business experts, so that the invention can effectively detect various anomaly types in the data.

TABLE 1 electric marketing business data set (section)

Table 2 electric marketing business data sheet (part of) after field splicing

TABLE 3 output form (part) of Power marketing business data anomaly results

Based on the method, the invention also provides an active auditing system of the electric power marketing abnormal business data, which comprises the following modules: the system comprises a data source management module, a data reading module, a feature processing module, a model training module, an abnormal output module and an abnormal output management module.

The data source management module is used for managing a database and access information of data required by the power marketing business data abnormal auditing system, a database table of the marketing business data required to be audited and the like.

The data reading module is used for acquiring required data from the data source management module, performing preprocessing operations including missing value processing, deduplication, format conversion and the like, and uniformly storing the processed data as table data required by the system.

The feature processing module is used for acquiring table data from the data reading module, carrying out certain transformation and integration on the table data, calculating information gain rate on classification attribute features in the electric power marketing business data features, evaluating correlation among the classification attributes and completing field splicing, and then carrying out feature transformation on the classification data based on the classification attribute frequency to convert the classification data into numerical value data so as to generate attribute feature data capable of being directly input into model training.

The model training module is used for randomly extracting a certain amount or proportion of data from the data output by the characteristic processing module and used for training the enhanced isolated forest algorithm model.

The abnormal output module is used for outputting normal values and abnormal values in the electric power marketing business data, and comprises three sub-modules: the system comprises a model prediction sub-module, a self-adaptive abnormal output sub-module and an abnormal result identification sub-module.

Model prediction submodule: the method is used for acquiring all data output from the feature processing module based on the model trained by the model training module, inputting the model for prediction, and outputting abnormal scores of the electric power marketing business data.

An adaptive abnormal population discrimination sub-module: and initializing k-means for each k value in a specified range based on the abnormal scoring vector of the electric power marketing business data output by the algorithm prediction module, and constructing a monobasic beta mixed model to self-adaptively determine normal values and abnormal values in the data.

An abnormal result identification sub-module: based on a plurality of unitary beta mixed models and abnormal results output by the self-adaptive abnormal group judging module, the optimal mixed model is identified through ICL_BIC criteria, and the abnormal results output by the optimal mixed model are used as the finally output abnormal results.

Aiming at the problems that in the process of checking electric power marketing business data, an expert sets rules to consume a great deal of energy, summarized rules are not covered fully, and abnormal data cannot be identified in a self-adaptive mode due to the fact that a threshold value is manually defined for abnormal evaluation, the invention provides an unsupervised marketing data checking method for structured data, which is driven by an empirical rule and supports different data types at the same time; aiming at the problem that the classification data in the electric power marketing business data have an interdependence relationship, the existing unsupervised anomaly detection method only supports independent attributes, but ignores the association between businesses, the invention provides a correlation analysis and attribute processing method based on the information gain rate, which can better meet the requirements of electric power marketing inspection businesses; aiming at the problem that the classical isolated forest algorithm cannot process the classified data, the invention provides an improved method for converting the frequency characteristic of the classified attribute into the numerical data, so that the algorithm can be suitable for the mixed data. Aiming at the problem that the conventional unsupervised anomaly detection method only provides anomaly degree evaluation and cannot clearly identify an anomaly individual, the method provides an anomaly result identification strategy based on k-means and unitary beta distribution, and anomaly data can be directly identified.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. An active auditing method for electric power marketing abnormal business data is characterized by comprising the following steps:

Step 6, constructing a unitary beta mixed model based on the abnormal scores of the electric power marketing business data to judge the self-adaptive abnormal group;

and 7, identifying the optimal unitary beta mixed model through the ICL_BIC criterion, and taking the output abnormal result as the final output abnormal result.

2. The method for actively auditing the abnormal power marketing business data according to claim 1, wherein the step 1 calculates the correlation between the classified attributes by using the information gain ratio as an evaluation index for the given power marketing business data to be audited, specifically comprising:

step 1-1, calculating information gain rates among various classification attributes in given electric power marketing business data to be audited, wherein a calculation formula of the information gain rates is as follows:

wherein ,

and />

Representing the calculation attribute +.>

For->

It means that the calculation is at attribute +.>

Property->

Information entropy; v _m Representing classification attribute->

The mth attribute value, p (v) _m ) Expressed in attribute +.>

wherein ,

the value range of (2) is O,1]Representing the classification attribute->

And->

Correlation between->

The larger the value, the more relevant.

3. The method for actively auditing the abnormal business data of electric power marketing according to claim 2, wherein the step 3 is based on new mixed data, performs frequency characteristic transformation on classified attributes, converts the classified data into numerical data, and generates attribute characteristic data which can be directly input into model training.

4. The method for actively auditing abnormal business data for electric power marketing according to claim 3, wherein the step 4 is based on attribute feature data, and an enhanced isolated forest model is constructed based on an enhanced isolated forest algorithm, and the modeling process is as follows:

the segmentation strategy is specifically as follows:

wherein ,

representing a vector consisting of a number of target attributes; />

then by calculation

Obtaining a segmentation hyperplane, segmenting a sample set of a current node based on the segmentation hyperplane, and if the segmentation strategy of the target attribute value of the current node is smaller than O, namely the current node is positioned below the segmentation hyperplane, the sample corresponding to the target attribute falls into a left subtree, and the rest samples fall into a right subtree;

5. The method for actively auditing electric power marketing business data according to claim 4, wherein in the step 5, the anomaly score of the electric power marketing business data is calculated based on a trained enhanced isolated forest model, and specifically comprising the following sub-steps:

c(n)＝2H(n-1)-2(n-1)/n

6. The method for actively auditing the power marketing abnormal business data according to claim 5, wherein the step 6 is based on the abnormal score of the power marketing abnormal business data, and the step comprises the steps of:

7. The method for actively auditing power marketing exception business data according to claim 6, wherein the unitary beta hybrid model comprises:

unitary beta mix distribution:

the probability density function of the monobeta distribution is:

wherein Γ (·) represents the gamma function;

8. The method for actively auditing abnormal power marketing business data according to claim 7, wherein in the step 7, an ICL-BIC rule is adopted to search an optimal model, and the calculation formula is as follows:

wherein ,

representing an estimate of the parameter set P +.>

9. An active auditing system for power marketing exception business data, comprising: the system comprises a data source management module, a data reading module, a feature processing module, a model training module, an abnormal output module and an abnormal output management module;

10. The system of claim 9, wherein the anomaly output module comprises three sub-modules: the system comprises a model prediction sub-module, a self-adaptive abnormal output sub-module and an abnormal result identification sub-module;

model prediction submodule: the method is used for acquiring all data output from the feature processing module based on the trained model of the model training module, inputting the model for prediction, and outputting abnormal scores of the electric power marketing business data;