CN116128544A - Active auditing method and system for electric power marketing abnormal business data - Google Patents

Active auditing method and system for electric power marketing abnormal business data Download PDF

Info

Publication number
CN116128544A
CN116128544A CN202211642952.4A CN202211642952A CN116128544A CN 116128544 A CN116128544 A CN 116128544A CN 202211642952 A CN202211642952 A CN 202211642952A CN 116128544 A CN116128544 A CN 116128544A
Authority
CN
China
Prior art keywords
data
abnormal
power marketing
electric power
attribute
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211642952.4A
Other languages
Chinese (zh)
Inventor
于瑞强
喻魏贤
闫谷丰
李锐
赵轩臣
李晓宇
刘效强
邵江东
李慧霖
李万勇
杨玉传
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
YANTAI HAIYI SOFTWARE CO Ltd
Original Assignee
YANTAI HAIYI SOFTWARE CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by YANTAI HAIYI SOFTWARE CO Ltd filed Critical YANTAI HAIYI SOFTWARE CO Ltd
Priority to CN202211642952.4A priority Critical patent/CN116128544A/en
Publication of CN116128544A publication Critical patent/CN116128544A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • G06Q30/0202Market predictions or forecasting for commercial activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/06Electricity, gas or water supply

Abstract

The invention discloses an active auditing method and system for electric power marketing abnormal business data, wherein the method comprises the following steps: calculating information gain rate among various classification attributes in given data to be audited, and analyzing classification attribute business association; splicing the attributes with obvious correlation based on the correlation analysis result among the classified attributes to form new mixed data; performing frequency characteristic transformation on the classification attribute based on the new mixed data to generate attribute characteristic data capable of being directly input into model training; constructing and training an enhanced isolated forest model based on the enhanced isolated forest algorithm based on the attribute characteristic data; calculating abnormal scores of the electric power marketing business data based on the trained enhanced isolated forest model; based on the abnormal scores of the electric power marketing business data, the abnormal groups are adaptively judged, and an adaptive abnormal result is obtained; and combining the self-adaptive abnormal result with the electric power marketing business data according to the output self-adaptive abnormal result, and outputting a final abnormal judgment result.

Description

Active auditing method and system for electric power marketing abnormal business data
Technical Field
The invention belongs to the technical field of power systems, and particularly relates to an active auditing method and system for abnormal business data of power marketing.
Background
In the electric power marketing business, the marketing auditing work is an important way for perfecting internal control constraint, finding marketing errors and strengthening the risk management of the marketing business. Along with the advancement of informatization construction in the electric power field, on-line normalized audit of business data such as changing electricity business, business expansion and installation, electricity price and electricity charge management is very necessary according to an automatic means. Informative construction of electric power marketing has entered the fast traffic lane and is currently being digitally transformed at a critical time.
In the conventional anomaly checking method, related anomalies are often checked by a business expert 'manually defining rules', and the anomalies are the anomalies when the related rules are not matched. The prior technical proposal is generally summarized as follows:
first, the method of manually defining rules: in the existing electric power marketing auditing service, a rule-based method is mainly adopted to audit electric power business data, for example: abnormal collection of the meter reading core: "abnormal electricity consumption type-the electricity consumption type and electricity price code in the customer file information are not accordant as abnormal"; capacity anomalies for user profile classes: "the power supply capacity is 0 or empty in the client profile information is regarded as abnormal"; the anomaly is examined by a rule such as the one defined by the expert, and if the anomaly is not matched with the rule, the anomaly is obtained.
Drawbacks of manual rule of thumb: as described above, the following drawbacks are mainly present in conducting inspection by manually defining rules: the definition of the rules is manually finished by an expert, so that the workload is huge, and a great amount of labor and time cost are consumed; the manpower is exhaustive, business specialists face huge data volume and complex abnormal types, and complete rules cannot be formulated to be applicable to all abnormal conditions, so that inspection surfaces are limited; rule-based auditing methods can only detect anomalies within rules, such as user profile class capacity anomalies: "the power supply capacity is 0 or empty in the client profile information is regarded as abnormal", based on the rule, it can only be checked that the power supply capacity is 0 or empty is abnormal, however, obvious extremum type anomalies, such as 99999, cannot be detected, so that the detection rate of the anomaly data during inspection is low.
The second category, supervised category machine learning method: in view of the fact that the objective of electric power marketing inspection is to inspect question data, many decision models such as decision trees, random forests, logistic regression and other supervised learning methods are introduced into the inspection field to identify the question data. Because the checked business range relates to different fields of business expansion, check-out, customer service business and the like, and different requirements are met in different fields, different models are needed to be built.
Drawbacks of the supervised approach: for supervised methods such as logistic regression, decision trees, random forests, etc., the main drawbacks are: a corresponding model is required to be built aiming at each target, so that the applicability of the model is poor, and the model cannot adapt to the requirements of various business categories faced by continuous expansion of inspection business; when the supervised method is used for training a model, training data are required to be provided with labels, business data generated by an electric marketing business are unlabeled, the data volume is huge, and if manual marking is carried out, the workload is huge.
Third class, unsupervised class machine learning method: the unsupervised learning method is also applied to the inspection field, and mainly adopts methods such as anomaly detection and the like. Such as cluster analysis, LOF (Local outlier factor, local anomaly factor), isolated forest algorithms, etc.
Drawbacks of the unsupervised approach: in the existing unsupervised anomaly detection method, most of the method is only suitable for single-type data, but cannot be used for mixed data such as electric power marketing business data; and the methods such as LOF and kNN (K-Nearest Neighbor) have large calculated amount and cannot meet the high-performance processing requirement of a large amount of data in the auditing service; in the research of the abnormality detection method, the existing methods such as isolated forest iForest (Isolation Forest), POD (Pattern-Based Outlier Detection) and the like can only give an abnormality score, and cannot specifically identify whether each piece of data is abnormal. And the classical iForest algorithm can only process numerical data and cannot meet the abundant business data processing requirements.
In the electric power marketing business, the marketing auditing work is an important way for perfecting internal control constraint, discovering marketing errors and strengthening the risk management of the marketing business. Business data of electric power marketing, customer file data of business expansion and installation management, meter reading and charge data of electric power price and electric charge management. These data are structured data stored in a relational database. In the process of warehousing related data, the input of error data can be caused by the misoperation of service personnel, or the inexperience of related service policies, or the data acquisition and transmission faults. Along with the advancement of informatization construction in the electric power field, by utilizing informatization and automation means, the method is very necessary for carrying out on-line normalized audit on business data such as changing electricity business, business expansion and reporting, electricity price and electricity charge management and the like, finding out data with problems and improving compliance of marketing business.
In summary, in addition to rule-driven auditing, current power marketing auditing services are urgently required to actively identify anomalies for which rules have not been explicitly defined.
Disclosure of Invention
In order to overcome the problems in the prior art, the invention provides an active auditing method and system for abnormal business data of electric power marketing.
The technical scheme for solving the technical problems is as follows:
an active auditing method for electric power marketing abnormal business data comprises the following steps:
step 1, calculating the correlation among classification attributes by using the information gain rate as an evaluation index for given electric power marketing business data to be audited;
step 2, based on the correlation analysis result among the classification attributes in the electric power marketing business data, splicing the classification attributes with obvious correlation to generate new classification attributes, deleting the original classification attributes and forming new mixed data;
step 3, based on the new mixed data, carrying out frequency characteristic transformation on the classification attribute, converting the classification data into numerical data, and generating attribute characteristic data which can be directly input into model training;
step 4, constructing and training an enhanced isolated forest model based on the enhanced isolated forest algorithm based on the attribute characteristic data;
step 5, calculating abnormal scores of the electric power marketing business data based on the trained enhanced isolated forest model;
step 6, constructing a unitary beta mixed model (Unary Beta Mixed Model) based on the abnormal scores of the electric power marketing business data to judge the self-adaptive abnormal group;
And 7, identifying the optimal unitary beta mixed model through ICL_BIC criterion (Integrated Completed Likelihood-Bayesian Information Criterion), and taking the output abnormal result as the final output abnormal result.
Further, the step 1 calculates the correlation between the classification attributes by using the information gain rate as the evaluation index for the given electric power marketing business data to be audited, and specifically includes:
step 1-1, calculating information gain rates among various classification attributes in electric power marketing business data to be audited, wherein the information gain rates are calculated according to the following formula:
Figure BDA0004008452850000041
Figure BDA0004008452850000042
Figure BDA0004008452850000043
wherein ,
Figure BDA0004008452850000044
and />
Figure BDA0004008452850000045
Respectively representing two different classification attributes, gainRatio (&) and Gain (&) respectively representing information Gain rate and information Gain between calculated classification attributes, and +.>
Figure BDA0004008452850000046
Representing the calculation attribute +.>
Figure BDA0004008452850000047
For->
Figure BDA0004008452850000048
Is a gain ratio of information of (a); h (& gt) and H (& lt & gt) respectively represent calculated information entropy and conditional entropy, such as + & lt/EN & gt>
Figure BDA0004008452850000049
It means that the calculation is at attribute +.>
Figure BDA00040084528500000410
Property->
Figure BDA00040084528500000411
Is an information entropy of (a); v m Representing classification attribute->
Figure BDA00040084528500000412
The mth attribute value, p (v) m ) Expressed in attribute +.>
Figure BDA00040084528500000413
In which v is calculated to be equal to m The ratio of the number of attribute values to the length of the whole attribute;
step 1-2, converting the information gain rate among the classification attributes into correlation among the classification attributes, wherein the formula is as follows:
Figure BDA00040084528500000414
wherein ,
Figure BDA00040084528500000415
the value range of (2) is [0,1 ]]Representing the classification attribute->
Figure BDA00040084528500000416
And->
Figure BDA00040084528500000417
Correlation between->
Figure BDA00040084528500000418
The larger the value, the more relevant.
Further, the step 4 builds and trains an enhanced isolated forest model based on the enhanced isolated forest algorithm based on the attribute characteristic data, and the modeling process is as follows:
step 4-1, setting the number/proportion of samples and the number of isolated trees required for constructing a model;
step 4-2, sampling: randomly extracting a set number or proportion of subsamples from the electric power marketing business data; setting a root node for each partition tree, and taking the root node as a current node;
step 4-3, attribute segmentation; on the current node, randomly selecting a plurality of attributes as target attributes for segmentation, and respectively constructing a left subtree and a right subtree of the segmentation node according to a segmentation result;
the segmentation strategy is specifically as follows:
Figure BDA0004008452850000051
wherein ,
Figure BDA0004008452850000052
representing a vector consisting of a number of target attributes; />
Figure BDA0004008452850000053
Representing intercept vectors, derived by taking values in a uniform distribution between maximum and minimum values for each target attribute; />
Figure BDA0004008452850000054
The normal vector is represented, which can be understood simply as the slope of the dividing plane, by randomly generating a value composition for each target attribute from the standard normal distribution N (0, 1);
Then by calculation
Figure BDA0004008452850000055
Obtaining a segmentation hyperplane, segmenting a sample set of a current node based on the segmentation hyperplane, and if the segmentation strategy of the target attribute value of the current node is smaller than 0, namely the current node is positioned below the segmentation hyperplane, the sample corresponding to the target attribute falls into a left subtree, and the rest samples fall into a right subtree; />
Step 4-4, constructing an isolated tree; repeating the step 4-3 in the child node until only one data in the child node can not be cut continuously or the child node reaches the set maximum depth of the tree, stopping the step, and completing the construction of the isolated tree;
step 4-5, constructing an enhanced isolated forest; and (3) according to the sampling number/proportion set in the step (4-1) and the number of trees to be constructed, the steps (4-2), the step (4-3) and the step (4-4) are cycled to finish the construction of all the isolated trees and form an enhanced isolated forest.
Further, in the step 5, based on the trained enhanced isolated forest model, an anomaly score of the electric power marketing business data is calculated, which specifically includes the following sub-steps:
step 5-1, inputting all data to be audited into an enhanced isolated forest model, and obtaining T segmentation results when each data point passes through the enhanced isolated forest model formed by T trees;
Step 5-2, for each data point, calculating the path length h (x) of the point in each isolated tree, namely the number of edges through which the sample point passes from the root node to the leaf node of the tree;
step 5-3, calculating the average path length of each data point in the enhanced isolated forest (namely all trees), wherein the calculation formula is as follows:
Figure BDA0004008452850000061
wherein E (h (x)) represents the average path length, T represents the number of trees in the enhanced isolated forest, h t (x) Representing the path length of the data point on the t-th tree;
step 5-4, carrying out normalization processing on the average path length of all the data points, and calculating to obtain an abnormal score of each data point, wherein the calculation formula is as follows:
Figure BDA0004008452850000062
wherein n represents the data amount of the input sample when the tree is constructed, c (n) is the global average path length, and is used for normalization processing, and the formula is as follows:
c(n)=2H(n-1)-2(n-1)/n
where H (k) =ln (k) +epsilon, epsilon= 0.5772156649, epsilon being the euler constant.
Further, in the step 6, based on the anomaly score of the electric power marketing business data, a monobeta hybrid model is constructed to perform self-adaptive anomaly group discrimination, which specifically comprises:
step 6-1, clustering analysis is carried out on the abnormal scoring result by using a k-means method, and abnormal scoring of the electric power marketing business data to be audited is divided into k components;
And 6-2, constructing a unitary beta mixed model, and outputting the unitary beta mixed model and abnormal results.
Further, the unitary beta-blending model includes:
unitary beta mix distribution:
Figure BDA0004008452850000063
wherein ,Bm (V i Alpha, beta) is the monobeta distribution of the mth component, V i Representing an anomaly score vector, α= { α 1 ,…,α M} and β={β1 ,…,β M -parameters representing the unitary beta distribution of the M components; lambda = { lambda 1 ,…,λ M The mixing coefficient between M components is represented, and
Figure BDA0004008452850000064
the probability density function of the monobeta distribution is:
Figure BDA0004008452850000071
wherein Γ (·) represents the gamma function;
using maximum likelihood estimation to estimate parameters of the monobasic beta mixed model, the log likelihood function is:
Figure BDA0004008452850000072
wherein log (·) represents log, V represents anomaly score vector, and parameter set p= { λ 1 ,…,λ m ,α 1 ,…,α m ,β 1 ,…,β m -representing a set of parameters to be estimated; η= { η 1 ,…,η N Associated with N anomaly score vectors V for indicating anomaly scores V i Of (A), e.g. eta im Representing the ith anomaly score V i Whether it belongs to the mth component, if eta im =1, then indicates that it belongs to, η im And =0 does not belong.
Further, in the step 7, an ICL-BIC criterion is adopted to search an optimal model, and the calculation formula is as follows:
Figure BDA0004008452850000073
wherein ,
Figure BDA0004008452850000074
representing an estimate of the parameter set P +.>
Figure BDA0004008452850000075
Represents an estimate of the home vector η, Q M Representing the number of model parameters having M components, N representing the amount of data.
The invention also discloses an active auditing system of the electric power marketing abnormal business data, which comprises the following steps: the system comprises a data source management module, a data reading module, a feature processing module, a model training module, an abnormal output module and an abnormal output management module;
the data source management module is used for managing a database of data required by the power marketing business data abnormal auditing system, access information and a database table of the marketing business data required to be audited;
the data reading module is used for acquiring required data from the data source management module, performing preprocessing operations including missing value processing, deduplication, format conversion and the like, and uniformly storing the processed data as table data required by the system;
the feature processing module is used for acquiring table data from the data reading module, carrying out certain transformation and integration on the table data, calculating information gain rate on classification attribute features in the electric power marketing business data features, evaluating correlation among the classification attributes and completing field splicing, and then carrying out feature transformation on the classification data based on the classification attribute frequency to convert the classification data into numerical value data so as to generate attribute feature data capable of being directly input into model training;
The model training module is used for randomly extracting a certain amount or proportion of data from the data output by the characteristic processing module to train the enhanced isolated forest algorithm model;
the abnormal output module is used for outputting normal values and abnormal values in the electric power marketing business data;
the abnormal output management module is used for managing final result display and output work by combining the abnormal result output by the abnormal result identification module with the electric power marketing business data output by the data source management module.
Further, the abnormal output module comprises three sub-modules: the system comprises an algorithm prediction sub-module, a self-adaptive abnormal output sub-module and an abnormal result identification sub-module;
an algorithm prediction sub-module: the method is used for acquiring all data output from the feature processing module based on the trained model of the model training module, inputting the model for prediction, and outputting abnormal scores of the electric power marketing business data;
an adaptive abnormal population discrimination sub-module: based on the abnormal scoring vector of the electric power marketing business data output by the algorithm prediction module, initializing k-means for each k value in a specified range and constructing a monobasic beta mixed model to self-adaptively determine normal values and abnormal values in the data;
And the abnormal result identification sub-module is used for identifying the optimal mixed model through the ICL_BIC criterion based on the plurality of unitary beta mixed models and the abnormal results output by the adaptive abnormal population judgment module and taking the abnormal result output by the optimal mixed model as the finally output abnormal result.
Compared with the prior art, the invention has the following technical effects:
(1) Aiming at the problems that in the checking process of electric power marketing business data, an expert needs to consume a great deal of effort to formulate rules, summarized rules are not covered fully, and abnormal data cannot be identified in a self-adaptive mode because a threshold value is required to be defined manually in abnormal evaluation, the method for checking structured data unsupervised marketing data is provided, wherein the structured data unsupervised marketing data is driven by an experience rule and different data types are supported simultaneously, the business experience of a business expert is not required to formulate a priori rules for relying on, and the machine learning method is utilized to conduct abnormal initiative heuristic screening;
(2) Aiming at the problem that the classification data in the electric power marketing business data have an interdependence relationship, and the existing unsupervised anomaly detection method only supports independent attributes and ignores the association between businesses, the invention provides a correlation analysis and attribute processing method based on the information gain rate, which can better meet the requirements of electric power marketing inspection businesses;
(3) Aiming at the problem that the classical isolated forest algorithm cannot process the classified data, the invention provides an improved method, which is used for carrying out frequency characteristic transformation on the classified attribute and converting the classified attribute into numerical data so that the algorithm can be suitable for the mixed data;
(4) Aiming at the problem that the conventional unsupervised anomaly detection method only provides anomaly degree evaluation and cannot clearly identify an anomaly individual, the invention provides an anomaly result identification strategy based on k-means and unitary beta distribution, and anomaly data can be directly identified.
Drawings
FIG. 1 is a flow chart of an active auditing method for power marketing abnormal business data according to the present invention;
FIG. 2 is a block diagram of an active auditing system for power marketing exception business data according to the present invention;
FIG. 3 is a three-dimensional segmented hyperplane schematic of the present invention.
Detailed Description
The principles and features of the present invention are described below with reference to the drawings, the examples are illustrated for the purpose of illustrating the invention and are not to be construed as limiting the scope of the invention.
The invention aims to solve the problem that the business risk exists due to the fact that the business data cannot be actively found out of the problem data defined by the rules because the manually set rules are too dependent to check the business data in the current electric power marketing checking business.
In view of the above problems, the present invention provides an active auditing method for abnormal business data of electric power marketing, the method comprising the following steps:
step 1, calculating the correlation among classification attributes by using the information gain rate as an evaluation index for given electric power marketing business data to be audited;
and acquiring electric power marketing business data to be audited, performing preprocessing operations including missing value processing, deduplication, format conversion and the like, and uniformly storing the processed data as required table data.
Because of specific business reasons, a dependency relationship exists among a plurality of data values of different types in the electric power marketing business data, and the relationship is the business experience in the past. If each attribute is regarded as independent, the effect of anomaly detection is greatly reduced, and the attribute is inconsistent with business expectation.
The invention firstly solves the problem of analyzing the potential association relation between the classified attribute services, namely, calculating the correlation between the classified attributes by using the information gain ratio as an evaluation index, and specifically comprises the following steps:
step 1-1, calculating information gain rates among various classification attributes in given electric power marketing business data to be audited;
let x= { X 1 ,…,x N A is a set of N mixed attribute data points, where a c and An Respectively representing sets of classification attributes and continuation attributes.
For each classification attribute, the information gain rate between the classification attribute and other classification attributes is calculated, and the correlation between classification attributes is measured based on the information gain rate.
The information gain ratio has the following calculation formula:
Figure BDA0004008452850000101
Figure BDA0004008452850000102
Figure BDA0004008452850000103
wherein ,
Figure BDA0004008452850000104
and />
Figure BDA0004008452850000105
Respectively representing two different classification attributes, gainRatio (&) and Gain (&) respectively representing information Gain rate and information Gain between calculated classification attributes, and +.>
Figure BDA0004008452850000106
Representing the calculation attribute +.>
Figure BDA0004008452850000107
For->
Figure BDA0004008452850000108
Is a gain ratio of information of (a); h (& gt) and H (& lt & gt) respectively represent calculated information entropy and conditional entropy, such as + & lt/EN & gt>
Figure BDA0004008452850000109
It means that the calculation is at attribute +.>
Figure BDA00040084528500001010
Property->
Figure BDA00040084528500001011
Is an information entropy of (a); v m Representing classification attribute->
Figure BDA00040084528500001012
The mth attribute value, p (v) m ) Expressed in attribute +.>
Figure BDA00040084528500001013
In which v is calculated to be equal to m The ratio of the number of attribute values to the overall attribute length.
Step 1-2, converting the information gain rate among the classification attributes into correlation among the classification attributes;
due to asymmetry of information gain ratio, i.e.
Figure BDA0004008452850000111
And->
Figure BDA0004008452850000112
(i+.j) is inconsistent and therefore cannot be directly used to measure the correlation between classification attributes, requiring further transformations as shown in the following equation:
Figure BDA0004008452850000113
wherein ,
Figure BDA0004008452850000114
the value range of (2) is [0,1 ] ]Representing the classification attribute->
Figure BDA0004008452850000115
And->
Figure BDA0004008452850000116
Correlation between->
Figure BDA0004008452850000117
The greater the valueThe more relevant is indicated.
And 2, based on correlation analysis results among all classification attributes in the electric power marketing business data, splicing the classification attributes with obvious correlation to generate new classification attributes, and deleting the original classification attributes to form new mixed data.
Because there may be a correlation between classification attributes, particularly in the power marketing business data, there may be a significant correlation between classification attributes, such as a strong correspondence between classification attribute "YHLBDM" (user class code) and "JLFSDM" (metering mode code), "DJDM" (electricity price code) and "YDLBDM" (electricity use class code), etc., so that in order to be able to retrieve anomalies existing in the multidimensional mixture, the present invention splices classification attributes with significant correlation based on the correlation between classification attributes calculated in step 1 to generate new classification attributes, deletes the original classification attributes, and merges with continuous attributes. Typically the correlation threshold may be set to 0.5, with correlations greater than 0.5 indicating significant correlations, otherwise uncorrelated/less correlated.
And 3, carrying out frequency characteristic transformation on the classification attribute based on the new mixed data, converting the classification data into numerical data, and generating attribute characteristic data capable of being directly input into model training.
Unlike classical isolated forest algorithms, it is only applicable to numerical data. In order to enable the enhanced isolated forest algorithm to be applicable to numerical data, classified data and mixed data with the classified data and the numerical data, the frequency characteristic transformation is carried out on the classified data, and the classified data is converted into the numerical data.
And 4, constructing and training an enhanced isolated forest model based on the enhanced isolated forest algorithm based on the attribute characteristic data.
The invention calculates anomaly scores for the blended data based on an enhanced isolated forest algorithm. Because the classical isolated forest algorithm can only be applied to numerical data, the method expands the algorithm to support classification attributes.
Constructing and training an enhanced isolated forest model based on an enhanced isolated forest algorithm, wherein the modeling process is as follows:
step 4-1, setting the number (or proportion) of samples and the number of isolated trees required for constructing a model;
step 4-2, sampling: randomly extracting a set number or proportion of subsamples from the electric power marketing business data; for each partition tree, a root node is set, and the root node is used as a current node.
Step 4-3, attribute segmentation; on the current node, a plurality of attributes are randomly selected as target attributes for segmentation, and a left subtree and a right subtree of the segmentation node are respectively constructed according to segmentation results.
The segmentation strategy is specifically as follows:
Figure BDA0004008452850000121
wherein ,
Figure BDA0004008452850000122
representing a vector consisting of a number of target attributes; />
Figure BDA0004008452850000123
Representing intercept vectors, derived by taking values in a uniform distribution between maximum and minimum values for each target attribute; />
Figure BDA0004008452850000124
The normal vector is represented, which can be understood simply as the slope of the dividing plane, by randomly generating a value composition for each target attribute from the standard normal distribution N (0, 1);
then by calculation
Figure BDA0004008452850000125
A split hyperplane is obtained (taking the example of selecting 3 target attributes, which split hyperplane is shown in fig. 3). Dividing a sample set of the current node based on the division hyperplane, if the division strategy of the target attribute value of the current node is smaller than 0I.e. under the segmentation hyperplane, the samples corresponding to the target attributes fall into the left subtree, and the rest fall into the right subtree;
step 4-4, constructing an isolated tree; and (3) repeatedly executing the step 4-3 in the child nodes until only one data (incapable of being cut) in the child nodes or the child nodes reach the set maximum depth of the tree, stopping the step, and completing the construction of the isolated tree.
Step 4-5, constructing an enhanced isolated forest; and (3) according to the sampling number (or proportion) set in the step (4-1) and the number of trees to be constructed, the steps (4-2), the steps (4-3) and the steps (4-4) are cycled to finish the construction of all the isolated trees and form the enhanced isolated forest.
And 5, calculating abnormal scores of the electric power marketing business data based on the trained enhanced isolated forest model. The step 5 specifically comprises the following substeps:
and 5-1, inputting data to be audited into the enhanced isolated forest model. For each data point, T segmentation results are obtained when an enhanced isolated forest model consisting of T trees is passed.
Step 5-2 for each data point, the path length h (x) of that point in each enhanced orphan tree is calculated, i.e., the number of edges that the sample point passes from the root node to the leaf node of the tree.
Step 5-3, calculating the average path length of each data point in the enhanced isolated forest (namely all trees), wherein the calculation formula is as follows:
Figure BDA0004008452850000131
wherein E (h (x)) represents the average path length, T represents the number of trees in the enhanced isolated forest, h t (x) Representing the path length of the data point on the t-th tree;
step 5-4, carrying out normalization processing on the average path length of all the data points, and calculating to obtain an abnormal score of each data point, wherein the calculation formula is as follows:
Figure BDA0004008452850000132
Wherein n represents the data amount of the input sample when the tree is constructed, c (n) is the global average path length, and is used for normalization processing, and the formula is as follows:
c(n)=2H(n-1)-2(n-1)/n
where H (k) =ln (k) +epsilon, epsilon= 0.5772156649, epsilon being the euler constant.
Step 6, constructing a unitary beta mixed model (Unary Beta Mixed Model) based on the abnormal scores of the electric power marketing business data to judge the self-adaptive abnormal group;
in order to obtain anomalies adaptively, a monobasic beta mixed model is used to automatically distinguish normal values from abnormal values in a data space, and the main steps are as follows:
step 6-1, clustering analysis is carried out on abnormal scoring results by using a k-means method;
carrying out Min-Max normalization processing on the anomaly scores obtained in the step 4, so that the anomaly scores are real numbers between [0,1 ]; and then carrying out cluster analysis on the abnormal scoring result by using a k-means method, and dividing the abnormal scoring of the data to be audited into k components. For each division result, the following unary beta modeling operation is carried out.
Step 6-2, constructing a unitary beta mixed model to adaptively determine normal values and abnormal values in the data;
in order to be able to adaptively distinguish normal values from abnormal values in data, a monobasic beta mixed model is constructed, and abnormal vectors are described by mixing monobasic beta distribution to give a flexible model.
Unitary beta mix distribution:
Figure BDA0004008452850000142
wherein ,Bm (V i Alpha, beta) is the monobeta distribution of the mth component, V i Representing an anomaly score vector, α= { α 1 ,…,α M} and β={β1 ,…,β M -parameters representing the unitary beta distribution of the M components; lambda = { lambda 1 ,…,λ M The mixing coefficient between M components is represented, and
Figure BDA0004008452850000143
the probability density function of the monobeta distribution is:
Figure BDA0004008452850000144
/>
wherein Γ (·) represents the gamma function.
Using maximum likelihood estimation to estimate parameters of the monobasic beta mixed model, the log likelihood function is:
Figure BDA0004008452850000145
wherein log (·) represents log, V represents anomaly score vector, and parameter set p= { λ 1 ,…,λ m 。α 1 ,…,α m ,β 1 ,…,β m -representing a set of parameters to be estimated; η= { η 1 ,…,η N Associated with N anomaly score vectors V for indicating anomaly scores V i Of (A), e.g. eta im Representing the ith anomaly score V i Whether it belongs to the mth component, if eta im =1, then indicates that it belongs to, η im And =0 does not belong.
The parameter set p is estimated using an EM algorithm that iterates between an expected step (Expectation) and a Maximization step (Maximization) to produce a sequence estimate
Figure BDA0004008452850000146
I denotes the current iteration step, which stops until the log-likelihood function value log (L (V, η|p)) converges to the threshold value no longer changes.
And 7, identifying the optimal unitary beta mixed model through ICL_BIC criterion (Integrated Completed Likelihood-Bayesian Information Criterion), and taking the output abnormal result as the final output abnormal result.
Through step 6, a monobasic beta mixture model is built separately for each k value within the specified range. In order to identify the optimal model, an ICL-BIC rule is adopted to search the optimal model, and the calculation formula is as follows:
Figure BDA0004008452850000151
wherein ,
Figure BDA0004008452850000152
representing an estimate of the parameter set P +.>
Figure BDA0004008452850000153
Represents an estimate of the home vector η, Q M Represents the number of model parameters (i.e., the length of parameter set P) having M components, N representing the amount of data.
The smaller the ICL-BIC value is calculated, the better the description model is, and the spatial distribution of the data can be described most effectively.
In the selected optimal model, all data are divided into M components according to unitary beta distribution, and based on the characteristic of enhancing abnormal scores of isolated forest algorithms, the abnormal scores of normal data points are extremely similar (or equal), so that the data can be divided into the same component. Because the larger the abnormal scoring result score of the invention shows that the more abnormal is, the smaller is and the more normal is, the components with higher average abnormal scoring are combined to obtain an abnormal result outlier. Outliers identification takes a value of {0,1}, where 0 represents normal and 1 represents abnormal.
Method verification
In order to verify the feasibility and effectiveness of the active auditing method for abnormal data, experiments based on electric power marketing business data are carried out, and the most common business expansion archival data in marketing auditing are taken as an example:
The data of the electric power marketing business with the sample size of 100w is extracted from the database, and the specific information is shown in the following table 1. The classification variables are 11, namely { "YHLBDM", "JFFDM", "YDLDM", "DJDM", "BSJFBZ", "XSJSYSBDM", "XSJFBZ", "JBDFJSBSM", "BSFTFSDM", "LTDFJSFS" }, and the continuous variables are 5, namely { "JFRL", "YDRL", "YGXSJSZ", "WGXSJSZ", "JBFFTZ" }.
Firstly, classifying attributes in electric power business data are analyzed, information gain rates among the classifying attributes are calculated according to the step 1, and attribute splicing is completed. By calculating the information gain rate, the remarkable correlation exists in { "YHLBDM", "JLFSDM" }, { "DJDM", "YDLBDM" }, { "BSJFBZ", "BSFTFSDM" }, wherein the remarkable correlation exists in { "DJDM", "YDLBDM" } and the rule example given above that the electricity use type (YDLBDM) in the customer profile information is consistent with the macroscopic service rule that the electricity price code (DJDM) does not accord with as abnormal "exist in the classification attribute, so that the attribute with the remarkable correlation is spliced and encoded to form a new classification attribute, and the original attribute is deleted. The method and the device can ensure that the multidimensional abnormal mixed data in the data are searched in the later step through the relevant field splicing, and can also reduce the dimension and improve the operation efficiency of the algorithm. The power business data set after field concatenation is shown in table 2 below.
Table 3 shows the output form of the abnormal result of the auditing method according to the present invention, and as shown in the table, the abnormal score and the adaptive abnormal result outlier are output on the basis of the electric power business data. The anomaly score is a result obtained by calculating an anomaly coefficient based on an enhanced isolated forest algorithm and normalizing by Min-Max; outlers is the result of identifying an abnormality (1 is abnormal, 0 is normal) present in the power marketing business data output based on the monobeta-beta hybrid model.
1 to 3 data anomalies in table 3 belong to the presence of extremum anomalies in the continuous attribute, such as the first data continuous attribute "JFRL", "YDRL", "JBDFFTZ" are all 80000, and their average is only 2.20, 40.66 and 2.69, so this data is anomaly data; while 4 to 7 data anomalies belong to multi-dimensional sparse anomalies in classification attributes, for example, when 'YDLDM' is '400', 'DJDM' is '4002071', and only a few data are in the whole data set after the two attributes are associated, so that the data anomalies can be detected as anomalies; the 8 th data exception belongs to multidimensional mixed exception, namely, the condition that both classification attribute and continuous attribute in the data are abnormal. The actively identified anomalies cannot be covered by the existing auditing rules, but are truly anomalies after being reviewed by business experts, so that the invention can effectively detect various anomaly types in the data.
TABLE 1 electric marketing business data set (section)
Figure BDA0004008452850000171
Table 2 electric marketing business data sheet (part of) after field splicing
Figure BDA0004008452850000172
TABLE 3 output form (part) of Power marketing business data anomaly results
Figure BDA0004008452850000181
Based on the method, the invention also provides an active auditing system of the electric power marketing abnormal business data, which comprises the following modules: the system comprises a data source management module, a data reading module, a feature processing module, a model training module, an abnormal output module and an abnormal output management module.
The data source management module is used for managing a database and access information of data required by the power marketing business data abnormal auditing system, a database table of the marketing business data required to be audited and the like.
The data reading module is used for acquiring required data from the data source management module, performing preprocessing operations including missing value processing, deduplication, format conversion and the like, and uniformly storing the processed data as table data required by the system.
The feature processing module is used for acquiring table data from the data reading module, carrying out certain transformation and integration on the table data, calculating information gain rate on classification attribute features in the electric power marketing business data features, evaluating correlation among the classification attributes and completing field splicing, and then carrying out feature transformation on the classification data based on the classification attribute frequency to convert the classification data into numerical value data so as to generate attribute feature data capable of being directly input into model training.
The model training module is used for randomly extracting a certain amount or proportion of data from the data output by the characteristic processing module and used for training the enhanced isolated forest algorithm model.
The abnormal output module is used for outputting normal values and abnormal values in the electric power marketing business data, and comprises three sub-modules: the system comprises a model prediction sub-module, a self-adaptive abnormal output sub-module and an abnormal result identification sub-module.
Model prediction submodule: the method is used for acquiring all data output from the feature processing module based on the model trained by the model training module, inputting the model for prediction, and outputting abnormal scores of the electric power marketing business data.
An adaptive abnormal population discrimination sub-module: and initializing k-means for each k value in a specified range based on the abnormal scoring vector of the electric power marketing business data output by the algorithm prediction module, and constructing a monobasic beta mixed model to self-adaptively determine normal values and abnormal values in the data.
An abnormal result identification sub-module: based on a plurality of unitary beta mixed models and abnormal results output by the self-adaptive abnormal group judging module, the optimal mixed model is identified through ICL_BIC criteria, and the abnormal results output by the optimal mixed model are used as the finally output abnormal results.
The abnormal output management module is used for managing final result display and output work by combining the abnormal result output by the abnormal result identification module with the electric power marketing business data output by the data source management module.
Aiming at the problems that in the process of checking electric power marketing business data, an expert sets rules to consume a great deal of energy, summarized rules are not covered fully, and abnormal data cannot be identified in a self-adaptive mode due to the fact that a threshold value is manually defined for abnormal evaluation, the invention provides an unsupervised marketing data checking method for structured data, which is driven by an empirical rule and supports different data types at the same time; aiming at the problem that the classification data in the electric power marketing business data have an interdependence relationship, the existing unsupervised anomaly detection method only supports independent attributes, but ignores the association between businesses, the invention provides a correlation analysis and attribute processing method based on the information gain rate, which can better meet the requirements of electric power marketing inspection businesses; aiming at the problem that the classical isolated forest algorithm cannot process the classified data, the invention provides an improved method for converting the frequency characteristic of the classified attribute into the numerical data, so that the algorithm can be suitable for the mixed data. Aiming at the problem that the conventional unsupervised anomaly detection method only provides anomaly degree evaluation and cannot clearly identify an anomaly individual, the method provides an anomaly result identification strategy based on k-means and unitary beta distribution, and anomaly data can be directly identified.
The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims (10)

1. An active auditing method for electric power marketing abnormal business data is characterized by comprising the following steps:
step 1, calculating the correlation among classification attributes by using the information gain rate as an evaluation index for given electric power marketing business data to be audited;
step 2, based on the correlation analysis result among the classification attributes in the electric power marketing business data, splicing the classification attributes with obvious correlation to generate new classification attributes, deleting the original classification attributes and forming new mixed data;
step 3, based on the new mixed data, carrying out frequency characteristic transformation on the classification attribute, converting the classification data into numerical data, and generating attribute characteristic data which can be directly input into model training;
step 4, constructing and training an enhanced isolated forest model based on the enhanced isolated forest algorithm based on the attribute characteristic data;
step 5, calculating abnormal scores of the electric power marketing business data based on the trained enhanced isolated forest model;
Step 6, constructing a unitary beta mixed model based on the abnormal scores of the electric power marketing business data to judge the self-adaptive abnormal group;
and 7, identifying the optimal unitary beta mixed model through the ICL_BIC criterion, and taking the output abnormal result as the final output abnormal result.
2. The method for actively auditing the abnormal power marketing business data according to claim 1, wherein the step 1 calculates the correlation between the classified attributes by using the information gain ratio as an evaluation index for the given power marketing business data to be audited, specifically comprising:
step 1-1, calculating information gain rates among various classification attributes in given electric power marketing business data to be audited, wherein a calculation formula of the information gain rates is as follows:
Figure FDA0004008452840000011
Figure FDA0004008452840000012
Figure FDA0004008452840000021
wherein ,
Figure FDA0004008452840000022
and />
Figure FDA0004008452840000023
Respectively representing two different classification attributes, gainRatio (&) and Gain (&) respectively representing information Gain rate and information Gain between calculated classification attributes, and +.>
Figure FDA0004008452840000024
Representing the calculation attribute +.>
Figure FDA0004008452840000025
For->
Figure FDA0004008452840000026
Is a gain ratio of information of (a); h (& gt) and H (& lt & gt) respectively represent calculated information entropy and conditional entropy, such as + & lt/EN & gt>
Figure FDA0004008452840000027
It means that the calculation is at attribute +.>
Figure FDA0004008452840000028
Property->
Figure FDA0004008452840000029
Information entropy; v m Representing classification attribute->
Figure FDA00040084528400000210
The mth attribute value, p (v) m ) Expressed in attribute +.>
Figure FDA00040084528400000211
In which v is calculated to be equal to m The ratio of the number of attribute values to the length of the whole attribute;
step 1-2, converting the information gain rate among the classification attributes into correlation among the classification attributes, wherein the formula is as follows:
Figure FDA00040084528400000212
wherein ,
Figure FDA00040084528400000213
the value range of (2) is O,1]Representing the classification attribute->
Figure FDA00040084528400000214
And->
Figure FDA00040084528400000215
Correlation between->
Figure FDA00040084528400000216
The larger the value, the more relevant.
3. The method for actively auditing the abnormal business data of electric power marketing according to claim 2, wherein the step 3 is based on new mixed data, performs frequency characteristic transformation on classified attributes, converts the classified data into numerical data, and generates attribute characteristic data which can be directly input into model training.
4. The method for actively auditing abnormal business data for electric power marketing according to claim 3, wherein the step 4 is based on attribute feature data, and an enhanced isolated forest model is constructed based on an enhanced isolated forest algorithm, and the modeling process is as follows:
step 4-1, setting the number/proportion of samples and the number of isolated trees required for constructing a model;
step 4-2, sampling: randomly extracting a set number or proportion of subsamples from the electric power marketing business data; setting a root node for each partition tree, and taking the root node as a current node;
Step 4-3, attribute segmentation; on the current node, randomly selecting a plurality of attributes as target attributes for segmentation, and respectively constructing a left subtree and a right subtree of the segmentation node according to a segmentation result;
the segmentation strategy is specifically as follows:
Figure FDA0004008452840000031
wherein ,
Figure FDA0004008452840000032
representing a vector consisting of a number of target attributes; />
Figure FDA0004008452840000033
Representing intercept vectors, derived by taking values in a uniform distribution between maximum and minimum values for each target attribute; />
Figure FDA0004008452840000034
The normal vector is represented, which can be understood simply as the slope of the dividing plane, by randomly generating a value composition for each target attribute from the standard normal distribution N (0, 1);
then by calculation
Figure FDA0004008452840000035
Obtaining a segmentation hyperplane, segmenting a sample set of a current node based on the segmentation hyperplane, and if the segmentation strategy of the target attribute value of the current node is smaller than O, namely the current node is positioned below the segmentation hyperplane, the sample corresponding to the target attribute falls into a left subtree, and the rest samples fall into a right subtree;
step 4-4, constructing an isolated tree; repeating the step 4-3 in the child node until only one data in the child node can not be cut continuously or the child node reaches the set maximum depth of the tree, stopping the step, and completing the construction of the isolated tree;
step 4-5, constructing an enhanced isolated forest; and (3) according to the sampling number/proportion set in the step (4-1) and the number of trees to be constructed, the steps (4-2), the step (4-3) and the step (4-4) are cycled to finish the construction of all the isolated trees and form an enhanced isolated forest.
5. The method for actively auditing electric power marketing business data according to claim 4, wherein in the step 5, the anomaly score of the electric power marketing business data is calculated based on a trained enhanced isolated forest model, and specifically comprising the following sub-steps:
step 5-1, inputting all data to be audited into an enhanced isolated forest model, and obtaining T segmentation results when each data point passes through the enhanced isolated forest model formed by T trees;
step 5-2, for each data point, calculating the path length h (x) of the point in each isolated tree, namely the number of edges through which the sample point passes from the root node to the leaf node of the tree;
step 5-3, calculating the average path length of each data point in the enhanced isolated forest (namely all trees), wherein the calculation formula is as follows:
Figure FDA0004008452840000036
wherein E (h (x)) represents the average path length, T represents the number of trees in the enhanced isolated forest, h t (x) Representing the path length of the data point on the t-th tree;
step 5-4, carrying out normalization processing on the average path length of all the data points, and calculating to obtain an abnormal score of each data point, wherein the calculation formula is as follows:
Figure FDA0004008452840000041
wherein n represents the data amount of the input sample when the tree is constructed, c (n) is the global average path length, and is used for normalization processing, and the formula is as follows:
c(n)=2H(n-1)-2(n-1)/n
Where H (k) =ln (k) +epsilon, epsilon= 0.5772156649, epsilon being the euler constant.
6. The method for actively auditing the power marketing abnormal business data according to claim 5, wherein the step 6 is based on the abnormal score of the power marketing abnormal business data, and the step comprises the steps of:
step 6-1, clustering analysis is carried out on the abnormal scoring result by using a k-means method, and abnormal scoring of the electric power marketing business data to be audited is divided into k components;
and 6-2, constructing a unitary beta mixed model, and outputting the unitary beta mixed model and abnormal results.
7. The method for actively auditing power marketing exception business data according to claim 6, wherein the unitary beta hybrid model comprises:
unitary beta mix distribution:
Figure FDA0004008452840000042
wherein ,Bm (V i Alpha, beta) is the monobeta distribution of the mth component, V i Representing an anomaly score vector, α= { α 1 ,…,α M} and β={β1 ,…,β M -parameters representing the unitary beta distribution of the M components; lambda = { lambda 1 ,…,λ M The mixing coefficient between M components is represented, and
Figure FDA0004008452840000043
the probability density function of the monobeta distribution is:
Figure FDA0004008452840000044
wherein Γ (·) represents the gamma function;
using maximum likelihood estimation to estimate parameters of the monobasic beta mixed model, the log likelihood function is:
Figure FDA0004008452840000051
Wherein log (·) represents log, V represents anomaly score vector, and parameter set p= { λ 1 ,…,λ m ,α 1 ,…,α m ,β 1 ,…,β m -representing a set of parameters to be estimated; η= { η 1 ,…,η N Associated with N anomaly score vectors V for indicating anomaly scores V i Of (A), e.g. eta im Representing the ith anomaly score V i Whether it belongs to the mth component, if eta im =1, then indicates that it belongs to, η im And =0 does not belong.
8. The method for actively auditing abnormal power marketing business data according to claim 7, wherein in the step 7, an ICL-BIC rule is adopted to search an optimal model, and the calculation formula is as follows:
Figure FDA0004008452840000052
wherein ,
Figure FDA0004008452840000053
representing an estimate of the parameter set P +.>
Figure FDA0004008452840000054
Represents an estimate of the home vector η, Q M Representing the number of model parameters having M components, N representing the amount of data.
9. An active auditing system for power marketing exception business data, comprising: the system comprises a data source management module, a data reading module, a feature processing module, a model training module, an abnormal output module and an abnormal output management module;
the data source management module is used for managing a database of data required by the power marketing business data abnormal auditing system, access information and a database table of the marketing business data required to be audited;
The data reading module is used for acquiring required data from the data source management module, performing preprocessing operations including missing value processing, deduplication, format conversion and the like, and uniformly storing the processed data as table data required by the system;
the feature processing module is used for acquiring table data from the data reading module, carrying out certain transformation and integration on the table data, calculating information gain rate on classification attribute features in the electric power marketing business data features, evaluating correlation among the classification attributes and completing field splicing, and then carrying out feature transformation on the classification data based on the classification attribute frequency to convert the classification data into numerical value data so as to generate attribute feature data capable of being directly input into model training;
the model training module is used for randomly extracting a certain amount or proportion of data from the data output by the characteristic processing module to train the enhanced isolated forest algorithm model;
the abnormal output module is used for outputting normal values and abnormal values in the electric power marketing business data;
the abnormal output management module is used for managing final result display and output work by combining the abnormal result output by the abnormal result identification module with the electric power marketing business data output by the data source management module.
10. The system of claim 9, wherein the anomaly output module comprises three sub-modules: the system comprises a model prediction sub-module, a self-adaptive abnormal output sub-module and an abnormal result identification sub-module;
model prediction submodule: the method is used for acquiring all data output from the feature processing module based on the trained model of the model training module, inputting the model for prediction, and outputting abnormal scores of the electric power marketing business data;
an adaptive abnormal population discrimination sub-module: based on the abnormal scoring vector of the electric power marketing business data output by the algorithm prediction module, initializing k-means for each k value in a specified range and constructing a monobasic beta mixed model to self-adaptively determine normal values and abnormal values in the data;
and the abnormal result identification sub-module is used for identifying the optimal mixed model through the ICL_BIC criterion based on the plurality of unitary beta mixed models and the abnormal results output by the adaptive abnormal population judgment module and taking the abnormal result output by the optimal mixed model as the finally output abnormal result.
CN202211642952.4A 2022-12-20 2022-12-20 Active auditing method and system for electric power marketing abnormal business data Pending CN116128544A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211642952.4A CN116128544A (en) 2022-12-20 2022-12-20 Active auditing method and system for electric power marketing abnormal business data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211642952.4A CN116128544A (en) 2022-12-20 2022-12-20 Active auditing method and system for electric power marketing abnormal business data

Publications (1)

Publication Number Publication Date
CN116128544A true CN116128544A (en) 2023-05-16

Family

ID=86302009

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211642952.4A Pending CN116128544A (en) 2022-12-20 2022-12-20 Active auditing method and system for electric power marketing abnormal business data

Country Status (1)

Country Link
CN (1) CN116128544A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117151768A (en) * 2023-10-30 2023-12-01 国网浙江省电力有限公司营销服务中心 Construction method and system of wind control rule base of generated marketing event

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117151768A (en) * 2023-10-30 2023-12-01 国网浙江省电力有限公司营销服务中心 Construction method and system of wind control rule base of generated marketing event

Similar Documents

Publication Publication Date Title
CN107577785B (en) Hierarchical multi-label classification method suitable for legal identification
Benabdellah et al. A survey of clustering algorithms for an industrial context
CN111324642A (en) Model algorithm type selection and evaluation method for power grid big data analysis
CN110717654B (en) Product quality evaluation method and system based on user comments
CN110335168B (en) Method and system for optimizing power utilization information acquisition terminal fault prediction model based on GRU
CN111967721A (en) Comprehensive energy system greening level evaluation method and system
CN113240201B (en) Method for predicting ship host power based on GMM-DNN hybrid model
CN114065605A (en) Intelligent electric energy meter running state detection and evaluation system and method
Jeong et al. A systemic approach to exploring an essential patent linking standard and patent maps: Application of generative topographic mapping (GTM)
CN112800232A (en) Big data based case automatic classification and optimization method and training set correction method
EP3901784A1 (en) Patent evaluation method and system
CN116128544A (en) Active auditing method and system for electric power marketing abnormal business data
Li et al. An improved genetic-XGBoost classifier for customer consumption behavior prediction
CN115392710A (en) Wind turbine generator operation decision method and system based on data filtering
CN112215420B (en) Customer passing identification method and system for resident electricity consumption
CN115081515A (en) Energy efficiency evaluation model construction method and device, terminal and storage medium
Jeong et al. Approximate life cycle assessment using case-based reasoning for the eco design of products
CN112258235A (en) Method and system for discovering new service of electric power marketing audit
CN112506930A (en) Data insight platform based on machine learning technology
CN111353523A (en) Method for classifying railway customers
CN117093935B (en) Classification method and system for service system
CN117216490B (en) Intelligent big data acquisition system
CN117131449A (en) Data management-oriented anomaly identification method and system with propagation learning capability
CN117745471A (en) Transmission and distribution network project management and control method and device based on big data association algorithm
Murata et al. Feature analysis applying clustering and optimisation methods to Mahalanobis-Taguchi method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination