CN116502813A - Abnormal order detection method based on ensemble learning - Google Patents
Abnormal order detection method based on ensemble learning Download PDFInfo
- Publication number
- CN116502813A CN116502813A CN202211293899.1A CN202211293899A CN116502813A CN 116502813 A CN116502813 A CN 116502813A CN 202211293899 A CN202211293899 A CN 202211293899A CN 116502813 A CN116502813 A CN 116502813A
- Authority
- CN
- China
- Prior art keywords
- abnormal
- data set
- order detection
- order
- method based
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000002159 abnormal effect Effects 0.000 title claims abstract description 59
- 238000001514 detection method Methods 0.000 title claims abstract description 28
- 238000000034 method Methods 0.000 claims abstract description 19
- 238000012549 training Methods 0.000 claims abstract description 14
- 238000005457 optimization Methods 0.000 claims abstract description 13
- 230000004927 fusion Effects 0.000 claims abstract description 11
- 238000012360 testing method Methods 0.000 claims abstract description 8
- 238000011156 evaluation Methods 0.000 claims abstract description 6
- 238000004220 aggregation Methods 0.000 claims abstract description 5
- 230000002776 aggregation Effects 0.000 claims abstract description 5
- 238000004140 cleaning Methods 0.000 claims abstract description 3
- 238000012545 processing Methods 0.000 claims description 11
- 238000010586 diagram Methods 0.000 claims description 5
- 230000006870 function Effects 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 2
- 238000007781 pre-processing Methods 0.000 claims description 2
- 238000012795 verification Methods 0.000 claims description 2
- 230000005856 abnormality Effects 0.000 claims 1
- 238000000605 extraction Methods 0.000 abstract 1
- 241000283690 Bos taurus Species 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000007635 classification algorithm Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000007637 random forest analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0631—Resource planning, allocation, distributing or scheduling for enterprises or organisations
- G06Q10/06311—Scheduling, planning or task assignment for a person or group
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/20—Design optimisation, verification or simulation
- G06F30/27—Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0635—Risk analysis of enterprise or organisation activities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0639—Performance analysis of employees; Performance analysis of enterprise or organisation operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/06—Buying, selling or leasing transactions
- G06Q30/0601—Electronic shopping [e-shopping]
- G06Q30/0633—Lists, e.g. purchase orders, compilation or processing
- G06Q30/0635—Processing of requisition or of purchase orders
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2119/00—Details relating to the type or aim of the analysis or the optimisation
- G06F2119/02—Reliability analysis or reliability optimisation; Failure analysis, e.g. worst case scenario performance, failure mode and effects analysis [FMEA]
Landscapes
- Business, Economics & Management (AREA)
- Engineering & Computer Science (AREA)
- Human Resources & Organizations (AREA)
- Strategic Management (AREA)
- Economics (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Entrepreneurship & Innovation (AREA)
- Development Economics (AREA)
- General Physics & Mathematics (AREA)
- Marketing (AREA)
- Educational Administration (AREA)
- General Business, Economics & Management (AREA)
- Game Theory and Decision Science (AREA)
- Tourism & Hospitality (AREA)
- Quality & Reliability (AREA)
- Operations Research (AREA)
- Accounting & Taxation (AREA)
- Finance (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Computer Hardware Design (AREA)
- Geometry (AREA)
- General Engineering & Computer Science (AREA)
- Image Analysis (AREA)
- Testing And Monitoring For Control Systems (AREA)
Abstract
The invention discloses an abnormal order detection method based on ensemble learning. Firstly, collecting an original order data set of a corresponding E-commerce platform; further cleaning missing values, repeated values and abnormal values existing in the data set, extracting more explanatory and relevant features from the preprocessed order data set, wherein the related methods comprise feature extraction, feature aggregation, binning and the like; secondly, training an integrated learning base classifier based on XGBoost, catBoost, GBDT by using a sample data set, providing an evaluation index system, optimizing the base classifier for a plurality of times by using a black box optimizing system openbox based on Bayesian optimization, and performing abnormal order detection test based on a test set on the optimized base classifier; finally, constructing a fusion model based on a voting method by utilizing a base classifier obtained by the optimal parameters, so that the generalization capability and accuracy of the model are improved; the obtained fusion integrated learning model is used for detecting or early warning abnormal orders of electronic commerce, and has good interpretability and precision.
Description
Technical Field
The invention relates to the technical field of abnormal order detection, in particular to an e-commerce abnormal order detection method based on integrated learning model fusion.
Background
During the process of selling goods, enterprises or platforms often encounter abnormal orders, such as cattle orders, malicious orders, merchant orders, etc. The cattle order would greatly cut the attractiveness of the promotion to the average user so that the promotional rights and interests are captured by a small percentage of people rather than being given to the target member. Malicious orders are more dangerous, and a large number of competitors can lock a large amount of commodity stock in a sales promotion mode in a normal way, and then release the stock in a mode of canceling, returning and the like after the sales promotion mode is finished. The way can lead the sales promotion to be unable to realize the purpose of sales promotion because the sales promotion can not really sell the commodity, and simultaneously can consume a great deal of manpower and material resources of the company, thereby being a malicious competition way which is very disliked by all the companies. Merchant skimming is a common way to promote merchant ranking, typically by merchants arranging for internal or associated personnel to purchase large amounts of goods to create merchant traffic and sales promotion objectives. The abnormal order detection aims at finding the order record of an unusual user, namely identifying the abnormal state of the order, sending out abnormal early warning and reducing the transaction risk of an e-commerce platform.
Disclosure of Invention
In order to solve the problem that the existence of abnormal orders in the e-commerce platform can bring platform transaction risk, the invention provides the e-commerce abnormal order detection method based on integrated learning multi-model fusion, which can effectively identify the abnormal state of the orders, further send abnormal order early warning to the background and reduce the transaction risk of the e-commerce platform.
According to a first aspect of the present invention, there is provided an abnormal order detection method based on ensemble learning, including the steps of:
step one, collecting original order data of a corresponding electronic commerce platform;
step two, cleaning missing values, repeated values and abnormal values existing in the data set, extracting features with better interpretation and correlation from the preprocessed order data set, and dividing the data set into a training set and a testing set;
training an integrated learning base classifier based on XGBoost, catBoost, GBDT by using a sample data set, providing an evaluation index system, optimizing the base classifier for a plurality of times by using a black box optimizing system openbox based on Bayesian optimization, and performing abnormal order detection test based on a test set on the optimized base classifier;
constructing a fusion model based on a voting method by utilizing a base classifier obtained by optimal parameters, so that the generalization capability and accuracy of the model are improved;
further, the features of the original dataset in the first step at least include an order ID, a time of placing an order, a first class of goods, a channel to which the goods belong, a goods ID, a brand, an order amount, a quantity of sales of the goods, a channel of the order, a payment mode, a user ID of placing an order, a city, and a tag value.
Further, the preprocessing of the abnormal order detection data set in the second step comprises missing value processing, repeated value processing and abnormal value processing. And detecting and processing abnormal values of two continuous characteristics of the transaction amount and the transaction quantity, wherein a box diagram method and a standard difference method are adopted for the abnormal value processing. The box diagram method and the standard deviation method are calculated as follows:
box diagram rule: when a particular eigenvalue of any sample exceeds [ QL-1.5×iqr, qu+1.5×iqr ], then that eigenvalue of that sample is considered to be an outlier, where QL represents the lower quartile, QU represents the upper quartile, IQR: QU-QL.
Standard deviation method: judging by means of the mean value and standard deviation: [ mean-2 x std, mean+2 x std ], out of range is outlier, where mean is mean and std is standard deviation.
Further, the step two provides an abnormal risk key evaluation index, namely an abnormal rate, which refers to the proportion of abnormal transactions/all transactions, and calculates the abnormal rate under different categories for each feature to determine the relationship between a specific feature/specific category and abnormal risk. And taking the abnormal rate as an aggregation characteristic, wherein the aggregation characteristic is a new characteristic extracted after statistics of the discrete variable of the original characteristic, each category in the new characteristic corresponds to a value in the original characteristic, and the value has a certain business meaning for the category. And further constructing a payment mode anomaly rate, a billing hour anomaly rate, a provincial anomaly rate, a commodity primary class anomaly rate, an order channel anomaly rate, a commodity affiliated channel anomaly rate and a user anomaly rate.
Further, in the second step, kmeans is used for sorting the order amount, and then characteristic sorting mean value and sorting abnormal rate are newly increased; and (3) classifying the commodity sales number by using a binary method based on a threshold value, wherein the sales number is higher than the threshold value, namely the high-risk transaction is marked as 1, and otherwise, the low-risk transaction is marked as 0.
In the step two, 80% of the abnormal order detection data set after the feature engineering processing is divided into a training set, and 20% of the abnormal order detection data set is divided into a verification set for training and checking of the model.
Further, the step three is to input the processed training sample data into a pre-built xgboost model, a catoost model and a GBDT model for training to obtain three base classifiers.
Further, in the step three, for the super-parameter determining part of the base classifier, a black box optimizing system openbox based on Bayesian optimization is used for optimizing the base classifier for a plurality of times, and (1-AUC) is adopted as an optimizing objective function to perform super-parameter optimization on the three base classifiers; wherein the openbox will optimize the iteration number max_run=200, the super-parametric optimization of the actual problem of the present embodiment uses a random forest as a proxy model surrogate_type= 'prf', sets a maximum time budget time_limit_per_three=180 (unit: seconds) for each objective function evaluation, and performs super-parametric optimization on the xgboost, catboost, GBDT-based classifier;
wherein the calculation formulas of AUC and Accuracy are as follows:
AUC=P(P positive direction >P Negative pole )
Accuracy=(TP+TN)/(TP+FP+FN+TN)
Wherein P is Positive direction Refers to the probability of predicting the positive sample as 1; p (P) Negative pole Refers to the probability of predicting the negative sample as 1;
further, the output probabilities of the xgboost, GBDT and catboost base classifiers are summed according to weights, a fusion model based on a soft voting method is built, and the probability weight ratio of the three base classifiers is 0.8, 0.2 and 4 respectively.
The technical scheme of the application has the following beneficial effects:
according to the method, a series of new features with strong correlation with the labels based on the abnormal rate are created through data analysis between the order features and the labels in the feature construction link, so that the model effect can be effectively improved. And training various classification algorithms to construct a base classifier, and performing super-parametric optimization on the base classifier by using a black box system openbox based on Bayesian optimization, so that the model prediction performance can be enhanced. And placing the constructed base model on a bottom layer, establishing a fusion model based on a soft voting method, and inputting processed data to realize the fusion model, wherein compared with a single model, the fusion model can break through the upper limit of the precision of the single model by fusing the single model based on different algorithms, and the generalization capability is improved, so that the abnormal order detection precision is improved.
Drawings
FIG. 1 is a flow chart of an embodiment.
Fig. 2 is a feature engineering flow chart.
FIG. 3 shows the result of the openbox super-parameter optimization.
Fig. 4 is a flowchart 1 of an embodiment.
Fig. 5 is a feature engineering flow chart 2.
FIG. 6 is table 3 of the results of the openbox super-parametric optimization.
Claims (9)
1. An abnormal order detection method based on ensemble learning is characterized by comprising the following steps of: comprises the following steps:
step one, collecting original order data of a corresponding electronic commerce platform;
step two, cleaning missing values, repeated values and abnormal values existing in the data set, extracting features with better interpretation and correlation from the preprocessed order data set, and dividing the data set into a training set and a testing set;
training an integrated learning base classifier based on XGBoost, catBoost, GBDT by using a sample data set, providing an evaluation index system, optimizing the base classifier for a plurality of times by using a black box optimizing system openbox based on Bayesian optimization, and performing abnormal order detection test based on a test set on the optimized base classifier;
and fourthly, constructing a fusion model based on a voting method by utilizing a base classifier obtained by the optimal parameters, so that the generalization capability and accuracy of the model are improved.
2. The abnormal order detection method based on ensemble learning according to claim 1, wherein: the original data set features in the first step at least comprise order ID, order placing time, commodity primary class, channel to which commodity belongs, commodity ID, brand, order amount, commodity sales quantity, order channel, payment mode, order placing user ID, city and label value abnormality.
3. The abnormal order detection method based on ensemble learning according to claim 1, wherein: and the second step of preprocessing the abnormal order detection data set comprises missing value processing, repeated value processing and abnormal value processing. And detecting and processing abnormal values of two continuous characteristics of the transaction amount and the transaction quantity, wherein a box diagram method and a standard difference method are adopted for the abnormal value processing.
Box diagram rule: when a particular eigenvalue of any sample exceeds [ QL-1.5×iqr, qu+1.5×iqr ], then that eigenvalue of that sample is considered to be an outlier, where QL represents the lower quartile, QU represents the upper quartile, IQR: QU-QL.
Standard deviation method: judging by means of the mean value and standard deviation: [ mean-2 x std, mean+2 x std ], out of range is outlier, where mean is mean and std is standard deviation.
4. The abnormal order detection method based on ensemble learning as set forth in claim 2, wherein said step two further includes: and providing an abnormal risk key evaluation index, namely an abnormal rate, which refers to the proportion of abnormal transactions/all transactions, and calculating the abnormal rate under different categories of each feature to determine the relationship between a specific feature/specific category and abnormal risk. And taking the abnormal rate as an aggregation characteristic, wherein the aggregation characteristic is a new characteristic extracted after statistics of the discrete variable of the original characteristic, each category in the new characteristic corresponds to a value in the original characteristic, and the value has a certain business meaning for the category. And further constructing a payment mode anomaly rate, a billing hour anomaly rate, a provincial anomaly rate, a commodity primary class anomaly rate, an order channel anomaly rate, a commodity affiliated channel anomaly rate and a user anomaly rate.
5. The abnormal order detection method based on ensemble learning as set forth in claim 2, wherein said step two further includes: using Kmeans to divide the amount of the order, and further adding a characteristic dividing mean value and a dividing abnormal rate; and (3) classifying the commodity sales number by using a binary method based on a threshold value, wherein the sales number is higher than the threshold value, namely the high-risk transaction is marked as 1, and otherwise, the low-risk transaction is marked as 0.
6. The abnormal order detection method based on ensemble learning as set forth in claim 1, wherein said step two further includes: in the processed abnormal order detection data set, 80% of the abnormal order detection data set is divided into a training set, and 20% of the abnormal order detection data set is divided into a verification set for training and checking of the model.
7. The abnormal order detection method based on ensemble learning as set forth in claim 1, wherein said step three includes: and inputting the processed training sample data into a pre-constructed xgboost model, a catboost model and a GBDT model for training to obtain three base classifiers.
8. The abnormal order detection method based on ensemble learning of claim 7, wherein: and for the super-parameter determining part of the base classifier, using a black box optimizing system openbox based on Bayesian optimization to optimize the base classifier for a plurality of times, and adopting (1-AUC) as an optimizing objective function to perform super-parameter optimization on the three base classifiers.
Wherein the AUC calculation formula is as follows:
AUC=P(P positive direction >P Negative pole )
Wherein P is Positive direction Refers to the probability of predicting the positive sample as 1; p (P) Negative pole Means predicting the negative sample asProbability of 1;
9. the abnormal order detection method based on ensemble learning as set forth in claim 1, wherein said step four further includes: and adding the output probabilities of the xgboost, GBDT and catboost base classifiers according to weights, and constructing a fusion model based on a soft voting method, wherein the probability weight ratio of the three base classifiers is 0.8, 0.2 and 4 respectively.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211293899.1A CN116502813A (en) | 2022-10-21 | 2022-10-21 | Abnormal order detection method based on ensemble learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211293899.1A CN116502813A (en) | 2022-10-21 | 2022-10-21 | Abnormal order detection method based on ensemble learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116502813A true CN116502813A (en) | 2023-07-28 |
Family
ID=87329105
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211293899.1A Pending CN116502813A (en) | 2022-10-21 | 2022-10-21 | Abnormal order detection method based on ensemble learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116502813A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116843346A (en) * | 2023-09-01 | 2023-10-03 | 北京三五通联科技发展有限公司 | Abnormal order monitoring and early warning method and system based on cloud platform |
-
2022
- 2022-10-21 CN CN202211293899.1A patent/CN116502813A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116843346A (en) * | 2023-09-01 | 2023-10-03 | 北京三五通联科技发展有限公司 | Abnormal order monitoring and early warning method and system based on cloud platform |
CN116843346B (en) * | 2023-09-01 | 2023-11-17 | 北京三五通联科技发展有限公司 | Abnormal order monitoring and early warning method and system based on cloud platform |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Lin et al. | Detecting the financial statement fraud: The analysis of the differences between data mining techniques and experts’ judgments | |
Chung et al. | Insolvency prediction model using multivariate discriminant analysis and artificial neural network for the finance industry in New Zealand | |
CN110417721A (en) | Safety risk estimating method, device, equipment and computer readable storage medium | |
CN109711955B (en) | Poor evaluation early warning method and system based on current order and blacklist base establishment method | |
CN111178675A (en) | LR-Bagging algorithm-based electric charge recycling risk prediction method, system, storage medium and computer equipment | |
CN112070543B (en) | Method for detecting comment quality in E-commerce website | |
CN104321794A (en) | A system and method using multi-dimensional rating to determine an entity's future commercial viability | |
CN104915842A (en) | Electronic commerce transaction monitoring method based on internet transaction data | |
CN111369171A (en) | User service safety comprehensive risk assessment method based on combined empowerment | |
Ramaki et al. | Credit card fraud detection based on ontology graph | |
CN116502813A (en) | Abnormal order detection method based on ensemble learning | |
CN110162958A (en) | For calculating the method, apparatus and recording medium of the synthesis credit score of equipment | |
Fan | Data mining model for predicting the quality level and classification of construction projects | |
CN117132383A (en) | Credit data processing method, device, equipment and readable storage medium | |
CN116228403A (en) | Personal bad asset valuation method and system based on machine learning algorithm | |
CN105930430A (en) | Non-cumulative attribute based real-time fraud detection method and apparatus | |
CN113763032A (en) | Commodity purchase intention identification method and device | |
CN114298472A (en) | Method and system for evaluating images of upstream and downstream enterprises of digital factory | |
CN113220970A (en) | E-commerce big data platform based on block chain | |
Jan et al. | Detection of fraudulent financial statements using decision tree and artificial neural network | |
Nadali et al. | Class Labeling of Bank Credit's Customers Using AHP and SAW for Credit Scoring with Data Mining Algorithms | |
Florbäck | Anomaly detection in logged sensor data | |
Yasser Sahib Nassar | Building and analyzing a crisis management model using Fuzzy DEMATEL technique | |
Religia et al. | Analysis of the Use of Particle Swarm Optimization on Naïve Bayes for Classification of Credit Bank Applications | |
Seo et al. | A Unified Model for Bid Landscape Forecasting in the Mixed Auction Types of Real-Time Bidding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication |