CN116502813A - Abnormal order detection method based on ensemble learning - Google Patents

Abnormal order detection method based on ensemble learning Download PDF

Info

Publication number
CN116502813A
CN116502813A CN202211293899.1A CN202211293899A CN116502813A CN 116502813 A CN116502813 A CN 116502813A CN 202211293899 A CN202211293899 A CN 202211293899A CN 116502813 A CN116502813 A CN 116502813A
Authority
CN
China
Prior art keywords
abnormal
data set
order detection
order
method based
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211293899.1A
Other languages
Chinese (zh)
Inventor
谭家荣
冯翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China University of Science and Technology
Original Assignee
East China University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China University of Science and Technology filed Critical East China University of Science and Technology
Priority to CN202211293899.1A priority Critical patent/CN116502813A/en
Publication of CN116502813A publication Critical patent/CN116502813A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0631Resource planning, allocation, distributing or scheduling for enterprises or organisations
    • G06Q10/06311Scheduling, planning or task assignment for a person or group
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0633Lists, e.g. purchase orders, compilation or processing
    • G06Q30/0635Processing of requisition or of purchase orders
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2119/00Details relating to the type or aim of the analysis or the optimisation
    • G06F2119/02Reliability analysis or reliability optimisation; Failure analysis, e.g. worst case scenario performance, failure mode and effects analysis [FMEA]

Landscapes

  • Business, Economics & Management (AREA)
  • Engineering & Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Development Economics (AREA)
  • General Physics & Mathematics (AREA)
  • Marketing (AREA)
  • Educational Administration (AREA)
  • General Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Tourism & Hospitality (AREA)
  • Quality & Reliability (AREA)
  • Operations Research (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Geometry (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Analysis (AREA)
  • Testing And Monitoring For Control Systems (AREA)

Abstract

The invention discloses an abnormal order detection method based on ensemble learning. Firstly, collecting an original order data set of a corresponding E-commerce platform; further cleaning missing values, repeated values and abnormal values existing in the data set, extracting more explanatory and relevant features from the preprocessed order data set, wherein the related methods comprise feature extraction, feature aggregation, binning and the like; secondly, training an integrated learning base classifier based on XGBoost, catBoost, GBDT by using a sample data set, providing an evaluation index system, optimizing the base classifier for a plurality of times by using a black box optimizing system openbox based on Bayesian optimization, and performing abnormal order detection test based on a test set on the optimized base classifier; finally, constructing a fusion model based on a voting method by utilizing a base classifier obtained by the optimal parameters, so that the generalization capability and accuracy of the model are improved; the obtained fusion integrated learning model is used for detecting or early warning abnormal orders of electronic commerce, and has good interpretability and precision.

Description

Abnormal order detection method based on ensemble learning
Technical Field
The invention relates to the technical field of abnormal order detection, in particular to an e-commerce abnormal order detection method based on integrated learning model fusion.
Background
During the process of selling goods, enterprises or platforms often encounter abnormal orders, such as cattle orders, malicious orders, merchant orders, etc. The cattle order would greatly cut the attractiveness of the promotion to the average user so that the promotional rights and interests are captured by a small percentage of people rather than being given to the target member. Malicious orders are more dangerous, and a large number of competitors can lock a large amount of commodity stock in a sales promotion mode in a normal way, and then release the stock in a mode of canceling, returning and the like after the sales promotion mode is finished. The way can lead the sales promotion to be unable to realize the purpose of sales promotion because the sales promotion can not really sell the commodity, and simultaneously can consume a great deal of manpower and material resources of the company, thereby being a malicious competition way which is very disliked by all the companies. Merchant skimming is a common way to promote merchant ranking, typically by merchants arranging for internal or associated personnel to purchase large amounts of goods to create merchant traffic and sales promotion objectives. The abnormal order detection aims at finding the order record of an unusual user, namely identifying the abnormal state of the order, sending out abnormal early warning and reducing the transaction risk of an e-commerce platform.
Disclosure of Invention
In order to solve the problem that the existence of abnormal orders in the e-commerce platform can bring platform transaction risk, the invention provides the e-commerce abnormal order detection method based on integrated learning multi-model fusion, which can effectively identify the abnormal state of the orders, further send abnormal order early warning to the background and reduce the transaction risk of the e-commerce platform.
According to a first aspect of the present invention, there is provided an abnormal order detection method based on ensemble learning, including the steps of:
step one, collecting original order data of a corresponding electronic commerce platform;
step two, cleaning missing values, repeated values and abnormal values existing in the data set, extracting features with better interpretation and correlation from the preprocessed order data set, and dividing the data set into a training set and a testing set;
training an integrated learning base classifier based on XGBoost, catBoost, GBDT by using a sample data set, providing an evaluation index system, optimizing the base classifier for a plurality of times by using a black box optimizing system openbox based on Bayesian optimization, and performing abnormal order detection test based on a test set on the optimized base classifier;
constructing a fusion model based on a voting method by utilizing a base classifier obtained by optimal parameters, so that the generalization capability and accuracy of the model are improved;
further, the features of the original dataset in the first step at least include an order ID, a time of placing an order, a first class of goods, a channel to which the goods belong, a goods ID, a brand, an order amount, a quantity of sales of the goods, a channel of the order, a payment mode, a user ID of placing an order, a city, and a tag value.
Further, the preprocessing of the abnormal order detection data set in the second step comprises missing value processing, repeated value processing and abnormal value processing. And detecting and processing abnormal values of two continuous characteristics of the transaction amount and the transaction quantity, wherein a box diagram method and a standard difference method are adopted for the abnormal value processing. The box diagram method and the standard deviation method are calculated as follows:
box diagram rule: when a particular eigenvalue of any sample exceeds [ QL-1.5×iqr, qu+1.5×iqr ], then that eigenvalue of that sample is considered to be an outlier, where QL represents the lower quartile, QU represents the upper quartile, IQR: QU-QL.
Standard deviation method: judging by means of the mean value and standard deviation: [ mean-2 x std, mean+2 x std ], out of range is outlier, where mean is mean and std is standard deviation.
Further, the step two provides an abnormal risk key evaluation index, namely an abnormal rate, which refers to the proportion of abnormal transactions/all transactions, and calculates the abnormal rate under different categories for each feature to determine the relationship between a specific feature/specific category and abnormal risk. And taking the abnormal rate as an aggregation characteristic, wherein the aggregation characteristic is a new characteristic extracted after statistics of the discrete variable of the original characteristic, each category in the new characteristic corresponds to a value in the original characteristic, and the value has a certain business meaning for the category. And further constructing a payment mode anomaly rate, a billing hour anomaly rate, a provincial anomaly rate, a commodity primary class anomaly rate, an order channel anomaly rate, a commodity affiliated channel anomaly rate and a user anomaly rate.
Further, in the second step, kmeans is used for sorting the order amount, and then characteristic sorting mean value and sorting abnormal rate are newly increased; and (3) classifying the commodity sales number by using a binary method based on a threshold value, wherein the sales number is higher than the threshold value, namely the high-risk transaction is marked as 1, and otherwise, the low-risk transaction is marked as 0.
In the step two, 80% of the abnormal order detection data set after the feature engineering processing is divided into a training set, and 20% of the abnormal order detection data set is divided into a verification set for training and checking of the model.
Further, the step three is to input the processed training sample data into a pre-built xgboost model, a catoost model and a GBDT model for training to obtain three base classifiers.
Further, in the step three, for the super-parameter determining part of the base classifier, a black box optimizing system openbox based on Bayesian optimization is used for optimizing the base classifier for a plurality of times, and (1-AUC) is adopted as an optimizing objective function to perform super-parameter optimization on the three base classifiers; wherein the openbox will optimize the iteration number max_run=200, the super-parametric optimization of the actual problem of the present embodiment uses a random forest as a proxy model surrogate_type= 'prf', sets a maximum time budget time_limit_per_three=180 (unit: seconds) for each objective function evaluation, and performs super-parametric optimization on the xgboost, catboost, GBDT-based classifier;
wherein the calculation formulas of AUC and Accuracy are as follows:
AUC=P(P positive direction >P Negative pole )
Accuracy=(TP+TN)/(TP+FP+FN+TN)
Wherein P is Positive direction Refers to the probability of predicting the positive sample as 1; p (P) Negative pole Refers to the probability of predicting the negative sample as 1;
further, the output probabilities of the xgboost, GBDT and catboost base classifiers are summed according to weights, a fusion model based on a soft voting method is built, and the probability weight ratio of the three base classifiers is 0.8, 0.2 and 4 respectively.
The technical scheme of the application has the following beneficial effects:
according to the method, a series of new features with strong correlation with the labels based on the abnormal rate are created through data analysis between the order features and the labels in the feature construction link, so that the model effect can be effectively improved. And training various classification algorithms to construct a base classifier, and performing super-parametric optimization on the base classifier by using a black box system openbox based on Bayesian optimization, so that the model prediction performance can be enhanced. And placing the constructed base model on a bottom layer, establishing a fusion model based on a soft voting method, and inputting processed data to realize the fusion model, wherein compared with a single model, the fusion model can break through the upper limit of the precision of the single model by fusing the single model based on different algorithms, and the generalization capability is improved, so that the abnormal order detection precision is improved.
Drawings
FIG. 1 is a flow chart of an embodiment.
Fig. 2 is a feature engineering flow chart.
FIG. 3 shows the result of the openbox super-parameter optimization.
Fig. 4 is a flowchart 1 of an embodiment.
Fig. 5 is a feature engineering flow chart 2.
FIG. 6 is table 3 of the results of the openbox super-parametric optimization.

Claims (9)

1. An abnormal order detection method based on ensemble learning is characterized by comprising the following steps of: comprises the following steps:
step one, collecting original order data of a corresponding electronic commerce platform;
step two, cleaning missing values, repeated values and abnormal values existing in the data set, extracting features with better interpretation and correlation from the preprocessed order data set, and dividing the data set into a training set and a testing set;
training an integrated learning base classifier based on XGBoost, catBoost, GBDT by using a sample data set, providing an evaluation index system, optimizing the base classifier for a plurality of times by using a black box optimizing system openbox based on Bayesian optimization, and performing abnormal order detection test based on a test set on the optimized base classifier;
and fourthly, constructing a fusion model based on a voting method by utilizing a base classifier obtained by the optimal parameters, so that the generalization capability and accuracy of the model are improved.
2. The abnormal order detection method based on ensemble learning according to claim 1, wherein: the original data set features in the first step at least comprise order ID, order placing time, commodity primary class, channel to which commodity belongs, commodity ID, brand, order amount, commodity sales quantity, order channel, payment mode, order placing user ID, city and label value abnormality.
3. The abnormal order detection method based on ensemble learning according to claim 1, wherein: and the second step of preprocessing the abnormal order detection data set comprises missing value processing, repeated value processing and abnormal value processing. And detecting and processing abnormal values of two continuous characteristics of the transaction amount and the transaction quantity, wherein a box diagram method and a standard difference method are adopted for the abnormal value processing.
Box diagram rule: when a particular eigenvalue of any sample exceeds [ QL-1.5×iqr, qu+1.5×iqr ], then that eigenvalue of that sample is considered to be an outlier, where QL represents the lower quartile, QU represents the upper quartile, IQR: QU-QL.
Standard deviation method: judging by means of the mean value and standard deviation: [ mean-2 x std, mean+2 x std ], out of range is outlier, where mean is mean and std is standard deviation.
4. The abnormal order detection method based on ensemble learning as set forth in claim 2, wherein said step two further includes: and providing an abnormal risk key evaluation index, namely an abnormal rate, which refers to the proportion of abnormal transactions/all transactions, and calculating the abnormal rate under different categories of each feature to determine the relationship between a specific feature/specific category and abnormal risk. And taking the abnormal rate as an aggregation characteristic, wherein the aggregation characteristic is a new characteristic extracted after statistics of the discrete variable of the original characteristic, each category in the new characteristic corresponds to a value in the original characteristic, and the value has a certain business meaning for the category. And further constructing a payment mode anomaly rate, a billing hour anomaly rate, a provincial anomaly rate, a commodity primary class anomaly rate, an order channel anomaly rate, a commodity affiliated channel anomaly rate and a user anomaly rate.
5. The abnormal order detection method based on ensemble learning as set forth in claim 2, wherein said step two further includes: using Kmeans to divide the amount of the order, and further adding a characteristic dividing mean value and a dividing abnormal rate; and (3) classifying the commodity sales number by using a binary method based on a threshold value, wherein the sales number is higher than the threshold value, namely the high-risk transaction is marked as 1, and otherwise, the low-risk transaction is marked as 0.
6. The abnormal order detection method based on ensemble learning as set forth in claim 1, wherein said step two further includes: in the processed abnormal order detection data set, 80% of the abnormal order detection data set is divided into a training set, and 20% of the abnormal order detection data set is divided into a verification set for training and checking of the model.
7. The abnormal order detection method based on ensemble learning as set forth in claim 1, wherein said step three includes: and inputting the processed training sample data into a pre-constructed xgboost model, a catboost model and a GBDT model for training to obtain three base classifiers.
8. The abnormal order detection method based on ensemble learning of claim 7, wherein: and for the super-parameter determining part of the base classifier, using a black box optimizing system openbox based on Bayesian optimization to optimize the base classifier for a plurality of times, and adopting (1-AUC) as an optimizing objective function to perform super-parameter optimization on the three base classifiers.
Wherein the AUC calculation formula is as follows:
AUC=P(P positive direction >P Negative pole )
Wherein P is Positive direction Refers to the probability of predicting the positive sample as 1; p (P) Negative pole Means predicting the negative sample asProbability of 1;
9. the abnormal order detection method based on ensemble learning as set forth in claim 1, wherein said step four further includes: and adding the output probabilities of the xgboost, GBDT and catboost base classifiers according to weights, and constructing a fusion model based on a soft voting method, wherein the probability weight ratio of the three base classifiers is 0.8, 0.2 and 4 respectively.
CN202211293899.1A 2022-10-21 2022-10-21 Abnormal order detection method based on ensemble learning Pending CN116502813A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211293899.1A CN116502813A (en) 2022-10-21 2022-10-21 Abnormal order detection method based on ensemble learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211293899.1A CN116502813A (en) 2022-10-21 2022-10-21 Abnormal order detection method based on ensemble learning

Publications (1)

Publication Number Publication Date
CN116502813A true CN116502813A (en) 2023-07-28

Family

ID=87329105

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211293899.1A Pending CN116502813A (en) 2022-10-21 2022-10-21 Abnormal order detection method based on ensemble learning

Country Status (1)

Country Link
CN (1) CN116502813A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116843346A (en) * 2023-09-01 2023-10-03 北京三五通联科技发展有限公司 Abnormal order monitoring and early warning method and system based on cloud platform

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116843346A (en) * 2023-09-01 2023-10-03 北京三五通联科技发展有限公司 Abnormal order monitoring and early warning method and system based on cloud platform
CN116843346B (en) * 2023-09-01 2023-11-17 北京三五通联科技发展有限公司 Abnormal order monitoring and early warning method and system based on cloud platform

Similar Documents

Publication Publication Date Title
Lin et al. Detecting the financial statement fraud: The analysis of the differences between data mining techniques and experts’ judgments
Chung et al. Insolvency prediction model using multivariate discriminant analysis and artificial neural network for the finance industry in New Zealand
CN110417721A (en) Safety risk estimating method, device, equipment and computer readable storage medium
CN109711955B (en) Poor evaluation early warning method and system based on current order and blacklist base establishment method
CN111178675A (en) LR-Bagging algorithm-based electric charge recycling risk prediction method, system, storage medium and computer equipment
CN112070543B (en) Method for detecting comment quality in E-commerce website
CN104321794A (en) A system and method using multi-dimensional rating to determine an entity's future commercial viability
CN104915842A (en) Electronic commerce transaction monitoring method based on internet transaction data
CN111369171A (en) User service safety comprehensive risk assessment method based on combined empowerment
Ramaki et al. Credit card fraud detection based on ontology graph
CN116502813A (en) Abnormal order detection method based on ensemble learning
CN110162958A (en) For calculating the method, apparatus and recording medium of the synthesis credit score of equipment
Fan Data mining model for predicting the quality level and classification of construction projects
CN117132383A (en) Credit data processing method, device, equipment and readable storage medium
CN116228403A (en) Personal bad asset valuation method and system based on machine learning algorithm
CN105930430A (en) Non-cumulative attribute based real-time fraud detection method and apparatus
CN113763032A (en) Commodity purchase intention identification method and device
CN114298472A (en) Method and system for evaluating images of upstream and downstream enterprises of digital factory
CN113220970A (en) E-commerce big data platform based on block chain
Jan et al. Detection of fraudulent financial statements using decision tree and artificial neural network
Nadali et al. Class Labeling of Bank Credit's Customers Using AHP and SAW for Credit Scoring with Data Mining Algorithms
Florbäck Anomaly detection in logged sensor data
Yasser Sahib Nassar Building and analyzing a crisis management model using Fuzzy DEMATEL technique
Religia et al. Analysis of the Use of Particle Swarm Optimization on Naïve Bayes for Classification of Credit Bank Applications
Seo et al. A Unified Model for Bid Landscape Forecasting in the Mixed Auction Types of Real-Time Bidding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication