WO2018097653A1 - Method and program for predicting chargeback fraud user - Google Patents

Method and program for predicting chargeback fraud user Download PDF

Info

Publication number
WO2018097653A1
WO2018097653A1 PCT/KR2017/013539 KR2017013539W WO2018097653A1 WO 2018097653 A1 WO2018097653 A1 WO 2018097653A1 KR 2017013539 W KR2017013539 W KR 2017013539W WO 2018097653 A1 WO2018097653 A1 WO 2018097653A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
user
feature
classification
performance
Prior art date
Application number
PCT/KR2017/013539
Other languages
French (fr)
Korean (ko)
Inventor
서재현
최대선
Original Assignee
공주대학교 산학협력단
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 공주대학교 산학협력단 filed Critical 공주대학교 산학협력단
Publication of WO2018097653A1 publication Critical patent/WO2018097653A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0609Buyer or seller confidence or verification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N99/00Subject matter not provided for in other groups of this subclass
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q20/00Payment architectures, schemes or protocols
    • G06Q20/38Payment protocols; Details thereof
    • G06Q20/40Authorisation, e.g. identification of payer or payee, verification of customer or shop credentials; Review and approval of payers, e.g. check credit lines or negative lists
    • G06Q20/401Transaction verification
    • G06Q20/4016Transaction verification involving fraud or risk level assessment in transaction processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions

Definitions

  • the present invention relates to a method and program for predicting a fraudulent fraud user. More particularly, the present invention provides a machine learning model that satisfies the target prediction classification performance by processing transaction history data of a conventional user, and implements the implemented machine learning model. And a method and program for predicting chargeback fraud for a new user's transaction history data.
  • FIG. 1 shows a flowchart of a chargeback fraud of a game user.
  • the chargeback fraud by the online game user as shown in Figure 1, after the user (game user) has spent the game money and the like paid by the game company purchased with a credit card, Established by requesting a bank for a chargeback, it is a huge problem because it can cause enormous damage to game companies and the like.
  • Patent Document 1 KR10-2016-0017629 A
  • the present invention implements a machine learning model that satisfies the targeted prediction classification performance by processing and using transaction history data of a conventional user, and implemented the machine learning model.
  • the purpose of the present invention is to provide a method and program for predicting a chargeback fraud user that predicts a chargeback fraud on a transaction history data of a new user.
  • a method of predicting a chargeback fraud user includes: (1) transaction history data of a conventional normal user and a chargeback fraud user for each user; A data processing step of processing the data based on one case, (2) A data classification step of dividing the processed transaction history data into training data and test data, and (3) A chargeback fraud user among the training data.
  • Predictive classification step for predicting whether a chargeback fraud user is against data (5) performance for predictive classification (6)
  • the data processing step may include: (1) a first feature deletion step of deleting features falling below a criterion by performing evaluation on each feature and a plurality of feature sets of transaction history data, and (2) a first feature deletion step The feature generation step of generating a new feature while processing the historical data into a single data for each user using a statistical method, (3) performing the evaluation of the generated feature to delete features that do not meet the evaluation criteria 2 Car feature deletion step.
  • the primary feature deleting step includes: (1) evaluating each feature of the transaction history data using an information gain technique, and (2) trading using a principal component analysis technique. And performing an evaluation on the plurality of feature sets of the historical data.
  • the performance measuring step may measure the performance of the prediction classification by using a confusion matrix.
  • the program for predicting a chargeback fraud user may be stored in a medium for predicting a chargeback fraud user according to the above-described method for predicting a chargeback fraud user.
  • Prediction method and program of the chargeback fraud user can implement a machine learning model that satisfies the target prediction classification performance to predict chargeback fraud on the transaction history data of the new user Therefore, there is an advantage that can be prevented in advance due to the chargeback fraud.
  • 1 shows a flow diagram of a chargeback fraud.
  • FIG. 2 illustrates a method for predicting a chargeback fraud user according to an embodiment of the present invention.
  • FIG. 3 illustrates a data processing step S10 of a method for predicting a chargeback fraud user according to an embodiment of the present invention.
  • FIG. 2 illustrates a method for predicting a chargeback fraud user according to an embodiment of the present invention.
  • the method of predicting a chargeback fraud user may be performed by a computer.
  • the computer may be a desktop personal computer, a laptop personal computer, a netbook computer, a tablet personal computer, or the like, but is not limited thereto.
  • the method of predicting a chargeback fraud user according to an embodiment of the present invention, as shown in Figure 2, the data processing step (S10), data classification step (S20), data adjustment step (S30), prediction A classification step S40, a performance measurement step S50, an iteration step S60, and a prediction step S70 are included.
  • the data processing step S10 is a step of processing transaction history data of a conventional user.
  • the conventional transaction history data includes in-flight details of the normal user and the chargeback fraud user, respectively, and may be provided from a database such as a game company that stores and manages them.
  • the data processing step (S10) processes the transaction history data into data of one reference for each user.
  • the transaction history data includes a plurality of attributes having different data characteristics and physical forms (record format, record length, etc.) of the data. This attribute of data is hereinafter referred to as "feature".
  • Table 1 shows the characteristics of the actual transaction history data stored in the database of a game company.
  • the actual transaction history data provided by the game company included transaction history data (hundreds of thousands) of 62,092 normal users and transaction history data of 372 (thousands) of chargeback fraud users.
  • the transaction history data may include a plurality of features as shown in Table 1. That is, each transaction history data may include features of "user_no, standard_country_code, charge_status, charge_no, payment_method_no, charge_amount, bonus_amount, datetime, charge_product_name and hash_ip, ip_addr".
  • Characteristic Contents One user_no User's identifier 2 standard_country_code User's Country Code 3 charge_status User's charging stage 4 charge_no Charging identifier 5 payment_method_no Form of payment identifier 6 charge_amount Charge amount 7 bonus_amount Bonus amount 8 datetime Transaction date 9 charge_product_name Payment gateway name 10 hash_ip IP address of the user converted to a hash function 11 ip_addr IP address of the user
  • FIG. 3 illustrates a data processing step S10 of a method for predicting a chargeback fraud user according to an embodiment of the present invention.
  • the data processing step S10 may include a first feature deletion step S11, a feature generation step S12, and a second feature deletion step S13.
  • the primary feature deletion step S11 is a step of evaluating each feature and a plurality of feature sets of the transaction history data and deleting a feature that does not meet the evaluation criteria.
  • each feature eg, user_no, standard_country_code, charge_status, charge_no, payment_method_no, charge_amount, bonus_amount, datetime, charge_product_name and hash_ip, ip_addr.
  • the information gain is an amount of reduction in entropy expected when one feature is selected, and the higher the value, the better the data can be distinguished. That is, in the first feature deletion step (S11), a value for the degree of discrimination of the chargeback fraud user according to the selection of each feature is obtained according to the information gain technique.
  • the corresponding feature having an information gain value less than a predetermined criterion is deleted. This is because a feature with an information gain value below a certain criterion corresponds to a feature that is not necessary to distinguish between chargeback fraud users.
  • a plurality of feature sets of transaction history data are evaluated using a principal component analysis technique.
  • Principal component analysis is a technique of reducing high-dimensional data to low-dimensional data, and finds a principal component of distributed data. That is, in the first feature deletion step (S11), a plurality of feature sets of principal components that extract the chargeback fraud users can be extracted according to the principal component analysis technique.
  • the plurality of feature sets of the transaction history data are a combination of two or more features, for example, ⁇ user_no, standard_country_code ⁇ , ⁇ user_no, charge_status ⁇ ,... ⁇ user_no, standard_country_code, charge_status ⁇ , ⁇ user_no, standard_country_code, charge_no ⁇ ... And the like.
  • a feature included in a plurality of feature sets that fall below a predetermined criterion, that is, does not correspond to a main component is deleted. This is because a feature included in a plurality of feature sets that does not correspond to a main component corresponds to a feature that is not necessary to distinguish between chargeback fraud users.
  • the feature generation step (S12) is a step of generating a new feature while processing the transaction history data that has undergone the first feature deletion step into data of one reference for each user using a statistical method.
  • the statistical method may include, but is not limited to, methods such as count, sum, difference, average, standard deviation, maximum value, minimum value, date statistics, time statistics, and the like with respect to data.
  • the secondary feature deletion step S13 is a step of deleting a feature that does not meet the evaluation criteria by performing an evaluation on the generated feature.
  • each feature of the transaction history data is evaluated by using an information gain technique.
  • the corresponding feature having an information gain value less than a predetermined criterion is deleted. This is because a feature with an information gain value below a certain criterion corresponds to a feature that is not necessary to distinguish between chargeback fraud users.
  • Table 2 shows the features of the transaction history data shown in Table 1 through the data processing step (S10), the first feature deletion step (S11) and the second feature deletion step (S13).
  • standard_country_code_kind is additionally created from standard_country_code
  • charge_stat10, charge_stat20, and charge_stat30 are additionally created from charge_status
  • payment_method_no_kind is additionally created from payment_method_no
  • charge_amount_sum is additionally created from payment_method_no
  • charge_amount_sum is additionally created from payment_amount_av_
  • transaction_recent_monthday, transaction_recent_hour, transaction_cnt_sum, transaction_cnt_1_month, transaction_cnt_2_month, transaction_cnt_3_month, transaction_cnt_6_month and transaction_cnt_else were additionally created from datetime
  • charge_product_name_kind_ was added from ip_addrkind ip_addrkind.
  • charge_no and hash_ip are deleted, and a class is added to distinguish
  • the information gain values of the characteristics of Table 2 were determined using a ClassifierSubsetEval attribute evaluator based on a decision tree (DT) and a genetic algorithm. 4, 5, 7, 8, 10, 11 , 12, 17, 18, 19, and 20 corresponding to features of charge_stat10, charge_stat20, payment_method_no, payment_method_no_kind, charge_amount_avg, charge_amount_stddev, bonus_amount_sum, transaction_cnt_sum, transaction_cnt_1_month, transaction_cnt_2_month, and transaction_cnt_3.
  • the data classification step S20 is a step of dividing the processed transaction details data into training data and test data.
  • the training data is data used as training data of a specific machine learning to be used later
  • the test data is data used to test the performance of the learned machine learning model.
  • Table 3 shows the various dataset types for dividing processed transaction history data into training data and test data.
  • 66% split is 66% of transaction history data divided by training data and the remaining 34% by test data. 10-fold is a case in which 9/10 is divided into training data and 1/10 is divided into test data among transaction details data, and a cross validation method is performed.
  • 50% split is a case in which 50% of transaction history data is divided into training data and test data, and the data is divided by StratifiedFolds preprocessing.
  • the data adjustment step (S30) is a step of adjusting the number of data by oversampling the data for the chargeback fraud user among the training data. Since the transaction history data of the chargeback fraud user is less than that of the normal user, the performance of the machine learning model learned from the training data may be degraded. Accordingly, the performance of the machine learning model may be improved by oversampling data for the chargeback fraud user in the training data through the data adjustment step S30. Specific experimental examples for improving the performance of the machine learning model through the data adjustment step (S30) will be described later.
  • training data whose number of data is adjusted is used as training data to be trained by a predetermined machine learing technique.
  • the machine learning includes various algorithms such as supervised learning, unsupervised learning, semi-supervised learning, and is not particularly limited.
  • Supervised learning may include a Support Vector Machine (SVM), Hidden Markov model, Regression, Neural Network, Naive Bayes Classification, and the like. .
  • the performance measurement step S50 is a step of measuring the performance for the prediction classification. That is, the performance measurement step S50 measures performance indicating the accuracy of the test data predicted and classified by the machine learning model in the prediction classification step S40. In this case, the performance measurement step S50 may measure the performance of the prediction classification by using a confusion matrix.
  • Table 4 shows the chaos matrix
  • TP predicts that a machine learning model predicts a test fraud user as a bogus fraud user, but actually a chargeback fraud user
  • TN predicts that a machine learning model predicts a test user as a normal user but is actually a normal user. Appears respectively.
  • FP predicted that the machine learning model predicted chargeback fraud users for a test data but was actually a normal user
  • FN predicted that the machine learning model predicted it as a normal user for a test data but was actually rejected Each case represents a fraudulent user.
  • the machine learning model collects the number of results classified and predicted for each test data according to the chaotic matrix of Table 4. Thereafter, the performance measurement step (S50) calculates the value of the performance indicator.
  • the present invention may include, but is not limited to, any measure that can measure performance of data classification accuracy.
  • Table 5 shows each performance index for measuring the performance of the machine learning model predicted and classified in the prediction classification step (S40).
  • Tables 6 and 7 show the prediction classification step (S40) and the performance measurement step (S50) for each data set type shown in Table 3 using the machine learning technique of the decision tree (DT) and the support vector machine (SVM). The result of measuring the prediction classification performance is shown.
  • Table 6 shows the results of performing the data adjustment step (S30), the prediction classification step (S40) and the performance measurement step (S50).
  • the support vector machine shows better predictive classification performance than the decision tree (DT), and the data conditioning step (S30). Performance shows better predictive classification performance.
  • the iteration step S60 may oversample or undersample the data for the chargeback fraud user of the training data until the performance of the prediction classification reaches the target value, thereby predicting classification S40 and The step of repeating the performance measurement step (S50). At this time, if the oversampling ratio is too high and the training data amount is excessively increased, an overload may occur when learning in the predictive classification step S40, and thus, the repeating step S60 may perform undersampling in addition to oversampling of the training data. Can be done. In addition, in the repeating step (S60), undersampling may be performed to reach a more accurate target value.
  • the ratio of oversampling or undersampling at the time of performing the repetition step S60 may be regular or arbitrary, and is not particularly limited.
  • the oversampling ratio at the nth iteration can be defined as A ⁇ B n (where A and B are natural numbers and n is an integer).
  • the undersampling ratio at the mth iteration can be defined as A ⁇ B n ⁇ (C ⁇ m) (where A, B and C are natural numbers, n and m are integers, n ⁇ m)
  • the prediction classification step S40 and the performance measurement step S50 when the oversampling is 100% are performed.
  • the oversampling is increased to 200% in the first iteration, and the prediction classification step S40 and the performance measurement step S50 are performed again.
  • the oversampling is raised to 300% in the second iteration, and the prediction classification step S40 and the performance measurement step S50 are performed again.
  • Recall has 0.948, which is above the target value.
  • undersampling is performed in the third iteration, that is, the oversampling is set to less than 300%, for example, 280% to predict the classification step ( S40) and the performance measurement step S50 may be performed again.
  • the predicting step S70 is a step of predicting a chargeback fraud on transaction history data of a new user using a machine learning model that has reached the target predictive classification performance.
  • the chargeback fraud user prediction program according to an embodiment of the present invention is stored in the medium to perform the chargeback fraud user prediction according to the above-described method of chargeback fraud user according to an embodiment of the present invention.
  • the predictive program of a chargeback fraud user may be recorded in a recording medium readable by a computer or similar device.
  • the recording medium may be a hard disk type, a magnetic media type, a compact disc read only memory (CD-ROM), an optical media type, a magnetic-optical medium Type (magneto-optical media type), multimedia card micro type, memory of the card type (e.g., SD or XD memory, etc.), flash memory type, ROM (read only memory); ROM, random access memory (RAM), or a combination of a memory composed of a memory, a main memory, or a secondary memory device, but is not limited thereto.
  • a hard disk type a magnetic media type, a compact disc read only memory (CD-ROM), an optical media type, a magnetic-optical medium Type (magneto-optical media type), multimedia card micro type, memory of the card type (e.g., SD or XD memory, etc.), flash memory type, ROM (read only memory); ROM, random access memory (RAM), or a combination of a memory composed of a memory, a main memory, or a secondary memory device, but is not
  • the program comprises a communication network such as the Internet, an intranet, a local area network (LAN), a wide area network (WLAN), or a storage area network (SAN), or a combination thereof. It may be stored in an attachable storage device accessible through a communication network.
  • a communication network such as the Internet, an intranet, a local area network (LAN), a wide area network (WLAN), or a storage area network (SAN), or a combination thereof. It may be stored in an attachable storage device accessible through a communication network.

Abstract

The present invention relates to a method for predicting a chargeback fraud user, the method being characterized by comprising: a data processing step of processing conventional transaction detail data regarding normal users and chargeback fraud users into data based on one case for each user; a data classification step of dividing the processed transaction detail data into training data and test data; a data adjustment step of oversampling data regarding chargeback fraud users among the training data, thereby adjusting the number of pieces of data; a prediction/classification step of conducting learning on the basis of a specific machine learning technique using the training data, the number of pieces of which has been adjusted, and predicting/classifying whether test data corresponds to a chargeback fraud user or not using the learned machine learning model; a performance measurement step of measuring performance regarding prediction/classification; a repeatedly performing step of oversampling or undersampling data regarding chargeback fraud users among the training data until the performance regarding prediction/classification reaches a target value, thereby repeatedly performing the prediction/classification step and the performance measurement step; and a prediction step of predicting a chargeback fraud with regard to a new user's transaction detail data using a machine leaning module that has reached the target prediction/classification performance.

Description

지불 거절 사기 사용자의 예측 방법 및 프로그램Method and program for predicting chargeback fraud users
본 발명은 지불 거절 사기 사용자의 예측 방법 및 프로그램에 관한 것으로서, 더욱 상세하게는 종래 사용자의 거래 내역 데이터를 가공 이용하여 목표 예측 분류 성능을 만족하는 머신 러닝 모델을 구현하고, 구현된 머신 러닝 모델을 이용하여 새로운 사용자의 거래 내역 데이터에 대한 지불 거절 사기를 예측하는 방법 및 프로그램에 관한 것이다.The present invention relates to a method and program for predicting a fraudulent fraud user. More particularly, the present invention provides a machine learning model that satisfies the target prediction classification performance by processing transaction history data of a conventional user, and implements the implemented machine learning model. And a method and program for predicting chargeback fraud for a new user's transaction history data.
최근, 전자 결제 수단의 사용이 보편화됨에 따라 전자 결제 수단을 이용한 사기(fraud)의 일종인 지불 거절 사기(chargeback fraud)의 사례가 급격히 증가하는 추세에 있다. In recent years, as the use of electronic payment means is becoming more common, cases of chargeback fraud, which is a kind of fraud using electronic payment methods, have been increasing rapidly.
도 1은 게임 사용자의 지불 거절 사기(chargeback fraud)의 흐름도를 나타낸다.1 shows a flowchart of a chargeback fraud of a game user.
지불 거절 사기의 일 예로서, 온라인 게임 사용자에 의한 지불 거절 사기는, 도 1에 도시된 바와 같이, 사용자(game user)가 신용 카드로 구매하여 게임 회사로부터 지급 받은 게임 머니 등을 소진한 후, 지불 거절(chargeback)을 은행에 요청함으로써 성립되는 것으로서, 게임 회사 등에 막대한 손해를 입힐 수 있어 큰 문제가 되고 있다. As an example of the chargeback fraud, the chargeback fraud by the online game user, as shown in Figure 1, after the user (game user) has spent the game money and the like paid by the game company purchased with a credit card, Established by requesting a bank for a chargeback, it is a huge problem because it can cause enormous damage to game companies and the like.
따라서, 지불 거절 사기를 사전 예측하여 지불 거절 사기에 의한 손해를 미연에 방지할 수 있는 기술 개발이 필요한 실정이다.Therefore, there is a need for a technology development capable of preventing damages due to chargeback fraud by predicting chargeback fraud in advance.
[선행문헌][Prior literature]
(특허문헌 1) KR10-2016-0017629 A (Patent Document 1) KR10-2016-0017629 A
상기한 바와 같은 지불 거절 사기에 의한 손해를 미연에 방지하기 위하여, 본 발명은 종래 사용자의 거래 내역 데이터를 가공 이용하여 목표한 예측 분류 성능을 만족하는 머신 러닝 모델을 구현하고, 구현된 머신 러닝 모델을 이용하여 새로운 사용자의 거래 내역 데이터에 대한 지불 거절 사기를 예측하는 지불 거절 사기 사용자의 예측 방법 및 프로그램을 제공하는데 그 목적이 있다.In order to prevent damages due to chargeback fraud as described above, the present invention implements a machine learning model that satisfies the targeted prediction classification performance by processing and using transaction history data of a conventional user, and implemented the machine learning model. The purpose of the present invention is to provide a method and program for predicting a chargeback fraud user that predicts a chargeback fraud on a transaction history data of a new user.
상기와 같은 과제를 해결하기 위한 본 발명의 일 실시예에 따른 지불 거절 사기 사용자의 예측 방법은, (1) 종래의 정상 사용자 및 지불 거절 사기(chargeback fraud) 사용자에 대한 거래 내역 데이터를 각 사용자마다 1건 기준의 데이터로 가공하는 데이터 가공 단계, (2) 가공된 거래 내역 데이터를 트레이닝 데이터(training data)와 테스트 데이터(test data)로 나누는 데이터 분류 단계, (3) 트레이닝 데이터 중 지불 거절 사기 사용자에 대한 데이터를 오버샘플링(oversampling)하여 데이터의 수를 조절하는 데이터 조절 단계, (4) 데이터의 수가 조절된 트레이닝 데이터를 이용하여 특정 머신 러닝 기법으로 학습시키고, 학습된 머신 러닝 모델을 이용하여 테스트 데이터에 대한 지불 거절 사기 사용자 해당 여부를 예측 분류하는 예측 분류 단계, (5) 예측 분류에 대한 성능을 측정하는 성능 측정 단계, (6) 예측 분류에 대한 성능이 목표값에 도달할 때까지 트레이닝 데이터 중 지불 거절 사기 사용자에 대한 데이터를 오버샘플링(oversampling)하거나 언더샘플링(undersampling)하여 예측 분류 단계 및 성능 측정 측정 단계를 반복 수행하는 반복 수행 단계, (7) 목표한 예측 분류 성능에 도달한 머신 러닝 모델을 이용하여 새로운 사용자의 거래 내역 데이터에 대한 지불 거절 사기를 예측하는 예측 단계를 포함한다.In order to solve the above problems, a method of predicting a chargeback fraud user according to an embodiment of the present invention includes: (1) transaction history data of a conventional normal user and a chargeback fraud user for each user; A data processing step of processing the data based on one case, (2) A data classification step of dividing the processed transaction history data into training data and test data, and (3) A chargeback fraud user among the training data. A data adjustment step of adjusting the number of data by oversampling the data for (4) training using a specific machine learning method using the training data in which the number of data is adjusted, and testing using the learned machine learning model Predictive classification step for predicting whether a chargeback fraud user is against data, (5) performance for predictive classification (6) Predictive classification step and performance by oversampling or undersampling data for chargeback fraud users in the training data until the performance for the predictive classification reaches the target value. Iteratively performing a repeating step of measuring the measurement, (7) predicting the chargeback fraud for the transaction history data of the new user using the machine learning model that has reached the target prediction classification performance.
상기 데이터 가공 단계는, (1) 거래 내역 데이터의 각 특징 및 복수 특징 집합에 대한 평가를 수행하여 기준에 미달하는 특징을 삭제하는 1차 특징 삭제 단계, (2) 1차 특징 삭제 단계를 거친 거래 내역 데이터를 통계적 방법을 이용하여 각 사용자마다 1건 기준의 데이터로 가공하면서 새로운 특징을 생성하는 특징 생성 단계, (3) 생성된 특징에 대한 평가를 수행하여 평가 기준에 미달하는 특징을 삭제하는 2차 특징 삭제 단계를 포함할 수 있다.The data processing step may include: (1) a first feature deletion step of deleting features falling below a criterion by performing evaluation on each feature and a plurality of feature sets of transaction history data, and (2) a first feature deletion step The feature generation step of generating a new feature while processing the historical data into a single data for each user using a statistical method, (3) performing the evaluation of the generated feature to delete features that do not meet the evaluation criteria 2 Car feature deletion step.
상기 1차 특징 삭제 단계는, (1) 정보 이득(information gain) 기법을 이용하여 거래 내역 데이터의 각 특징에 대한 평가를 수행하는 단계, (2) 주성분 분석(principal component analysis) 기법을 이용하여 거래 내역 데이터의 복수 특징 집합에 대한 평가를 수행하는 단계를 포함할 수 있다.The primary feature deleting step includes: (1) evaluating each feature of the transaction history data using an information gain technique, and (2) trading using a principal component analysis technique. And performing an evaluation on the plurality of feature sets of the historical data.
상기 성능 측정 단계는 혼돈 행렬(confusion matrix)를 이용하여 예측 분류에 대한 성능을 측정할 수 있다.The performance measuring step may measure the performance of the prediction classification by using a confusion matrix.
또한, 본 발명의 일 실시예에 따른 지불 거절 사기 사용자의 예측 프로그램은 상술한 지불 거절 사기 사용자의 예측 방법에 따라 지불 거절 사기 사용자를 예측하기 위해 매체에 저장될 수 있다.In addition, the program for predicting a chargeback fraud user according to an embodiment of the present invention may be stored in a medium for predicting a chargeback fraud user according to the above-described method for predicting a chargeback fraud user.
상기와 같이 구성되는 본 발명의 일 실시예에 따른 지불 거절 사기 사용자의 예측 방법 및 프로그램은 목표 예측 분류 성능을 만족하는 머신 러닝 모델을 구현하여 새로운 사용자의 거래 내역 데이터에 대한 지불 거절 사기를 예측할 수 있어, 지불 거절 사기에 의한 손해를 미연에 방지할 수 있는 이점이 있다.Prediction method and program of the chargeback fraud user according to an embodiment of the present invention configured as described above can implement a machine learning model that satisfies the target prediction classification performance to predict chargeback fraud on the transaction history data of the new user Therefore, there is an advantage that can be prevented in advance due to the chargeback fraud.
도 1은 지불 거절 사기(chargeback fraud)의 흐름도를 나타낸다.1 shows a flow diagram of a chargeback fraud.
도 2는 본 발명의 일 실시예에 따른 지불 거절 사기 사용자의 예측 방법을 나타낸다.2 illustrates a method for predicting a chargeback fraud user according to an embodiment of the present invention.
도 3은 본 발명의 일 실시예에 따른 지불 거절 사기 사용자의 예측 방법의 데이터 가공 단계(S10)를 나타낸다.3 illustrates a data processing step S10 of a method for predicting a chargeback fraud user according to an embodiment of the present invention.
본 발명의 상기 목적과 수단 및 그에 따른 효과는 첨부된 도면과 관련한 다음의 상세한 설명을 통하여 보다 분명해 질 것이며, 그에 따라 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명의 기술적 사상을 용이하게 실시할 수 있을 것이다. 또한, 본 발명을 설명함에 있어서 본 발명과 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략하기로 한다.The above objects, means, and effects thereof will become more apparent from the following detailed description taken in conjunction with the accompanying drawings, and as a result, those skilled in the art to which the present invention pertains may easily facilitate the technical idea of the present invention. It can be done. In addition, in describing the present invention, when it is determined that the detailed description of the known technology related to the present invention may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted.
또한, 본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며, 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 경우에 따라 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소 외의 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다. 다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다. Also, the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. In this specification, the singular forms also include the plural forms as the case otherwise indicates. As used herein, “comprises” and / or “comprising” does not exclude the presence or addition of one or more components other than the mentioned components. Unless otherwise defined, all terms used in the present specification may be used in a sense that can be commonly understood by those skilled in the art. Moreover, the terms defined in the commonly used dictionaries are not ideally or excessively interpreted unless they are specifically defined clearly.
이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 일 실시예를 상세히 설명하도록 한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.
도 2는 본 발명의 일 실시예에 따른 지불 거절 사기 사용자의 예측 방법을 나타낸다.2 illustrates a method for predicting a chargeback fraud user according to an embodiment of the present invention.
본 발명의 일 실시예에 따른 지불 거절 사기 사용자의 예측 방법은 컴퓨터에 의해 수행될 수 있다. 예를 들어, 컴퓨터는 데스크탑 컴퓨터(desktop personal computer), 랩탑 컴퓨터(laptop personal computer), 넷북 컴퓨터(netbook computer), 태블릿 PC(tablet personal computer) 등일 수 있으나, 이에 한정되는 것은 아니다.The method of predicting a chargeback fraud user according to an embodiment of the present invention may be performed by a computer. For example, the computer may be a desktop personal computer, a laptop personal computer, a netbook computer, a tablet personal computer, or the like, but is not limited thereto.
구체적으로, 본 발명의 일 실시예에 따른 지불 거절 사기 사용자의 예측 방법은, 도 2에 도시된 바와 같이, 데이터 가공 단계(S10), 데이터 분류 단계(S20), 데이터 조절 단계(S30), 예측 분류 단계(S40), 성능 측정 단계(S50), 반복 수행 단계(S60) 및 예측 단계(S70)를 포함한다.Specifically, the method of predicting a chargeback fraud user according to an embodiment of the present invention, as shown in Figure 2, the data processing step (S10), data classification step (S20), data adjustment step (S30), prediction A classification step S40, a performance measurement step S50, an iteration step S60, and a prediction step S70 are included.
데이터 가공 단계(S10)는 종래 사용자의 거래 내역 데이터를 가공하는 단계이다. 이때, 종래의 거래 내역 데이터는 정상 사용자 및 지불 거절 사기(chargeback fraud) 사용자에 대한 거내 내역을 각각 포함하며, 이를 저장 관리하는 게임 회사 등의 데이터베이스(database)로부터 제공 받을 수 있다. 특히, 각 사용자마다 거래한 건수가 다르므로, 데이터 가공 단계(S10)는 거래 내역 데이터를 각 사용자마다 1건 기준의 데이터로 가공 처리한다. 이때, 거래 내역 데이터는 데이터의 특성 및 데이터의 물리적 형태(레코드 형식, 레코드 길이 등)가 서로 다른 복수의 속성(attribute)를 포함한다. 이러한 데이터의 속성을 이하에서는 "특징(feature)"라 지칭한다. The data processing step S10 is a step of processing transaction history data of a conventional user. In this case, the conventional transaction history data includes in-flight details of the normal user and the chargeback fraud user, respectively, and may be provided from a database such as a game company that stores and manages them. In particular, since the number of transactions for each user is different, the data processing step (S10) processes the transaction history data into data of one reference for each user. In this case, the transaction history data includes a plurality of attributes having different data characteristics and physical forms (record format, record length, etc.) of the data. This attribute of data is hereinafter referred to as "feature".
표 1은 어느 게임 회사(game company)의 데이터 베이스에 저장된 실제 거래 내역 데이터의 특징을 나타낸다. 이때, 해당 게임 회사로부터 제공 받은 실제 거래 내역 데이터에는 62,092명의 정상 사용자에 대한 거래 내역 데이터(수십만 건)와, 372명(수천 건)의 지불 거절 사기 사용자에 대한 거래 내역 데이터가 포함되어 있었다.Table 1 shows the characteristics of the actual transaction history data stored in the database of a game company. In this case, the actual transaction history data provided by the game company included transaction history data (hundreds of thousands) of 62,092 normal users and transaction history data of 372 (thousands) of chargeback fraud users.
예를 들면, 거래 내역 데이터는 표 1과 같은 복수의 특징을 포함할 수 있다. 즉, 각 거래 내역 데이터마다 "user_no, standard_country_code, charge_status, charge_no, payment_method_no, charge_amount, bonus_amount, datetime, charge_product_name 및 hash_ip, ip_addr"의 특징을 포함할 수 있다.For example, the transaction history data may include a plurality of features as shown in Table 1. That is, each transaction history data may include features of "user_no, standard_country_code, charge_status, charge_no, payment_method_no, charge_amount, bonus_amount, datetime, charge_product_name and hash_ip, ip_addr".
특징Characteristic 내용Contents
1One user_nouser_no 사용자의 식별자User's identifier
22 standard_country_codestandard_country_code 사용자의 국가 코드User's Country Code
33 charge_statuscharge_status 사용자의 충전 단계User's charging stage
44 charge_nocharge_no 충전 식별자Charging identifier
55 payment_method_nopayment_method_no 결제 방법 식별자Form of payment identifier
66 charge_amountcharge_amount 충전 금액Charge amount
77 bonus_amountbonus_amount 보너스 금액Bonus amount
88 datetimedatetime 거래 일시Transaction date
99 charge_product_namecharge_product_name 지불 결제 사업자(payment gateway) 명칭Payment gateway name
1010 hash_iphash_ip 해쉬 함수(hash function)로 변환된 사용자의 IP 주소IP address of the user converted to a hash function
1111 ip_addrip_addr 사용자의 IP 주소IP address of the user
도 3은 본 발명의 일 실시예에 따른 지불 거절 사기 사용자의 예측 방법의 데이터 가공 단계(S10)를 나타낸다.3 illustrates a data processing step S10 of a method for predicting a chargeback fraud user according to an embodiment of the present invention.
구체적으로, 데이터 가공 단계(S10)는, 도 3에 도시된 바와 같이, 1차 특징 삭제 단계(S11), 특징 생성 단계(S12) 및 2차 특징 삭제 단계(S13)를 포함할 수 있다.Specifically, as illustrated in FIG. 3, the data processing step S10 may include a first feature deletion step S11, a feature generation step S12, and a second feature deletion step S13.
1차 특징 삭제 단계(S11)는 거래 내역 데이터의 각 특징 및 복수 특징 집합에 대한 평가를 수행하고, 평가 기준에 미달하는 특징을 삭제하는 단계이다. 이때, 1차 특징 삭제 단계(S11)에서는 정보 이득(information gain) 기법을 이용하여 거래 내역 데이터의 각 특징(예를 들어, user_no, standard_country_code, charge_status, charge_no, payment_method_no, charge_amount, bonus_amount, datetime, charge_product_name 및 hash_ip, ip_addr)에 대한 평가를 수행한다. The primary feature deletion step S11 is a step of evaluating each feature and a plurality of feature sets of the transaction history data and deleting a feature that does not meet the evaluation criteria. In this case, in the first feature deletion step S11, each feature (eg, user_no, standard_country_code, charge_status, charge_no, payment_method_no, charge_amount, bonus_amount, datetime, charge_product_name and hash_ip, ip_addr).
정보 이득은 어느 한 특징을 선택할 경우에 기대되는 엔트로피의 감소량으로서, 그 값이 높을 수록 데이터를 더 잘 구분할 수 있음을 나타낸다. 즉, 1차 특징 삭제 단계(S11)에서는 정보 이득 기법에 따라 각 특징의 선택에 따른 지불 거절 사기 사용자의 구분 가능 정도에 대한 값을 획득한다. The information gain is an amount of reduction in entropy expected when one feature is selected, and the higher the value, the better the data can be distinguished. That is, in the first feature deletion step (S11), a value for the degree of discrimination of the chargeback fraud user according to the selection of each feature is obtained according to the information gain technique.
이후, 1차 특징 삭제 단계(S11)에서는 일정 기준 미만의 정보 이득 값을 갖는 해당 특징을 삭제 처리한다. 이는 일정 기준 미만의 정보 이득 값을 갖는 특징이 지불 거절 사기 사용자를 구분하는데 필요 없는 특징에 해당하기 때문이다.Subsequently, in the first feature deletion step S11, the corresponding feature having an information gain value less than a predetermined criterion is deleted. This is because a feature with an information gain value below a certain criterion corresponds to a feature that is not necessary to distinguish between chargeback fraud users.
또한, 1차 특징 삭제 단계(S11)에서는 주성분 분석(principal component analysis) 기법을 이용하여 거래 내역 데이터의 복수 특징 집합에 대한 평가를 수행한다. In addition, in the first feature deletion step S11, a plurality of feature sets of transaction history data are evaluated using a principal component analysis technique.
주성분 분석 기법은 고차원의 데이터를 저차원의 데이터로 환원시키는 기법으로서, 분포된 데이터들의 주성분(Principal Component)를 찾아준다. 즉, 1차 특징 삭제 단계(S11)에서는 주성분 분석 기법에 따라 지불 거절 사기 사용자를 더 잘 구분할 수 있는 주성분의 복수 특징 집합을 추출한다. 이때, 거래 내역 데이터의 복수 특징 집합은 2개 이상 특징의 조합으로서, 예를 들어, {user_no, standard_country_code}, {user_no, charge_status}, … {user_no, standard_country_code, charge_status}, {user_no, standard_country_code, charge_no} … 등을 포함할 수 있다.Principal component analysis is a technique of reducing high-dimensional data to low-dimensional data, and finds a principal component of distributed data. That is, in the first feature deletion step (S11), a plurality of feature sets of principal components that extract the chargeback fraud users can be extracted according to the principal component analysis technique. In this case, the plurality of feature sets of the transaction history data are a combination of two or more features, for example, {user_no, standard_country_code}, {user_no, charge_status},... {user_no, standard_country_code, charge_status}, {user_no, standard_country_code, charge_no}... And the like.
이후, 1차 특징 삭제 단계(S11)에서는 일정 기준 미만에 해당하는, 즉 주성분에 해당하지 않은 복수 특징 집합에 포함된 특징을 삭제 처리한다. 이는 주성분에 해당하지 않는 복수 특징 집합에 포함된 특징이 지불 거절 사기 사용자를 구분하는데 필요 없는 특징에 해당하기 때문이다.Subsequently, in the first feature deletion step S11, a feature included in a plurality of feature sets that fall below a predetermined criterion, that is, does not correspond to a main component is deleted. This is because a feature included in a plurality of feature sets that does not correspond to a main component corresponds to a feature that is not necessary to distinguish between chargeback fraud users.
특징 생성 단계(S12)는 1차 특징 삭제 단계를 거친 거래 내역 데이터를 통계적 방법을 이용하여 각 사용자마다 1건 기준의 데이터로 가공하면서 새로운 특징을 생성하는 단계이다. 예를 들어, 통계적 방법은 데이터들에 대해 개수, 합계, 차이, 평균, 표준 편차, 최대값, 최소값, 날짜 통계, 시간 통계 등의 방법을 포함할 수 있으나, 이에 한정되는 것은 아니다.The feature generation step (S12) is a step of generating a new feature while processing the transaction history data that has undergone the first feature deletion step into data of one reference for each user using a statistical method. For example, the statistical method may include, but is not limited to, methods such as count, sum, difference, average, standard deviation, maximum value, minimum value, date statistics, time statistics, and the like with respect to data.
2차 특징 삭제 단계(S13)는 생성된 특징에 대한 평가를 수행하여 평가 기준에 미달하는 특징을 삭제하는 단계이다. 이때, 2차 특징 삭제 단계(S12)에서는 정보 이득(information gain) 기법을 이용하여 거래 내역 데이터의 각 특징에 대한 평가를 수행한다.The secondary feature deletion step S13 is a step of deleting a feature that does not meet the evaluation criteria by performing an evaluation on the generated feature. In this case, in the second feature deletion step (S12), each feature of the transaction history data is evaluated by using an information gain technique.
이후, 2차 특징 삭제 단계(S13)에서는 일정 기준 미만의 정보 이득 값을 갖는 해당 특징을 삭제 처리한다. 이는 일정 기준 미만의 정보 이득 값을 갖는 특징이 지불 거절 사기 사용자를 구분하는데 필요 없는 특징에 해당하기 때문이다. Subsequently, in the second feature deletion step (S13), the corresponding feature having an information gain value less than a predetermined criterion is deleted. This is because a feature with an information gain value below a certain criterion corresponds to a feature that is not necessary to distinguish between chargeback fraud users.
표 2는 표 1에 나타낸 거래 내역 데이터의 특징을 데이터 가공 단계(S10), 1차 특징 삭제 단계(S11) 및 2차 특징 삭제 단계(S13)를 통해 가공 처리한 특징을 나타낸다.Table 2 shows the features of the transaction history data shown in Table 1 through the data processing step (S10), the first feature deletion step (S11) and the second feature deletion step (S13).
특징Characteristic 내용Contents
1One user_no user_no 사용자의 식별자User's identifier
22 standard_country_code standard_country_code 사용자의 국가 코드User's Country Code
33 standard_country_code_kind standard_country_code_kind 사용자의 국가 코드의 종류Type of user's country code
44 charge_stat10 charge_stat10 사용자의 충전 횟수가 10 이하User charges less than 10
55 charge_stat20 charge_stat20 사용자의 충전 횟수가 20 이하User charges less than 20
66 charge_stat30 charge_stat30 사용자의 충전 횟수가 30 이하User charges less than 30
77 payment_method_no payment_method_no 가장 최근의 결제 방법Most recent form of payment
88 payment_method_no_kind payment_method_no_kind 결제 방법의 종류Type of payment method
99 charge_amount_sum charge_amount_sum 충전 총액Charge
1010 charge_amount_avg charge_amount_avg 평균 충전 금액Average charge amount
1111 charge_amount_stddev charge_amount_stddev 충전 금액의 표준 편차Standard deviation of charge amount
1212 bonus_amount_sum bonus_amount_sum 보너스 총액Bonus amount
1313 bonus_amount_avg bonus_amount_avg 평균 보너스 금액Average bonus amount
1414 bonus_amount_stddev bonus_amount_stddev 보너스 금액의 표준 편차Standard Deviation of Bonus Amount
1515 transaction_recent_monthday transaction_recent_monthday 최종 거래 날짜Last transaction date
1616 transaction_recent_hour transaction_recent_hour 최종 거래 시간Last trading time
1717 transaction_cnt_sum transaction_cnt_sum 총 거래 횟수Total transactions
1818 transaction_cnt_1_month transaction_cnt_1_month 1개월 동안의 거래 횟수Transactions in a Month
1919 transaction_cnt_2_month transaction_cnt_2_month 최근 1개월을 제외한 2개월 동안의 거래 횟수Transactions in 2 Months Except Last 1 Month
2020 transaction_cnt_3_month transaction_cnt_3_month 최근 2개월을 제외한 3개월 동안의 거래 횟수) Transactions in 3 months excluding the last 2 months)
2121 transaction_cnt_6_month transaction_cnt_6_month 최근 3개월을 제외한 6개월 동안의 거래 횟수Transactions in 6 Months Except Last 3 Months
2222 transaction_cnt_else transaction_cnt_else 최근 6개월을 제외한 총 거래 횟수Total Transactions Except Last 6 Months
2323 charge_product_name charge_product_name 지불 결제 사업자(payment gateway)Payment gateway
2424 charge_product_name_kind charge_product_name_kind 지불 결제 사업자 종류Payment payment carrier type
2525 ip_addr ip_addr IP 주소IP address
2626 ip_addr_kind ip_addr_kind IP 주소 종류IP address type
2727 class class 0: 정상 사용자, 1: 지불 거절 사기 사용자0: normal user, 1: chargeback fraud user
즉, standard_country_code로부터 standard_country_code_kind가 추가 생성되었고, charge_status로부터 charge_stat10, charge_stat20 및 charge_stat30가 추가 생성되었으며, payment_method_no로부터 payment_method_no_kind가 추가 생성되었고, charge_amount 및 bonus_amount로부터 charge_amount_sum, charge_amount_avg, charge_amount_stddev, bonus_amount_sum, bonus_amount_avg 및 bonus_amount_stddev가 추가 생성되었다. 또한, datetime로부터 transaction_recent_monthday, transaction_recent_hour, transaction_cnt_sum, transaction_cnt_1_month, transaction_cnt_2_month, transaction_cnt_3_month, transaction_cnt_6_month 및 transaction_cnt_else가 추가 생성되었고, charge_product_name로부터 charge_product_name_kind가 추가 생성되었으며, ip_addr로부터 ip_addr_kind가 추가 생성되었다. 또한, charge_no 및 hash_ip는 삭제 처리되었으며, 사용자 구분을 위해 class가 추가 생성되었다. class는 표 1에 처음부터 포함될 수도 있다.That is, standard_country_code_kind is additionally created from standard_country_code, charge_stat10, charge_stat20, and charge_stat30 are additionally created from charge_status, payment_method_no_kind is additionally created from payment_method_no, charge_amount_sum, charge_amount_av_, charge_amount_mount_mount_mount_amount_mount_amount and bonus_amount bonusamount In addition, transaction_recent_monthday, transaction_recent_hour, transaction_cnt_sum, transaction_cnt_1_month, transaction_cnt_2_month, transaction_cnt_3_month, transaction_cnt_6_month and transaction_cnt_else were additionally created from datetime, charge_product_name_kind_ was added from ip_addrkind ip_addrkind. In addition, charge_no and hash_ip are deleted, and a class is added to distinguish users. Classes may be included from the beginning in Table 1.
참고로, 결정 트리(Decision Tree ; DT) 기반의 ClassifierSubsetEval attribute evaluator와 유전 알고리즘(genetic algorithm)을 이용하여 표 2의 특징의 정보 이득 값을 구해 본 결과, 4, 5, 7, 8, 10, 11, 12, 17, 18, 19 및 20에 해당하는 특징, 즉 charge_stat10, charge_stat20, payment_method_no, payment_method_no_kind, charge_amount_avg, charge_amount_stddev, bonus_amount_sum, transaction_cnt_sum, transaction_cnt_1_month, transaction_cnt_2_month 및 transaction_cnt_3_month가 비교적 높은 정보 이득의 특징으로 추출되었다.For reference, the information gain values of the characteristics of Table 2 were determined using a ClassifierSubsetEval attribute evaluator based on a decision tree (DT) and a genetic algorithm. 4, 5, 7, 8, 10, 11 , 12, 17, 18, 19, and 20 corresponding to features of charge_stat10, charge_stat20, payment_method_no, payment_method_no_kind, charge_amount_avg, charge_amount_stddev, bonus_amount_sum, transaction_cnt_sum, transaction_cnt_1_month, transaction_cnt_2_month, and transaction_cnt_3.
다음으로, 데이터 분류 단계(S20)는 가공된 거래 내역 데이터를 트레이닝 데이터(training data)와 테스트 데이터(test data)로 나누는 단계이다. 이때, 트레이닝 데이터는 이후에 사용할 특정 머신 러닝(machine learing)의 학습 데이터로 사용되는 데이터이며, 테스트 데이터는 학습된 머신 러닝 모델의 성능을 테스트하기 위해 사용되는 데이터이다.Next, the data classification step S20 is a step of dividing the processed transaction details data into training data and test data. In this case, the training data is data used as training data of a specific machine learning to be used later, and the test data is data used to test the performance of the learned machine learning model.
표 3은 가공된 거래 내역 데이터를 트레이닝 데이터와 테스트 데이터로 나누기 위한 다양한 데이터 집합 유형을 나타낸다.Table 3 shows the various dataset types for dividing processed transaction history data into training data and test data.
66% split66% split 10-fold10-fold 50% split50% split
정상 사용자Normal user 21,11321,113 62,09262,092 31,04631,046
지불 거절 사기 사용자Chargeback fraud user 125125 372372 186186
예를 들어, 66% split은 거래 내역 데이터 중에 66%를 트레이닝 데이터로, 나머지 34%를 테스트 데이터로 각각 나눈 경우이다. 10-fold는 거래 내역 데이터 중에 9/10를 트레이닝 데이터로, 나머지 1/10를 테스트 데이터로 각각 나누며, 교차타당화(cross validation) 방법이 수행되는 경우이다. 또한, 50% split은 거래 내역 데이터의 50% 각각을 트레이닝 데이터 및 테스트 데이터로 나누되, StratifiedFolds 전처리(preprocessing)에 의해 각각 데이터를 나누는 경우이다.For example, 66% split is 66% of transaction history data divided by training data and the remaining 34% by test data. 10-fold is a case in which 9/10 is divided into training data and 1/10 is divided into test data among transaction details data, and a cross validation method is performed. In addition, 50% split is a case in which 50% of transaction history data is divided into training data and test data, and the data is divided by StratifiedFolds preprocessing.
다음으로, 데이터 조절 단계(S30)는 트레이닝 데이터 중 지불 거절 사기 사용자에 대한 데이터를 오버샘플링(oversampling)하여 데이터의 수를 조절하는 단계이다. 지불 거절 사기 사용자의 거래 내역 데이터가 정상 사용자의 거래 내역 데이터에 비해 건수가 부족하므로, 트레이닝 데이터로 학습된 머신 러닝 모델의 성능이 떨어질 수 있다. 이에 따라, 데이터 조절 단계(S30)를 통해, 즉 트레이닝 데이터 중 지불 거절 사기 사용자에 대한 데이터를 오버샘플링함으로써 머신 러닝 모델의 성능을 향상시킬 수 있다. 데이터 조절 단계(S30)를 통한 머신 러닝 모델의 성능 향상에 대한 구체적인 실험예는 후술하기로 한다.Next, the data adjustment step (S30) is a step of adjusting the number of data by oversampling the data for the chargeback fraud user among the training data. Since the transaction history data of the chargeback fraud user is less than that of the normal user, the performance of the machine learning model learned from the training data may be degraded. Accordingly, the performance of the machine learning model may be improved by oversampling data for the chargeback fraud user in the training data through the data adjustment step S30. Specific experimental examples for improving the performance of the machine learning model through the data adjustment step (S30) will be described later.
다음으로, 예측 분류 단계(S40)에서는 데이터의 수가 조절된 트레이닝 데이터를 학습 데이터로 사용하여 미리 선택된 특정 머신 러닝(machine learing) 기법으로 학습시키다. 이때, 머신 러닝은 지도 학습(Supervised Learning), 자율 학습(Unsupervised Learning), 준 지도 학습(Semi-Supervised Learning) 등 다양한 알고리즘을 포함하며, 특별히 제한되는 것은 아니다. 지도 학습은 서포트 벡터 머신(Support Vector Machine ; SVM), 은닉 마르코프 모델(Hidden Markov model), 회귀 분석(Regression), 신경망(Neural network), 나이브 베이즈 분류(Naive Bayes Classification) 등을 포함할 수 있다.Next, in the prediction classification step S40, training data whose number of data is adjusted is used as training data to be trained by a predetermined machine learing technique. In this case, the machine learning includes various algorithms such as supervised learning, unsupervised learning, semi-supervised learning, and is not particularly limited. Supervised learning may include a Support Vector Machine (SVM), Hidden Markov model, Regression, Neural Network, Naive Bayes Classification, and the like. .
이후, 예측 분류 단계(S40)에서는 트레이닝 데이터를 통해 학습된 머신 러닝 모델을 이용하여 테스트 데이터에 대한 지불 거절 사기 사용자 해당 여부를 예측 분류한다.Subsequently, in the predictive classification step S40, whether or not a chargeback fraud user is applied to the test data is predicted using the machine learning model trained through the training data.
다음으로, 성능 측정 단계(S50)는 예측 분류에 대한 성능을 측정하는 단계이다. 즉, 성능 측정 단계(S50)는 예측 분류 단계(S40)에서 머신 러닝 모델이 예측 분류한 테스트 데이터에 대한 정확성을 나타내는 성능 측정한다. 이때, 성능 측정 단계(S50)는 혼돈 행렬(confusion matrix)를 이용하여 예측 분류에 대한 성능을 측정할 수 있다.Next, the performance measurement step S50 is a step of measuring the performance for the prediction classification. That is, the performance measurement step S50 measures performance indicating the accuracy of the test data predicted and classified by the machine learning model in the prediction classification step S40. In this case, the performance measurement step S50 may measure the performance of the prediction classification by using a confusion matrix.
표 4는 혼돈 행렬을 나타낸다.Table 4 shows the chaos matrix.
예측prediction
TrueTrue FalseFalse
결과result TrueTrue True Posivies(TP)True Posivies (TP) False Negatives(FN)False Negatives (FN)
FlaseFlase False Posivies(FP)False Posivies (FP) True Negatives(TN)True Negatives (TN)
TP는 머신 러닝 모델이 어느 테스트 데이터에 대해 지불 거절 사기 사용자로 예측 분류했는데 실제로도 지불 거절 사기 사용자인 경우를, TN은 머신 러닝 모델이 어느 테스트 데이터에 대해 정상 사용자로 예측 분류했는데 실제로도 정상 사용자인 경우를 각각 나타난다. 또한, FP는 머신 러닝 모델이 어느 테스트 데이터에 대해 지불 거절 사기 사용자로 예측 분류했으나 실제로는 정상 사용자인 경우를, FN는 머신 러닝 모델이 어느 테스트 데이터에 대해 정상 사용자로 예측 분류했으나 실제로는 지불 거절 사기 사용자인 경우를 각각 나타낸다.TP predicts that a machine learning model predicts a test fraud user as a bogus fraud user, but actually a chargeback fraud user, and TN predicts that a machine learning model predicts a test user as a normal user but is actually a normal user. Appears respectively. In addition, FP predicted that the machine learning model predicted chargeback fraud users for a test data but was actually a normal user, while FN predicted that the machine learning model predicted it as a normal user for a test data but was actually rejected Each case represents a fraudulent user.
즉, 성능 측정 단계(S50)에서는 머신 러닝 모델이 각 테스트 데이터에 대해 예측 분류한 결과의 건수를 표 4의 혼돈 행렬에 따라 수집한다. 이후, 성능 측정 단계(S50)에서는 성능 지표의 값을 계산한다. 이때, 성능 지표는 예측 분류 단계(S40)에서 예측 분류한 머신 러닝 모델의 분류 정확성의 성능을 측정하기 위한 것으로서, 표 4에 예시된 Recall(=TPR), Precision, F-measure, ROC curve 등을 포함할 수 있으나, 이에 한정되는 것은 아니며, 데이터 분류 정확성의 성능을 측정하기 위한 것이면 어떤 것이든 제한 없이 성능 지표가 될 수 있다.That is, in the performance measurement step (S50), the machine learning model collects the number of results classified and predicted for each test data according to the chaotic matrix of Table 4. Thereafter, the performance measurement step (S50) calculates the value of the performance indicator. At this time, the performance indicator is to measure the performance of the classification accuracy of the machine learning model predicted and classified in the prediction classification step (S40), the Recall (= TPR), Precision, F-measure, ROC curve, etc. The present invention may include, but is not limited to, any measure that can measure performance of data classification accuracy.
표 5는 예측 분류 단계(S40)에서 예측 분류한 머신 러닝 모델의 성능을 측정하기 위한 각 성능 지표를 나타낸다.Table 5 shows each performance index for measuring the performance of the machine learning model predicted and classified in the prediction classification step (S40).
성능 지표Performance indicators 계산 방법Calculation method
Recall=TPR(True Posivies Rate)Recall = TPR (True Posivies Rate) Recall = TPR = TP / (TP + FN)Recall = TPR = TP / (TP + FN)
FPR(False Posivies Rate)False Posivies Rate (FPR) FPR = FP / (FP + TN)FPR = FP / (FP + TN)
Precision Precision Precision = TP / (TP + FP)Precision = TP / (TP + FP)
F-measure F-measure
Figure PCTKR2017013539-appb-I000001
Figure PCTKR2017013539-appb-I000001
ROC curve Area(Receiver operating characteristic)ROC curve Area (Receiver operating characteristic) 축은 FPR, Y축은 TPR로 각각 이루어진 곡선의 넓이The area of the curve consisting of FPR on the axis and TPR on the Y axis
표 6 및 7은 표 3의 각 데이터 집합 유형에 대해 결정 트리(DT) 및 서포트 벡터 머신(SVM)의 머신 러닝 기법을 이용해 예측 분류 단계(S40) 및 성능 측정 단계(S50)를 수행한 경우의 예측 분류 성능을 측정한 결과를 나타낸다. 이때, 데이터 조절 단계(S30)의 효과를 직접적으로 비교하기 위해, 데이터 조절 단계(S30)를 생략하고 예측 분류 단계(S40) 및 성능 측정 단계(S50)를 수행한 결과를 표 6에 나타내었으며, 데이터 조절 단계(S30), 예측 분류 단계(S40) 및 성능 측정 단계(S50)를 수행한 결과를 표 7에 나타내었다.Tables 6 and 7 show the prediction classification step (S40) and the performance measurement step (S50) for each data set type shown in Table 3 using the machine learning technique of the decision tree (DT) and the support vector machine (SVM). The result of measuring the prediction classification performance is shown. In this case, in order to directly compare the effect of the data adjustment step (S30), the results of performing the prediction classification step (S40) and the performance measurement step (S50) without the data adjustment step (S30) is shown in Table 6, Table 7 shows the results of performing the data adjustment step (S30), the prediction classification step (S40) and the performance measurement step (S50).
알고리즘algorithm 데이터 집합 유형Dataset type TPRTPR FPRFPR RrecisionRrecision RecallRecall F-measureF-measure ROC AreaROC Area ClassClass
DTDT 66% split 66% split 0.552 0.552 0.001 0.001 0.841 0.841 0.552 0.552 0.667 0.667 0.828 0.828 1 One
0.999 0.999 0.448 0.448 0.997 0.997 0.999 0.999 0.998 0.998 0.828 0.828 0 0
10-fold 10-fold 0.530 0.530 0.001 0.001 0.853 0.853 0.5300.530 0.653 0.653 0.877 0.877 1 One
0.999 0.999 0.470 0.470 0.997 0.997 0.999 0.999 0.998 0.998 0.877 0.877 0 0
50% split 50% split 0.516 0.516 0.001 0.001 0.787 0.787 0.516 0.516 0.623 0.623 0.808 0.808 1 One
0.999 0.999 0.484 0.484 0.997 0.997 0.999 0.999 0.998 0.998 0.808 0.808 0 0
SVMSVM 66% split 66% split 0.544 0.544 0.000 0.000 0.883 0.883 0.5440.544 0.673 0.673 0.772 0.772 1 One
1.000 1.000 0.456 0.456 0.997 0.997 1.000 1.000 0.998 0.998 0.772 0.772 0 0
10-fold 10-fold 0.573 0.573 0.000 0.000 0.914 0.914 0.573 0.573 0.704 0.704 0.786 0.786 1 One
1.000 1.000 0.427 0.427 0.997 0.997 1.000 1.000 0.999 0.999 0.786 0.786 0 0
50% split 50% split 0.570 0.570 0.001 0.001 0.869 0.869 0.5700.570 0.688 0.688 0.785 0.785 1 One
0.999 0.999 0.430 0.430 0.997 0.997 0.999 0.999 0.998 0.998 0.785 0.785 0 0
오버샘플링 비율(%)Oversampling Rate (%) 알고리즘algorithm TPR TPR FPRFPR Precision Precision Recall Recall F-measure F-measure ROC Area ROC Area Class Class
100100 DTDT 0.6610.661 0.0010.001 0.9090.909 0.6610.661 0.7660.766 0.8960.896 1 One
0.9990.999 0.3390.339 0.9960.996 0.9990.999 0.9980.998 0.8960.896 0 0
SVMSVM 0.8200.820 0.0010.001 0.9410.941 0.8200.820 0.8760.876 0.9100.910 1 One
0.9990.999 0.1800.180 0.9980.998 0.9990.999 0.9990.999 0.9100.910 0 0
200200 DTDT 0.6770.677 0.0010.001 0.9410.941 0.6770.677 0.7880.788 0.9240.924 1 One
0.9990.999 0.3230.323 0.9940.994 0.9990.999 0.9970.997 0.9240.924 0 0
SVMSVM 0.8870.887 0.0010.001 0.9640.964 0.8870.887 0.9240.924 0.9430.943 1 One
0.9990.999 0.1130.113 0.9980.998 0.9990.999 0.9990.999 0.9430.943 0 0
300300 DTDT 0.7920.792 0.0020.002 0.9440.944 0.7920.792 0.8610.861 0.9200.920 1 One
0.9980.998 0.2080.208 0.9930.993 0.9980.998 0.9950.995 0.9200.920 0 0
SVMSVM 0.9480.948 0.0010.001 0.9800.980 0.9480.948 0.9640.964 0.9740.974 1 One
0.9990.999 0.0520.052 0.9980.998 0.9990.999 0.9990.999 0.9740.974 0 0
표 6 및 표 7의 지불 거절 사기 사용자(Class=1)에 대한 Recall 성능 지표를 참조하면, 서포트 벡터 머신(SVM)이 결정 트리(DT) 보다 우수한 예측 분류 성능을 나타내며, 데이터 조절 단계(S30) 수행을 통해 예측 분류 성능이 더 좋아짐을 알 수 있다.Referring to the Recall performance indicators for chargeback fraud users (Class = 1) in Tables 6 and 7, the support vector machine (SVM) shows better predictive classification performance than the decision tree (DT), and the data conditioning step (S30). Performance shows better predictive classification performance.
반복 수행 단계(S60)는 예측 분류에 대한 성능이 목표값에 도달할 때까지 트레이닝 데이터 중 지불 거절 사기 사용자에 대한 데이터를 오버샘플링(oversampling)하거나 언더샘플링(undersampling)하여 예측 분류 단계(S40) 및 성능 측정 단계(S50)를 반복 수행하는 단계이다. 이때, 오버샘플링 비율이 너무 높아 트레이닝 데이터 양의 지나치게 증가하는 경우, 예측 분류 단계(S40)에서 학습 시에 과부하가 걸릴 수 있으므로, 반복 수행 단계(S60)는 트레이닝 데이터에 대해 오버샘플링 외에도 언더샘플링을 수행할 수 있다. 또한, 반복 수행 단계(S60)에서는 보다 정확한 목표값 도달을 위해 언더샘플링을 수행할 수도 있다.The iteration step S60 may oversample or undersample the data for the chargeback fraud user of the training data until the performance of the prediction classification reaches the target value, thereby predicting classification S40 and The step of repeating the performance measurement step (S50). At this time, if the oversampling ratio is too high and the training data amount is excessively increased, an overload may occur when learning in the predictive classification step S40, and thus, the repeating step S60 may perform undersampling in addition to oversampling of the training data. Can be done. In addition, in the repeating step (S60), undersampling may be performed to reach a more accurate target value.
반복 수행 단계(S60) 수행 시의 오버샘플링 비율 또는 언더샘플링의 비율은 규칙적이거나 임의적일 수 있으며, 특별히 제한되지 않는다. 예를 들어, n차 반복 시의 오버샘플링 비율을 A×Bn로 정할 수 있다(단, A 및 B는 자연수, n은 정수). n차 반복 이후, m차 반복 시의 언더샘플링 비율은 A×Bn-(C×m)로 정할 수 있다(단, A, B 및 C는 자연수, n 및 m은 정수, n<m)The ratio of oversampling or undersampling at the time of performing the repetition step S60 may be regular or arbitrary, and is not particularly limited. For example, the oversampling ratio at the nth iteration can be defined as A × B n (where A and B are natural numbers and n is an integer). After the nth iteration, the undersampling ratio at the mth iteration can be defined as A × B n − (C × m) (where A, B and C are natural numbers, n and m are integers, n <m)
즉, 서포트 벡터 머신(SVM)에 대해 "0.940 이상을 갖는 Recall"을 성능 목표값으로 정한 경우, 오버샘플링이 100%인 경우의 예측 분류 단계(S40) 및 성능 측정 단계(S50)를 수행한다. 그 결과, Recall이 목표값 이하인 0.820을 가지므로, 1차 반복으로 오버샘플링을 200%로 상향하여 예측 분류 단계(S40) 및 성능 측정 단계(S50)를 다시 수행한다. 그 결과, Recall이 목표값 이하인 0.887을 가지므로, 2차 반복으로 오버샘플링을 300%로 상향하여 예측 분류 단계(S40) 및 성능 측정 단계(S50)를 다시 수행한다. 그 결과, Recall이 목표값 이상인 0.948을 가지게 된다. 이후, 바로 예측 단계(S70)로 넘어갈 수도 있지만, 반복 수행 단계(S60)에서는 3차 반복으로 언더샘플링하여, 즉 오버샘플링을 300% 보다 작게, 예를 들어, 280%로 설정하여 예측 분류 단계(S40) 및 성능 측정 단계(S50)를 다시 수행할 수도 있다.That is, when "Recall with 0.940 or more" is set as the performance target value for the support vector machine SVM, the prediction classification step S40 and the performance measurement step S50 when the oversampling is 100% are performed. As a result, since Recall has 0.820 that is less than or equal to the target value, the oversampling is increased to 200% in the first iteration, and the prediction classification step S40 and the performance measurement step S50 are performed again. As a result, since Recall has 0.887 that is less than or equal to the target value, the oversampling is raised to 300% in the second iteration, and the prediction classification step S40 and the performance measurement step S50 are performed again. As a result, Recall has 0.948, which is above the target value. Subsequently, although it may be immediately skipped to the prediction step S70, in the iteration step S60, undersampling is performed in the third iteration, that is, the oversampling is set to less than 300%, for example, 280% to predict the classification step ( S40) and the performance measurement step S50 may be performed again.
예측 단계(S70)는 목표한 예측 분류 성능에 도달한 머신 러닝 모델을 이용하여 새로운 사용자의 거래 내역 데이터에 대한 지불 거절 사기를 예측하는 단계이다.The predicting step S70 is a step of predicting a chargeback fraud on transaction history data of a new user using a machine learning model that has reached the target predictive classification performance.
한편, 본 발명의 일 실시예에 따른 지불 거절 사기 사용자의 예측 프로그램은 상술한 본 발명의 일 실시예에 따른 지불 거절 사기 사용자의 예측 방법에 따라 지불 거절 사기 사용자의 예측을 수행하기 위해 매체에 저장된 프로그램이다. 예를 들어, 지불 거절 사기 사용자의 예측 프로그램은 컴퓨터 또는 이와 유사한 장치로 읽을 수 있는 기록 매체 내에 기록될 수 있다. On the other hand, the chargeback fraud user prediction program according to an embodiment of the present invention is stored in the medium to perform the chargeback fraud user prediction according to the above-described method of chargeback fraud user according to an embodiment of the present invention. Program. For example, the predictive program of a chargeback fraud user may be recorded in a recording medium readable by a computer or similar device.
예를 들어, 기록 매체는 하드디스크 타입(hard disk type), 마그네틱 매체 타입(magnetic media type), CD-ROM(compact disc read only memory), 광기록 매체 타입(Optical Media type), 자기-광 매체 타입(magneto-optical media type), 멀티미디어 카드 마이크로 타입(multimedia card micro type), 카드 타입의 메모리(예를 들어 SD 또는 XD 메모리 등), 플래시 메모리 타입(flash memory type), 롬(read only memory; ROM), 램(random access memory; RAM), 또는 이들의 조합으로 구성된 메모리로 이루어지는 버퍼, 주기억장치, 또는 보조기억장치일 수 있으나, 이에 한정되는 것은 아니다. For example, the recording medium may be a hard disk type, a magnetic media type, a compact disc read only memory (CD-ROM), an optical media type, a magnetic-optical medium Type (magneto-optical media type), multimedia card micro type, memory of the card type (e.g., SD or XD memory, etc.), flash memory type, ROM (read only memory); ROM, random access memory (RAM), or a combination of a memory composed of a memory, a main memory, or a secondary memory device, but is not limited thereto.
또한, 상기 프로그램은, 입력장치에 인터넷(Internet), 인트라넷(Intranet), LAN(Local Area Network), WLAN(Wide LAN), 또는 SAN(Storage Area Network)과 같은 통신 네트워크, 또는 이들의 조합으로 구성된 통신 네트워크를 통하여 접근(access)할 수 있는 부착 가능한(attachable) 저장 장치(storage device)에 저장될 수 있다.In addition, the program comprises a communication network such as the Internet, an intranet, a local area network (LAN), a wide area network (WLAN), or a storage area network (SAN), or a combination thereof. It may be stored in an attachable storage device accessible through a communication network.
본 발명의 상세한 설명에서는 구체적인 실시 예에 관하여 설명하였으나 본 발명의 범위에서 벗어나지 않는 한도 내에서 여러 가지 변형이 가능함은 물론이다. 그러므로 본 발명의 범위는 설명된 실시 예에 국한되지 않으며, 후술되는 특허청구의 범위 및 이 특허청구의 범위와 균등한 것들에 의해 정해져야 한다.In the detailed description of the present invention, specific embodiments have been described, but various modifications may be made without departing from the scope of the present invention. Therefore, the scope of the present invention should not be limited to the described embodiments, but should be defined by the following claims and their equivalents.

Claims (5)

  1. 종래의 정상 사용자 및 지불 거절 사기(chargeback fraud) 사용자에 대한 거래 내역 데이터를 각 사용자마다 1건 기준의 데이터로 가공하는 데이터 가공 단계;A data processing step of processing transaction history data for a conventional normal user and a chargeback fraud user into one reference data for each user;
    가공된 거래 내역 데이터를 트레이닝 데이터(training data)와 테스트 데이터(test data)로 나누는 데이터 분류 단계;A data classification step of dividing the processed transaction history data into training data and test data;
    트레이닝 데이터 중 지불 거절 사기 사용자에 대한 데이터를 오버샘플링(oversampling)하여 데이터의 수를 조절하는 데이터 조절 단계;A data adjustment step of oversampling data of the chargeback fraud user among the training data to adjust the number of data;
    데이터의 수가 조절된 트레이닝 데이터를 이용하여 특정 머신 러닝 기법으로 학습시키고, 학습된 머신 러닝 모델을 이용하여 테스트 데이터에 대한 지불 거절 사기 사용자 해당 여부를 예측 분류하는 예측 분류 단계;Predictive classification step of learning by a specific machine learning method using the training data adjusted the number of data, Predictive classification of whether the chargeback fraud user corresponding to the test data using the learned machine learning model;
    예측 분류에 대한 성능을 측정하는 성능 측정 단계;A performance measurement step of measuring performance for prediction classification;
    예측 분류에 대한 성능이 목표값에 도달할 때까지 트레이닝 데이터 중 지불 거절 사기 사용자에 대한 데이터를 오버샘플링(oversampling)하거나 언더샘플링(undersampling)하여 예측 분류 단계 및 성능 측정 측정 단계를 반복 수행하는 반복 수행 단계; 및Repeat iteratively oversampling or undersampling the data for the chargeback fraud users in the training data until the performance for the predictive classification reaches the target value step; And
    목표한 예측 분류 성능에 도달한 머신 러닝 모델을 이용하여 새로운 사용자의 거래 내역 데이터에 대한 지불 거절 사기를 예측하는 예측 단계;를 포함하는 것을 특징으로 하는 지불 거절 사기 사용자의 예측 방법.A predicting step of predicting a chargeback fraud on transaction history data of a new user using a machine learning model that has reached a target predictive classification performance.
  2. 제1항에 있어서,The method of claim 1,
    상기 데이터 가공 단계는,The data processing step,
    거래 내역 데이터의 각 특징 및 복수 특징 집합에 대한 평가를 수행하여 기준에 미달하는 특징을 삭제하는 1차 특징 삭제 단계;A primary feature deleting step of performing evaluation on each feature and a plurality of feature sets of the transaction history data to delete features that do not meet the criteria;
    1차 특징 삭제 단계를 거친 거래 내역 데이터를 통계적 방법을 이용하여 각 사용자마다 1건 기준의 데이터로 가공하면서 새로운 특징을 생성하는 특징 생성 단계; 및A feature generation step of generating new features while processing transaction history data that has undergone the first feature deletion step into data of one reference for each user using a statistical method; And
    생성된 특징에 대한 평가를 수행하여 평가 기준에 미달하는 특징을 삭제하는 2차 특징 삭제 단계;를 포함하는 것을 특징으로 하는 지불 거절 사기 사용자의 예측 방법.And a second feature deleting step of performing an evaluation on the generated feature and deleting a feature that does not meet the evaluation criteria.
  3. 제2항에 있어서,The method of claim 2,
    상기 1차 특징 삭제 단계는,The primary feature deletion step,
    정보 이득(information gain) 기법을 이용하여 거래 내역 데이터의 각 특징에 대한 평가를 수행하는 단계; 및Performing evaluation of each feature of transaction history data using an information gain technique; And
    주성분 분석(principal component analysis) 기법을 이용하여 거래 내역 데이터의 복수 특징 집합에 대한 평가를 수행하는 단계;를 포함하는 것을 특징으로 하는 지불 거절 사기 사용자의 예측 방법.And evaluating a plurality of feature sets of transaction history data using principal component analysis techniques.
  4. 제1항에 있어서,The method of claim 1,
    상기 성능 측정 단계는,The performance measurement step,
    혼돈 행렬(confusion matrix)를 이용하여 예측 분류에 대한 성능을 측정하는 것을 특징으로 하는 지불 거절 사기 사용자의 예측 방법.A method for predicting chargeback fraud users, which measures the performance of predictive classification using a confusion matrix.
  5. 제1항 내지 제4항 중 어느 한 항에 따른 지불 거절 사기 사용자의 예측 방법에 따라 지불 거절 사기 사용자를 예측하기 위해 매체에 저장된 지불 거절 사기 사용자의 예측 프로그램.A program of a chargeback fraud user stored in a medium for predicting a chargeback fraud user according to a method of predicting a chargeback fraud user according to any one of claims 1 to 4.
PCT/KR2017/013539 2016-11-25 2017-11-24 Method and program for predicting chargeback fraud user WO2018097653A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020160158491A KR20180059203A (en) 2016-11-25 2016-11-25 Method and program for predicting chargeback fraud user
KR10-2016-0158491 2016-11-25

Publications (1)

Publication Number Publication Date
WO2018097653A1 true WO2018097653A1 (en) 2018-05-31

Family

ID=62195250

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2017/013539 WO2018097653A1 (en) 2016-11-25 2017-11-24 Method and program for predicting chargeback fraud user

Country Status (2)

Country Link
KR (1) KR20180059203A (en)
WO (1) WO2018097653A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110675220A (en) * 2019-09-12 2020-01-10 深圳前海大数金融服务有限公司 Method, system and computer readable storage medium for identifying fraudulent user
US11151573B2 (en) * 2017-11-30 2021-10-19 Accenture Global Solutions Limited Intelligent chargeback processing platform
CN114297054A (en) * 2021-12-17 2022-04-08 北京交通大学 Software defect number prediction method based on subspace mixed sampling

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20180128874A (en) * 2018-11-14 2018-12-04 주식회사 미탭스플러스 Apparatus and method of deposit of cryptocurrency exchange using transaction verification
KR102607383B1 (en) * 2021-01-05 2023-11-29 중소기업은행 Method for recognizing suspicious money laundering transactions and apparatus therefor

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20020086695A (en) * 2000-03-24 2002-11-18 알티코 인크. System and method for detecting fraudulent transactions
US20080288405A1 (en) * 2007-05-20 2008-11-20 Michael Sasha John Systems and Methods for Automatic and Transparent Client Authentication and Online Transaction Verification
US20120158540A1 (en) * 2010-12-16 2012-06-21 Verizon Patent And Licensing, Inc. Flagging suspect transactions based on selective application and analysis of rules
KR20160017629A (en) * 2014-08-06 2016-02-16 아마데우스 에스.에이.에스. Predictive fraud screening
US20160328715A1 (en) * 2015-05-06 2016-11-10 Forter Ltd. Gating decision system and methods for determining whether to allow material implications to result from online activities

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20020086695A (en) * 2000-03-24 2002-11-18 알티코 인크. System and method for detecting fraudulent transactions
US20080288405A1 (en) * 2007-05-20 2008-11-20 Michael Sasha John Systems and Methods for Automatic and Transparent Client Authentication and Online Transaction Verification
US20120158540A1 (en) * 2010-12-16 2012-06-21 Verizon Patent And Licensing, Inc. Flagging suspect transactions based on selective application and analysis of rules
KR20160017629A (en) * 2014-08-06 2016-02-16 아마데우스 에스.에이.에스. Predictive fraud screening
US20160328715A1 (en) * 2015-05-06 2016-11-10 Forter Ltd. Gating decision system and methods for determining whether to allow material implications to result from online activities

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11151573B2 (en) * 2017-11-30 2021-10-19 Accenture Global Solutions Limited Intelligent chargeback processing platform
CN110675220A (en) * 2019-09-12 2020-01-10 深圳前海大数金融服务有限公司 Method, system and computer readable storage medium for identifying fraudulent user
CN114297054A (en) * 2021-12-17 2022-04-08 北京交通大学 Software defect number prediction method based on subspace mixed sampling
CN114297054B (en) * 2021-12-17 2023-06-30 北京交通大学 Software defect number prediction method based on subspace mixed sampling

Also Published As

Publication number Publication date
KR20180059203A (en) 2018-06-04

Similar Documents

Publication Publication Date Title
WO2018097653A1 (en) Method and program for predicting chargeback fraud user
Pillar How sharp are classifications?
Obaidat et al. Fundamentals of performance evaluation of computer and telecommunication systems
Sahiner et al. Classifier performance prediction for computer‐aided diagnosis using a limited dataset
CN107122669B (en) Method and device for evaluating data leakage risk
Ekina et al. Application of bayesian methods in detection of healthcare fraud
CN110706026A (en) Abnormal user identification method, identification device and readable storage medium
CN106203103B (en) File virus detection method and device
CN113688042A (en) Method and device for determining test scene, electronic equipment and readable storage medium
WO2022199185A1 (en) User operation inspection method and program product
CN113011888A (en) Method, device, equipment and medium for detecting abnormal transaction behaviors of digital currency
CN112948823A (en) Data leakage risk assessment method
CN110348471B (en) Abnormal object identification method, device, medium and electronic equipment
CN110490750B (en) Data identification method, system, electronic equipment and computer storage medium
CN115018210B (en) Service data classification prediction method and device, computer equipment and storage medium
CN113177733B (en) Middle and small micro enterprise data modeling method and system based on convolutional neural network
CN110728585A (en) Authority guaranteeing method, device, equipment and storage medium
CN113518010B (en) Link prediction method, device and storage medium
KR102336462B1 (en) Apparatus and method of credit rating
CN111062800B (en) Data processing method, device, electronic equipment and computer readable medium
CN113256422A (en) Method and device for identifying bin account, computer equipment and storage medium
Borkar et al. Comparative study of supervised learning algorithms for fake news classification
Kang Fraud Detection in Mobile Money Transactions Using Machine Learning
CN111444362B (en) Malicious picture interception method, device, equipment and storage medium
Huang et al. Performance measures for rare event targeting

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17874986

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17874986

Country of ref document: EP

Kind code of ref document: A1