WO2018097653A1

WO2018097653A1 - Method and program for predicting chargeback fraud user

Info

Publication number: WO2018097653A1
Application number: PCT/KR2017/013539
Authority: WO
Inventors: 서재현; 최대선
Original assignee: 공주대학교 산학협력단
Priority date: 2016-11-25
Filing date: 2017-11-24
Publication date: 2018-05-31
Also published as: KR20180059203A

Abstract

The present invention relates to a method for predicting a chargeback fraud user, the method being characterized by comprising: a data processing step of processing conventional transaction detail data regarding normal users and chargeback fraud users into data based on one case for each user; a data classification step of dividing the processed transaction detail data into training data and test data; a data adjustment step of oversampling data regarding chargeback fraud users among the training data, thereby adjusting the number of pieces of data; a prediction/classification step of conducting learning on the basis of a specific machine learning technique using the training data, the number of pieces of which has been adjusted, and predicting/classifying whether test data corresponds to a chargeback fraud user or not using the learned machine learning model; a performance measurement step of measuring performance regarding prediction/classification; a repeatedly performing step of oversampling or undersampling data regarding chargeback fraud users among the training data until the performance regarding prediction/classification reaches a target value, thereby repeatedly performing the prediction/classification step and the performance measurement step; and a prediction step of predicting a chargeback fraud with regard to a new user's transaction detail data using a machine leaning module that has reached the target prediction/classification performance.

Description

Method and program for predicting chargeback fraud users

The present invention relates to a method and program for predicting a fraudulent fraud user. More particularly, the present invention provides a machine learning model that satisfies the target prediction classification performance by processing transaction history data of a conventional user, and implements the implemented machine learning model. And a method and program for predicting chargeback fraud for a new user's transaction history data.

In recent years, as the use of electronic payment means is becoming more common, cases of chargeback fraud, which is a kind of fraud using electronic payment methods, have been increasing rapidly.

1 shows a flowchart of a chargeback fraud of a game user.

As an example of the chargeback fraud, the chargeback fraud by the online game user, as shown in Figure 1, after the user (game user) has spent the game money and the like paid by the game company purchased with a credit card, Established by requesting a bank for a chargeback, it is a huge problem because it can cause enormous damage to game companies and the like.

Therefore, there is a need for a technology development capable of preventing damages due to chargeback fraud by predicting chargeback fraud in advance.

[Prior literature]

(Patent Document 1) KR10-2016-0017629 A

In order to prevent damages due to chargeback fraud as described above, the present invention implements a machine learning model that satisfies the targeted prediction classification performance by processing and using transaction history data of a conventional user, and implemented the machine learning model. The purpose of the present invention is to provide a method and program for predicting a chargeback fraud user that predicts a chargeback fraud on a transaction history data of a new user.

In order to solve the above problems, a method of predicting a chargeback fraud user according to an embodiment of the present invention includes: (1) transaction history data of a conventional normal user and a chargeback fraud user for each user; A data processing step of processing the data based on one case, (2) A data classification step of dividing the processed transaction history data into training data and test data, and (3) A chargeback fraud user among the training data. A data adjustment step of adjusting the number of data by oversampling the data for (4) training using a specific machine learning method using the training data in which the number of data is adjusted, and testing using the learned machine learning model Predictive classification step for predicting whether a chargeback fraud user is against data, (5) performance for predictive classification (6) Predictive classification step and performance by oversampling or undersampling data for chargeback fraud users in the training data until the performance for the predictive classification reaches the target value. Iteratively performing a repeating step of measuring the measurement, (7) predicting the chargeback fraud for the transaction history data of the new user using the machine learning model that has reached the target prediction classification performance.

The data processing step may include: (1) a first feature deletion step of deleting features falling below a criterion by performing evaluation on each feature and a plurality of feature sets of transaction history data, and (2) a first feature deletion step The feature generation step of generating a new feature while processing the historical data into a single data for each user using a statistical method, (3) performing the evaluation of the generated feature to delete features that do not meet the evaluation criteria 2 Car feature deletion step.

The primary feature deleting step includes: (1) evaluating each feature of the transaction history data using an information gain technique, and (2) trading using a principal component analysis technique. And performing an evaluation on the plurality of feature sets of the historical data.

The performance measuring step may measure the performance of the prediction classification by using a confusion matrix.

In addition, the program for predicting a chargeback fraud user according to an embodiment of the present invention may be stored in a medium for predicting a chargeback fraud user according to the above-described method for predicting a chargeback fraud user.

Prediction method and program of the chargeback fraud user according to an embodiment of the present invention configured as described above can implement a machine learning model that satisfies the target prediction classification performance to predict chargeback fraud on the transaction history data of the new user Therefore, there is an advantage that can be prevented in advance due to the chargeback fraud.

1 shows a flow diagram of a chargeback fraud.

2 illustrates a method for predicting a chargeback fraud user according to an embodiment of the present invention.

3 illustrates a data processing step S10 of a method for predicting a chargeback fraud user according to an embodiment of the present invention.

The above objects, means, and effects thereof will become more apparent from the following detailed description taken in conjunction with the accompanying drawings, and as a result, those skilled in the art to which the present invention pertains may easily facilitate the technical idea of the present invention. It can be done. In addition, in describing the present invention, when it is determined that the detailed description of the known technology related to the present invention may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted.

Also, the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. In this specification, the singular forms also include the plural forms as the case otherwise indicates. As used herein, “comprises” and / or “comprising” does not exclude the presence or addition of one or more components other than the mentioned components. Unless otherwise defined, all terms used in the present specification may be used in a sense that can be commonly understood by those skilled in the art. Moreover, the terms defined in the commonly used dictionaries are not ideally or excessively interpreted unless they are specifically defined clearly.

Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

The method of predicting a chargeback fraud user according to an embodiment of the present invention may be performed by a computer. For example, the computer may be a desktop personal computer, a laptop personal computer, a netbook computer, a tablet personal computer, or the like, but is not limited thereto.

Specifically, the method of predicting a chargeback fraud user according to an embodiment of the present invention, as shown in Figure 2, the data processing step (S10), data classification step (S20), data adjustment step (S30), prediction A classification step S40, a performance measurement step S50, an iteration step S60, and a prediction step S70 are included.

The data processing step S10 is a step of processing transaction history data of a conventional user. In this case, the conventional transaction history data includes in-flight details of the normal user and the chargeback fraud user, respectively, and may be provided from a database such as a game company that stores and manages them. In particular, since the number of transactions for each user is different, the data processing step (S10) processes the transaction history data into data of one reference for each user. In this case, the transaction history data includes a plurality of attributes having different data characteristics and physical forms (record format, record length, etc.) of the data. This attribute of data is hereinafter referred to as "feature".

Table 1 shows the characteristics of the actual transaction history data stored in the database of a game company. In this case, the actual transaction history data provided by the game company included transaction history data (hundreds of thousands) of 62,092 normal users and transaction history data of 372 (thousands) of chargeback fraud users.

For example, the transaction history data may include a plurality of features as shown in Table 1. That is, each transaction history data may include features of "user_no, standard_country_code, charge_status, charge_no, payment_method_no, charge_amount, bonus_amount, datetime, charge_product_name and hash_ip, ip_addr".

	특징Characteristic	내용Contents
1One	user_nouser_no	사용자의 식별자User's identifier
22	standard_country_codestandard_country_code	사용자의 국가 코드User's Country Code
33	charge_statuscharge_status	사용자의 충전 단계User's charging stage
44	charge_nocharge_no	충전 식별자Charging identifier
55	payment_method_nopayment_method_no	결제 방법 식별자Form of payment identifier
66	charge_amountcharge_amount	충전 금액Charge amount
77	bonus_amountbonus_amount	보너스 금액Bonus amount
88	datetimedatetime	거래 일시Transaction date
99	charge_product_namecharge_product_name	지불 결제 사업자(payment gateway) 명칭Payment gateway name
1010	hash_iphash_ip	해쉬 함수(hash function)로 변환된 사용자의 IP 주소IP address of the user converted to a hash function
1111	ip_addrip_addr	사용자의 IP 주소IP address of the user

Specifically, as illustrated in FIG. 3, the data processing step S10 may include a first feature deletion step S11, a feature generation step S12, and a second feature deletion step S13.

The primary feature deletion step S11 is a step of evaluating each feature and a plurality of feature sets of the transaction history data and deleting a feature that does not meet the evaluation criteria. In this case, in the first feature deletion step S11, each feature (eg, user_no, standard_country_code, charge_status, charge_no, payment_method_no, charge_amount, bonus_amount, datetime, charge_product_name and hash_ip, ip_addr).

The information gain is an amount of reduction in entropy expected when one feature is selected, and the higher the value, the better the data can be distinguished. That is, in the first feature deletion step (S11), a value for the degree of discrimination of the chargeback fraud user according to the selection of each feature is obtained according to the information gain technique.

Subsequently, in the first feature deletion step S11, the corresponding feature having an information gain value less than a predetermined criterion is deleted. This is because a feature with an information gain value below a certain criterion corresponds to a feature that is not necessary to distinguish between chargeback fraud users.

In addition, in the first feature deletion step S11, a plurality of feature sets of transaction history data are evaluated using a principal component analysis technique.

Principal component analysis is a technique of reducing high-dimensional data to low-dimensional data, and finds a principal component of distributed data. That is, in the first feature deletion step (S11), a plurality of feature sets of principal components that extract the chargeback fraud users can be extracted according to the principal component analysis technique. In this case, the plurality of feature sets of the transaction history data are a combination of two or more features, for example, {user_no, standard_country_code}, {user_no, charge_status},... {user_no, standard_country_code, charge_status}, {user_no, standard_country_code, charge_no}... And the like.

Subsequently, in the first feature deletion step S11, a feature included in a plurality of feature sets that fall below a predetermined criterion, that is, does not correspond to a main component is deleted. This is because a feature included in a plurality of feature sets that does not correspond to a main component corresponds to a feature that is not necessary to distinguish between chargeback fraud users.

The feature generation step (S12) is a step of generating a new feature while processing the transaction history data that has undergone the first feature deletion step into data of one reference for each user using a statistical method. For example, the statistical method may include, but is not limited to, methods such as count, sum, difference, average, standard deviation, maximum value, minimum value, date statistics, time statistics, and the like with respect to data.

The secondary feature deletion step S13 is a step of deleting a feature that does not meet the evaluation criteria by performing an evaluation on the generated feature. In this case, in the second feature deletion step (S12), each feature of the transaction history data is evaluated by using an information gain technique.

Subsequently, in the second feature deletion step (S13), the corresponding feature having an information gain value less than a predetermined criterion is deleted. This is because a feature with an information gain value below a certain criterion corresponds to a feature that is not necessary to distinguish between chargeback fraud users.

Table 2 shows the features of the transaction history data shown in Table 1 through the data processing step (S10), the first feature deletion step (S11) and the second feature deletion step (S13).

	특징Characteristic	내용Contents
1One	user_no user_no	사용자의 식별자User's identifier
22	standard_country_code standard_country_code	사용자의 국가 코드User's Country Code
33	standard_country_code_kind standard_country_code_kind	사용자의 국가 코드의 종류Type of user's country code
44	charge_stat10 charge_stat10	사용자의 충전 횟수가 10 이하User charges less than 10
55	charge_stat20 charge_stat20	사용자의 충전 횟수가 20 이하User charges less than 20
66	charge_stat30 charge_stat30	사용자의 충전 횟수가 30 이하User charges less than 30
77	payment_method_no payment_method_no	가장 최근의 결제 방법Most recent form of payment
88	payment_method_no_kind payment_method_no_kind	결제 방법의 종류Type of payment method
99	charge_amount_sum charge_amount_sum	충전 총액Charge
1010	charge_amount_avg charge_amount_avg	평균 충전 금액Average charge amount
1111	charge_amount_stddev charge_amount_stddev	충전 금액의 표준 편차Standard deviation of charge amount
1212	bonus_amount_sum bonus_amount_sum	보너스 총액Bonus amount
1313	bonus_amount_avg bonus_amount_avg	평균 보너스 금액Average bonus amount
1414	bonus_amount_stddev bonus_amount_stddev	보너스 금액의 표준 편차Standard Deviation of Bonus Amount
1515	transaction_recent_monthday transaction_recent_monthday	최종 거래 날짜Last transaction date
1616	transaction_recent_hour transaction_recent_hour	최종 거래 시간Last trading time
1717	transaction_cnt_sum transaction_cnt_sum	총 거래 횟수Total transactions
1818	transaction_cnt_1_month transaction_cnt_1_month	1개월 동안의 거래 횟수Transactions in a Month
1919	transaction_cnt_2_month transaction_cnt_2_month	최근 1개월을 제외한 2개월 동안의 거래 횟수Transactions in 2 Months Except Last 1 Month
2020	transaction_cnt_3_month transaction_cnt_3_month	최근 2개월을 제외한 3개월 동안의 거래 횟수) Transactions in 3 months excluding the last 2 months)
2121	transaction_cnt_6_month transaction_cnt_6_month	최근 3개월을 제외한 6개월 동안의 거래 횟수Transactions in 6 Months Except Last 3 Months
2222	transaction_cnt_else transaction_cnt_else	최근 6개월을 제외한 총 거래 횟수Total Transactions Except Last 6 Months
2323	charge_product_name charge_product_name	지불 결제 사업자(payment gateway)Payment gateway
2424	charge_product_name_kind charge_product_name_kind	지불 결제 사업자 종류Payment payment carrier type
2525	ip_addr ip_addr	IP 주소IP address
2626	ip_addr_kind ip_addr_kind	IP 주소 종류IP address type
2727	class class	0: 정상 사용자, 1: 지불 거절 사기 사용자0: normal user, 1: chargeback fraud user

That is, standard_country_code_kind is additionally created from standard_country_code, charge_stat10, charge_stat20, and charge_stat30 are additionally created from charge_status, payment_method_no_kind is additionally created from payment_method_no, charge_amount_sum, charge_amount_av_, charge_amount_mount_mount_mount_amount_mount_amount and bonus_amount bonusamount In addition, transaction_recent_monthday, transaction_recent_hour, transaction_cnt_sum, transaction_cnt_1_month, transaction_cnt_2_month, transaction_cnt_3_month, transaction_cnt_6_month and transaction_cnt_else were additionally created from datetime, charge_product_name_kind_ was added from ip_addrkind ip_addrkind. In addition, charge_no and hash_ip are deleted, and a class is added to distinguish users. Classes may be included from the beginning in Table 1.

For reference, the information gain values of the characteristics of Table 2 were determined using a ClassifierSubsetEval attribute evaluator based on a decision tree (DT) and a genetic algorithm. 4, 5, 7, 8, 10, 11 , 12, 17, 18, 19, and 20 corresponding to features of charge_stat10, charge_stat20, payment_method_no, payment_method_no_kind, charge_amount_avg, charge_amount_stddev, bonus_amount_sum, transaction_cnt_sum, transaction_cnt_1_month, transaction_cnt_2_month, and transaction_cnt_3.

Next, the data classification step S20 is a step of dividing the processed transaction details data into training data and test data. In this case, the training data is data used as training data of a specific machine learning to be used later, and the test data is data used to test the performance of the learned machine learning model.

Table 3 shows the various dataset types for dividing processed transaction history data into training data and test data.

	66% split66% split	10-fold10-fold	50% split50% split
정상 사용자Normal user	21,11321,113	62,09262,092	31,04631,046
지불 거절 사기 사용자Chargeback fraud user	125125	372372	186186

For example, 66% split is 66% of transaction history data divided by training data and the remaining 34% by test data. 10-fold is a case in which 9/10 is divided into training data and 1/10 is divided into test data among transaction details data, and a cross validation method is performed. In addition, 50% split is a case in which 50% of transaction history data is divided into training data and test data, and the data is divided by StratifiedFolds preprocessing.

Next, the data adjustment step (S30) is a step of adjusting the number of data by oversampling the data for the chargeback fraud user among the training data. Since the transaction history data of the chargeback fraud user is less than that of the normal user, the performance of the machine learning model learned from the training data may be degraded. Accordingly, the performance of the machine learning model may be improved by oversampling data for the chargeback fraud user in the training data through the data adjustment step S30. Specific experimental examples for improving the performance of the machine learning model through the data adjustment step (S30) will be described later.

Next, in the prediction classification step S40, training data whose number of data is adjusted is used as training data to be trained by a predetermined machine learing technique. In this case, the machine learning includes various algorithms such as supervised learning, unsupervised learning, semi-supervised learning, and is not particularly limited. Supervised learning may include a Support Vector Machine (SVM), Hidden Markov model, Regression, Neural Network, Naive Bayes Classification, and the like. .

Subsequently, in the predictive classification step S40, whether or not a chargeback fraud user is applied to the test data is predicted using the machine learning model trained through the training data.

Next, the performance measurement step S50 is a step of measuring the performance for the prediction classification. That is, the performance measurement step S50 measures performance indicating the accuracy of the test data predicted and classified by the machine learning model in the prediction classification step S40. In this case, the performance measurement step S50 may measure the performance of the prediction classification by using a confusion matrix.

Table 4 shows the chaos matrix.

		예측prediction
		TrueTrue	FalseFalse
결과result	TrueTrue	True Posivies(TP)True Posivies (TP)	False Negatives(FN)False Negatives (FN)
결과result	FlaseFlase	False Posivies(FP)False Posivies (FP)	True Negatives(TN)True Negatives (TN)

TP predicts that a machine learning model predicts a test fraud user as a bogus fraud user, but actually a chargeback fraud user, and TN predicts that a machine learning model predicts a test user as a normal user but is actually a normal user. Appears respectively. In addition, FP predicted that the machine learning model predicted chargeback fraud users for a test data but was actually a normal user, while FN predicted that the machine learning model predicted it as a normal user for a test data but was actually rejected Each case represents a fraudulent user.

That is, in the performance measurement step (S50), the machine learning model collects the number of results classified and predicted for each test data according to the chaotic matrix of Table 4. Thereafter, the performance measurement step (S50) calculates the value of the performance indicator. At this time, the performance indicator is to measure the performance of the classification accuracy of the machine learning model predicted and classified in the prediction classification step (S40), the Recall (= TPR), Precision, F-measure, ROC curve, etc. The present invention may include, but is not limited to, any measure that can measure performance of data classification accuracy.

Table 5 shows each performance index for measuring the performance of the machine learning model predicted and classified in the prediction classification step (S40).

성능 지표Performance indicators	계산 방법Calculation method
Recall=TPR(True Posivies Rate)Recall = TPR (True Posivies Rate)	Recall = TPR = TP / (TP + FN)Recall = TPR = TP / (TP + FN)
FPR(False Posivies Rate)False Posivies Rate (FPR)	FPR = FP / (FP + TN)FPR = FP / (FP + TN)
Precision Precision	Precision = TP / (TP + FP)Precision = TP / (TP + FP)
F-measure F-measure
ROC curve Area(Receiver operating characteristic)ROC curve Area (Receiver operating characteristic)	축은 FPR, Y축은 TPR로 각각 이루어진 곡선의 넓이The area of the curve consisting of FPR on the axis and TPR on the Y axis

Tables 6 and 7 show the prediction classification step (S40) and the performance measurement step (S50) for each data set type shown in Table 3 using the machine learning technique of the decision tree (DT) and the support vector machine (SVM). The result of measuring the prediction classification performance is shown. In this case, in order to directly compare the effect of the data adjustment step (S30), the results of performing the prediction classification step (S40) and the performance measurement step (S50) without the data adjustment step (S30) is shown in Table 6, Table 7 shows the results of performing the data adjustment step (S30), the prediction classification step (S40) and the performance measurement step (S50).

알고리즘algorithm	데이터 집합 유형Dataset type	TPRTPR	FPRFPR	RrecisionRrecision	RecallRecall	F-measureF-measure	ROC AreaROC Area	ClassClass
DTDT	66% split 66% split	0.552 0.552	0.001 0.001	0.841 0.841	0.552 0.552	0.667 0.667	0.828 0.828	1 One
	66% split 66% split	0.999 0.999	0.448 0.448	0.997 0.997	0.999 0.999	0.998 0.998	0.828 0.828	0 0
	10-fold 10-fold	0.530 0.530	0.001 0.001	0.853 0.853	0.5300.530	0.653 0.653	0.877 0.877	1 One
	10-fold 10-fold	0.999 0.999	0.470 0.470	0.997 0.997	0.999 0.999	0.998 0.998	0.877 0.877	0 0
	50% split 50% split	0.516 0.516	0.001 0.001	0.787 0.787	0.516 0.516	0.623 0.623	0.808 0.808	1 One
	50% split 50% split	0.999 0.999	0.484 0.484	0.997 0.997	0.999 0.999	0.998 0.998	0.808 0.808	0 0
SVMSVM	66% split 66% split	0.544 0.544	0.000 0.000	0.883 0.883	0.5440.544	0.673 0.673	0.772 0.772	1 One
	66% split 66% split	1.000 1.000	0.456 0.456	0.997 0.997	1.000 1.000	0.998 0.998	0.772 0.772	0 0
	10-fold 10-fold	0.573 0.573	0.000 0.000	0.914 0.914	0.573 0.573	0.704 0.704	0.786 0.786	1 One
	10-fold 10-fold	1.000 1.000	0.427 0.427	0.997 0.997	1.000 1.000	0.999 0.999	0.786 0.786	0 0
	50% split 50% split	0.570 0.570	0.001 0.001	0.869 0.869	0.5700.570	0.688 0.688	0.785 0.785	1 One
	50% split 50% split	0.999 0.999	0.430 0.430	0.997 0.997	0.999 0.999	0.998 0.998	0.785 0.785	0 0

오버샘플링 비율(%)Oversampling Rate (%)	알고리즘algorithm	TPR TPR	FPRFPR	Precision Precision	Recall Recall	F-measure F-measure	ROC Area ROC Area	Class Class
100100	DTDT	0.6610.661	0.0010.001	0.9090.909	0.6610.661	0.7660.766	0.8960.896	1 One
	DTDT	0.9990.999	0.3390.339	0.9960.996	0.9990.999	0.9980.998	0.8960.896	0 0
	SVMSVM	0.8200.820	0.0010.001	0.9410.941	0.8200.820	0.8760.876	0.9100.910	1 One
	SVMSVM	0.9990.999	0.1800.180	0.9980.998	0.9990.999	0.9990.999	0.9100.910	0 0
200200	DTDT	0.6770.677	0.0010.001	0.9410.941	0.6770.677	0.7880.788	0.9240.924	1 One
	DTDT	0.9990.999	0.3230.323	0.9940.994	0.9990.999	0.9970.997	0.9240.924	0 0
	SVMSVM	0.8870.887	0.0010.001	0.9640.964	0.8870.887	0.9240.924	0.9430.943	1 One
	SVMSVM	0.9990.999	0.1130.113	0.9980.998	0.9990.999	0.9990.999	0.9430.943	0 0
300300	DTDT	0.7920.792	0.0020.002	0.9440.944	0.7920.792	0.8610.861	0.9200.920	1 One
	DTDT	0.9980.998	0.2080.208	0.9930.993	0.9980.998	0.9950.995	0.9200.920	0 0
	SVMSVM	0.9480.948	0.0010.001	0.9800.980	0.9480.948	0.9640.964	0.9740.974	1 One
	SVMSVM	0.9990.999	0.0520.052	0.9980.998	0.9990.999	0.9990.999	0.9740.974	0 0

Referring to the Recall performance indicators for chargeback fraud users (Class = 1) in Tables 6 and 7, the support vector machine (SVM) shows better predictive classification performance than the decision tree (DT), and the data conditioning step (S30). Performance shows better predictive classification performance.

The iteration step S60 may oversample or undersample the data for the chargeback fraud user of the training data until the performance of the prediction classification reaches the target value, thereby predicting classification S40 and The step of repeating the performance measurement step (S50). At this time, if the oversampling ratio is too high and the training data amount is excessively increased, an overload may occur when learning in the predictive classification step S40, and thus, the repeating step S60 may perform undersampling in addition to oversampling of the training data. Can be done. In addition, in the repeating step (S60), undersampling may be performed to reach a more accurate target value.

The ratio of oversampling or undersampling at the time of performing the repetition step S60 may be regular or arbitrary, and is not particularly limited. For example, the oversampling ratio at the nth iteration can be ^defined as A × B ⁿ (where A and B are natural numbers and n is an integer). After the nth iteration, the undersampling ratio at the mth iteration can be defined as A × B ⁿ − (C × m) (where A, B and C are natural numbers, n and m are integers, n <m)

That is, when "Recall with 0.940 or more" is set as the performance target value for the support vector machine SVM, the prediction classification step S40 and the performance measurement step S50 when the oversampling is 100% are performed. As a result, since Recall has 0.820 that is less than or equal to the target value, the oversampling is increased to 200% in the first iteration, and the prediction classification step S40 and the performance measurement step S50 are performed again. As a result, since Recall has 0.887 that is less than or equal to the target value, the oversampling is raised to 300% in the second iteration, and the prediction classification step S40 and the performance measurement step S50 are performed again. As a result, Recall has 0.948, which is above the target value. Subsequently, although it may be immediately skipped to the prediction step S70, in the iteration step S60, undersampling is performed in the third iteration, that is, the oversampling is set to less than 300%, for example, 280% to predict the classification step ( S40) and the performance measurement step S50 may be performed again.

The predicting step S70 is a step of predicting a chargeback fraud on transaction history data of a new user using a machine learning model that has reached the target predictive classification performance.

On the other hand, the chargeback fraud user prediction program according to an embodiment of the present invention is stored in the medium to perform the chargeback fraud user prediction according to the above-described method of chargeback fraud user according to an embodiment of the present invention. Program. For example, the predictive program of a chargeback fraud user may be recorded in a recording medium readable by a computer or similar device.

For example, the recording medium may be a hard disk type, a magnetic media type, a compact disc read only memory (CD-ROM), an optical media type, a magnetic-optical medium Type (magneto-optical media type), multimedia card micro type, memory of the card type (e.g., SD or XD memory, etc.), flash memory type, ROM (read only memory); ROM, random access memory (RAM), or a combination of a memory composed of a memory, a main memory, or a secondary memory device, but is not limited thereto.

In addition, the program comprises a communication network such as the Internet, an intranet, a local area network (LAN), a wide area network (WLAN), or a storage area network (SAN), or a combination thereof. It may be stored in an attachable storage device accessible through a communication network.

In the detailed description of the present invention, specific embodiments have been described, but various modifications may be made without departing from the scope of the present invention. Therefore, the scope of the present invention should not be limited to the described embodiments, but should be defined by the following claims and their equivalents.

Claims

A data processing step of processing transaction history data for a conventional normal user and a chargeback fraud user into one reference data for each user;

A data classification step of dividing the processed transaction history data into training data and test data;

A data adjustment step of oversampling data of the chargeback fraud user among the training data to adjust the number of data;

Predictive classification step of learning by a specific machine learning method using the training data adjusted the number of data, Predictive classification of whether the chargeback fraud user corresponding to the test data using the learned machine learning model;

A performance measurement step of measuring performance for prediction classification;

Repeat iteratively oversampling or undersampling the data for the chargeback fraud users in the training data until the performance for the predictive classification reaches the target value step; And

A predicting step of predicting a chargeback fraud on transaction history data of a new user using a machine learning model that has reached a target predictive classification performance.
The method of claim 1,

The data processing step,

A primary feature deleting step of performing evaluation on each feature and a plurality of feature sets of the transaction history data to delete features that do not meet the criteria;

A feature generation step of generating new features while processing transaction history data that has undergone the first feature deletion step into data of one reference for each user using a statistical method; And

And a second feature deleting step of performing an evaluation on the generated feature and deleting a feature that does not meet the evaluation criteria.
The method of claim 2,

The primary feature deletion step,

Performing evaluation of each feature of transaction history data using an information gain technique; And

And evaluating a plurality of feature sets of transaction history data using principal component analysis techniques.
The method of claim 1,

The performance measurement step,

A method for predicting chargeback fraud users, which measures the performance of predictive classification using a confusion matrix.
A program of a chargeback fraud user stored in a medium for predicting a chargeback fraud user according to a method of predicting a chargeback fraud user according to any one of claims 1 to 4.