CN116757702A

CN116757702A - Transaction data determining method and device, processor and electronic equipment

Info

Publication number: CN116757702A
Application number: CN202310807655.9A
Authority: CN
Inventors: 谭宗麟; 曾炜; 李杰一; 温卓宇
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2023-07-03
Filing date: 2023-07-03
Publication date: 2023-09-15

Abstract

The application discloses a method and a device for determining transaction data, a processor and electronic equipment. Relates to the field of artificial intelligence, and the method comprises the following steps: acquiring transaction data of a target object, and determining characteristic data corresponding to the transaction data; inputting the characteristic data into a pre-trained transaction model so that the transaction model outputs voting results of the transaction data, wherein the transaction model is trained by machine learning by using a plurality of sets of data, and each set of data in the plurality of sets of data comprises: the feature sample data and the transaction data corresponding to the feature sample data are labels of abnormal transaction data or not; and determining whether the transaction data is abnormal transaction data according to the voting result. The application solves the problems that in the prior art, the rule setting is static and based on the known transaction data in the process of screening and identifying the transaction data, and the adjustment can not be flexibly carried out according to different conditions.

Description

Transaction data determining method and device, processor and electronic equipment

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a method and apparatus for determining transaction data, a processor, and an electronic device.

Background

Abnormal transactions are prone to forging and covering transaction information, so that errors occur in transaction and loan decisions by financial institutions, and financial risks are increased. Abnormal fund accumulation caused by abnormal transactions also breaks the financial market balance.

The existing abnormal transaction data identification scheme is based on fixed rules for identifying the transaction data. And screening and identifying the transaction data by formulating a series of rules and thresholds, such as that the amount of a single transaction exceeds a certain threshold, the number of transactions exceeds a limit within a certain time, and the transaction account is inconsistent with the use account. When the transaction actions trigger rules and thresholds, the system automatically identifies abnormal transaction data and processes the abnormal transaction data accordingly.

The abnormal transaction data based on rule judgment is simple and feasible, the realization cost is lower, but the defects are obvious as well: 1) It is difficult to cover all abnormal transaction data: since the setting of the rule is based on known abnormal transaction data, it is difficult to cover all abnormal situations. When new abnormal transaction data appears, rules need to be updated continuously, and maintenance and update cost is increased; 2) Misjudgment and missed judgment are easy to occur: since the rule setting is based on known transaction data and is static, the rule setting cannot be flexibly adjusted according to different situations; 3) Is not easy to expand: the scheme based on rule judgment cannot be extended to other fields. To identify other types of abnormal transaction data, new rules need to be redesigned; 4) Difficult to handle complex situations: when a situation in which a plurality of abnormal transaction data are interleaved together occurs, it is difficult for the rule-based judgment scheme to perform efficient judgment and processing.

Aiming at the problems that in the prior art, the rule setting is static and based on the known transaction data in the process of screening and identifying the transaction data, and the adjustment can not be flexibly carried out according to different conditions, no effective solution is proposed at present.

Disclosure of Invention

The application mainly aims to provide a method and a device for determining transaction data, a processor and electronic equipment, so as to solve the problem that in the process of screening and identifying the transaction data in the related art, the rule setting is static and based on the known transaction data, and the adjustment cannot be flexibly carried out according to different conditions.

In order to achieve the above object, according to one aspect of the present application, there is provided a method of determining transaction data. The method comprises the following steps: acquiring transaction data of a target object, and determining characteristic data corresponding to the transaction data; inputting the characteristic data into a pre-trained transaction model so that the transaction model outputs voting results of the transaction data, wherein the transaction model is trained by machine learning by using a plurality of groups of data, and each group of data in the plurality of groups of data comprises: the feature sample data and the transaction data corresponding to the feature sample data are labels of abnormal transaction data or not; and determining whether the transaction data is abnormal transaction data according to the voting result.

Optionally, when the transaction model is a random forest model, inputting the feature data into a pre-trained transaction model, so that the transaction model outputs a voting result of the transaction data, including: determining a classification result of the transaction data through each decision tree in the random forest model; dividing the plurality of classification results into a plurality of groups, wherein the classification results in each group are consistent; and determining the number of the classification results in each group, and determining the classification result corresponding to the group with the largest number as the voting result.

Optionally, before inputting the feature data into the pre-trained transaction model, the method further comprises: and circularly executing a construction step until each decision tree in the random forest model is constructed, wherein the construction step comprises the following steps: calculating the information gain ratio of the corresponding characteristics of each non-leaf node in the decision tree; taking the feature with the maximum information gain ratio as a dividing feature, and taking the node corresponding to the dividing feature as a splitting node; and dividing the characteristic sample data into a plurality of sub-data sets according to the dividing characteristics, and constructing the sub-nodes of the split node according to the plurality of sub-data sets.

Optionally, the building step is circularly performed until after each decision tree of the random forest model is built, and the method further comprises: inputting feature verification data into the random forest model, and obtaining a prediction result output by the random forest model; determining index information of the random forest model according to the prediction result and the label data corresponding to the feature verification data value, wherein the index information comprises: accuracy, precision, recall, F1 value, specificity, and false positive rate; and determining whether the random forest model passes verification according to the index information.

Optionally, determining whether the random forest model is verified according to the index information includes: when the accuracy is larger than or equal to a first threshold, the accuracy is larger than or equal to a second threshold, the recall rate is larger than or equal to a third threshold, the F1 value is larger than or equal to a fourth threshold, the specificity is larger than or equal to a fifth threshold, and the false positive rate is smaller than or equal to a sixth threshold, the random forest model is determined to pass verification; determining that the random forest model is not validated if the accuracy is less than the first threshold, and/or the precision is less than the second threshold, and/or the recall is less than the third threshold, and/or the F1 value is less than the fourth threshold, and/or the specificity is less than the fifth threshold, and/or the false positive rate is greater than the sixth threshold.

Optionally, after determining whether the transaction data is abnormal transaction data according to the voting result, the method further includes: under the condition that the transaction data is abnormal transaction data, carrying out differential privacy calculation on the transaction data to obtain encrypted transaction data; and sending the encrypted transaction data to a terminal device, so that the terminal device outputs an analysis result according to the encrypted transaction data.

Optionally, inputting the feature data into a pre-trained transaction model includes: determining a transaction value of the target object and a transaction frequency of the target object according to the transaction data; determining whether the transaction data of the target object is abnormal transaction data according to the transaction data and the transaction frequency; and under the condition that the transaction data are determined to be normal transaction data, inputting the characteristic data into a pre-trained transaction model.

In order to achieve the above object, according to another aspect of the present application, there is provided a transaction data determination apparatus. The device comprises: the acquisition module is used for acquiring the transaction data of the target object and determining the characteristic data corresponding to the transaction data; the input module is used for inputting the characteristic data into a pre-trained transaction model so that the transaction model outputs voting results of the transaction data, wherein the transaction model is trained by machine learning by using a plurality of groups of data, and each group of data in the plurality of groups of data comprises: the feature sample data and the transaction data corresponding to the feature sample data are labels of abnormal transaction data or not; and the determining module is used for determining whether the transaction data is abnormal transaction data according to the voting result.

According to the application, the following steps are adopted: acquiring transaction data of a target object, and determining characteristic data corresponding to the transaction data; inputting the characteristic data into a pre-trained transaction model so that the transaction model outputs voting results of the transaction data; and determining whether the transaction data is abnormal transaction data according to the voting result. The method solves the problems that in the prior art, the rule setting is static and based on the known transaction data in the process of screening and identifying the transaction data, and the adjustment can not be flexibly carried out according to different conditions, thereby improving the financial stability and guarding the financial safety.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application. In the drawings:

FIG. 1 is a flow chart of a method of determining transaction data provided in accordance with an embodiment of the present application;

FIG. 2 is a block diagram (I) of a transaction data determination device provided according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an alternative electronic device provided in accordance with an embodiment of the present application;

FIG. 4 is a block diagram (II) of a transaction data determination device according to an embodiment of the present application;

FIG. 5 is a random forest model training flowchart in accordance with an embodiment of the present application;

FIG. 6 is a transaction behavior discrimination flow diagram according to an embodiment of the application;

fig. 7 is a flow chart of datagram delivery according to an embodiment of the present application.

Detailed Description

It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.

In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate in order to describe the embodiments of the application herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

For convenience of description, the following will describe some terms or terminology involved in the embodiments of the present application:

random forests: the random forest algorithm belongs to one of the expansion of the Bagging algorithm, and is a combined model of a plurality of decision trees by taking a decision tree model as a base learner. In order to build a diversified decision tree, a random forest algorithm introduces random attributes in the model training process. The randomization was divided into two randomizations: the first re-randomness is the randomness of the data sampling, and the second re-randomness is the random extraction of the characteristics, namely, each decision tree randomly extracts part of the characteristics for training.

The application will be described with reference to preferred implementation steps, and fig. 1 is a flowchart of a method for determining transaction data according to an embodiment of the application, as shown in fig. 1, and the method includes the following steps:

step S101, acquiring transaction data of a target object, and determining characteristic data corresponding to the transaction data;

it should be noted that the transaction data includes, but is not limited to: recharging data and transferring data.

Step S102, inputting the characteristic data into a pre-trained transaction model so that the transaction model outputs voting results of the transaction data;

Step S103, determining whether the transaction data is abnormal transaction data according to the voting result.

The method for determining transaction data provided by the embodiment of the application comprises the steps of firstly obtaining the transaction data of a target object and determining the characteristic data corresponding to the transaction data; inputting the characteristic data into a pre-trained transaction model so that the transaction model outputs voting results of the transaction data; according to the voting result, whether the transaction data is abnormal transaction data is determined, so that the problems that in the prior art, the rule setting is static and based on the known transaction data in the process of screening and identifying the transaction data, and the transaction data cannot be flexibly adjusted according to different conditions are solved, and the technical effects of improving the financial stability and guarding the financial safety are achieved.

Optionally, in the foregoing embodiment, when the transaction model is a random forest model, inputting the feature data into a pre-trained transaction model, so that the transaction model outputs a voting result of the transaction data, including: determining a classification result of the transaction data through each decision tree in the random forest model; dividing the plurality of classification results into a plurality of groups, wherein the classification results in each group are consistent; and determining the number of the classification results in each group, and determining the classification result corresponding to the group with the largest number as the voting result.

In the embodiment of the invention, the transaction data is decided through a random forest model, all decision trees in the trained random forest model vote, and the voting result of the transaction is judged according to the voting result proportion, specifically, the classification result with the highest proportion is taken as the voting result.

Optionally, in the foregoing embodiment, before inputting the feature data into the pre-trained transaction model: and circularly executing a construction step until each decision tree in the random forest model is constructed, wherein the construction step comprises the following steps: calculating the information gain ratio of the corresponding characteristics of each non-leaf node in the decision tree; taking the feature with the maximum information gain ratio as a dividing feature, and taking the node corresponding to the dividing feature as a splitting node; and dividing the characteristic sample data into a plurality of sub-data sets according to the dividing characteristics, and constructing the sub-nodes of the split node according to the plurality of sub-data sets.

It can be understood that the manner of constructing the decision tree in the embodiment of the present invention is as follows: selecting one feature as a partition feature of the root node; for each non-leaf node, calculating the information gain ratio of each feature below the non-leaf node, and selecting the feature with the largest information gain ratio as a division feature; dividing the data set into a plurality of sub-data sets according to the selected dividing characteristics, wherein each sub-data set corresponds to a characteristic value; for each sub-data set, subtrees are recursively constructed until a termination condition is met, e.g., all samples belong to the same class. And circularly executing the construction step until each decision in the random forest model is constructed, so that the random forest model can be added, and the accuracy of the label data corresponding to the verification data value can be improved to the greatest extent by the characteristic.

Optionally, in the above embodiment, the building step is circularly performed until after each decision tree of the random forest model is built: inputting feature verification data into the random forest model, and obtaining a prediction result output by the random forest model; determining index information of the random forest model according to the prediction result and the label data corresponding to the feature verification data value, wherein the index information comprises: accuracy, precision, recall, F1 value, specificity, and false positive rate; and determining whether the random forest model passes verification according to the index information.

It can be understood that the accuracy refers to the proportion of all samples predicted to be correct to the total sample; the accuracy refers to the proportion of samples which are accurately predicted to be abnormal transaction data to all the predicted abnormal transaction data; the recall rate refers to the proportion of the abnormal transaction data which is predicted to be abnormal accurately and accounts for all the abnormal transaction data actually; the F1 value is a comprehensive index which combines accuracy and recall rate and enables the reconciliation average of the accuracy and recall rate to be used as balance of the accuracy and recall rate; the specificity refers to the proportion of samples which are accurately predicted to be normal transaction data to all samples which are predicted to be normal transaction data; the false positive rate refers to the proportion of samples of the data which are predicted to be normal in error to all samples of the data which are predicted to be normal in transaction.

Optionally, in the foregoing embodiment, determining whether the random forest model is verified according to the index information includes: when the accuracy is larger than or equal to a first threshold, the accuracy is larger than or equal to a second threshold, the recall rate is larger than or equal to a third threshold, the F1 value is larger than or equal to a fourth threshold, the specificity is larger than or equal to a fifth threshold, and the false positive rate is smaller than or equal to a sixth threshold, the random forest model is determined to pass verification; determining that the random forest model is not validated if the accuracy is less than the first threshold, and/or the precision is less than the second threshold, and/or the recall is less than the third threshold, and/or the F1 value is less than the fourth threshold, and/or the specificity is less than the fifth threshold, and/or the false positive rate is greater than the sixth threshold.

It can be appreciated that in the above embodiment, the closer the index of the correctly predicted sample is to 1, the better the model prediction effect is, for example, it can be formulated in this embodiment: under the condition that the fifth threshold value of the first threshold value, the second threshold value, the third threshold value and the fourth threshold value is 0.80 and the sixth threshold value is 0.20, the accuracy is more than or equal to 0.80, the precision is more than or equal to 0.80, the recall rate is more than or equal to 0.80, the F1 value is more than or equal to 0.80, the specificity is more than or equal to 0.80, the false positive rate is less than or equal to 0.20, and the model verification is passed, the model can be output after the model verification is passed, and the model which does not pass the verification is not output.

Optionally, in the foregoing embodiment, after determining whether the transaction data is abnormal transaction data according to the voting result, the method further includes: under the condition that the transaction data is abnormal transaction data, carrying out differential privacy calculation on the transaction data to obtain encrypted transaction data; and sending the encrypted transaction data to a terminal device, so that the terminal device outputs an analysis result according to the encrypted transaction data.

It can be understood that the abnormal transaction data determined according to the model can be periodically counted and determined to be abnormal transaction data by taking a certain time as a unit, and the transaction data which is successfully blocked is reported. The statistical transaction data may include, but is not limited to, abnormal transaction amount, abnormal transaction deduction bank card account information, etc. So that the terminal device can accurately know whether the transaction data is abnormal or not and execute further operation.

It should be noted that differential privacy is a privacy protection technology for protecting personal privacy during statistical analysis of personal data. The differential privacy calculation method comprises the following steps:

1. noise addition: in the statistical analysis process, certain noise is added to the data of each individual to protect privacy. For example, for one binary attribute, the noise may be added using Laplacian noise or Gaussian noise. The size of the noise may be adjusted based on the differential privacy parameters.

2. Data perturbation: and disturbing the individual data so that the analysis result does not leak sensitive information of the individual. For example, the data is subjected to randomization, desensitization, or generalization, etc., to protect individual privacy.

3. Query restriction: query operations on data are restricted to reduce disclosure of individual privacy. For example, the result of a restricted query can only be some statistical information that is predefined, and cannot provide information about a particular individual.

4. Randomization response: for the query results, the individual privacy is protected by randomizing the results. For example, when a query result is returned, the result is randomized such that the querier cannot determine information of a specific individual.

5. Differential privacy algorithm: differential privacy algorithms are a computational method that is specifically used to protect privacy. It provides a range of privacy preserving mechanisms such as Locality Sensitive Hashing (LSH), generalization and aggregation, etc., to preserve the privacy of individual data.

It should be noted that the transaction data may be selected and used in combination according to specific application scenarios and requirements, so as to implement differential privacy calculation on the transaction data. Meanwhile, setting of differential privacy parameters and evaluation of privacy protection effects are required to be carried out according to specific differential privacy requirements and risk evaluation.

Optionally, in the foregoing embodiment, inputting the feature data into a pre-trained transaction model includes: determining a transaction value of the target object and a transaction frequency of the target object according to the transaction data; determining whether the transaction data of the target object is abnormal transaction data according to the transaction data and the transaction frequency; and under the condition that the transaction data are determined to be normal transaction data, inputting the characteristic data into a pre-trained transaction model.

In order to accelerate the determination of whether the transaction data is abnormal transaction data, whether the transaction data is abnormal transaction data is determined according to the transaction value and the transaction frequency of the target object before the feature data is input into the pre-trained transaction model, specifically, the feature data is input into the pre-trained transaction model when the transaction value is greater than a seventh threshold value and the transaction frequency is greater than an eighth threshold value and the transaction value is less than or equal to the seventh threshold value and/or the transaction frequency is less than or equal to the eighth threshold value.

In the embodiment of the invention, before the characteristic data is input into the pre-trained transaction model, whether the transaction data is abnormal transaction data or not is determined according to the transaction value and the transaction frequency of the target object, so that the speed of determining whether the transaction data is abnormal transaction data or not can be improved.

It will be appreciated that the transaction value of the target object may be divided into discrete divisions, such as 0-50 for small transactions, 50-100 for medium transactions, 100 for large transactions and above; the transaction frequency of the target object is the transaction frequency which is up to the current time, and the transaction frequency can be divided by taking the month as a unit for statistics, for example, the transaction frequency can be divided into three types of low frequency, medium frequency and high frequency.

The embodiment of the application also provides a device for determining transaction data, and the device for determining transaction data can be used for executing the method for determining transaction data provided by the embodiment of the application. The following describes a device for determining transaction data provided by an embodiment of the present application.

Fig. 2 is a block diagram (a) of a construction of a transaction data determination device according to an embodiment of the present application. As shown in fig. 2, the apparatus includes:

the acquiring module 22 is configured to acquire transaction data of a target object, and determine feature data corresponding to the transaction data;

an input module 24, configured to input the feature data into a pre-trained transaction model, so that the transaction model outputs a voting result of the transaction data;

A determining module 26 is configured to determine whether the transaction data is abnormal transaction data according to the voting result.

By the application, the device is adopted: acquiring transaction data of a target object, and determining characteristic data corresponding to the transaction data; inputting the characteristic data into a pre-trained transaction model so that the transaction model outputs voting results of the transaction data; and determining whether the transaction data is abnormal transaction data according to the voting result. The method solves the problems that in the prior art, the rule setting is static and based on the known transaction data in the process of screening and identifying the transaction data, and the adjustment can not be flexibly carried out according to different conditions, thereby achieving the technical effects of improving the financial stability and guarding the financial safety.

Optionally, the input module 24 is further configured to determine a classification result of the transaction data through each decision tree in the random forest model; dividing the plurality of classification results into a plurality of groups, wherein the classification results in each group are consistent; and determining the number of the classification results in each group, and determining the classification result corresponding to the group with the largest number as the voting result.

Optionally, the apparatus further includes: the circulation module is configured to perform a construction step in a circulation manner until each decision tree in the random forest model is constructed before the feature data is input into the pre-trained transaction model, where the construction step includes: calculating the information gain ratio of the corresponding characteristics of each non-leaf node in the decision tree; taking the feature with the maximum information gain ratio as a dividing feature, and taking the node corresponding to the dividing feature as a splitting node; and dividing the characteristic sample data into a plurality of sub-data sets according to the dividing characteristics, and constructing the sub-nodes of the split node according to the plurality of sub-data sets.

Optionally, the apparatus further includes: the verification module is used for circularly executing the construction step until each decision tree of the random forest model is constructed, inputting feature verification data into the random forest model, and obtaining a prediction result output by the random forest model; determining index information of the random forest model according to the prediction result and the label data corresponding to the feature verification data value, wherein the index information comprises: accuracy, precision, recall, F1 value, specificity, and false positive rate; and determining whether the random forest model passes verification according to the index information.

Optionally, the verification module is further configured to determine that the random forest model passes verification when the accuracy rate is greater than or equal to a first threshold, the accuracy rate is greater than or equal to a second threshold, the recall rate is greater than or equal to a third threshold, the F1 value is greater than or equal to a fourth threshold, the specificity is greater than or equal to a fifth threshold, and the false positive rate is less than or equal to a sixth threshold; determining that the random forest model is not validated if the accuracy is less than the first threshold, and/or the precision is less than the second threshold, and/or the recall is less than the third threshold, and/or the F1 value is less than the fourth threshold, and/or the specificity is less than the fifth threshold, and/or the false positive rate is greater than the sixth threshold.

Optionally, the input module is further configured to perform differential privacy calculation on the transaction data to obtain encrypted transaction data when the transaction data is determined to be abnormal transaction data; and sending the encrypted transaction data to a terminal device, so that the terminal device outputs an analysis result according to the encrypted transaction data.

Optionally, the input module is further configured to determine a transaction value of the target object and a transaction frequency of the target object according to the transaction data; determining whether the transaction data of the target object is abnormal transaction data according to the transaction data and the transaction frequency; and under the condition that the transaction data are determined to be normal transaction data, inputting the characteristic data into a pre-trained transaction model.

In order to better understand the process of the method for determining transaction data, the implementation flow of the method for determining transaction data is described below in conjunction with the optional embodiments, but is not limited to the technical solution of the embodiment of the present application.

The optional embodiment establishes a determination method of transaction data, and is suitable for recharging the telephone charge of the mobile phone in a scene. As shown in fig. 4, fig. 4 is a block diagram (two) of a transaction data determining apparatus according to an embodiment of the present application, where a module structure of the system includes a data acquisition module 400, a data preprocessing module 401, a data caching module 402, a model training module 403, a model generating module 404, a model tuning module 405, a model evaluating module 406, a data interception module 407, a model predicting module 408, a result executing module 409, and a data reporting module 410.

1) The data acquisition module 400: is responsible for collecting multi-source historical transaction data. The transaction data read from the database includes: the transaction order ID, the transaction mobile phone number, the transaction time, the transaction amount, the transaction type, the transaction mode, the payment channel, the transaction interval from the last transaction, the account balance before transaction, the transaction frequency of the record, whether the transaction is settled across operators, whether the transaction is abnormal and the like.

2) The data preprocessing module 401:

(1) The data is processed to form a dataset. Converting the transaction data into a multi-attribute data set, discretizing and classifying the continuous data according to a preset rule for the continuous data, determining the abnormal attribute of the transaction data according to a historical judgment result, and taking the abnormal attribute of the transaction data as a label of the transaction data.

Trade order ID: each trade order is numbered and each ID corresponds to each trade order.

Transaction cell phone number: and dividing the region according to 4-7 bits in the mobile phone number.

Transaction source IP address: dividing the in-house transaction and the cross-house transaction.

Transaction time: the transaction time is divided into four time periods of morning, afternoon, evening and early morning in a discrete manner.

Transaction amount: the transaction amount is discretized and divided into three grades of small-amount transaction (0-40), medium-amount transaction (40-100) and large-amount transaction (> 100).

Transaction type: including both prepaid and post-payment transactions.

Transaction mode: including electronic payment transactions, over-the-air transactions.

Payment channel: the method comprises bank card payment, third party payment platform payment, credit deduction and offline payment.

Distance last transaction interval: the time interval is divided into an instantaneous interval, a short interval, a daily interval, a weekly interval, a monthly interval and a long-term interval in a discrete manner.

Pre-transaction account balance: discretizing the account balance, wherein the account balance type comprises: low balance, general balance and rich balance.

Transaction frequency: and counting and dividing the transaction frequency by taking the month as a unit after the current transaction. The low frequency, the intermediate frequency and the high frequency are divided into three types.

Whether to settle across operators: currently the transaction is divided if there is a settlement across operators.

Whether or not the foreign mobile phone number: whether the current transaction corresponds to the division of the mobile phone number is the foreign mobile phone number.

Whether virtual cell phone number: whether the corresponding mobile phone number of the transaction is divided into virtual mobile phone numbers or not currently.

And finally determining whether the abnormal attribute of the transaction data is the abnormal transaction data or not according to the history judgment result. A dataset of a plurality of sets of data records is formed. Such as: "[ Attribute 1, attribute 2, … …, attribute n ] -Normal/Exception transaction".

(2) And processing incomplete data and null data. And deleting the data records with incomplete data and empty data, and removing the data records from the data set.

3) Data storage module 402: the training model is used for storing the data set to be trained after pretreatment and the new transaction record judged by the model, and the corresponding number of data sets can be randomly extracted from the module for training when the training model is updated.

4) Model training module 403: data pre-processed data were partitioned, with 70% of the data used to train the model and 30% of the data used to evaluate the model.

5) Model generation module 404, model tuning module 405: and sampling and training the training set by using a Bagging method, and optimizing the model by using K-fold cross validation. The data set is divided into K parts and 1 part as verification sets, K-1 parts are used as training sets, each training set comprises X samples and Y characteristics, and each round of training has (K-1) X samples to participate. For one requirement, a random forest with G decision trees is generated:

(1) According to the sampling randomness: the samples are required to be randomly sampled in a G/K round, the number of data records of each round of sampling is (K-1) X, and the samples are collected (K-1) X times and are used for training G/K decision trees.

(2) Selecting random according to the characteristics: each decision tree extracts a features from all features of the data record, wherein

The embodiment of the invention completes the construction of each G/K decision tree by calculating a C4.5 algorithm. Selecting successive nodes of the feature division of the decision tree according to the information gain rate, wherein the attribute with large information gain rate is used as a preferential splitting node:

a. information amount (entropy): entropy represents the degree of uncertainty of a transaction, i.e., the size of the amount of information. In this scenario, D represents a training set, pi represents a probability of occurrence of classification of whether transaction data in the training set D is abnormal, and m represents the number of classifications:

b. information entropy: assume that training set D is partitioned by attribute A, and attribute A partitions D into v different classes. Info (Dj) represents the information quantity of each part calculated after dividing the training set D according to the attribute A, and the information entropy of the attribute A is as follows:

c. information gain: the difference between the original information requirement and the new requirement (i.e. the information obtained after dividing a):

Gain(A)＝Info(D)-Info _A (D)。

d. information splitting: information gain ratio the information gain is normalized using the "split info" value:

e. Information gain ratio: by dividing the training data set D into v divided information corresponding to v outputs of the attribute a test, the attribute having the largest gain ratio is selected as the preferential splitting attribute:

and finally, constructing G/K decision trees, verifying the constructed G/K decision trees by using 1 verification set, and adjusting the super parameters of the decision trees according to the requirement, so that G/K decision trees can be generated by one round. And then selecting another group of data to be used as a verification set, using the rest K-1 group of data as a training set, repeating the steps by K rounds, and finally obtaining a random forest model consisting of G decision trees subjected to K-fold cross verification.

6) Model evaluation module 406: and evaluating the trained random forest model by using the 30% of data which are distinguished before, voting all decision trees of the forest to determine a result together through the random forest model by using the evaluated data set, and calculating six indexes of accuracy, precision, recall rate, F1 value, specificity and false positive rate of the model.

In the transaction scenario, abnormal transaction data is noted as positive and normal transaction data is noted as negative. The confusion matrix is as shown in table 1:

TABLE 1 confusion matrix

Based on the established confusion matrix, whether the model passes verification is judged by calculating six indexes of accuracy, precision, recall rate, F1 value, specificity and false positive rate of the model and defining a threshold value.

(1) Accuracy rate: all predicted correct samples (including positive and negative classes) are proportional to the total sample:

(2) Accuracy: samples predicted correctly as positive class (abnormal transaction data), account for the proportion of all samples predicted as positive class:

(3) Recall rate: the proportion of the total actual positive class that is correctly predicted to be positive (abnormal transaction data):

(4) F1 value: the precision and recall rate are both considered, and the harmonic mean of the precision and recall rate is used as a comprehensive index for balancing the two considered:

(5) Specificity: samples correctly predicted as negative classes (normal transaction data) account for all the proportion of samples predicted as negative classes:

(6) False positive rate: samples that are incorrectly predicted as negative classes (normal transaction data) account for all the proportions that are predicted as negative class samples:

among the above six model evaluation indexes, TP and TN are correctly predicted samples, FN and FP are incorrectly predicted samples, and the index with TP and TN as molecules is closer to 1, which represents that the model prediction effect is better, so that the invention is formulated: the accuracy is more than or equal to 0.80, the precision is more than or equal to 0.80, the recall rate is more than or equal to 0.80, the F1 value is more than or equal to 0.80, the specificity is more than or equal to 0.80, and the false positive rate is less than or equal to 0.2, namely the model verification is passed, and the model is output.

7) Data interception module 507: the transaction data of the user is firstly intercepted and is not directly transacted, the data record is processed by the data preprocessing module 501 to form a data record without a label, such as [ attribute 1, attribute 2, attribute n ], and the data is forwarded by the data prediction module 508 for prediction judgment.

8) The data prediction module 508: and storing the trained random forest model so as to conduct immediate judgment of transaction behavior records. And making a decision on the intercepted user transaction data through a random forest model, voting by all decision trees in the trained random forest model, and judging whether the transaction is abnormal or not according to the ratio of voting results. The new transaction data is judged and then returned to the data storage module 502, and the data storage module 502 waits for the new judgment transaction data to accumulate a certain amount and then randomly extracts a sufficient number of samples to update the trained model.

9) The result execution module 509: and executing corresponding operation according to the model judgment result. If the transaction is normal, the transaction data is released and sent upwards, and the transaction data is accepted; if the transaction is abnormal, the transaction data is not released, the transaction service is refused and the transaction failure is returned.

10 Data reporting module 510): and according to the abnormal transaction result judged by the model, periodically counting the abnormal transaction in month units and reporting the order data which is successfully blocked. The data comprises sum data related to the transaction, and abnormal transaction deduction bank card account information.

The process of this alternative embodiment is divided into a random forest model training process, a transaction data discrimination process and a data reporting process, and the flowcharts thereof are respectively fig. 5, 6 and 7.

(1) Training process of random forest model:

step S500: the data collection is from the database of the operator, the historical collected mobile phone charge transaction records exist in the database, and the historical transaction records already exist a classification label which is used as the transaction record after judging and verifying whether the transaction records are abnormal transactions or not.

Step S501: preprocessing the collected data, extracting a plurality of attributes of a transaction record, including but not limited to a transaction order ID, a transaction mobile phone number, a transaction source IP address, a transaction time, a transaction amount, a transaction type, a transaction mode, a payment channel, a last transaction interval, a pre-transaction account balance, a transaction frequency, whether settlement is carried out across operators, whether an overseas mobile phone number, whether a virtual mobile phone number is carried out, and whether abnormal transaction is carried out. And carrying out discrete division on the data according to the continuously distributed attributes, deleting records of data incomplete and null values, and finally prompting whether the abnormal transaction attributes form a multi-attribute transaction record data set with labels for model training.

Step S502: this step enables the storage of the data set to be trained for use in the next model training step.

Step S503: the extracted training data was divided, 70% for training the model and 30% for evaluating the model. The method comprises the steps of training a model, sampling by using a Bagging method, optimizing the model by adopting K-fold cross validation, randomly extracting samples and characteristics, constructing decision trees by using a C4.5 algorithm, and forming all decision trees to form a random forest.

Step S504: and evaluating the trained random forest model by 30% of data, comparing the result of data judgment by the model with a label of the model, comparing a true value with a predicted value, calculating indexes such as the accuracy, recall rate and F1 value of the model, and judging whether the model passes or not according to the threshold value of the indexes. If yes, outputting the model, ending the random forest model training flow, if not, re-performing model training, and evaluating again until the model passes, and outputting the model.

Step S505: and outputting the trained model to intercept new transaction records to be identified to introduce the model, and completing the identification of whether abnormal transactions are generated.

(2) Transaction behavior discriminating process

Step S600: and uniformly intercepting the customer transaction data, and performing subsequent step operation on the customer transaction data.

Step S601: and processing the newly intercepted customer orders, and extracting the newly intercepted customer orders from the transaction data according to the data attributes required in the data preprocessing to form a multidimensional data record containing the required attributes.

Step S602: and inputting the processed data into a trained model for prediction judgment, and outputting a judgment result according to the fact that all decision trees in the model are voting results. And the normal transaction is carried out for transmitting transaction data, the abnormal transaction is stopped, and the abnormal transaction is returned.

Step S603: and releasing the transaction data, uploading the order, and completing the transaction.

Step S604: and stopping the transaction data, returning to the transaction abnormality, and failing the transaction.

Step S605: and marking the transaction data judged by the random forest model with a classification label of whether the transaction data belongs to abnormal transactions or not, adding the new data record into a training set, and randomly extracting enough data records as a new training set updating model after a certain number of increment is reached.

Step S606: when the increment of the data record reaches a certain amount, enough data is randomly fetched from the data storage module 402, and a random forest model is trained and updated.

(3) Data reporting process

Step S701: and judging whether the report time point is reached, wherein the report time point is defined as the last day of each month as the data report day by taking month as a unit.

Step S702: and summarizing transaction data which is judged to be abnormal transaction data by the random forest model and is successfully prevented in the month, wherein the data comprises the sum of the summarization statistics, and a deduction bank card account number of the bank card transaction is utilized.

Step S703: and after summarizing and counting the data, carrying out differential privacy calculation on the data locally, and publishing the data to a financial system.

Step S704: the financial system receives the submitted data, analyzes abnormal fund sizes blocked from flowing into the regular market in the transaction data by analyzing and integrating the submitted data, senses bank account information, and utilizes the data to promote anti-fraud joint research of the financial system.

It will be appreciated that this alternative embodiment is whether the transaction data is anomalous or not by means of machine learning. Under the condition that the transaction data belong to abnormal transaction data, the transaction behavior is cut off, so that the stability of a financial system is maintained, and the financial security is guarded.

It can be appreciated that this alternative embodiment extracts the multidimensional data attribute of a transaction data by training a machine learning model algorithm, and the input model quickly determines whether the transaction data is an abnormal transaction. Compared with the traditional rule judgment and manual judgment, the method has the characteristics of strong expansibility, easy maintenance, high efficiency and low misjudgment rate.

It can be appreciated that the random forest training model is adopted in the alternative embodiment to judge the transaction data, the sampling is random, the feature selection is random, and the decision result is decided and output through the voting of a plurality of decision trees. The random forest classification can reduce overfitting, has high classification accuracy, can process high-dimensional data, is suitable for a transaction data judgment scene requiring multidimensional feature judgment, allows a model to be quickly trained, and is suitable for actual business operation.

It can be appreciated that the present alternative embodiment periodically gathers the discrimination results of the transaction data for the model of the system, counts the transaction data that is discriminated as the abnormal transaction, and periodically transmits the abnormal transaction data to the financial system. The published data comprises the account information of the bank card related to the abnormal transaction amount and abnormal transaction deduction which is successfully judged as abnormal transaction and prevented, and the financial system is helped to perfect fund management in a regular continuous feedback mode to promote the combined research of anti-fraud.

The device for determining the transaction data comprises a processor and a memory, wherein the processor is used for running a program, the data of the program running are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.

The processor includes a kernel, and the kernel fetches the corresponding program unit from the memory. The kernel can be provided with one or more than one kernel, and the problems that the prior model feature analysis method in the related technology ignores complex relations among features, so that a large amount of time and manpower resources are consumed and the user experience is poor are solved by adjusting kernel parameters.

The memory may include volatile memory, random Access Memory (RAM), and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM), among other forms in computer readable media, the memory including at least one memory chip.

An embodiment of the present invention provides a computer-readable storage medium having stored thereon a program which, when executed by a processor, implements a method of determining transaction data.

The embodiment of the invention provides a processor which is used for running a program, wherein the program runs to execute the method for determining the transaction data.

As shown in fig. 3, an embodiment of the present invention provides an electronic device, where the device includes a processor, a memory, and a program stored in the memory and executable on the processor, and when the processor executes the program, the following steps are implemented: inputting the characteristic data into a pre-trained transaction model so that the transaction model outputs voting results of the transaction data, wherein the transaction model is trained by machine learning by using a plurality of sets of data, and each set of data in the plurality of sets of data comprises: the feature sample data and the transaction data corresponding to the feature sample data are labels of abnormal transaction data or not; and determining whether the transaction data is abnormal transaction data according to the voting result.

The device in the application can be a server, a PC, a PAD, a mobile phone and the like.

The application also provides a computer program product adapted to perform, when executed on a data processing device, a program initialized with the method steps of: acquiring transaction data of a target object, and determining characteristic data corresponding to the transaction data; inputting the characteristic data into a pre-trained transaction model so that the transaction model outputs voting results of the transaction data, wherein the transaction model is trained by machine learning by using a plurality of groups of data, and each group of data in the plurality of groups of data comprises: the feature sample data and the transaction data corresponding to the feature sample data are labels of abnormal transaction data or not; determining whether the transaction data is abnormal transaction data according to the voting result

Optionally, inputting the feature data into a pre-trained transaction model includes: determining a transaction value of the target object and a transaction frequency of the target object according to the transaction data; determining whether the transaction data of the target object is abnormal transaction data according to the transaction data and the transaction frequency; and under the condition that the transaction data are determined to be normal transaction data, inputting the characteristic data into a pre-trained transaction model. It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, etc., such as Read Only Memory (ROM) or flash RAM. Memory is an example of a computer-readable medium.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises an element.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are to be included in the scope of the claims of the present application.

Claims

1. A method of determining transaction data, comprising:

acquiring transaction data of a target object, and determining characteristic data corresponding to the transaction data;

inputting the characteristic data into a pre-trained transaction model so that the transaction model outputs voting results of the transaction data, wherein the transaction model is trained by machine learning by using a plurality of groups of data, and each group of data in the plurality of groups of data comprises: the feature sample data and the transaction data corresponding to the feature sample data are labels of abnormal transaction data or not;

and determining whether the transaction data is abnormal transaction data according to the voting result.

2. The method of claim 1, wherein, in the case where the transaction model is a random forest model, inputting the feature data into a pre-trained transaction model to cause the transaction model to output voting results of the transaction data, comprises:

determining a classification result of the transaction data through each decision tree in the random forest model;

dividing the plurality of classification results into a plurality of groups, wherein the classification results in each group are consistent;

And determining the number of the classification results in each group, and determining the classification result corresponding to the group with the largest number as the voting result.

3. The method of claim 2, wherein prior to inputting the characteristic data into a pre-trained transaction model, the method further comprises:

circularly executing the construction step until each decision tree in the random forest model is constructed,

wherein, the construction steps include: calculating the information gain ratio of the corresponding characteristics of each non-leaf node in the decision tree; taking the feature with the maximum information gain ratio as a dividing feature, and taking the node corresponding to the dividing feature as a splitting node; and dividing the characteristic sample data into a plurality of sub-data sets according to the dividing characteristics, and constructing the sub-nodes of the split node according to the plurality of sub-data sets.

4. A method according to claim 3, wherein the building step is performed in a loop until after each decision tree of the random forest model has been built, the method further comprising:

inputting feature verification data into the random forest model, and obtaining a prediction result output by the random forest model;

Determining index information of the random forest model according to the prediction result and the label data corresponding to the feature verification data value, wherein the index information comprises: accuracy, precision, recall, F1 value, specificity, and false positive rate;

and determining whether the random forest model passes verification according to the index information.

5. The method of claim 4, wherein determining whether the random forest model is validated based on the metric information comprises:

when the accuracy is larger than or equal to a first threshold, the accuracy is larger than or equal to a second threshold, the recall rate is larger than or equal to a third threshold, the F1 value is larger than or equal to a fourth threshold, the specificity is larger than or equal to a fifth threshold, and the false positive rate is smaller than or equal to a sixth threshold, the random forest model is determined to pass verification;

determining that the random forest model is not validated if the accuracy is less than the first threshold, and/or the precision is less than the second threshold, and/or the recall is less than the third threshold, and/or the F1 value is less than the fourth threshold, and/or the specificity is less than the fifth threshold, and/or the false positive rate is greater than the sixth threshold.

6. The method of claim 1, wherein after determining whether the transaction data is abnormal transaction data based on the voting results, the method further comprises:

under the condition that the transaction data is abnormal transaction data, carrying out differential privacy calculation on the transaction data to obtain encrypted transaction data;

and sending the encrypted transaction data to a terminal device, so that the terminal device outputs an analysis result according to the encrypted transaction data.

7. The method of claim 1, wherein inputting the characteristic data into a pre-trained transaction model comprises:

determining a transaction value of the target object and a transaction frequency of the target object according to the transaction data;

determining whether the transaction data of the target object is abnormal transaction data according to the transaction data and the transaction frequency;

and under the condition that the transaction data are determined to be normal transaction data, inputting the characteristic data into a pre-trained transaction model.

8. A transaction data determining device, comprising:

the acquisition module is used for acquiring the transaction data of the target object and determining the characteristic data corresponding to the transaction data;

The input module is used for inputting the characteristic data into a pre-trained transaction model so that the transaction model outputs voting results of the transaction data, wherein the transaction model is trained by machine learning by using a plurality of groups of data, and each group of data in the plurality of groups of data comprises: the feature sample data and the transaction data corresponding to the feature sample data are labels of abnormal transaction data or not;

and the determining module is used for determining whether the transaction data is abnormal transaction data according to the voting result.

9. A processor for running a program, wherein the program when run performs the method of any one of claims 1 to 7.

10. An electronic device comprising one or more processors and memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-7.