EP3655893A1

EP3655893A1 - Machine learning system for various computer applications

Info

Publication number: EP3655893A1
Application number: EP18755710.3A
Authority: EP
Inventors: Olivier CAELEN; Liyun HE-GUELTON; Pierre-Edouard PORTIER; Michael GRANITZER; Konstantin ZIEGLER; Johannes JURGOVSKY
Original assignee: Worldline SA
Current assignee: Worldline SA
Priority date: 2017-07-18
Filing date: 2018-07-13
Publication date: 2020-05-27
Also published as: CN110998608B; FR3069357B1; FR3069357A1; US20200257964A1; CN110998608A; US11763137B2; WO2019016106A1

Abstract

The invention relates to a machine learning system for various computer applications enabling text mining in order to detect faults or anomalies in an authentication, transaction or operation carried out by the application, comprising: - a hardware and software arrangement forming a preprocessing system; - a hardware and software arrangement forming a neural network leading to an enriched aggregated data processing model, - a hardware and software arrangement for injecting enriched aggregated data into the neural network, and - a hardware and software arrangement for validating the operation or transaction on the basis of results obtained at the output of the neural network.

Description

MACHINE LEARNING SYSTEM FOR VARIOUS

IT APPLICATIONS

Technical field of the invention

The invention relates to the field of fraud detection systems during authentication, particularly during authentication, an operation or a transaction.

State of the art

Due to the constantly increasing volume of electronic exchanges, the various players are constantly looking for new ways to detect fraud during authentication, transactions or transactions.

With the large amount of data that we see today, traditional human observation does not meet the essential requirements for establishing accurate detection of fraud given the amount, diversity and nature dynamics of malicious behavior.

[0004] Systems using modern data-based methods and stand-alone learning methods are beginning to be used for the detection of defects in computer applications, such as authentication frauds, particularly those related to authentication. use of credit cards.

[0005] To do this, these systems generally use neural networks whose statistical learning is based on decision tree forests (random forests) that analyze a sampling of non-sequential data.

However, the use of learning by decision tree can generate very complex decision trees that generalize poorly training set and lead to the acceptance of a fraudulent identification that will not be detected. There is therefore a need for a system making it possible to identify anomalies that are not detected by neural networks whose statistical learning is based on decision tree forests (random forests).

Description of the invention

The object of the present invention is therefore to provide a system for detecting fraud during identification, to overcome at least some of the disadvantages of the prior art, by providing a machine learning system for various computer applications allowing a text search for the detection of defects or anomalies in an authentication, operation or transaction performed by the application, comprising:

a hardware and software arrangement forming a pretreatment system;

a hardware and software arrangement forming a neural network leading to an aggregated enriched data processing model,

a hardware and software arrangement for injecting aggregated enriched data into the neural network,

a hardware and software arrangement for validating the operation or transaction on the basis of the results obtained at the output of the neural network.

[0009] The neural network driving the treatment model is advantageously:

- a long-term and short-term memory type recurrent neural network (LSTM);

a neural network for statistical learning of the type of decision tree; or

- a combination of both.

Advantageously, the recurrent neural network of the LSTM type comprises at least two recurrent layers and a Logistic Regression Classifier positioned above the last layer. recurring taking into account the time elapsed between two authentications, operations or transactions.

[001 1] Advantageously, the hardware and software arrangement for validating authentication, operation or transaction is parameterized with a Jaccard index matrix so that the degree of similarity between the output data of a first neural network is measured of the LSTM type and those from a hardware and software arrangement of a second neural network for statistical learning of the type of decision tree and to validate the results of one of the two neural networks.

It is advantageously used for a computer application allowing a risk prediction from the detection of a fraud in authentication operations of electronic memory objects containing in a zone secret information used to authenticate the object and his bearer.

Advantageously, the hardware and software arrangement forming a recurrent neural network resulting in an LSTM-type model uses a GPU.

Advantageously, the hardware and software arrangement forming a pretreatment system comprises:

at least one first database containing at least one set of sequential schematics of raw data relating to said computer application,

a hardware and software arrangement forming at least a second database containing at least one set of external data; a hardware and software arrangement for enriching the raw data with external data;

a hardware and software arrangement for aggregating the enriched data.

Advantageously, the pretreatment system uses a multi-threaded mode.

Brief description of the figures Other features, details and advantages of the invention will become apparent from a reading of the description which follows, with reference to the appended figures, in which:

- Figure 1 is a schematic representation of a recurrent neural network unrolled in time by creating a copy of the model for each time step.

Figure 2 shows averaged averaged recall curves in the test set (the figure shows LSTM results on Long sequences).

- Figure 3 shows the evolution of the AUPCR on all test days.

The horizontal dashed lines indicate the average AUPCR for each curve (the figure shows the LSTM results on Long sequences).

FIG. 4 shows a pairwise comparison of the true positive sets of two models measured with the Jaccard index and encoded in color in a density map;

- Figure 5 shows the drive architecture of an LSTM model.

Figure 6 shows a meta-classifier that combines the LSTM model and the random forest model.

Figure 7 shows a fraud detection framework according to the invention.

Detailed description of various embodiments of the invention

The following description focuses on a credit card fraud detection application of the system, but it can be applied to other fraud, defects or anomalies in an authentication, operation or transaction performed by various applications. executed by a computer system or network.

Depending on the perspective envisaged, the authentications, transactions or fraudulent transactions can be understood as anomalies in consumer buying behavior or as a set of outliers in the class of genuine authentications, transactions or transactions which themselves form a class opposing fraudulent transactions. In all cases, in the characteristic space, frauds mingle very well with genuine authentications, transactions or transactions, for two reasons. First, the actual buying actions of millions of consumers naturally cover a broad spectrum of variability. And secondly, fraudsters apply a variety of insurable, yet rational, strategies for performing fraudulent acts that span multiple consumer accounts over different time periods - but in the end, these acts will similarly appear only as authentications, transactions, or individual transactions. in a dataset. At the same time, identical purchasing actions may reflect either completely legitimate behavior in the context of certain consumers, or obvious anomalies in the context of other consumers.

In order to support a better discrimination among authentications, transactions or transactions that are difficult to distinguish, we have identified two approaches that allow us to summarize the history of consumer transactions and use this summary during transaction classification. individual. The first method is a well-established practice in the field of credit card fraud detection and is based on manual peculiarity engineering. With the second method, we focus on recovering the sequential structure of a user's authentication, transaction, or transaction history by modeling the transition dynamics between authentications, transactions, or transactions by means of a recurrent neural network.

A long-term and short-term memory network (LSTM) is a special variant of a recurrent neural network (RNN). Recurrent neural networks were developed in the 1980s [Williams and Hinton, 1986, Werbos, 1988, Elman, 1990] for time series modeling. The structure of an RNN is similar to that of a standard multilayer perception, with the difference that it allows connections among hidden units associated with discrete time steps. The time steps index the individual elements in an input sequence. Through connections between time steps, the model can retain information about past entries, which allows it to discover temporal correlations between events that are possibly far apart from one another in the input sequence. This is a crucial property for the appropriate learning of time series in which the occurrence of an event is likely to depend on the presence of several other events even more distant in time.

A generic neural network, with an input x _f and a state s _f for a time step t, is represented by equation 1.

The parameters of the model Θ = {W, U, b} are given by the repetitive weight matrix W, the input weight matrix U and the bias b. The initial state s ₀ is the zero vector and a is a certain nonlinear element activation function - tanh in this case. A cost ε measures network performance on a given task and is typically composed of costs at all time steps

Such a composite cost will be applicable, for example, to text marking tasks, for which a tag is assigned to each word entered. In this case, only the label of the last authentication, operation or transaction in a sequence is predicted.

The distribution on classes of fraud and non-fraud, the state s _f being given, is modeled by means of a logistic regression output model. We interpret the true label>% ^ {0 ^* 1} of a authentication, operation or transaction as being the probability x _f that it belongs to class 0 or 1, and the cost induced by the probabilities predicted by the model is measured by means of the entropy error, defined by

8 _t = £ (-x _{1: t} , yt) = -Vt lg yt - (1 - i¾) log (l - y _t )

The model parameters Θ are learned by minimizing the cost _f with an optimization method based on a gradient. One approach that can be used to calculate the required gradients is backpropagation over time (BPTT). BPTT works by deploying a recurrent network over time to represent it as a deep multilayer network with as many hidden layers as there are time steps (see Figure 1). Next, the well-known backpropagation algorithm [Williams and Hinton, 1986] is applied to the deployed network.

Although in principle the recurrent network is a simple and powerful model, in practice it is difficult to train appropriately with a gradient gradient. Among the many reasons why this model is so laborious, there are two major problems that have been called the disappearance and gradient explosion problem [Bengio et al., 1994].

With the recurrent connection between latent states, the parameter Θ affects the error through not only the last state, but also all the previous states. Similarly, the error depends on W across all states s. This dependence becomes problematic when calculating the gradient of W. l. t. Θ.

dst

The ^jacobian matrix ^ÎÎSfc contains all the component interactions between the Sk state and the st state. We can understand it as a means for returning the error of the state t to the state k. It occurs as the product of all paired interactions between consecutive states

This product is the real reason why it is so difficult to learn dependencies in the long run with optimization methods based on a gradient. The longer the dependence between t and k, the more factors become multiplied in, as a result of which the gradient norm increases or decreases ds.

exponentially with t - k. Each fa involves both the recurrent weight matrix and the derivative [Pascanu et al., 2013] show that it is sufficient if the most important eigenvalue of the recurrent weight matrix is less than 1 for long-term components to disappear, and it suffices if it is greater than 1 for the gradients to explode.

There are several solutions to reduce these problems.

Using a L1 or L2 penalty on the recurring weight matrix can ensure that the largest eigenvalue never exceeds 1, given initialization with sufficiently small weights. Another proposal is based on the assumption that if the model has the same kind of asymptotic behavior from the beginning as the target requires, then the gradients are less likely to explode [Doya, 1993]. However, it is not insignificant to initialize a model in this specific scheme. Gradient truncation is another radical approach that involves the truncation of components by gradient elements when they exceed a fixed threshold [Mikolov et al., 201 1]. Finally, a solution to avoid the problem of gradient disappearance has been proposed by [Hochreiter and Schmidhuber, 1997] _by elimination of direct dependence to a matrix of recurrent weight in a _Si [Bayer, 2015]. This modified network structure is called the Short and Long Term Memory Network (LSTM), and is the state of the art for many real world tasks such as speech recognition, handwriting recognition and statistical machine translation. .

[0030] As an alternative to modeling authentication, operation or transaction sequences with an LSTM, traditional feature engineering is employed.

Aggregations of features: a means for extracting information from an authentication, operation or transaction sequence consists in aggregating the values of certain variables along the sequence. To assemble these aggregations of peculiarities, one follows the procedure that has recently been proposed by [Bahnsen et al., 2016]. This simple but powerful procedure can be considered as constituting the state of the art engineering technique in the detection of credit card fraud. They add new features to each authentication, operation or transaction based on certain predefined rules. The value of a new feature is calculated with an aggregation function applied to a subset of previous transactions. The goal is to create a record of the activities from the history of authentications, operations or transactions of a cardholder, which quantifies the degree to which the authentication, operation or transaction in progress complies with the previous ones.

It is considered that ( ^tl t) ieN is the sequence of authentications, operations or transactions, temporally ordered, of a given card holder, where t indexes the authentications, transactions or individual transactions in its sequence. The value of a particular variable is indicated in an authentication, operation or transaction by

_{, ..} ,, (Friend)

an exponent: for example, ^t is the quantity used in an authentication, operation or transaction x _t . Based on one authentication, operation or transaction x _k , a subset of authentications, transactions or transactions from the past is selected up to a maximum time horizon t _h and according to certain nominal variables A and B:

The set S _k contains all the authentications, operations or transactions of t _h hours preceding x _k , where the nominal variables A and B have taken the same values as for x _k . The nominal variables A and B and the time horizon t _h can be considered as constraints imposed on the subset. For example, if we define A: = Country, B: = MCC and t _h = 24, the subset S _k contains all the authentications, transactions, or transactions of the previous 24 hours that were performed in the same country and in the same country. same category of merchants as authentication, operation or transaction x _k .

We can now define aggregation functions on Sk. There are many possibilities to define such functions, and even if all are likely to be equally valid, it is limited to the two functions that have been proposed by the authors: the total amount spent and the number of transactions.

The pair (sums *, counts /) corresponds to a single constraint given by A, B and t _h . To cover a wider range of statistics from the authentication, transaction, or transaction history, these pairs are calculated for all combinations of country, merchant class, and card entry variables. , inside a time horizon of 24 hours. Finally, all these pairs are added to the authentication, operation or transaction particularity vector x _k .

Delta time: a sequence reader detects patterns in consecutive transaction sequences. These patterns are assumed to resemble some form of latent cardholder purchasing behavior. If this is the case, the behavioral patterns should be invariant at concrete points in time when the purchase actions were actually performed. To support temporal normalization on input sequences that overlap very different time periods, the time in minutes is extracted between two consecutive authentications, operations, or transactions, and explicitly added as an additional feature: tdelta, = xf ^mps) - a ^ ^w> (7)

As in any statistical modeling task, the true phenomenon can be observed in the real world only through a proxy indicated as being a finite set of point observations.

In the detection of credit card fraud, the real interesting phenomenon is the genuine purchasing behavior of cardholders or, similarly, the malicious behavior of fraudsters. It is assumed that this object, which is roughly called behavior, is controlled by certain latent but coherent qualities. With its state variables, the LSTM is in principle able to identify these qualities from the sequence of observations.

In the real world, societal conventions, official regulations or simple physics impose constraints on the potential variability of observations and consequently on the complexity of the qualities that control them. For example, opening hours strictly limit when and where consumers are likely to buy their goods or services. Geographic distances and the modalities of displacement limit the possibilities of consecutive transactions. It is to be expected that all of the authentications, transactions, or face-to-face transactions observed in this database respect, to some extent, these real-world constraints. By contrast, authentications, transactions or e-commerce transactions, or rather their corresponding online purchases, are largely unrestricted, both for the moment and for the location. There is virtually no attribute that can not actually change arbitrarily between an authentication, operation, or transaction and the next one.

It is assumed that the presence of constraints, in the real world, in face-to-face transactions leads to more obvious behavioral patterns with fewer variations. In this case, a sequence learner will take advantage of a more regular sequential structure.

Being motivated by the considerations and the previous statistical analyzes concerning the purchasing behavior in the real world, it was decided to separately study the impact of a sequence learner on the accuracy of detection during authentications. , transactions or e-commerce and face-to-face transactions. The results are contrasted with a non-learner sequence, in other words a random forest.

On the basis of a set of authentication data, operations or transactions labeled credit card, recorded between March and May 2015, we created data sets as follows: all the authentications, operations or Transactions of an identified cardholder are grouped and the authentications, transactions or transactions of each cardholder are sorted according to time. As a result, there is obtained a temporally ordered sequence of authentications, operations or transactions for each cardholder. In the rest of this work, this sequence is called a cardholder's account, and the complete set of all accounts is called the sequence data set. The sequence data set is further divided into two mutually exclusive sets: one set of sequence data contains only the Authentications, Operations or Ecommerce Transactions (ECOM), and the other set contains only the Authentications, Operations or Transactions. made in sales outlets (F2F).

Table 1: Data Set Sizes and Fraud Proportions

Accounts Sampling: A typical characteristic of fraud detection problems is the strong imbalance between the minority class (fraudulent transactions) and the majority class (authentic transactions). The overall fraction of fraudulent authentication, transactions or transactions is usually about 0.5% or less. In the F2F dataset, frauds occur with an order of magnitude lower frequency than the ECOM dataset, further exacerbating the problem of detection. Literature studies [Bhattacharyya et al., 201 1] and previous experiments have shown that some form of under-sampling of the majority class on the training set improves learning. However, unlike transaction-based data sets, in which authentications, transactions, or transactions are considered as independent training examples, such a downsampling strategy can not be applied to a set of sequence data. Therefore, sub-sampling is used at the account level. In this respect, an account is considered to be compromised if it contains at least authentication, transaction or fraudulent transaction, and is considered to be genuine if it contains only genuine transactions. A simple account-based sampling process was used to construct the training set. With a probability p _g = 0.9, an account was randomly selected from the set of authentic accounts and, with a probability 1 - p _g , an account was selected from the pool of compromised accounts. This process is repeated 10 ⁶ times to create a training set with one million accounts. The de facto transaction-level fraud report is always less than 1/10, but we find that this simple approach works well in practice. See Table 1 for details on data set sizes and time periods.

Deferred Ground Reality: The present test period begins more than a week after the training period. The reason for this decision is twofold: in a production system, authentication labels, transactions, or transactions are only available after human investigators have verified the transactions. As a result, the availability of a specific ground reality is always delayed by about a week. The second reason is that the classification is typically more accurate on recent authentications, transactions or transactions that closely follow the training period. But this accuracy and likely to be an overly optimistic evaluation of the performance of the classifier in a production system, since in practice we still do not get access to the real labels.

Alignment of data sets: Both the random forest and the LSTM were trained to predict the label of individual transactions. However, there is a difference that must be taken into account in the experiments. With an LSTM, one can only predict the label of an authentication, operation, or transaction after multiple authentications, operations, or transactions preceded it, whereas with the random forest, no previous transaction is required. To improve the comparability of the results, this difference is taken into account by removing all the authentications, transactions or transactions that are not preceded by at least w = 9 previous transactions. Random Forest (RF) and LSTM can now be trained, validated and tested on identical sets of transactions. To study the influence of the length of the input sequence on the LSTM predictions, only 4 (SHORT) or 9 (LONG) authentications, operations or previous transactions are retained.

As the data collected during an authentication, operation or credit card transaction must comply with the standards N I IF (International Financial Reporting Standards), all the raw features are very similar throughout the literature. As a result, all the specific features of a trade were removed and only those commonly used in other studies were removed [Bhattacharyya et al., 201 1, Bahnsen et al., 2016, Carneiro et al., 2017] . In order to determine the impact of additional features on the accuracy of a classification, three sets of features have been defined.

The first set of features (BASE) contains all the raw features after the specific variables of a trade have been removed. Since frauds do not usually appear in isolation but rather as elements of complete fraud sequences that may span several hours or days, the identity of the cardholder from the set of features has been removed. Otherwise, a classifier could simply remember the identities of cardholders with compromised accounts and make decisions only in this much smaller set of transactions. However, in practice, one would rather know if there is an authentication, operation or fraudulent transaction and then make the account compromised. The second set of features (TDELTA) contains all the features of the BASE set plus the delta-time feature as described in section 3.2. This third set of peculiarities (AGG) contains all the peculiarities of the TDELTA set plus 14 aggregated peculiarities like described above. The authentications, transactions, or transactions of the preceding 24 hours were aggregated in terms of the quantity and number of authentications, transactions, or transactions based on all combinations of the term-mcc, term-country, and card-entry-mode dummy variables. . See Table 2 for an overview of the features.

Table 2: List of features in these datasets.

Marked features ( ^* ) are composite features composed of several lower-level features.

Particularity Type

TERM-MCC Nominal

TERM-COUNTRY Nominal

TX-AMOUNT Proportional

TX-DATETIME ( ^* ) Nominal

TX-3D-SECURE Nominal

TX-EMV Nominal

TX-LOCAL-CURRENCY Nominal

TX-LOCAL-AMOUNT Proportional

TX-PROCESS Nominal

TX-CARD-ENTRY-MODE Nominal

Nominal BROKER

CARD-BRAND Nominal

CARD-EXPIRY Nominal

CARD-TYPE Nominal

CREDIT-LIMIT Proportional

CARD-AUTHENTICATION Nominal

TDELTA Proportional

AGGREGATIONS ( ^* ) Proportional Proportional variables: a Gaussian normalization has been applied to proportional variables such as the quantity of authentications, operations or transactions or the credit limit to center the variable on μ = 0 with a standard deviation σ = 1 . This normalization has no effect on learning a random forest, but it accelerates the convergence of gradient-based optimization in neural networks.

Nominal variables: in the case of the random forest, the nominal variables can be used just as they are. We have only established a correspondence between each value and an integer. In the case of neural networks, we wanted to avoid having vectors of a single particularity encoded by token (one-hot encoding) to very high dimension. Therefore, a label encoding mechanism which is very popular in the field of natural language processing and neural networks has been employed, Collobert et al. [201 1], Socher et al. [2013], Tang et al. [2014], which is applicable to arbitrariness of dummy variables other than words [Guo and Berkhahn, 2016]. For a dummy variable with its set of C values, each value is assigned a random weight vector with d dimensions v, which comes from a uniform multivariate distribution v ~ U ([-0.05, 0.05 ] ^d ), with d = riog ₂ (| C |) 1 The peculiarity values and their corresponding vectors (vector integrations of peculiarity values) are stored inside a dictionary. To encode a particular value of the nominal variable, we look at the value of the particularity in the dictionary and retrieve its vector. The vectors in integration are part of the parameters of the model and can be adjusted jointly during the estimation of the parameters.

Time function: we consider the function of time as a composition of several nominal variables. For each temporal resolution of the time function, ie the year, the month, the day the day, the hour, the minute and the second, we define a nominal variable in the same way as that described above.

The long and short term memory network has two recurrent layers and a logistic regression classifier stacked above the last layer. The logistic regression classifier can be driven in conjunction with the LSTM state transition model via error backpropagation. An abandonment [Srivastava et al., 2014] is applied to the LSTM nodes to regularize the parameters and the whole model is trained by minimizing the cross entropy between the predicted class distribution and the true class distribution with the ADAM algorithm. This implementation is based on the Keras Deep Learning Library.

As the potential benefits of an LSTM-based sequence learning approach to a static learner are studied, an instance of the static learner class must be extracted. We choose here to compare it to random forests. In previous experiments, it has been observed that random forests provide a strong baseline for this task, which also explains its widespread use for fraud detection [Carneiro et al., 2017, Bahnsen et al., 2016, Ngai et al., 201 1]. We use the random forest implementation of SciKit-Learn.

Grid search: both the random forest (RF) and the LSTM must be parameterized with hyper-parameters. The space of possible hyper-parameter configurations was searched for in terms of a coarse grid overlapped by a subset of all hyper-parameters (see Table 3). The configuration was then selected with AU CP / ¾, maximum value 2 on the validation set. Table 3: Hyper-parameters taken into consideration during the grid search

[0053] Two criteria guide the selection of suitable metrics of performance: robustness vis-à-vis unbalanced classes and attention to the specific interests of a trade.

[0054] AUCPR: a precision-return curve (PR) and in particular the area under this curve was used to quantify the accuracy of detection. Each point on the PR curve corresponds to the accuracy of the classifier at a specific recall level. As a result, the entire curve gives a complete picture of the accuracy of a classifier and its robustness even in unbalanced settings. The integral above this curve yields a single-valued summary of performance, and is called AUCPR.

[0055] AUCPR@0.2: From the point of view of trade, low booster and high accuracy are preferable to high booster and low accuracy. A typical choice is therefore to measure the accuracy on the first K elements in the list of hierarchical results. This precision at K corresponds to an isolated point on the PR curve and is likely to vary because of the different ones chosen for K. In order to reflect the commercial interests and to avoid a problem of variability, it is suggested to use the integral on the calculated PR curve up to a certain recall level (0.2 in the present experiments). The maximum value for AUCPR@0.2 is 0.2. Jaccard's index: to explore the qualitative differences between the two present approaches, the Jaccard index was used to measure the degree to which two classifiers are similar in terms of the frauds they detect. With two sets of results (true positives) A

^Given fi 4 n \ ₌ \ ^AnB \ and B, the Jaccard index is defined by ^u v | Au.9 |. The decision threshold is set to st and corresponds to a reminder of 0.2.

Savings: Savings are another metric measure that is often used in the field of fraud detection credit card. They measure the monetary benefit of a certain algorithm over a trivial acceptor / rejector and are based on a predefined cost matrix. A test of a binary classifier on a single authentication, operation or transaction can have four possible outcomes defined by the two predictions (p = 0 or p = 1) and the two true judgments (y = 0 or y = 1). To each of these results, we can associate a monetary cost induced by an investigation process that accepts p as a decision in the light of the true label y. Table 4 presents the cost matrix.

Table 4: Cost Matrix

y = 1 y = 0 p - 1

P = 0 .ø (* ") o

The individual inputs are composed of a processing cost C _p , a reimputation C _C b and a cost dependent on the transaction g ('). g represents the loss of money due to fraud occurring while the investigation process is in progress. It is defined by:

, (Amt)

V

(8) ν ^ ι Where Fj is the set of authentication, operations or fraudulent transactions that occur until T hours after authentication, operation or transaction x, <.

F. _t = fa I hours ( ^emps} , ^'emps' ) <T Λ i ^rmde} = 1} £,. (9 [0059] Due to trade regulations, details of particulate values of C _p , C _Cb and T can not be given. It can be clearly stated, however, that outside of a particular commercial context, There is no reason to report a classification performance of statistical models in terms of money savings.This measure depends entirely on the cost matrix.This metric value has been incorporated only because it has been found that it was commonly used in related work, in contrast, the AUCPR should be a metric value of choice for comparisons between different classification methods, it is objective and therefore allows more general conclusions that are valid also outside 'a particular business context.

A model was qualified for each set combination of features, data set and sequence length, and its classification performance was tested on the test set held. In the case of random forests, the length of the input sequence has no influence on the model since only the last authentication, operation or transaction of the input sequence is used. Qualified models were evaluated on each of the 24 test days individually, and their average performance is reported against the metric values defined above.

Table 5 and Table 6 show a summary of the results for face-to-face and ecommerce data sets. A first observation is that the global detection accuracy is much higher on the ECOM than on the F2F, which can be explained by the higher proportion of frauds in the ECOM. Secondly, longer input sequences seem to have no effect on the accuracy of detection, neither for F2F nor for ECOM. Third, taking into account prior authentications, transactions or transactions with an LSTM significantly improves the detection of F2F fraud. However, this improvement is not observable in ECOM - instead, the results of the basic learning and the sequence learning approach are surprisingly similar.

Table 5: Average AUC on all test days. Sequence lengths (SHORT, LONG) and sets of features (BASE, TDELTA, AGG)

Table 6: Average AUC on all test days. Sequence lengths (SHORT, LONG) and sets of features (BASE, TDELTA, AGG)

ECOM features

AUCPR (μ) AUCPRo, 2 (μ) Savings [%]

RF LSTM RF LSTM RF LSTM

LU BASE 0.179 0.180 0.102 0.099 7.13% 18.82% H

this

Z) TDELTA 0.236 0.192 0.124 0.107 9.02% 15.30% o

o AGG 0.394 0.380 0.158 0.157 39.58% 45.00%

LU BASE 0,179 0,178 0,101 0,104 7,60% 15,04% _D

CD TDELTA 0.228 0.238 0.1 18 0.1 15 10.77% 18.51% "Z.

O

_l AGG 0.404 0.402 0.158 0.160 38.73% 42.93% Another observation confirms the discovery that aggregations of features improve the detection of fraud. Their impact is much more obvious on the ECOM than on the F2F. The observation that aggregations of features are useful in cases where the sequence model is not useful suggests that these two forms of context representation are not correlated, and that the approaches are complementary. Whatever the information that LSTM states track in the history of authentications, transactions, or transactions, it is not the same as the one that has been manually added through aggregations.

Apparently, an LSTM improves the detection of fraud during authentications, transactions or transactions face to face in terms of AUCPR. It is curious to know where this improvement comes from. Figure 2 shows the precision-recall curves of all model variants. In Figure 2a, it can be seen that the PR curves of RF models have a high precision peak at low recall levels, but they disappear rapidly as the booster increases. In contrast, LSTM models have slightly lower accuracy for low recall levels, but retain higher accuracy as recall increases. However, there is an interesting exception: once aggregated peculiarities have been added, the PR curve of the random forest increases with appreciable margin to a performance that is equal to that of the LSTM models. We can not at all observe such a net gain for LSTMs. In E-commerce authentication, operations, or transactions (see Figure 2b), the PR curves of the random forest and LSTM are virtually identical for all feature sets. RF and LSTM take advantage of aggregated features with the same margin.

[0064] Tables 5 and 6 report the average statistics on all test days. When the AUCPRs of the RF and LSTM are plotted for the individual test days, it can be seen in Figure 3 that the predictions of the two classifiers show strong variations according to the days. However, as the curves are correlated, we can deduce that some days the detection problem is more difficult than other days. For example, both classifiers have their minimum wrt value of the AUPCR in the time periods 9/05 - 10/05 and 25/05 - 26/05. By manual inspection, attempts were made to link the authentications, transactions, or transactions of these days to public events or the calendar, but no satisfactory explanation could be found for this mediocre performance.

In this analysis, a more in-depth examination of the frauds detected with RF and LSTM was carried out. A pair of models was extracted from all of the qualified models and their predictions compared. The decision threshold was again chosen to correspond to a recall level of 0.2. All predictions with a score above the threshold were considered positive predictions, and all others predicted negative predictions. Fixing the recall made sure to have an equal number of true positives in the result sets of a pair of models. However, there was some interest in determining whether the true positives of the RF are the same as those of the LSTM. The overlap of the true positive sets of a pair of models was measured with the Jaccard index. Figure 4 shows all paired comparisons in the form of a density map.

On the two density maps, four distinct zones are observed: two zones that correspond to intra-model comparisons and two zones that correspond to inter-model comparisons ⁴ . Jaccard's indices suggest that both the RF and the LSTM are consistent with the frauds they detect. This property is slightly more pronounced in random forest comparisons. However, the central and fascinating observation is that RF and LSTM tend to detect different frauds. On an F2F, the RF models agree on 50.8% of their true positives on average and the LSTM models on 37.8%. Between the two model classes, there is an average agreement of only 25.2%. This is similar for the ECOM with 47.5% (RF) and 50.8% (LSTM) average intra-model agreements and an average intermodel agreement of only 35.0%. [0067] There is one exception to this general observation. Models that have been driven with aggregated peculiarities tend to detect a single common set of frauds that have not been detected by random forests or LSTMs without aggregated peculiarities. This property is much more pronounced for the ECOM than for the F2F.

During the present experiments, it has been found that the application of long and short term memory networks to such structured data is not as simple as one might think. We would therefore like to share some observations that might be useful for practitioners.

Model regularization: when dealing with a temporal process for which one aims at predicting certain properties of future events, no collection of historical data points can truly satisfy the requirements requested from a set representative validation. The accuracy of a prediction the next day just after the end of the training set is better than for the more distant days in the future, suggesting a time dependence of the conditional distribution. When we choose the days just after the learning period as the validation set, the results with this set will suggest a small regularization of the model. But this choice has the opposite effect on performance for the more distant days in the future. An exact and very reliable model of today's data will probably be bad in a few days, while a less reliable model of the day will still be valid in a few days. This is less problematic for ensemble classifiers such as random forests, but is for neural networks. A pure workaround is to use dropout on the network structure. It samples smaller networks from the complete structure, drives them independently and ultimately averages the assumptions of these smaller networks. Predictions based on this hypothesis averaged are more stable over time. Online learning: the stochastic gradient descent and the many variants that have been developed for the training of neural networks (ADAM, RMSprop, Adagrad) are able to update the model iteratively even from inaccurate errors which have been estimated on small sets of training examples. This property combines well with the requirement that businesses maintain their detection models with the current of authentication data, transactions or transactions.

Comments on the training of LSTM: because of its recurring structure, the LSTM is likely to over-learning even when the layers of LSTM have only a few nodes. Therefore, it is recommended to start with a rather small structure and to increase the size with caution as long as there is reason to expect further generalization performance. We have noticed that a penalty ¾ leads to a much smoother convergence and better optima than a ½ penalty. The ADAM optimizer works much better than a conventional SGD algorithm in the present experiments since it estimates an appropriate learning speed scheme on the fly.

Combined approach: qualitatively, there remains only one difference between the random forests and the LSTM even after the addition of aggregated peculiarities. In face-to-face transactions, the LSTM detects a different set of frauds than the random forest, invariably more different than within individual families. It is presumed that this difference can be explained by the presence of more distinct succession patterns, which are guided and framed by real-world constraints. Therefore, in the F2F scenario, the combination of a sequence learner with a static learner and aggregate features is likely to further improve the detection accuracy.

Depending on the type of application, or the type of fraud, defects or anomalies in an authentication, operation or transaction that the operator wants to detect, the system can use only the neural network. recurrent long-term and short-term memory type (LSTM), or the neural network for statistical learning of the type of decision tree, or a combination of both (see Figure 6).

It will be readily understood from the present description that the features of the present invention, as generally described and illustrated in the figures, can be arranged and designed in a wide variety of different configurations. Thus, the description of the present invention and accompanying figures are not intended to limit the scope of the invention, but represent only selected embodiments.

Those skilled in the art will understand that the technical features of a given embodiment may in fact be combined with features of another embodiment, unless the reverse is explicitly mentioned, or if it is obvious that these features are incompatible. In addition, the technical features described in one embodiment can be isolated from the other features of this mode, unless the reverse is explicitly mentioned.

It should be obvious to those skilled in the art that the present invention allows embodiments in many other specific forms without departing from the scope defined by the intended protection. The illustration and the invention should not be limited to the details given above.

Claims

1. A machine learning system for various computer applications for searching text for detecting defects or anomalies in an authentication, transaction or operation performed by the application, comprising:

• a hardware and software arrangement forming a pretreatment system;

A hardware and software arrangement forming a recurrent neural network of the long-term and short-term memory type (LSTM), alone or in combination with an algorithm for statistical learning of the type of decision tree, and leading to a model of aggregated enriched data processing from the preprocessing system,

A hardware and software arrangement for injecting aggregated enriched data from the preprocessing system into the neural network;

• a hardware and software arrangement to validate the authentication, operation or transaction based on the results obtained at the output of the neural network.

characterized in that the recurrent neural network of the LSTM type comprises at least two recurrent layers and a Logistic Regression Classifier positioned above the last recurrent layer, the Logistic Regression Classifier takes into account the time elapsed between two authentications, operations or transactions during its implementation.

2. System according to the preceding claim, wherein the hardware and software arrangement for validating the authentication, operation or transaction is parameterized with a matrix of Jaccard indices so that the degree of similarity between the output data of the device is measured. a first algorithm in the form of a neural network of the LSTM type and those coming from a hardware and software arrangement of a second algorithm for statistical learning of the type of decision tree and for validating the results of one of the two neural networks.

3. System according to one of the preceding claims, which is used for a computer application allowing a prediction of risk from the detection of fraud in object authentication operations in the electronic memory containing in an area a secret information used to authenticate the object and its holder.

4. System according to one of the preceding claims, wherein the hardware and software arrangement forming a recurrent neural network driving an LSTM type model uses a GPU.

5. System according to one of the preceding claims, wherein the hardware and software arrangement forming a pretreatment system comprises:

a hardware and software arrangement forming at least a second database containing at least one set of external data,

a hardware and software arrangement for enriching raw data with external data,

a hardware and software arrangement for aggregating the enriched data.

6. System according to one of the preceding claims, wherein the pretreatment system uses a multifil mode.