WO2022143431A1

WO2022143431A1 - Method and apparatus for training anti-money laundering model

Info

Publication number: WO2022143431A1
Application number: PCT/CN2021/140997
Authority: WO
Inventors: 徐紫绮; 朱晓丹; 王萌
Original assignee: 第四范式（北京）技术有限公司
Priority date: 2020-12-30
Filing date: 2021-12-23
Publication date: 2022-07-07
Also published as: CN112634048B; CN112634048A

Abstract

Provided are a method and apparatus for training an anti-money laundering model, comprising: acquiring a source domain sample set and a target domain sample set, both source domain samples and target domain samples being transaction samples used to train an anti-money laundering model; classifying features involved in the source domain sample set and the target domain sample set, and determining a common feature set of the source domain sample set and the target domain sample set, a unique feature set of the source domain sample set, and a unique feature set of the target domain sample set; uniformly encoding the features in the source domain sample set and the features in the target domain sample set into a feature space corresponding to the union of the common feature set of the source domain sample set and the target domain sample set, the unique feature set of the source domain sample set, and the unique feature set of the target domain sample set; merging the source domain sample set and the target domain sample set that have been uniformly encoded; and training the anti-money laundering model on the basis of the merged sample set.

Description

An anti-money laundering model training method and device

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the priority of the Chinese patent application number "202011625865.9" filed by Fourth Paradigm (Beijing) Technology Co., Ltd. on December 30, 2020, with the application title "An Anti-Money Laundering Model Training Method and Device", The disclosures of the aforementioned applications are incorporated herein by reference.

technical field

The present application relates to the field of computer technology, and in particular, to a training method and device for an anti-money laundering model.

Background technique

With the development of Internet technology, more and more transactions in the financial field rely on the Internet, but the accompanying money laundering has gradually penetrated into the Internet. Money laundering is the process of concealing, disguising or investing illegally obtained income through legitimate activities or construction. In order to maintain social justice and combat economic crimes such as corruption, money laundering monitoring is required in the Internet. The monitoring of money laundering in the Internet is mainly accomplished by analyzing and identifying Internet data through anti-money laundering models.

Traditional anti-money laundering methods usually use anti-money laundering models to identify money laundering behaviors. Anti-money laundering models need to be trained on a large number of samples with known labels. The source of the label of the sample is mainly based on the rule system, which is set by professionals with high business literacy, and the quality of the sample label may vary. Therefore, in order to train an anti-money laundering model with better ability to identify money laundering behaviors, a large amount of human resources need to be invested in label review for a long time. However, there are operational risks in label review, and the experience of reviewers may be invalid.

SUMMARY OF THE INVENTION

In view of this, the present application proposes an anti-money laundering model training method and device, the main purpose of which is to complete the anti-money laundering model training task of the target domain sample set by introducing the characteristics of the source domain sample set, so as to improve the effect of anti-money laundering identification. The main technical solutions include:

In a first aspect, the present application provides an anti-money laundering model training method, the method includes: obtaining a source domain sample set and a target domain sample set, wherein the source domain sample and the target domain sample are both used for training the anti-money laundering model. Transaction samples; classify the features involved in the source domain sample set and the target domain sample set, determine the common feature set of the source domain sample set and the target domain sample set, and the source domain sample set The unique feature set and the unique feature set of the target domain sample set; the features in the source domain sample set and the features in the target domain sample set are uniformly encoded into the source domain sample set and the target domain sample set. In the feature space corresponding to the union of the common feature set, the unique feature set of the source domain sample set, and the unique feature set of the target domain sample set; the source domain sample set and the target domain sample set after merging unified coding Domain sample set; train an AML model based on the combined sample set.

In a second aspect, the present application provides an anti-money laundering model training device, the device includes: an acquisition unit configured to acquire a source domain sample set and a target domain sample set, wherein the source domain sample and the target domain sample are both used The transaction samples used for training the anti-money laundering model; the classification unit is configured to classify the features involved in the source domain sample set and the target domain sample set, and determine the source domain sample set and the target domain sample set. The common feature set, the unique feature set of the source domain sample set and the unique feature set of the target domain sample set; the coding unit is configured to combine the features in the source domain sample set and the target domain sample set. Features, which are uniformly encoded into the common feature set of the source domain sample set and the target domain sample set, the unique feature set of the source domain sample set and the unique feature set of the target domain sample set corresponding to the union of the three In the feature space; a merging unit, configured to merge the uniformly encoded source domain sample set and the target domain sample set; a training unit, configured to train an anti-money laundering model based on the merged sample set.

In a third aspect, the present application provides a computer-readable storage medium, the storage medium includes a stored program, wherein when the program is run, a device where the storage medium is located is controlled to execute the anti-money laundering model of the first aspect training.

In a fourth aspect, the present application provides a storage management device, the storage management device comprising: a memory configured to store a program; a processor coupled to the memory and configured to execute the program to execute the first aspect The training of the described anti-money laundering model.

With the above technical solutions, the present application provides an anti-money laundering model training method and device, which first obtains a source domain sample set and a target domain sample set, and classifies the features involved in the source domain sample set and the target domain sample set. , and determine the common feature set of the source domain sample set and the target domain sample set, the unique feature set of the source domain sample set and the unique feature set of the target domain sample set. The features in the source domain sample set and the features in the target domain sample set are uniformly encoded into the common feature set of the source domain sample set and the target domain sample set, the unique feature set of the source domain sample set and the unique feature set of the target domain sample set III. in the feature space corresponding to the union. Combine the uniformly coded source domain sample set and target domain sample set, and train an anti-money laundering model based on the combined sample set. It can be seen that the solution provided by this application completes the training task of the anti-money laundering model of the target domain sample set by introducing the features of the source domain sample set, so that the anti-money laundering model can not only learn the existing knowledge in the source domain sample set, but also learn the target domain. The new knowledge in the sample set, that is, the anti-money laundering model can learn both existing knowledge and new knowledge at the same time, realizing the accumulation and precipitation of existing knowledge and realizing the learning of new knowledge, which can improve the anti-money laundering recognition effect of the anti-money laundering model.

The above description is only an overview of the technical solution of the present application, in order to be able to understand the technical means of the present application more clearly, it can be implemented according to the content of the description, and in order to make the above and other purposes, features and advantages of the present application more obvious and easy to understand , and the specific embodiments of the present application are listed below.

Description of drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are For some embodiments of the present application, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without any creative effort.

1 shows a flowchart of a training method for an anti-money laundering model provided by an embodiment of the present application;

FIG. 2 shows a flowchart of a training method for an anti-money laundering model provided by another embodiment of the present application;

FIG. 3 shows a schematic structural diagram of a training device for an anti-money laundering model provided by an embodiment of the present application;

FIG. 4 shows a schematic structural diagram of a training device for an anti-money laundering model provided by another embodiment of the present application.

Detailed ways

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided so that the present disclosure will be more thoroughly understood, and will fully convey the scope of the present disclosure to those skilled in the art.

Money laundering behavior is often hidden in the transaction process in the financial field, so the data generated by the transaction behavior in the financial field transaction process includes a large number of money laundering behavior-related characteristics, and these characteristics can be used as the training basis for anti-money laundering models. At present, for small financial institutions or newly created financial institutions, there may not be enough data for them to train an anti-money laundering model with better anti-money laundering effect. The target domain sample set for learning new knowledge to train the anti-money laundering model. The anti-money laundering model trained in this way can not only learn the existing knowledge in the source domain sample set, but also learn new knowledge in the target domain sample set, that is, the anti-money laundering model can learn both existing knowledge and new knowledge at the same time. The accumulation and precipitation of knowledge realizes the learning of new knowledge, which can improve the effect of anti-money laundering identification of the anti-money laundering model.

As shown in FIG. 1 , an embodiment of the present application provides a training method for an anti-money laundering model, and the method mainly includes:

101. Obtain a source domain sample set and a target domain sample set, where the source domain sample and the target domain sample are both transaction samples used for training an anti-money laundering model.

Money laundering behavior is often hidden in the transaction process in the financial field, so the data generated by the transaction behavior in the financial field transaction process includes a large number of features related to money laundering behavior, and these features can be used as the training basis for anti-money laundering models. The source domain sample set and the target domain sample set are both datasets oriented to the financial field. The source domain samples in the source domain sample set and the source domain samples in the target domain sample set are both transaction samples for training anti-money laundering models, and these transaction samples have their corresponding binary labels, and the binary labels are used to characterize transaction samples. Is it money laundering or legal.

The process of determining transaction samples in the source domain sample set and the transaction samples in the target domain sample set is basically the same. The difference between the two is that the knowledge involved in the source domain sample set is existing knowledge, while the target domain sample set involves knowledge that needs to be learned. new knowledge. The following describes the process of determining the transaction sample, which includes the following steps 1 and 2:

Step 1: Determine the transaction sample and define the label of the transaction sample.

Customers will have a large number of transaction records in the process of financial transactions, and these transaction records are the basis for determining transaction samples. When determining transaction samples, it is necessary to define the time granularity first, and then determine the transaction records generated by the customer under the time granularity as the samples to be selected, and then select the transaction samples from the samples to be selected. The transaction sample is used to train the anti-money laundering model. The transaction sample needs to be able to clearly identify whether it is money laundering or legal behavior. Therefore, the candidate sample that can clearly determine whether it is money laundering or legal behavior is selected as the transaction sample, but it cannot be clearly determined. A candidate sample of behavior or legal behavior cannot be used as a transaction sample and needs to be excluded.

Exemplarily, the time granularity is day granularity. Under the customer-day granularity, the date when the customer has a transaction is selected from the transaction records of the financial institution to form a customer-day granularity transaction record, and the transaction record generated by a customer in one day is determined as a candidate sample. Then, the transaction samples are screened from the samples to be selected. The screening process includes the following operations: First, determine whether there is a transaction record with a large difference between the date of the money laundering report and the date of the money laundering activity in the transaction records of the customer-day granularity. If it exists, the candidate samples corresponding to this part of the transaction records will be excluded and not selected as transaction samples. Among them, the date of the money laundering report is the date of manual reporting, and the date of the money laundering activity is the date reported by the financial institution's money laundering rules system. Second, if the anti-money laundering rule system of a financial institution such as a bank reports money laundering triggered by a certain customer, the transaction records on the reporting date corresponding to the customer and N before the reporting date (N is greater than or equal to 1, exemplarily, N The candidate samples corresponding to the transaction records within 30) days are screened as transaction samples respectively, and these transaction samples are regarded as suspicious behaviors, and the labels of suspicious behaviors are marked, label=1. Third, after the above two operations, the remaining part of the samples to be selected are selected as transaction samples, and these transaction samples are regarded as legal behaviors, and the labels of legal behaviors are marked, label=0.

It should be noted that the sources of transaction samples in the source domain sample set and the target domain sample set are different, the knowledge involved in the source domain sample set is existing knowledge, and the target domain sample set involves new knowledge that needs to be learned. Exemplarily, the source domain sample set is the transaction records generated by financial institution A in January, the features in this part of the transaction records have become known, and the target domain sample set is generated by financial institution A in February. Transaction records, this part of transaction records includes new knowledge that needs to be learned. In order to facilitate the accumulation and inheritance of knowledge, it is necessary to obtain the source domain sample set and the target domain sample set, so that the anti-money laundering model can not only learn the existing knowledge involved in the source domain sample set, but also learn from the transfer learning method. The new knowledge that needs to be learned is involved in the sample set of the target domain.

Step 2: Perform feature splicing on transaction samples.

The characteristics of transaction samples mainly include user-type characteristics and user-behavior-type characteristics. Among them, user-type characteristics mainly describe the user's characteristic information, such as age, gender, deposit balance, number of family members, etc. User behavior characteristics mainly describe information related to user transaction behavior, such as the amount of the user's late-night transfer, the number of withdrawals from the user's ATM, and the number of transactions stored at the user's counter within a week.

Feature splicing of transaction samples is mainly used to enrich the features of transaction samples, so that the anti-money laundering model can learn more useful anti-money laundering information. When feature splicing is performed on transaction samples, feature derivation is actually performed based on the existing features of transaction samples. Exemplarily, if the transaction sample includes the number of user counter deposits in one week and the deposit amount of each deposit in the user counter in one week, the feature "total amount of user counter deposits in one week" can be derived.

Exemplarily, Table-1 is a transaction sample after feature splicing.

Table 1

客户IDCustomer ID	交易日期transaction date	F1(ATM取款数)F1 (ATM withdrawals)	F2(存款金额/万元)F2 (deposit amount/10,000 yuan)	F3(分行号)F3 (branch number)	交易行为transaction behavior
123123	2020.1.22020.1.2	1000010000	00	203203	可疑suspicious
124124	2020.1.22020.1.2	2000020000	2000000020000000	304304	合法legitimate
125125	2020.1.32020.1.3	30003000	3399933999	335335	合法legitimate
123123	2020.1.32020.1.3	3030	4488844888	445445	合法legitimate
126126	2020.1.32020.1.3	100000100000	9018990189	515515	合法legitimate
122122	2020.1.42020.1.4	2000020000	10000001000000	895895	合法legitimate
128128	2020.1.42020.1.4	30003000	5588855888	233233	合法legitimate
124124	2020.1.42020.1.4	4343	3232	452452	可疑suspicious

After the source domain sample set and the target domain sample set are obtained, the source domain sample set and the target domain sample set can be stored in the database in a multi-copy manner according to the daily partition (slice table) or the full scale (zipper table).

102. Classify the features involved in the source domain sample set and the target domain sample set, and determine the common feature set of the source domain sample set and the target domain sample set, and the unique feature set of the source domain sample set. A feature set and a unique feature set of the target domain sample set.

The main functions of classifying the features involved in the source domain sample set and the target domain sample set are as follows: First, in order to test whether the source domain sample set and the target domain sample set share some parameters of the anti-money laundering model, among which This parameter includes the parameters of the model or the hyperparameters of the model. If it is verified that the source domain sample set and the target domain sample set have shared parameters, it means that the anti-money laundering model can be trained by using the source domain sample set and the target domain sample set by means of transfer learning. If it is verified that the source domain sample set and the target domain sample set do not have shared parameters, it means that the transfer learning method cannot be used to train the anti-money laundering model using the source domain sample set and the target domain sample set, and inform the business personnel to re-select the source domain sample set and the target domain sample set. It should be noted that the process of checking whether the source domain sample set and the target domain sample set share some parameters of the anti-money laundering model is essentially the process of determining whether the source domain sample set and the target domain sample set have a common feature set. In the second aspect, when determining the shared parameters of the source domain sample set and the target domain sample set, find out the common parameters and unique parameters of the source domain sample set and the target domain sample set in their respective money laundering tasks.

The following describes the process of classifying the features involved in the source domain sample set and the target domain sample set. The process specifically includes the following steps 1 and 2:

Step 1: Determine the stability index of each continuous feature involved in the source domain sample set and the target domain sample set.

Specifically, the following formula can be used to determine the stability index of each continuous feature involved in the source domain sample set and the target domain sample set. For any continuous feature, its corresponding stability index can reflect the difference in the distribution of the source domain sample set and the target domain sample set. Based on this difference, it can be determined whether the continuous feature is a common feature of the two sample sets or A unique feature that belongs to a sample set.

The formula for determining the stability index of each continuous feature involved in the source domain sample set and the target domain sample set is:

Among them, PSI(Y _e , Y; B) _j represents the stability index of the jth continuous feature among the continuous features involved in the source domain sample set and the target domain sample set; Y _e represents the expected distribution, and the The expected distribution is the full data of the target domain sample set; Y represents the actual distribution, and the actual distribution is the full data of the source domain sample set; B represents the preset number of buckets; y _ij represents the jth continuous feature in The proportion of the ith bucket of the source domain sample set; y _eij represents the proportion of the jth continuous feature in the ith bucket of the target domain sample set.

Specifically, the number of buckets may be determined based on service requirements, which is not specifically limited in this embodiment. It should be noted that if the number of buckets is too large, the number of samples in each bucket may be too small, thus losing statistical significance. If the number of buckets is too small, the accuracy of the calculation results will be lower. Therefore, the total number of samples in the source domain sample set and the target domain sample set should be reasonably considered when determining the number of buckets. When dividing buckets, the buckets can be divided by the same amount of buckets. Exemplarily, the number of buckets is 15.

Specifically, the smaller the stability index of the continuous feature, the smaller the difference between the two sample sets of the continuous feature, which is a common feature of the two sample sets. The larger the stability index of the continuous feature, the greater the difference between the two sample sets of the continuous feature, which is the unique feature of the corresponding sample set.

Step 2: Classify each of the continuous features based on the size of the stability index of each of the continuous features.

For any continuous feature, its corresponding stability index can reflect the difference in the distribution of the source domain sample set and the target domain sample set, so the continuous features can be classified based on the size of the stability index of the continuous feature.

The following describes the process of classifying each continuous feature based on the size of the stability index of each continuous feature. The process specifically includes the following three steps:

One is to classify the continuous features whose stability index is less than a first threshold into a common feature set of the source domain sample set and the target domain sample set.

For a continuous feature whose stability index is less than the first threshold, it means that the difference between the two sample sets is small, and it is a common feature of the two sample sets. Therefore, this part of the continuous feature is classified into the source domain sample set. and the common feature set of the target domain sample set.

Specifically, the size of the first threshold may be determined based on service requirements, which is not specifically limited in this embodiment. Optionally, the first threshold is 0.25, that is, all continuous features whose stability index is less than 0.25 are classified into a common feature set.

Exemplarily, as shown in Table-2, after the calculation of the stability index, it is exemplified that which continuous features are the common features of the two sample sets and which features are the non-common features of the two sample sets, wherein, the non-common features It is also necessary to further determine which sample set it is a characteristic feature.

Table 2

特征feature	PSI值PSI value	特征分类Feature classification
F1(ATM取款数)F1 (ATM withdrawals)	0.230.23	共有shared
F2(存款金额/万元)F2 (deposit amount/10,000 yuan)	0.250.25	非共有non-shared
F3(夜间交易次数)F3 (number of night transactions)	0.0010.001	共有shared
F4(夜间收款金额)F4 (Amount received at night)	0.0040.004	共有shared
F5(1天借方交易金额)F5 (1 day debit transaction amount)	0.30.3	非共有non-shared
F6(3天交易总金额)F6 (3-day total transaction amount)	0.1230.123	共有shared
F7(3天交易总笔数)F7 (total number of transactions in 3 days)	0.030.03	共有shared
F8(10天借贷金额比例)F8 (10-day loan amount ratio)	0.020.02	共有shared

The second step is to classify the continuous features involved in the source domain sample set whose stability index is not less than the first threshold into a unique feature set of the source domain sample set.

For the continuous features involved in the source domain sample set whose stability index is less than the first threshold, it means that the greater the difference between the two sample sets, the continuous feature is the unique feature of the corresponding source domain sample set, so Classify it to the unique feature set of the source domain sample set.

Exemplarily, the first threshold is 0.25, and the continuous features involved in all source domain sample sets whose stability index is greater than or equal to 0.25 are classified into the unique feature set of the source domain sample set.

The third step is to classify the continuous features involved in the target domain sample set whose stability index is not less than the first threshold into a unique feature set of the target domain sample set.

For the continuous features involved in the target domain sample set whose stability index is less than the first threshold, it means that the greater the difference between the two sample sets, the continuous feature is the unique feature of the corresponding target domain sample set, so Classify it to the unique feature set of the target domain sample set.

Exemplarily, the first threshold is 0.25, and the continuous features involved in all target domain sample sets whose stability index is greater than or equal to 0.25 are classified into the unique feature set of the target domain sample set.

Further, since there are not only continuous features but also discrete features in the features involved in the source domain sample set and the target domain sample set, in addition to the above steps 1 and 2 for the source domain sample set and the target domain sample set In addition to the process of classifying the involved features, it also includes the following process of classifying the features involved in the source domain sample set and the target domain sample set: classifying the discrete features involved in the source domain sample set into The unique feature set of the source domain sample set; the discrete features involved in the target domain sample set are classified into the unique feature set of the target domain sample set. Since the discrete features involved in the two sample sets are basically user-type features, which are vertical isolation features and are unique to their respective sample sets, the discrete features involved in each sample set are directly classified into their corresponding unique feature sets. That's it.

103. Uniformly encode the features in the source domain sample set and the features in the target domain sample set into a common feature set of the source domain sample set and the target domain sample set, and a unique feature set of the source domain sample set. The feature set and the unique feature set of the target domain sample set are in the feature space corresponding to the union of the three.

In order for the anti-money laundering model to learn both the features in the source domain sample set and the features in the target domain set, it is necessary to uniformly encode the features in the source domain sample set and the target domain sample set into the source domain sample set and the target domain sample set. The common feature set of the target domain sample set, the unique feature set of the source domain sample set, and the unique feature set of the target domain sample set are in the feature space corresponding to the union of the three, so that the anti-money laundering model can learn the source The existing knowledge in the domain sample set can also learn new knowledge in the target domain sample set. Not only the anti-money laundering model can learn the existing knowledge and new knowledge at the same time, it can realize the accumulation and precipitation of the existing knowledge and realize the learning of new knowledge. Thereby, the effect of anti-money laundering identification of the anti-money laundering model can be improved.

The data required by the AML model is numeric, because only numeric types can perform calculations. Therefore, for various features, they need to be encoded accordingly, which is also a process of quantization. In the encoding process, through the preset encoding mechanism, the features in the source domain sample set and the features in the target domain sample set are uniformly encoded into the common feature set of the source domain sample set and the target domain sample set, and the source domain sample set. The unique feature set and the unique feature set of the target domain sample set are in the feature space corresponding to the union of the three. The encoding mechanism may be determined according to service requirements, which is not specifically limited in this embodiment. Optionally, the encoding mechanism can be one-hot encoding.

Specifically, when coding the common features, for the common features of the source domain sample set and the target domain sample set, such as transaction behavior, customer demographic attributes, etc., the samples of the source domain sample set and the target domain sample set can be unified as features. Coding, that is, coding separately for the feature space, directly merging the samples, and entering the feature extraction operator uniformly.

Specifically, when encoding the unique features in the form of discrete features of the two sample sets, due to the vertical isolation features of the two sample sets of discrete features, such as the branch to which the customer belongs, the ATM number used by the transaction, etc. time code, and the feature will be blanked if there is no value.

Specifically, when encoding the unique features in the form of continuous features of the two sample sets, separate spatial locations are performed for the unique features of the source domain sample set and the target domain sample set. The unique feature of the source domain sample set is a location, and the unique feature of the target domain sample set is a location.

Exemplarily, as shown in Table-3, the data formed after feature encoding is exemplified.

table 3

After the above feature encoding, a feature space is obtained, which includes the features in the common feature set of the source domain sample set and the target domain sample set, the features in the unique feature set of the source domain sample set, and the target domain sample set. Features in a unique feature set. This feature space provides the data basis for the training of subsequent anti-money laundering models.

104. Combine the uniformly coded source domain sample set and the target domain sample set.

The data required by the AML model is numeric, because only numeric types can perform calculations. Therefore, after encoding various features, the feature quantization process is completed, and the uniformly encoded source domain sample set and target domain sample set can be combined to form the training data for training the anti-money laundering model.

105. Train an anti-money laundering model based on the combined sample set.

The anti-money laundering model is used to identify money laundering activities on the data generated in the process of financial transactions, and it is used to identify whether the data is money laundering or legal, so the anti-money laundering model is a binary model. In practical applications, the specific type of the anti-money laundering model may be determined based on business requirements, which is not specifically limited in this embodiment. Optionally, the anti-money laundering model is GBDT (Gradient Boosting Tree) or LR (Logistic Regression).

The process of training an anti-money laundering model based on the combined sample set is related to the samples that the input model participates in training, and includes at least the following:

The first is to use all the samples in the combined sample set to input the anti-money laundering model for training.

In this way, since all the data in the sample set is used, the features in the input model are rich, so that the anti-money laundering model can not only learn the existing knowledge in the source domain sample set, but also learn the target domain samples. Concentrating new knowledge, that is, the anti-money laundering model can learn both existing knowledge and new knowledge at the same time, realizing the accumulation and precipitation of existing knowledge and realizing the learning of new knowledge, which can improve the anti-money laundering recognition effect of the anti-money laundering model.

The second is to extract a set number of samples from the combined sample set, and input the extracted samples into the anti-money laundering model for training.

Specifically, the features involved in the extracted samples include the common features of the source domain sample set and the target domain sample set, the unique feature set of the source domain sample set and the unique feature set of the target domain sample set. Since only a set number of samples are extracted, the anti-money laundering model can be trained with less computing power, and the anti-money laundering model can learn both the existing knowledge in the source domain sample set and the target domain. The new knowledge in the sample set enables the anti-money laundering model to learn both existing knowledge and new knowledge at the same time, realizing the accumulation and precipitation of existing knowledge and the learning of new knowledge, which can improve the anti-money laundering recognition effect of the anti-money laundering model.

Exemplarily, as shown in Table-4, Table-4 is a sample selected from the combined sample set for training an anti-money laundering model.

Table 4

In an anti-money laundering model training method provided by the embodiment of the present application, a source domain sample set and a target domain sample set are first obtained, the features involved in the source domain sample set and the target domain sample set are classified, and the source domain sample set is determined. and the common feature set of the target domain sample set, the unique feature set of the source domain sample set and the unique feature set of the target domain sample set. The features in the source domain sample set and the features in the target domain sample set are uniformly encoded into the common feature set of the source domain sample set and the target domain sample set, the unique feature set of the source domain sample set and the unique feature set of the target domain sample set III. in the feature space corresponding to the union. Combine the uniformly coded source domain sample set and target domain sample set, and train an anti-money laundering model based on the combined sample set. It can be seen that the solution provided by the embodiment of this application completes the training task of the anti-money laundering model of the target domain sample set by introducing the features of the source domain sample set, so that the anti-money laundering model can not only learn the existing knowledge in the source domain sample set, but also learn The target domain sample concentrates new knowledge, that is, the anti-money laundering model can learn both existing knowledge and new knowledge, realizing the accumulation of existing knowledge and the learning of new knowledge, which can improve the anti-money laundering recognition effect of the anti-money laundering model. .

Further, according to the method shown in FIG. 1 , another embodiment of the present application also provides a training method for an anti-money laundering model, as shown in FIG. 2 , the method mainly includes:

201. Obtain a source domain sample set and a target domain sample set, where the source domain sample and the target domain sample are both transaction samples used for training an anti-money laundering model.

202 . Determine whether there are discrete features of a preset category in the features involved in the source domain sample set; if yes, go to 203 ; if not, go to 204 .

If the source domain sample set includes static customer familiar attributes, IP used in transactions, transaction regions, counterparty accounts and other preset categories of discrete features, it will affect the efficiency of anti-money laundering model training. Moreover, in the anti-money laundering scenario, the distribution of these discrete features in the source domain sample set and the target domain sample set is very different. If these discrete features in the source domain sample set are directly applied to the target domain sample set, these discrete features will be invalid, and the training result will be invalid. The anti-money laundering model will not be able to learn these features, resulting in a poor anti-money laundering effect of the anti-money laundering model. Therefore, in order to learn the discrete features of this part of the preset category in the source domain sample set by the anti-money laundering model, it is necessary to determine whether there are discrete features of the preset category in the features involved in the source domain sample set.

If it is determined that there are discrete features of the preset category in the features involved in the source domain sample set, step 203 is executed to convert the discrete features of the preset category into continuous features, so as to ensure that these discrete features can be learned by the anti-money laundering model arrive.

If it is determined that the features involved in the source domain sample set do not have discrete features of the preset category, it means that no feature conversion is required, and the features in the source domain sample set can be learned by the anti-money laundering model, and step 204 can be executed. .

203. Convert the discrete features of the preset category into continuous features.

In order to bring the discrete features of the preset category in the source domain sample set to the target domain sample set, the discrete-to-continuous transformation is performed on the discrete features of the preset category. The process of converting the discrete features of the preset category into continuous features includes the following steps 1 to 2:

Step 1: Count the situation of the samples in the source domain sample set associated with the discrete features of each of the preset categories.

The main purpose of counting the sample conditions associated with the discrete features of each of the preset categories in the source domain sample set includes the following two points: First, through what kind of relationship the suspicious risk is propagated, and to whom the risk is propagated . The second is how close a certain relationship is, and how big is the risk of spreading to individuals through that relationship.

The specific process of counting the sample conditions associated with the discrete features of each preset category in the source domain sample set is as follows: performing for each discrete feature of the preset category: counting the offline features related to the offline feature within the preset time period; The characteristic condition is determined as the sample condition associated with the discrete feature.

Specifically, the sample situation includes at least one of the following: the number of positive transaction samples, the number of negative transaction samples, the proportion of negative transaction samples in the source domain sample set, and the proportion of positive transaction samples in the source domain sample set , the proportion of the transaction times of any individual in the total transaction times of the individual in the source domain sample set; wherein, the transaction type in the source domain sample set is a positive transaction sample, and the transaction type is a positive transaction sample. Negative transaction samples are suspicious behaviors.

Exemplarily, within a period of time, the number of negative samples or the proportion of negative samples in the total samples associated with the discrete feature X in the source domain sample set is counted. The number of negative samples or the proportion of negative samples in the total samples is determined as the sample situation of the discrete feature X.

Step 2: Determine the sample conditions of the discrete features of each preset category as continuous features corresponding to the discrete features of each preset category.

204. Classify the features involved in the source domain sample set and the target domain sample set, and determine the common feature set of the source domain sample set and the target domain sample set, and the unique feature set of the source domain sample set. A feature set and a unique feature set of the target domain sample set.

205. Uniformly encode the features in the source domain sample set and the features in the target domain sample set into a common feature set of the source domain sample set and the target domain sample set, and a unique feature set of the source domain sample set. The feature set and the unique feature set of the target domain sample set are in the feature space corresponding to the union of the three.

206. Merge the uniformly coded source domain sample set and the target domain sample set.

207. Train an anti-money laundering model based on the combined sample set.

Further, according to the above method embodiment, another embodiment of the present application further provides a training device for an anti-money laundering model. As shown in FIG. 3 , the device includes:

The obtaining unit 31 is configured to obtain a source domain sample set and a target domain sample set, wherein the source domain sample and the target domain sample are transaction samples used for training an anti-money laundering model;

The classification unit 32 is configured to classify the features involved in the source domain sample set and the target domain sample set, and determine the common feature set of the source domain sample set and the target domain sample set, the source domain sample set The unique feature set of the domain sample set and the unique feature set of the target domain sample set;

The encoding unit 33 is configured to uniformly encode the features in the source domain sample set and the features in the target domain sample set into a common feature set of the source domain sample set and the target domain sample set, the source domain sample set In the feature space corresponding to the union of the unique feature set of the domain sample set and the unique feature set of the target domain sample set;

The merging unit 34 is configured to merge the uniformly encoded source domain sample set and the target domain sample set;

The training unit 35 is configured to train an anti-money laundering model based on the combined sample set.

An apparatus for training an anti-money laundering model provided by an embodiment of the present application first obtains a source domain sample set and a target domain sample set, classifies the features involved in the source domain sample set and the target domain sample set, and determines the source domain sample set and the common feature set of the target domain sample set, the unique feature set of the source domain sample set and the unique feature set of the target domain sample set. The features in the source domain sample set and the features in the target domain sample set are uniformly encoded into the common feature set of the source domain sample set and the target domain sample set, the unique feature set of the source domain sample set and the unique feature set of the target domain sample set III. in the feature space corresponding to the union. Combine the uniformly coded source domain sample set and target domain sample set, and train an anti-money laundering model based on the combined sample set. It can be seen that the solution provided by the embodiment of this application completes the training task of the anti-money laundering model of the target domain sample set by introducing the features of the source domain sample set, so that the anti-money laundering model can not only learn the existing knowledge in the source domain sample set, but also learn The target domain sample sets new knowledge, that is, the anti-money laundering model can learn both existing knowledge and new knowledge, realizing the accumulation of existing knowledge and the learning of new knowledge, which can improve the anti-money laundering recognition effect of the anti-money laundering model. .

Optionally, as shown in Figure 4, the classification unit 32 includes:

A determination module 321, configured to determine the stability index of each continuous feature involved in the source domain sample set and the target domain sample set;

The first classification module 322 is configured to classify each of the continuous features based on the size of the stability index of each of the continuous features.

Optionally, as shown in FIG. 4 , the determining module 321 is configured to determine the stability index of each continuous feature involved in the source domain sample set and the target domain sample set through the following formula;

The formula is:

Optionally, as shown in FIG. 4 , the first classification module 322 is configured to classify the continuous features whose stability index is less than the first threshold into the source domain sample set and the target domain sample set. the common feature set; classify the continuous features involved in the source domain sample set whose stability index is not less than the first threshold into the unique feature set of the source domain sample set; classify the stability index not less than the first threshold The continuous features involved in the target domain sample set that are smaller than the first threshold are classified into the unique feature set of the target domain sample set.

Optionally, as shown in Figure 4, the classification unit 32 includes:

The second classification module 323 is configured to classify the discrete features involved in the source domain sample set into the unique feature set of the source domain sample set; classify the discrete features involved in the target domain sample set into the The unique feature set of the target domain sample set.

Optionally, as shown in Figure 4, the device further includes:

The judging unit 36 is configured to, before the classifying unit 32 classifies the features involved in the source domain sample set and the target domain sample set, determine whether there is a predetermined feature in the features involved in the source domain sample set. Set the discrete features of the category; if there is, trigger the conversion unit 37;

The converting unit 37 is configured to convert the discrete features of the preset category into continuous features under the triggering of the judging unit 36 .

Optionally, as shown in FIG. 4 , the conversion unit 37 is configured to count the sample conditions associated with the discrete features of each of the preset categories in the source domain sample set; The sample situation of the discrete features is determined as the continuous features corresponding to the discrete features of each preset category.

Optionally, as shown in FIG. 4 , the sample situation includes at least one of the following: the number of positive transaction samples, the number of negative transaction samples, the proportion of negative transaction samples in the source domain sample set, and the number of positive transaction samples in the sample set. The proportion of the source domain sample set, the proportion of the transaction number of transactions of any individual in the total transaction volume of the individual in the source domain sample set; wherein, the transaction type in the source domain sample set is legal The behaviors are positive transaction samples, and the transaction types are suspicious behaviors are negative transaction samples.

In the training device of the anti-money laundering model provided by the embodiment of the present application, for a detailed explanation of the method used in the operation of each functional module, please refer to the detailed explanation of the corresponding method of the method embodiment in FIG. 1 and FIG. 2 , which will not be repeated here.

Further, according to the above embodiment, another embodiment of the present application further provides a computer-readable storage medium, characterized in that, the storage medium includes a stored program, wherein when the program runs, the The device where the storage medium is located executes the training method of the anti-money laundering model described in FIG. 1 or FIG. 2 .

Further, according to the above embodiment, another embodiment of the present application further provides a storage management device, wherein the storage management device includes:

memory, configured to store programs;

A processor, coupled to the memory, is configured to run the program to perform the training method of the anti-money laundering model described in FIG. 1 or FIG. 2 .

This application discloses the following:

A1. A training method for an anti-money laundering model, comprising:

Obtain the source domain sample set and the target domain sample set, wherein the source domain sample and the target domain sample are both transaction samples used to train the anti-money laundering model;

Classify the features involved in the source domain sample set and the target domain sample set, and determine the common feature set of the source domain sample set and the target domain sample set, and the unique feature set of the source domain sample set and the unique feature set of the target domain sample set;

The features in the source domain sample set and the features in the target domain sample set are uniformly encoded into the common feature set of the source domain sample set and the target domain sample set, and the unique feature set of the source domain sample set and in the feature space corresponding to the union of the unique feature sets of the target domain sample set;

merging the uniformly encoded source domain sample set and the target domain sample set;

Train an anti-money laundering model based on the combined sample set.

A2. According to the method of A1, classify the features involved in the source domain sample set and the target domain sample set, including:

determining the stability index of each continuous feature involved in the source domain sample set and the target domain sample set;

Each of the continuous features is classified based on the magnitude of the stability index of each of the continuous features.

A3. According to the method of A2, determine the stability index of each continuous feature involved in the source domain sample set and the target domain sample set, including:

Determine the stability index of each continuous feature involved in the source domain sample set and the target domain sample set by the following formula;

The formula is:

A4. According to the method of A2, classify each of the continuous features based on the size of the stability index of each of the continuous features, including:

classifying the continuous features whose stability index is less than the first threshold into a common feature set of the source domain sample set and the target domain sample set;

classifying the continuous features involved in the source domain sample set whose stability index is not less than the first threshold into a unique feature set of the source domain sample set;

Classifying the continuous features involved in the target domain sample set whose stability index is not less than the first threshold into a unique feature set of the target domain sample set.

A5. According to the method of A1 or 2, classify the features involved in the source domain sample set and the target domain sample set, including:

classifying the discrete features involved in the source domain sample set into a unique feature set of the source domain sample set;

The discrete features involved in the target domain sample set are classified into a unique feature set of the target domain sample set.

A6. The method according to A1, before classifying the features involved in the source domain sample set and the target domain sample set, the method further includes:

Judging whether there are discrete features of a preset category in the features involved in the source domain sample set;

If it exists, convert the discrete features of the preset category into continuous features.

A7. The method according to A6, converting the discrete features of the preset category into continuous features, including:

Counting the sample conditions associated with the discrete features of each of the preset categories in the source domain sample set;

The sample conditions of the discrete features of each preset category are determined as continuous features corresponding to the discrete features of each preset category.

A8. According to the method described in A7, the sample conditions include at least one of the following: the number of positive transaction samples, the number of negative transaction samples, the proportion of negative transaction samples in the source domain sample set, the number of positive transaction samples in the The proportion of the sample set of the source domain, the proportion of transaction times of any individual in the total transaction times of the individual in the sample set of the source domain; wherein, the transaction type of the sample set of the source domain is legal behavior are positive transaction samples, and those whose transaction type is suspicious are negative transaction samples.

B1. An anti-money laundering model training device, comprising:

an obtaining unit, configured to obtain a source domain sample set and a target domain sample set, wherein the source domain sample and the target domain sample are transaction samples used for training an anti-money laundering model;

A classification unit, configured to classify the features involved in the source domain sample set and the target domain sample set, and determine the common feature set of the source domain sample set and the target domain sample set, the source domain sample set The unique feature set of the sample set and the unique feature set of the target domain sample set;

an encoding unit configured to uniformly encode the features in the source domain sample set and the features in the target domain sample set into a common feature set of the source domain sample set and the target domain sample set, the source domain In the feature space corresponding to the union of the unique feature set of the sample set and the unique feature set of the target domain sample set;

a merging unit, configured to merge the uniformly encoded source domain sample set and the target domain sample set;

A training unit configured to train an anti-money laundering model based on the combined sample set.

B2. The apparatus according to B1, the classification unit comprises:

a determination module, configured to determine the stability index of each continuous feature involved in the source domain sample set and the target domain sample set;

The first classification module is configured to classify each of the continuous features based on the size of the stability index of each of the continuous features.

B3. The apparatus according to B2, wherein the determining module is configured to determine the stability index of each continuous feature involved in the source domain sample set and the target domain sample set by the following formula;

The formula is:

B4. The apparatus according to B2, wherein the first classification module is configured to classify the continuous features whose stability index is less than a first threshold into the common features of the source domain sample set and the target domain sample set feature set; classify the continuous features involved in the source domain sample set whose stability index is not less than the first threshold into the unique feature set of the source domain sample set; classify the stability index not less than The continuous features involved in the target domain sample set of the first threshold are classified into a unique feature set of the target domain sample set.

B5. The apparatus according to B1 or B2, wherein the classification unit comprises:

The second classification module is configured to classify the discrete features involved in the source domain sample set into a unique feature set of the source domain sample set; classify the discrete features involved in the target domain sample set into the target A set of features specific to the domain sample set.

B6. The apparatus according to B1, further comprising:

a judgment unit, configured to judge whether a preset category exists in the features involved in the source domain sample set before the classification unit classifies the features involved in the source domain sample set and the target domain sample set The discrete feature of ; if it exists, trigger the conversion unit;

The converting unit is configured to convert the discrete features of the preset category into continuous features under the triggering of the judging unit.

B7. The apparatus according to B6, wherein the conversion unit is configured to count the sample conditions associated with the discrete features of each of the preset categories in the source domain sample set; The sample situation of the feature is determined as the continuous feature corresponding to the discrete feature of each preset category.

B8. The device according to B7, the sample conditions include at least one of the following: the number of positive transaction samples, the number of negative transaction samples, the proportion of negative transaction samples in the source domain sample set, the number of positive transaction samples in the The proportion of the sample set of the source domain, the proportion of transaction times of any individual in the total transaction times of the individual in the sample set of the source domain; wherein, the transaction type of the sample set of the source domain is legal behavior are positive transaction samples, and those whose transaction type is suspicious are negative transaction samples.

C1. A computer-readable storage medium, the storage medium comprising a stored program, wherein, when the program is run, a device where the storage medium is located is controlled to perform the training of the anti-money laundering model described in any one of A1 to A8 method.

D1. A storage management device, the storage management device comprising:

memory, configured to store programs;

A processor, coupled to the memory, is configured to run the program to perform the training method of the anti-money laundering model of any one of A1 to A8.

In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.

It can be understood that the relevant features in the above-mentioned methods and apparatuses may refer to each other. In addition, "first", "second", etc. in the above-mentioned embodiments are used to distinguish each embodiment, and do not represent the advantages and disadvantages of each embodiment.

Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the system, device and unit described above may refer to the corresponding process in the foregoing method embodiments, which will not be repeated here.

The algorithms and displays provided herein are not inherently related to any particular computer, virtual system, or other device. Various general-purpose systems can also be used with teaching based on this. The structure required to construct such a system is apparent from the above description. Furthermore, this application is not directed to any particular programming language. It should be understood that the content of the application described herein can be implemented using a variety of programming languages and that the descriptions of specific languages above are intended to disclose the best mode of the application.

In the description provided herein, numerous specific details are set forth. It will be understood, however, that the embodiments of the present application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it will be appreciated that in the above description of example embodiments of the application, various features of the application are sometimes grouped together into a single embodiment, figure, or its description. This disclosure, however, should not be interpreted as reflecting an intention that the claimed application requires more features than are expressly recited in each claim. Rather, as the following claims reflect, application aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the Detailed Description are hereby expressly incorporated into this Detailed Description, with each claim standing on its own as a separate embodiment of this application.

Those skilled in the art will understand that the modules in the device in the embodiment can be adaptively changed and arranged in one or more devices different from the embodiment. The modules or units or components in the embodiments may be combined into one module or unit or component, and further they may be divided into multiple sub-modules or sub-units or sub-assemblies. All features disclosed in this specification (including accompanying claims, abstract and drawings) and any method so disclosed may be employed in any combination, unless at least some of such features and/or procedures or elements are mutually exclusive. All processes or units of equipment are combined. Each feature disclosed in this specification (including accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that although some of the embodiments described herein include certain features, but not others, included in other embodiments, that combinations of features of different embodiments are intended to be within the scope of the present application within and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

Various component embodiments of the present application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art should understand that a microprocessor or a digital signal processor (DSP) may be used in practice to implement some or all of the components in the method, apparatus, and framework for running the deep neural network model according to the embodiments of the present application. some or all functions. The present application can also be implemented as an apparatus or apparatus program (eg, computer programs and computer program products) for performing part or all of the methods described herein. Such a program implementing the present application may be stored on a computer-readable medium, or may be in the form of one or more signals. Such signals may be downloaded from Internet sites, or provided on carrier signals, or in any other form.

It should be noted that the above-described embodiments illustrate rather than limit the application, and alternative embodiments may be devised by those skilled in the art without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application can be implemented by means of hardware comprising several different elements and by means of a suitably programmed computer. In a unit claim enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, and third, etc. do not denote any order. These words can be interpreted as names.

Industrial Applicability

The solution provided in this application completes the training task of the anti-money laundering model of the target domain sample set by introducing the features of the source domain sample set, so that the anti-money laundering model can not only learn the existing knowledge in the source domain sample set, but also learn the target domain sample set New knowledge, that is, the anti-money laundering model can learn both existing knowledge and new knowledge at the same time, realizing the accumulation and precipitation of existing knowledge and realizing the learning of new knowledge, which can improve the effect of anti-money laundering identification of the anti-money laundering model.

Claims

An anti-money laundering model training method, including:

Obtain the source domain sample set and the target domain sample set, wherein the source domain sample and the target domain sample are both transaction samples used to train the anti-money laundering model;

Classify the features involved in the source domain sample set and the target domain sample set, and determine the common feature set of the source domain sample set and the target domain sample set, and the unique feature set of the source domain sample set and the unique feature set of the target domain sample set;

The features in the source domain sample set and the features in the target domain sample set are uniformly encoded into the common feature set of the source domain sample set and the target domain sample set, and the unique feature set of the source domain sample set and in the feature space corresponding to the union of the unique feature sets of the target domain sample set;

merging the uniformly encoded source domain sample set and the target domain sample set;

Train an anti-money laundering model based on the combined sample set.
The method according to claim 1, wherein classifying the features involved in the source domain sample set and the target domain sample set comprises:

determining the stability index of each continuous feature involved in the source domain sample set and the target domain sample set;

Each of the continuous features is classified based on the magnitude of the stability index of each of the continuous features.
The method according to claim 2, wherein determining the stability index of each continuous feature involved in the source domain sample set and the target domain sample set comprises:

Determine the stability index of each continuous feature involved in the source domain sample set and the target domain sample set by the following formula;

The formula is:

Among them, PSI(Y e , Y; B) j represents the stability index of the jth continuous feature among the continuous features involved in the source domain sample set and the target domain sample set; Y e represents the expected distribution, and the The expected distribution is the full data of the target domain sample set; Y represents the actual distribution, and the actual distribution is the full data of the source domain sample set; B represents the preset number of buckets; y ij represents the jth continuous feature in The proportion of the ith bucket of the source domain sample set; y eij represents the proportion of the jth continuous feature in the ith bucket of the target domain sample set.
The method according to claim 2, wherein, classifying each of the continuous features based on the size of the stability index of each of the continuous features, comprising:

classifying the continuous features whose stability index is less than the first threshold into a common feature set of the source domain sample set and the target domain sample set;

classifying the continuous features involved in the source domain sample set whose stability index is not less than the first threshold into a unique feature set of the source domain sample set;

Classifying the continuous features involved in the target domain sample set whose stability index is not less than the first threshold into a unique feature set of the target domain sample set.
The method according to claim 1 or 2, wherein classifying the features involved in the source domain sample set and the target domain sample set comprises:

classifying the discrete features involved in the source domain sample set into a unique feature set of the source domain sample set;

The discrete features involved in the target domain sample set are classified into a unique feature set of the target domain sample set.
The method according to claim 1, wherein before classifying the features involved in the source domain sample set and the target domain sample set, the method further comprises:

Judging whether there are discrete features of a preset category in the features involved in the source domain sample set;

If it exists, convert the discrete features of the preset category into continuous features.
The method according to claim 6, wherein converting the discrete features of the preset category into continuous features comprises:

Counting the sample conditions associated with the discrete features of each of the preset categories in the source domain sample set;

The sample conditions of the discrete features of each preset category are determined as continuous features corresponding to the discrete features of each preset category.
The method according to claim 7, wherein the sample conditions include at least one of the following: the number of positive transaction samples, the number of negative transaction samples, the proportion of negative transaction samples in the source domain sample set, the positive transaction samples The proportion in the sample set of the source domain, the proportion of the transaction times of any individual in the total transaction times of the individual in the sample set of the source domain; wherein, the transaction type in the sample set of the source domain is The legal behavior is a positive transaction sample, and the transaction type is suspicious behavior is a negative transaction sample.
An anti-money laundering model training device, comprising:

an acquiring unit, configured to acquire a source domain sample set and a target domain sample set, wherein both the source domain sample and the target domain sample are transaction samples used for training an anti-money laundering model;

A classification unit, configured to classify the features involved in the source domain sample set and the target domain sample set, and determine the common feature set of the source domain sample set and the target domain sample set, the source domain sample set The unique feature set of the sample set and the unique feature set of the target domain sample set;

an encoding unit configured to uniformly encode the features in the source domain sample set and the features in the target domain sample set into a common feature set of the source domain sample set and the target domain sample set, the source domain In the feature space corresponding to the union of the unique feature set of the sample set and the unique feature set of the target domain sample set;

a merging unit, configured to merge the uniformly encoded source domain sample set and the target domain sample set;

A training unit configured to train an anti-money laundering model based on the combined sample set.
The apparatus of claim 9, wherein the classification unit comprises:

a determination module, configured to determine the stability index of each continuous feature involved in the source domain sample set and the target domain sample set;

The first classification module is configured to classify each of the continuous features based on the size of the stability index of each of the continuous features.
The apparatus according to claim 10, wherein the determining module is configured to determine the stability index of each continuous feature involved in the source domain sample set and the target domain sample set by the following formula;

The formula is:

Among them, PSI(Y e , Y; B) j represents the stability index of the jth continuous feature among the continuous features involved in the source domain sample set and the target domain sample set; Y e represents the expected distribution, and the The expected distribution is the full data of the target domain sample set; Y represents the actual distribution, and the actual distribution is the full data of the source domain sample set; B represents the preset number of buckets; y ij represents the jth continuous feature in The proportion of the ith bucket of the source domain sample set; y eij represents the proportion of the jth continuous feature in the ith bucket of the target domain sample set.
The apparatus according to claim 10, wherein the first classification module is configured to classify the continuous features whose stability index is less than a first threshold into the source domain sample set and the target domain sample set the common feature set; classify the continuous features involved in the source domain sample set whose stability index is not less than the first threshold into the unique feature set of the source domain sample set; classify the stability index not less than the first threshold The continuous features involved in the target domain sample set that are smaller than the first threshold are classified into the unique feature set of the target domain sample set.
The apparatus according to claim 9 or 10, wherein the classification unit comprises:

The second classification module is configured to classify the discrete features involved in the source domain sample set into a unique feature set of the source domain sample set; classify the discrete features involved in the target domain sample set into the target A set of features specific to the domain sample set.
The apparatus of claim 9, wherein the apparatus further comprises:

A judgment unit, configured to judge whether a preset category exists in the features involved in the source domain sample set before the classification unit classifies the features involved in the source domain sample set and the target domain sample set The discrete feature of ; if it exists, trigger the conversion unit;

The converting unit is configured to convert the discrete features of the preset category into continuous features under the triggering of the judging unit.
The apparatus according to claim 14, wherein the conversion unit is configured to count the sample conditions associated with the discrete features of each of the preset categories in the source domain sample set; The sample situation of the discrete features is determined as the continuous features corresponding to the discrete features of each preset category.
The device according to claim 15, wherein the sample conditions include at least one of the following: the number of positive transaction samples, the number of negative transaction samples, the proportion of negative transaction samples in the source domain sample set, the positive transaction samples The proportion in the sample set of the source domain, the proportion of the transaction times of any individual in the total transaction times of the individual in the sample set of the source domain; wherein, the transaction type in the sample set of the source domain is The legal behavior is a positive transaction sample, and the transaction type is suspicious behavior is a negative transaction sample.
A computer-readable storage medium, the storage medium comprising a stored program, wherein when the program is run, a device where the storage medium is located is controlled to execute the anti-money laundering model according to any one of claims 1 to 8 training method.
A storage management device, the storage management device comprising:

memory, configured to store programs;

A processor, coupled to the memory, is configured to run the program to perform the training method of the anti-money laundering model of any one of claims 1 to 8.