WO2022143431A1 - Method and apparatus for training anti-money laundering model - Google Patents

Method and apparatus for training anti-money laundering model Download PDF

Info

Publication number
WO2022143431A1
WO2022143431A1 PCT/CN2021/140997 CN2021140997W WO2022143431A1 WO 2022143431 A1 WO2022143431 A1 WO 2022143431A1 CN 2021140997 W CN2021140997 W CN 2021140997W WO 2022143431 A1 WO2022143431 A1 WO 2022143431A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample set
domain sample
source domain
features
target domain
Prior art date
Application number
PCT/CN2021/140997
Other languages
French (fr)
Chinese (zh)
Inventor
徐紫绮
朱晓丹
王萌
Original Assignee
第四范式(北京)技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 第四范式(北京)技术有限公司 filed Critical 第四范式(北京)技术有限公司
Publication of WO2022143431A1 publication Critical patent/WO2022143431A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/04Trading; Exchange, e.g. stocks, commodities, derivatives or currency exchange
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Definitions

  • the present application relates to the field of computer technology, and in particular, to a training method and device for an anti-money laundering model.
  • Money laundering is the process of concealing, disguising or investing illegally obtained income through legitimate activities or construction.
  • money laundering monitoring is required in the Internet.
  • the monitoring of money laundering in the Internet is mainly accomplished by analyzing and identifying Internet data through anti-money laundering models.
  • Anti-money laundering models need to be trained on a large number of samples with known labels.
  • the source of the label of the sample is mainly based on the rule system, which is set by professionals with high business literacy, and the quality of the sample label may vary. Therefore, in order to train an anti-money laundering model with better ability to identify money laundering behaviors, a large amount of human resources need to be invested in label review for a long time. However, there are operational risks in label review, and the experience of reviewers may be invalid.
  • the present application proposes an anti-money laundering model training method and device, the main purpose of which is to complete the anti-money laundering model training task of the target domain sample set by introducing the characteristics of the source domain sample set, so as to improve the effect of anti-money laundering identification.
  • the main technical solutions include:
  • the present application provides an anti-money laundering model training method, the method includes: obtaining a source domain sample set and a target domain sample set, wherein the source domain sample and the target domain sample are both used for training the anti-money laundering model.
  • Transaction samples classify the features involved in the source domain sample set and the target domain sample set, determine the common feature set of the source domain sample set and the target domain sample set, and the source domain sample set.
  • the unique feature set and the unique feature set of the target domain sample set; the features in the source domain sample set and the features in the target domain sample set are uniformly encoded into the source domain sample set and the target domain sample set.
  • the source domain sample set and the target domain sample set after merging unified coding Domain sample set; train an AML model based on the combined sample set.
  • the present application provides an anti-money laundering model training device, the device includes: an acquisition unit configured to acquire a source domain sample set and a target domain sample set, wherein the source domain sample and the target domain sample are both used The transaction samples used for training the anti-money laundering model; the classification unit is configured to classify the features involved in the source domain sample set and the target domain sample set, and determine the source domain sample set and the target domain sample set. The common feature set, the unique feature set of the source domain sample set and the unique feature set of the target domain sample set; the coding unit is configured to combine the features in the source domain sample set and the target domain sample set.
  • a merging unit configured to merge the uniformly encoded source domain sample set and the target domain sample set
  • a training unit configured to train an anti-money laundering model based on the merged sample set.
  • the present application provides a computer-readable storage medium, the storage medium includes a stored program, wherein when the program is run, a device where the storage medium is located is controlled to execute the anti-money laundering model of the first aspect training.
  • the present application provides a storage management device, the storage management device comprising: a memory configured to store a program; a processor coupled to the memory and configured to execute the program to execute the first aspect The training of the described anti-money laundering model.
  • the present application provides an anti-money laundering model training method and device, which first obtains a source domain sample set and a target domain sample set, and classifies the features involved in the source domain sample set and the target domain sample set. , and determine the common feature set of the source domain sample set and the target domain sample set, the unique feature set of the source domain sample set and the unique feature set of the target domain sample set.
  • the features in the source domain sample set and the features in the target domain sample set are uniformly encoded into the common feature set of the source domain sample set and the target domain sample set, the unique feature set of the source domain sample set and the unique feature set of the target domain sample set III. in the feature space corresponding to the union.
  • the solution provided by this application completes the training task of the anti-money laundering model of the target domain sample set by introducing the features of the source domain sample set, so that the anti-money laundering model can not only learn the existing knowledge in the source domain sample set, but also learn the target domain.
  • the new knowledge in the sample set that is, the anti-money laundering model can learn both existing knowledge and new knowledge at the same time, realizing the accumulation and precipitation of existing knowledge and realizing the learning of new knowledge, which can improve the anti-money laundering recognition effect of the anti-money laundering model.
  • FIG. 1 shows a flowchart of a training method for an anti-money laundering model provided by an embodiment of the present application
  • FIG. 2 shows a flowchart of a training method for an anti-money laundering model provided by another embodiment of the present application
  • FIG. 3 shows a schematic structural diagram of a training device for an anti-money laundering model provided by an embodiment of the present application
  • FIG. 4 shows a schematic structural diagram of a training device for an anti-money laundering model provided by another embodiment of the present application.
  • Money laundering behavior is often hidden in the transaction process in the financial field, so the data generated by the transaction behavior in the financial field transaction process includes a large number of money laundering behavior-related characteristics, and these characteristics can be used as the training basis for anti-money laundering models.
  • the target domain sample set for learning new knowledge to train the anti-money laundering model.
  • the anti-money laundering model trained in this way can not only learn the existing knowledge in the source domain sample set, but also learn new knowledge in the target domain sample set, that is, the anti-money laundering model can learn both existing knowledge and new knowledge at the same time.
  • the accumulation and precipitation of knowledge realizes the learning of new knowledge, which can improve the effect of anti-money laundering identification of the anti-money laundering model.
  • an embodiment of the present application provides a training method for an anti-money laundering model, and the method mainly includes:
  • the source domain sample set and the target domain sample set are both datasets oriented to the financial field.
  • the source domain samples in the source domain sample set and the source domain samples in the target domain sample set are both transaction samples for training anti-money laundering models, and these transaction samples have their corresponding binary labels, and the binary labels are used to characterize transaction samples. Is it money laundering or legal.
  • the process of determining transaction samples in the source domain sample set and the transaction samples in the target domain sample set is basically the same. The difference between the two is that the knowledge involved in the source domain sample set is existing knowledge, while the target domain sample set involves knowledge that needs to be learned. new knowledge.
  • Step 1 Determine the transaction sample and define the label of the transaction sample.
  • the time granularity is day granularity.
  • the date when the customer has a transaction is selected from the transaction records of the financial institution to form a customer-day granularity transaction record, and the transaction record generated by a customer in one day is determined as a candidate sample.
  • the transaction samples are screened from the samples to be selected.
  • the screening process includes the following operations: First, determine whether there is a transaction record with a large difference between the date of the money laundering report and the date of the money laundering activity in the transaction records of the customer-day granularity. If it exists, the candidate samples corresponding to this part of the transaction records will be excluded and not selected as transaction samples.
  • the date of the money laundering report is the date of manual reporting
  • the date of the money laundering activity is the date reported by the financial institution's money laundering rules system.
  • the anti-money laundering rule system of a financial institution such as a bank reports money laundering triggered by a certain customer
  • the transaction records on the reporting date corresponding to the customer and N before the reporting date (N is greater than or equal to 1, exemplarily, N
  • the sources of transaction samples in the source domain sample set and the target domain sample set are different, the knowledge involved in the source domain sample set is existing knowledge, and the target domain sample set involves new knowledge that needs to be learned.
  • the source domain sample set is the transaction records generated by financial institution A in January, the features in this part of the transaction records have become known, and the target domain sample set is generated by financial institution A in February.
  • Transaction records, this part of transaction records includes new knowledge that needs to be learned.
  • it is necessary to obtain the source domain sample set and the target domain sample set so that the anti-money laundering model can not only learn the existing knowledge involved in the source domain sample set, but also learn from the transfer learning method.
  • the new knowledge that needs to be learned is involved in the sample set of the target domain.
  • Step 2 Perform feature splicing on transaction samples.
  • the characteristics of transaction samples mainly include user-type characteristics and user-behavior-type characteristics.
  • user-type characteristics mainly describe the user's characteristic information, such as age, gender, deposit balance, number of family members, etc.
  • user behavior characteristics mainly describe information related to user transaction behavior, such as the amount of the user's late-night transfer, the number of withdrawals from the user's ATM, and the number of transactions stored at the user's counter within a week.
  • Feature splicing of transaction samples is mainly used to enrich the features of transaction samples, so that the anti-money laundering model can learn more useful anti-money laundering information.
  • feature derivation is actually performed based on the existing features of transaction samples. Exemplarily, if the transaction sample includes the number of user counter deposits in one week and the deposit amount of each deposit in the user counter in one week, the feature "total amount of user counter deposits in one week" can be derived.
  • Table-1 is a transaction sample after feature splicing.
  • the source domain sample set and the target domain sample set can be stored in the database in a multi-copy manner according to the daily partition (slice table) or the full scale (zipper table).
  • the main functions of classifying the features involved in the source domain sample set and the target domain sample set are as follows: First, in order to test whether the source domain sample set and the target domain sample set share some parameters of the anti-money laundering model, among which This parameter includes the parameters of the model or the hyperparameters of the model. If it is verified that the source domain sample set and the target domain sample set have shared parameters, it means that the anti-money laundering model can be trained by using the source domain sample set and the target domain sample set by means of transfer learning.
  • the process of checking whether the source domain sample set and the target domain sample set share some parameters of the anti-money laundering model is essentially the process of determining whether the source domain sample set and the target domain sample set have a common feature set.
  • the shared parameters of the source domain sample set and the target domain sample set find out the common parameters and unique parameters of the source domain sample set and the target domain sample set in their respective money laundering tasks.
  • the following describes the process of classifying the features involved in the source domain sample set and the target domain sample set.
  • the process specifically includes the following steps 1 and 2:
  • Step 1 Determine the stability index of each continuous feature involved in the source domain sample set and the target domain sample set.
  • the following formula can be used to determine the stability index of each continuous feature involved in the source domain sample set and the target domain sample set.
  • its corresponding stability index can reflect the difference in the distribution of the source domain sample set and the target domain sample set. Based on this difference, it can be determined whether the continuous feature is a common feature of the two sample sets or A unique feature that belongs to a sample set.
  • the formula for determining the stability index of each continuous feature involved in the source domain sample set and the target domain sample set is:
  • PSI(Y e , Y; B) j represents the stability index of the jth continuous feature among the continuous features involved in the source domain sample set and the target domain sample set; Y e represents the expected distribution, and the The expected distribution is the full data of the target domain sample set; Y represents the actual distribution, and the actual distribution is the full data of the source domain sample set; B represents the preset number of buckets; y ij represents the jth continuous feature in The proportion of the ith bucket of the source domain sample set; y eij represents the proportion of the jth continuous feature in the ith bucket of the target domain sample set.
  • the number of buckets may be determined based on service requirements, which is not specifically limited in this embodiment. It should be noted that if the number of buckets is too large, the number of samples in each bucket may be too small, thus losing statistical significance. If the number of buckets is too small, the accuracy of the calculation results will be lower. Therefore, the total number of samples in the source domain sample set and the target domain sample set should be reasonably considered when determining the number of buckets.
  • the buckets can be divided by the same amount of buckets. Exemplarily, the number of buckets is 15.
  • the smaller the stability index of the continuous feature the smaller the difference between the two sample sets of the continuous feature, which is a common feature of the two sample sets.
  • the larger the stability index of the continuous feature the greater the difference between the two sample sets of the continuous feature, which is the unique feature of the corresponding sample set.
  • Step 2 Classify each of the continuous features based on the size of the stability index of each of the continuous features.
  • its corresponding stability index can reflect the difference in the distribution of the source domain sample set and the target domain sample set, so the continuous features can be classified based on the size of the stability index of the continuous feature.
  • the following describes the process of classifying each continuous feature based on the size of the stability index of each continuous feature.
  • the process specifically includes the following three steps:
  • One is to classify the continuous features whose stability index is less than a first threshold into a common feature set of the source domain sample set and the target domain sample set.
  • this part of the continuous feature is classified into the source domain sample set. and the common feature set of the target domain sample set.
  • the size of the first threshold may be determined based on service requirements, which is not specifically limited in this embodiment.
  • the first threshold is 0.25, that is, all continuous features whose stability index is less than 0.25 are classified into a common feature set.
  • Feature classification F1 (ATM withdrawals) 0.23 shared F2 (deposit amount/10,000 yuan) 0.25 non-shared F3 (number of night transactions) 0.001 shared F4 (Amount received at night) 0.004 shared F5 (1 day debit transaction amount) 0.3 non-shared F6 (3-day total transaction amount) 0.123 shared F7 (total number of transactions in 3 days) 0.03 shared F8 (10-day loan amount ratio) 0.02 shared
  • the second step is to classify the continuous features involved in the source domain sample set whose stability index is not less than the first threshold into a unique feature set of the source domain sample set.
  • the continuous feature is the unique feature of the corresponding source domain sample set, so Classify it to the unique feature set of the source domain sample set.
  • the first threshold is 0.25, and the continuous features involved in all source domain sample sets whose stability index is greater than or equal to 0.25 are classified into the unique feature set of the source domain sample set.
  • the third step is to classify the continuous features involved in the target domain sample set whose stability index is not less than the first threshold into a unique feature set of the target domain sample set.
  • the continuous feature is the unique feature of the corresponding target domain sample set, so Classify it to the unique feature set of the target domain sample set.
  • the first threshold is 0.25, and the continuous features involved in all target domain sample sets whose stability index is greater than or equal to 0.25 are classified into the unique feature set of the target domain sample set.
  • the process of classifying the involved features it also includes the following process of classifying the features involved in the source domain sample set and the target domain sample set: classifying the discrete features involved in the source domain sample set into The unique feature set of the source domain sample set; the discrete features involved in the target domain sample set are classified into the unique feature set of the target domain sample set. Since the discrete features involved in the two sample sets are basically user-type features, which are vertical isolation features and are unique to their respective sample sets, the discrete features involved in each sample set are directly classified into their corresponding unique feature sets. That's it.
  • the feature set and the unique feature set of the target domain sample set are in the feature space corresponding to the union of the three.
  • the anti-money laundering model In order for the anti-money laundering model to learn both the features in the source domain sample set and the features in the target domain set, it is necessary to uniformly encode the features in the source domain sample set and the target domain sample set into the source domain sample set and the target domain sample set.
  • the common feature set of the target domain sample set, the unique feature set of the source domain sample set, and the unique feature set of the target domain sample set are in the feature space corresponding to the union of the three, so that the anti-money laundering model can learn the source
  • the existing knowledge in the domain sample set can also learn new knowledge in the target domain sample set.
  • the anti-money laundering model can learn the existing knowledge and new knowledge at the same time, it can realize the accumulation and precipitation of the existing knowledge and realize the learning of new knowledge. Thereby, the effect of anti-money laundering identification of the anti-money laundering model can be improved.
  • the data required by the AML model is numeric, because only numeric types can perform calculations. Therefore, for various features, they need to be encoded accordingly, which is also a process of quantization.
  • the features in the source domain sample set and the features in the target domain sample set are uniformly encoded into the common feature set of the source domain sample set and the target domain sample set, and the source domain sample set.
  • the unique feature set and the unique feature set of the target domain sample set are in the feature space corresponding to the union of the three.
  • the encoding mechanism may be determined according to service requirements, which is not specifically limited in this embodiment.
  • the encoding mechanism can be one-hot encoding.
  • the samples of the source domain sample set and the target domain sample set can be unified as features. Coding, that is, coding separately for the feature space, directly merging the samples, and entering the feature extraction operator uniformly.
  • the unique feature of the source domain sample set is a location
  • the unique feature of the target domain sample set is a location
  • a feature space is obtained, which includes the features in the common feature set of the source domain sample set and the target domain sample set, the features in the unique feature set of the source domain sample set, and the target domain sample set. Features in a unique feature set. This feature space provides the data basis for the training of subsequent anti-money laundering models.
  • the data required by the AML model is numeric, because only numeric types can perform calculations. Therefore, after encoding various features, the feature quantization process is completed, and the uniformly encoded source domain sample set and target domain sample set can be combined to form the training data for training the anti-money laundering model.
  • the anti-money laundering model is used to identify money laundering activities on the data generated in the process of financial transactions, and it is used to identify whether the data is money laundering or legal, so the anti-money laundering model is a binary model.
  • the specific type of the anti-money laundering model may be determined based on business requirements, which is not specifically limited in this embodiment.
  • the anti-money laundering model is GBDT (Gradient Boosting Tree) or LR (Logistic Regression).
  • the process of training an anti-money laundering model based on the combined sample set is related to the samples that the input model participates in training, and includes at least the following:
  • the first is to use all the samples in the combined sample set to input the anti-money laundering model for training.
  • the features in the input model are rich, so that the anti-money laundering model can not only learn the existing knowledge in the source domain sample set, but also learn the target domain samples. Concentrating new knowledge, that is, the anti-money laundering model can learn both existing knowledge and new knowledge at the same time, realizing the accumulation and precipitation of existing knowledge and realizing the learning of new knowledge, which can improve the anti-money laundering recognition effect of the anti-money laundering model.
  • the second is to extract a set number of samples from the combined sample set, and input the extracted samples into the anti-money laundering model for training.
  • the features involved in the extracted samples include the common features of the source domain sample set and the target domain sample set, the unique feature set of the source domain sample set and the unique feature set of the target domain sample set. Since only a set number of samples are extracted, the anti-money laundering model can be trained with less computing power, and the anti-money laundering model can learn both the existing knowledge in the source domain sample set and the target domain. The new knowledge in the sample set enables the anti-money laundering model to learn both existing knowledge and new knowledge at the same time, realizing the accumulation and precipitation of existing knowledge and the learning of new knowledge, which can improve the anti-money laundering recognition effect of the anti-money laundering model.
  • Table-4 is a sample selected from the combined sample set for training an anti-money laundering model.
  • a source domain sample set and a target domain sample set are first obtained, the features involved in the source domain sample set and the target domain sample set are classified, and the source domain sample set is determined. and the common feature set of the target domain sample set, the unique feature set of the source domain sample set and the unique feature set of the target domain sample set.
  • the features in the source domain sample set and the features in the target domain sample set are uniformly encoded into the common feature set of the source domain sample set and the target domain sample set, the unique feature set of the source domain sample set and the unique feature set of the target domain sample set III. in the feature space corresponding to the union.
  • the solution provided by the embodiment of this application completes the training task of the anti-money laundering model of the target domain sample set by introducing the features of the source domain sample set, so that the anti-money laundering model can not only learn the existing knowledge in the source domain sample set, but also learn
  • the target domain sample concentrates new knowledge, that is, the anti-money laundering model can learn both existing knowledge and new knowledge, realizing the accumulation of existing knowledge and the learning of new knowledge, which can improve the anti-money laundering recognition effect of the anti-money laundering model. .
  • FIG. 1 another embodiment of the present application also provides a training method for an anti-money laundering model, as shown in FIG. 2 , the method mainly includes:
  • the source domain sample set includes static customer familiar attributes, IP used in transactions, transaction regions, counterparty accounts and other preset categories of discrete features, it will affect the efficiency of anti-money laundering model training. Moreover, in the anti-money laundering scenario, the distribution of these discrete features in the source domain sample set and the target domain sample set is very different. If these discrete features in the source domain sample set are directly applied to the target domain sample set, these discrete features will be invalid, and the training result will be invalid. The anti-money laundering model will not be able to learn these features, resulting in a poor anti-money laundering effect of the anti-money laundering model. Therefore, in order to learn the discrete features of this part of the preset category in the source domain sample set by the anti-money laundering model, it is necessary to determine whether there are discrete features of the preset category in the features involved in the source domain sample set.
  • step 203 is executed to convert the discrete features of the preset category into continuous features, so as to ensure that these discrete features can be learned by the anti-money laundering model arrive.
  • step 204 can be executed.
  • the discrete-to-continuous transformation is performed on the discrete features of the preset category.
  • the process of converting the discrete features of the preset category into continuous features includes the following steps 1 to 2:
  • Step 1 Count the situation of the samples in the source domain sample set associated with the discrete features of each of the preset categories.
  • the main purpose of counting the sample conditions associated with the discrete features of each of the preset categories in the source domain sample set includes the following two points: First, through what kind of relationship the suspicious risk is propagated, and to whom the risk is propagated . The second is how close a certain relationship is, and how big is the risk of spreading to individuals through that relationship.
  • the specific process of counting the sample conditions associated with the discrete features of each preset category in the source domain sample set is as follows: performing for each discrete feature of the preset category: counting the offline features related to the offline feature within the preset time period; The characteristic condition is determined as the sample condition associated with the discrete feature.
  • the sample situation includes at least one of the following: the number of positive transaction samples, the number of negative transaction samples, the proportion of negative transaction samples in the source domain sample set, and the proportion of positive transaction samples in the source domain sample set , the proportion of the transaction times of any individual in the total transaction times of the individual in the source domain sample set; wherein, the transaction type in the source domain sample set is a positive transaction sample, and the transaction type is a positive transaction sample.
  • Negative transaction samples are suspicious behaviors.
  • the number of negative samples or the proportion of negative samples in the total samples associated with the discrete feature X in the source domain sample set is counted.
  • the number of negative samples or the proportion of negative samples in the total samples is determined as the sample situation of the discrete feature X.
  • Step 2 Determine the sample conditions of the discrete features of each preset category as continuous features corresponding to the discrete features of each preset category.
  • the feature set and the unique feature set of the target domain sample set are in the feature space corresponding to the union of the three.
  • another embodiment of the present application further provides a training device for an anti-money laundering model.
  • the device includes:
  • the obtaining unit 31 is configured to obtain a source domain sample set and a target domain sample set, wherein the source domain sample and the target domain sample are transaction samples used for training an anti-money laundering model;
  • the classification unit 32 is configured to classify the features involved in the source domain sample set and the target domain sample set, and determine the common feature set of the source domain sample set and the target domain sample set, the source domain sample set The unique feature set of the domain sample set and the unique feature set of the target domain sample set;
  • the encoding unit 33 is configured to uniformly encode the features in the source domain sample set and the features in the target domain sample set into a common feature set of the source domain sample set and the target domain sample set, the source domain sample set In the feature space corresponding to the union of the unique feature set of the domain sample set and the unique feature set of the target domain sample set;
  • the merging unit 34 is configured to merge the uniformly encoded source domain sample set and the target domain sample set;
  • the training unit 35 is configured to train an anti-money laundering model based on the combined sample set.
  • An apparatus for training an anti-money laundering model first obtains a source domain sample set and a target domain sample set, classifies the features involved in the source domain sample set and the target domain sample set, and determines the source domain sample set and the common feature set of the target domain sample set, the unique feature set of the source domain sample set and the unique feature set of the target domain sample set.
  • the features in the source domain sample set and the features in the target domain sample set are uniformly encoded into the common feature set of the source domain sample set and the target domain sample set, the unique feature set of the source domain sample set and the unique feature set of the target domain sample set III. in the feature space corresponding to the union.
  • the solution provided by the embodiment of this application completes the training task of the anti-money laundering model of the target domain sample set by introducing the features of the source domain sample set, so that the anti-money laundering model can not only learn the existing knowledge in the source domain sample set, but also learn The target domain sample sets new knowledge, that is, the anti-money laundering model can learn both existing knowledge and new knowledge, realizing the accumulation of existing knowledge and the learning of new knowledge, which can improve the anti-money laundering recognition effect of the anti-money laundering model. .
  • the classification unit 32 includes:
  • a determination module 321, configured to determine the stability index of each continuous feature involved in the source domain sample set and the target domain sample set;
  • the first classification module 322 is configured to classify each of the continuous features based on the size of the stability index of each of the continuous features.
  • the determining module 321 is configured to determine the stability index of each continuous feature involved in the source domain sample set and the target domain sample set through the following formula;
  • PSI(Y e , Y; B) j represents the stability index of the jth continuous feature among the continuous features involved in the source domain sample set and the target domain sample set; Y e represents the expected distribution, and the The expected distribution is the full data of the target domain sample set; Y represents the actual distribution, and the actual distribution is the full data of the source domain sample set; B represents the preset number of buckets; y ij represents the jth continuous feature in The proportion of the ith bucket of the source domain sample set; y eij represents the proportion of the jth continuous feature in the ith bucket of the target domain sample set.
  • the first classification module 322 is configured to classify the continuous features whose stability index is less than the first threshold into the source domain sample set and the target domain sample set. the common feature set; classify the continuous features involved in the source domain sample set whose stability index is not less than the first threshold into the unique feature set of the source domain sample set; classify the stability index not less than the first threshold
  • the continuous features involved in the target domain sample set that are smaller than the first threshold are classified into the unique feature set of the target domain sample set.
  • the classification unit 32 includes:
  • the second classification module 323 is configured to classify the discrete features involved in the source domain sample set into the unique feature set of the source domain sample set; classify the discrete features involved in the target domain sample set into the The unique feature set of the target domain sample set.
  • the device further includes:
  • the judging unit 36 is configured to, before the classifying unit 32 classifies the features involved in the source domain sample set and the target domain sample set, determine whether there is a predetermined feature in the features involved in the source domain sample set. Set the discrete features of the category; if there is, trigger the conversion unit 37;
  • the converting unit 37 is configured to convert the discrete features of the preset category into continuous features under the triggering of the judging unit 36 .
  • the conversion unit 37 is configured to count the sample conditions associated with the discrete features of each of the preset categories in the source domain sample set; The sample situation of the discrete features is determined as the continuous features corresponding to the discrete features of each preset category.
  • the sample situation includes at least one of the following: the number of positive transaction samples, the number of negative transaction samples, the proportion of negative transaction samples in the source domain sample set, and the number of positive transaction samples in the sample set.
  • the proportion of the source domain sample set the proportion of the transaction number of transactions of any individual in the total transaction volume of the individual in the source domain sample set; wherein, the transaction type in the source domain sample set is legal
  • the behaviors are positive transaction samples, and the transaction types are suspicious behaviors are negative transaction samples.
  • another embodiment of the present application further provides a computer-readable storage medium, characterized in that, the storage medium includes a stored program, wherein when the program runs, the The device where the storage medium is located executes the training method of the anti-money laundering model described in FIG. 1 or FIG. 2 .
  • another embodiment of the present application further provides a storage management device, wherein the storage management device includes:
  • memory configured to store programs
  • a processor coupled to the memory, is configured to run the program to perform the training method of the anti-money laundering model described in FIG. 1 or FIG. 2 .
  • a training method for an anti-money laundering model comprising:
  • the features in the source domain sample set and the features in the target domain sample set are uniformly encoded into the common feature set of the source domain sample set and the target domain sample set, and the unique feature set of the source domain sample set and in the feature space corresponding to the union of the unique feature sets of the target domain sample set;
  • A2 According to the method of A1, classify the features involved in the source domain sample set and the target domain sample set, including:
  • Each of the continuous features is classified based on the magnitude of the stability index of each of the continuous features.
  • A3 determine the stability index of each continuous feature involved in the source domain sample set and the target domain sample set, including:
  • PSI(Y e , Y; B) j represents the stability index of the jth continuous feature among the continuous features involved in the source domain sample set and the target domain sample set; Y e represents the expected distribution, and the The expected distribution is the full data of the target domain sample set; Y represents the actual distribution, and the actual distribution is the full data of the source domain sample set; B represents the preset number of buckets; y ij represents the jth continuous feature in The proportion of the ith bucket of the source domain sample set; y eij represents the proportion of the jth continuous feature in the ith bucket of the target domain sample set.
  • A4 classify each of the continuous features based on the size of the stability index of each of the continuous features, including:
  • A5. According to the method of A1 or 2, classify the features involved in the source domain sample set and the target domain sample set, including:
  • the discrete features involved in the target domain sample set are classified into a unique feature set of the target domain sample set.
  • the sample conditions of the discrete features of each preset category are determined as continuous features corresponding to the discrete features of each preset category.
  • the sample conditions include at least one of the following: the number of positive transaction samples, the number of negative transaction samples, the proportion of negative transaction samples in the source domain sample set, the number of positive transaction samples in the The proportion of the sample set of the source domain, the proportion of transaction times of any individual in the total transaction times of the individual in the sample set of the source domain; wherein, the transaction type of the sample set of the source domain is legal behavior are positive transaction samples, and those whose transaction type is suspicious are negative transaction samples.
  • An anti-money laundering model training device comprising:
  • an obtaining unit configured to obtain a source domain sample set and a target domain sample set, wherein the source domain sample and the target domain sample are transaction samples used for training an anti-money laundering model;
  • a classification unit configured to classify the features involved in the source domain sample set and the target domain sample set, and determine the common feature set of the source domain sample set and the target domain sample set, the source domain sample set The unique feature set of the sample set and the unique feature set of the target domain sample set;
  • an encoding unit configured to uniformly encode the features in the source domain sample set and the features in the target domain sample set into a common feature set of the source domain sample set and the target domain sample set, the source domain In the feature space corresponding to the union of the unique feature set of the sample set and the unique feature set of the target domain sample set;
  • a merging unit configured to merge the uniformly encoded source domain sample set and the target domain sample set
  • a training unit configured to train an anti-money laundering model based on the combined sample set.
  • the classification unit comprises:
  • a determination module configured to determine the stability index of each continuous feature involved in the source domain sample set and the target domain sample set
  • the first classification module is configured to classify each of the continuous features based on the size of the stability index of each of the continuous features.
  • PSI(Y e , Y; B) j represents the stability index of the jth continuous feature among the continuous features involved in the source domain sample set and the target domain sample set; Y e represents the expected distribution, and the The expected distribution is the full data of the target domain sample set; Y represents the actual distribution, and the actual distribution is the full data of the source domain sample set; B represents the preset number of buckets; y ij represents the jth continuous feature in The proportion of the ith bucket of the source domain sample set; y eij represents the proportion of the jth continuous feature in the ith bucket of the target domain sample set.
  • the first classification module is configured to classify the continuous features whose stability index is less than a first threshold into the common features of the source domain sample set and the target domain sample set feature set; classify the continuous features involved in the source domain sample set whose stability index is not less than the first threshold into the unique feature set of the source domain sample set; classify the stability index not less than The continuous features involved in the target domain sample set of the first threshold are classified into a unique feature set of the target domain sample set.
  • the second classification module is configured to classify the discrete features involved in the source domain sample set into a unique feature set of the source domain sample set; classify the discrete features involved in the target domain sample set into the target A set of features specific to the domain sample set.
  • a judgment unit configured to judge whether a preset category exists in the features involved in the source domain sample set before the classification unit classifies the features involved in the source domain sample set and the target domain sample set The discrete feature of ; if it exists, trigger the conversion unit;
  • the converting unit is configured to convert the discrete features of the preset category into continuous features under the triggering of the judging unit.
  • the sample conditions include at least one of the following: the number of positive transaction samples, the number of negative transaction samples, the proportion of negative transaction samples in the source domain sample set, the number of positive transaction samples in the The proportion of the sample set of the source domain, the proportion of transaction times of any individual in the total transaction times of the individual in the sample set of the source domain; wherein, the transaction type of the sample set of the source domain is legal behavior are positive transaction samples, and those whose transaction type is suspicious are negative transaction samples.
  • a computer-readable storage medium comprising a stored program, wherein, when the program is run, a device where the storage medium is located is controlled to perform the training of the anti-money laundering model described in any one of A1 to A8 method.
  • a storage management device comprising:
  • memory configured to store programs
  • a processor coupled to the memory, is configured to run the program to perform the training method of the anti-money laundering model of any one of A1 to A8.
  • modules in the device in the embodiment can be adaptively changed and arranged in one or more devices different from the embodiment.
  • the modules or units or components in the embodiments may be combined into one module or unit or component, and further they may be divided into multiple sub-modules or sub-units or sub-assemblies. All features disclosed in this specification (including accompanying claims, abstract and drawings) and any method so disclosed may be employed in any combination, unless at least some of such features and/or procedures or elements are mutually exclusive. All processes or units of equipment are combined.
  • Each feature disclosed in this specification may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
  • Various component embodiments of the present application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof.
  • a microprocessor or a digital signal processor (DSP) may be used in practice to implement some or all of the components in the method, apparatus, and framework for running the deep neural network model according to the embodiments of the present application. some or all functions.
  • the present application can also be implemented as an apparatus or apparatus program (eg, computer programs and computer program products) for performing part or all of the methods described herein.
  • Such a program implementing the present application may be stored on a computer-readable medium, or may be in the form of one or more signals. Such signals may be downloaded from Internet sites, or provided on carrier signals, or in any other form.
  • the solution provided in this application completes the training task of the anti-money laundering model of the target domain sample set by introducing the features of the source domain sample set, so that the anti-money laundering model can not only learn the existing knowledge in the source domain sample set, but also learn the target domain sample set New knowledge, that is, the anti-money laundering model can learn both existing knowledge and new knowledge at the same time, realizing the accumulation and precipitation of existing knowledge and realizing the learning of new knowledge, which can improve the effect of anti-money laundering identification of the anti-money laundering model.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Marketing (AREA)
  • Artificial Intelligence (AREA)
  • General Business, Economics & Management (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • Development Economics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Technology Law (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)

Abstract

Provided are a method and apparatus for training an anti-money laundering model, comprising: acquiring a source domain sample set and a target domain sample set, both source domain samples and target domain samples being transaction samples used to train an anti-money laundering model; classifying features involved in the source domain sample set and the target domain sample set, and determining a common feature set of the source domain sample set and the target domain sample set, a unique feature set of the source domain sample set, and a unique feature set of the target domain sample set; uniformly encoding the features in the source domain sample set and the features in the target domain sample set into a feature space corresponding to the union of the common feature set of the source domain sample set and the target domain sample set, the unique feature set of the source domain sample set, and the unique feature set of the target domain sample set; merging the source domain sample set and the target domain sample set that have been uniformly encoded; and training the anti-money laundering model on the basis of the merged sample set.

Description

一种反洗钱模型的训练方法及装置An anti-money laundering model training method and device
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请要求第四范式(北京)技术有限公司于2020年12月30日提交的、申请名称为“一种反洗钱模型的训练方法及装置”的、中国专利申请号“202011625865.9”的优先权,上述申请公开的内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application number "202011625865.9" filed by Fourth Paradigm (Beijing) Technology Co., Ltd. on December 30, 2020, with the application title "An Anti-Money Laundering Model Training Method and Device", The disclosures of the aforementioned applications are incorporated herein by reference.
技术领域technical field
本申请涉及计算机技术领域,特别是涉及一种反洗钱模型的训练方法及装置。The present application relates to the field of computer technology, and in particular, to a training method and device for an anti-money laundering model.
背景技术Background technique
随着互联网技术的发展,金融领域的交易行为越来越多的依赖于互联网进行,但是相伴而生的洗钱行为逐渐渗入到互联网中。洗钱,是指通过合法的活动或建设将违法获得的收入隐藏、伪装或投资的过程。为了维护社会公正和打击腐败等经济犯罪,需要在互联网中进行洗钱监控。互联网中的洗钱监控主要通过反洗钱模型分析识别互联网数据完成。With the development of Internet technology, more and more transactions in the financial field rely on the Internet, but the accompanying money laundering has gradually penetrated into the Internet. Money laundering is the process of concealing, disguising or investing illegally obtained income through legitimate activities or construction. In order to maintain social justice and combat economic crimes such as corruption, money laundering monitoring is required in the Internet. The monitoring of money laundering in the Internet is mainly accomplished by analyzing and identifying Internet data through anti-money laundering models.
传统的反洗钱方法通常使用反洗钱模型进行洗钱行为识别。反洗钱模型需要大量已知标签的样本训练而得。样本的标签来源主要依据规则体系,规则体系由具有较高的业务素养的专业人员来设定,样本标签的质量可能参差不齐。因此为了训练出洗钱行为识别能力较好的反洗钱模型,需要长期投入大量的人力资源进行标签审核,但是标签审核存在操作风险,审核人员的经验可能失效。Traditional anti-money laundering methods usually use anti-money laundering models to identify money laundering behaviors. Anti-money laundering models need to be trained on a large number of samples with known labels. The source of the label of the sample is mainly based on the rule system, which is set by professionals with high business literacy, and the quality of the sample label may vary. Therefore, in order to train an anti-money laundering model with better ability to identify money laundering behaviors, a large amount of human resources need to be invested in label review for a long time. However, there are operational risks in label review, and the experience of reviewers may be invalid.
发明内容SUMMARY OF THE INVENTION
有鉴于此,本申请提出了一种反洗钱模型的训练方法及装置,主要目的在于通过引入源域样本集的特征完成目标域样本集的反洗钱模型训练任务,以提高反洗钱识别的效果。主要的技术方案包括:In view of this, the present application proposes an anti-money laundering model training method and device, the main purpose of which is to complete the anti-money laundering model training task of the target domain sample set by introducing the characteristics of the source domain sample set, so as to improve the effect of anti-money laundering identification. The main technical solutions include:
第一方面,本申请提供了一种反洗钱模型的训练方法,该方法包括:获取源域样本集和目标域样本集,其中,源域样本和目标域样本均为用于训练反洗钱模型的交易样本;对所述源域样本集和所述目标域样本集所涉及的特征进行分类,确定所述源域样本集和所述目标域样本集的共有特征集、所述源域样本集的特有特征集和所述目标域样本集的特有特征集;所述源域样本集中的特征和所述目标域样本集中的特征,统一编码到所述源域样本集和所述目标域样本集的共有特征集、所述源域样本集的特有特征集以及所述目标域样本 集的特有特征集三者并集对应的特征空间中;合并统一编码后的所述源域样本集和所述目标域样本集;基于合并后的样本集训练反洗钱模型。In a first aspect, the present application provides an anti-money laundering model training method, the method includes: obtaining a source domain sample set and a target domain sample set, wherein the source domain sample and the target domain sample are both used for training the anti-money laundering model. Transaction samples; classify the features involved in the source domain sample set and the target domain sample set, determine the common feature set of the source domain sample set and the target domain sample set, and the source domain sample set The unique feature set and the unique feature set of the target domain sample set; the features in the source domain sample set and the features in the target domain sample set are uniformly encoded into the source domain sample set and the target domain sample set. In the feature space corresponding to the union of the common feature set, the unique feature set of the source domain sample set, and the unique feature set of the target domain sample set; the source domain sample set and the target domain sample set after merging unified coding Domain sample set; train an AML model based on the combined sample set.
第二方面,本申请提供了一种反洗钱模型的训练装置,该装置包括:获取单元,被配置为获取源域样本集和目标域样本集,其中,源域样本和目标域样本均为用于训练反洗钱模型的交易样本;分类单元,被配置为对所述源域样本集和所述目标域样本集所涉及的特征进行分类,确定所述源域样本集和所述目标域样本集的共有特征集、所述源域样本集的特有特征集和所述目标域样本集的特有特征集;编码单元,被配置为将所述源域样本集中的特征和所述目标域样本集中的特征,统一编码到所述源域样本集和所述目标域样本集的共有特征集、所述源域样本集的特有特征集以及所述目标域样本集的特有特征集三者并集对应的特征空间中;合并单元,被配置为合并统一编码后的所述源域样本集和所述目标域样本集;训练单元,被配置为基于合并后的样本集训练反洗钱模型。In a second aspect, the present application provides an anti-money laundering model training device, the device includes: an acquisition unit configured to acquire a source domain sample set and a target domain sample set, wherein the source domain sample and the target domain sample are both used The transaction samples used for training the anti-money laundering model; the classification unit is configured to classify the features involved in the source domain sample set and the target domain sample set, and determine the source domain sample set and the target domain sample set. The common feature set, the unique feature set of the source domain sample set and the unique feature set of the target domain sample set; the coding unit is configured to combine the features in the source domain sample set and the target domain sample set. Features, which are uniformly encoded into the common feature set of the source domain sample set and the target domain sample set, the unique feature set of the source domain sample set and the unique feature set of the target domain sample set corresponding to the union of the three In the feature space; a merging unit, configured to merge the uniformly encoded source domain sample set and the target domain sample set; a training unit, configured to train an anti-money laundering model based on the merged sample set.
第三方面,本申请提供了一种计算机可读存储介质,所述存储介质包括存储的程序,其中,在所述程序运行时控制所述存储介质所在设备执行第一方面所述的反洗钱模型的训练。In a third aspect, the present application provides a computer-readable storage medium, the storage medium includes a stored program, wherein when the program is run, a device where the storage medium is located is controlled to execute the anti-money laundering model of the first aspect training.
第四方面,本申请提供了一种存储管理设备,所述存储管理设备包括:存储器,被配置为存储程序;处理器,耦合至所述存储器,被配置为运行所述程序以执行第一方面所述的反洗钱模型的训练。In a fourth aspect, the present application provides a storage management device, the storage management device comprising: a memory configured to store a program; a processor coupled to the memory and configured to execute the program to execute the first aspect The training of the described anti-money laundering model.
借由上述技术方案,本申请提供的一种反洗钱模型的训练方法和装置,首先获取源域样本集和目标域样本集,并对源域样本集和目标域样本集所涉及的特征进行分类,确定源域样本集和目标域样本集的共有特征集、源域样本集的特有特征集和目标域样本集的特有特征集。将源域样本集中的特征和目标域样本集中的特征,统一编码到源域样本集和目标域样本集的共有特征集、源域样本集的特有特征集以及目标域样本集的特有特征集三者并集对应的特征空间中。合并统一编码后的源域样本集和目标域样本集,并基于合并后的样本集训练反洗钱模型。可见,本申请提供的方案通过引入源域样本集的特征完成目标域样本集的反洗钱模型训练任务,使反洗钱模型既可以学习到源域样本集中的已有知识,又可以学习到目标域样本集中新的知识,即反洗钱模型可同时学习已有知识和新知识,实现了已有知识的积累沉淀又实现了新知识的学习,从而能够提高反洗钱模型的反洗钱识别的效果。With the above technical solutions, the present application provides an anti-money laundering model training method and device, which first obtains a source domain sample set and a target domain sample set, and classifies the features involved in the source domain sample set and the target domain sample set. , and determine the common feature set of the source domain sample set and the target domain sample set, the unique feature set of the source domain sample set and the unique feature set of the target domain sample set. The features in the source domain sample set and the features in the target domain sample set are uniformly encoded into the common feature set of the source domain sample set and the target domain sample set, the unique feature set of the source domain sample set and the unique feature set of the target domain sample set III. in the feature space corresponding to the union. Combine the uniformly coded source domain sample set and target domain sample set, and train an anti-money laundering model based on the combined sample set. It can be seen that the solution provided by this application completes the training task of the anti-money laundering model of the target domain sample set by introducing the features of the source domain sample set, so that the anti-money laundering model can not only learn the existing knowledge in the source domain sample set, but also learn the target domain. The new knowledge in the sample set, that is, the anti-money laundering model can learn both existing knowledge and new knowledge at the same time, realizing the accumulation and precipitation of existing knowledge and realizing the learning of new knowledge, which can improve the anti-money laundering recognition effect of the anti-money laundering model.
上述说明仅是本申请技术方案的概述,为了能够更清楚了解本申请的技术手段,而可依照说明书的内容予以实施,并且为了让本申请的上述和其它目的、特征和优点能够更明显易懂,以下特举本申请的具体实施方式。The above description is only an overview of the technical solution of the present application, in order to be able to understand the technical means of the present application more clearly, it can be implemented according to the content of the description, and in order to make the above and other purposes, features and advantages of the present application more obvious and easy to understand , and the specific embodiments of the present application are listed below.
附图说明Description of drawings
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are For some embodiments of the present application, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without any creative effort.
图1示出了本申请一个实施例提供的一种反洗钱模型的训练方法的流程图;1 shows a flowchart of a training method for an anti-money laundering model provided by an embodiment of the present application;
图2示出了本申请另一个实施例提供的一种反洗钱模型的训练方法的流程图;FIG. 2 shows a flowchart of a training method for an anti-money laundering model provided by another embodiment of the present application;
图3示出了本申请一个实施例提供的一种反洗钱模型的训练装置的结构示意图;FIG. 3 shows a schematic structural diagram of a training device for an anti-money laundering model provided by an embodiment of the present application;
图4示出了本申请另一个实施例提供的一种反洗钱模型的训练装置的结构示意图。FIG. 4 shows a schematic structural diagram of a training device for an anti-money laundering model provided by another embodiment of the present application.
具体实施方式Detailed ways
下面将参照附图更加详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided so that the present disclosure will be more thoroughly understood, and will fully convey the scope of the present disclosure to those skilled in the art.
洗钱行为往往隐藏在金融领域的交易过程中,故金融领域的交易过程中的交易行为所产生的数据中包括有大量洗钱行为相关的特征,而这些特征可以作为反洗钱模型的训练基础。目前对于小型金融机构或新创建的金融机构来说,其可能并没有足够的数据供其训练出反洗钱效果较好的反洗钱模型,因此需要引用具有已有的知识源域样本集和具有要学习新知识的目标域样本集,来训练反洗钱模型。这样训练而得的反洗钱模型既可以学习到源域样本集中的已有知识,又可以学习到目标域样本集中新的知识,即反洗钱模型可同时学习已有知识和新知识,实现了已有知识的积累沉淀又实现了新知识的学习,从而能够提高反洗钱模型的反洗钱识别的效果。Money laundering behavior is often hidden in the transaction process in the financial field, so the data generated by the transaction behavior in the financial field transaction process includes a large number of money laundering behavior-related characteristics, and these characteristics can be used as the training basis for anti-money laundering models. At present, for small financial institutions or newly created financial institutions, there may not be enough data for them to train an anti-money laundering model with better anti-money laundering effect. The target domain sample set for learning new knowledge to train the anti-money laundering model. The anti-money laundering model trained in this way can not only learn the existing knowledge in the source domain sample set, but also learn new knowledge in the target domain sample set, that is, the anti-money laundering model can learn both existing knowledge and new knowledge at the same time. The accumulation and precipitation of knowledge realizes the learning of new knowledge, which can improve the effect of anti-money laundering identification of the anti-money laundering model.
如图1所示,本申请实施例提供了一种反洗钱模型的训练方法,该方法主要包括:As shown in FIG. 1 , an embodiment of the present application provides a training method for an anti-money laundering model, and the method mainly includes:
101、获取源域样本集和目标域样本集,其中,源域样本和目标域样本均为用于训练反洗钱模型的交易样本。101. Obtain a source domain sample set and a target domain sample set, where the source domain sample and the target domain sample are both transaction samples used for training an anti-money laundering model.
洗钱行为往往隐藏在金融领域的交易过程中,故金融领域的交易过程中的交易行为所产生的数据中包括有大量洗钱行为相关的特征,而这些特征可以作为反洗钱模型的训练基础,故获取的源域样本集和目标域样本集均为面向金融领域的数据集。源域样本集中的源域样本和目标域样本集中的源域样本均为训练反洗钱模型的交易样本,而这些交易样本均具有其各自对应的二分类标签,而二分类标签用于表征交易样本是洗钱行为还是合法行为。Money laundering behavior is often hidden in the transaction process in the financial field, so the data generated by the transaction behavior in the financial field transaction process includes a large number of features related to money laundering behavior, and these features can be used as the training basis for anti-money laundering models. The source domain sample set and the target domain sample set are both datasets oriented to the financial field. The source domain samples in the source domain sample set and the source domain samples in the target domain sample set are both transaction samples for training anti-money laundering models, and these transaction samples have their corresponding binary labels, and the binary labels are used to characterize transaction samples. Is it money laundering or legal.
源域样本集中交易样本以及目标域样本集中交易样本的确定过程基本相同,二者之间的区别仅为源域样本集所涉及的知识为已有的知识,而目标域样本集中涉及有需要学习的新知识。下面对交易样本的确定过程进行说明,该过程包括如下步骤一和步骤二:The process of determining transaction samples in the source domain sample set and the transaction samples in the target domain sample set is basically the same. The difference between the two is that the knowledge involved in the source domain sample set is existing knowledge, while the target domain sample set involves knowledge that needs to be learned. new knowledge. The following describes the process of determining the transaction sample, which includes the following steps 1 and 2:
步骤一、确定交易样本,并定义交易样本的标签。Step 1: Determine the transaction sample and define the label of the transaction sample.
客户在金融交易过程中会存在大量的交易记录,这些交易记录是确定交易样本的基础。在确定交易样本时,首先需要定义时间粒度,然后将客户在该时间粒度下产生的交易记录确定为待选样本,再从待选样本中选取出交易样本。交易样本用于训练反洗钱模型,交易样本需要能够明确出是洗钱行为还是合法行为,因此,选取能够明确确定出是洗钱行为还是合法行为的待选样本为交易样本,而不能明确确定出是洗钱行为还是合法行为的待选样本不能作为交易样本,需要被排除。Customers will have a large number of transaction records in the process of financial transactions, and these transaction records are the basis for determining transaction samples. When determining transaction samples, it is necessary to define the time granularity first, and then determine the transaction records generated by the customer under the time granularity as the samples to be selected, and then select the transaction samples from the samples to be selected. The transaction sample is used to train the anti-money laundering model. The transaction sample needs to be able to clearly identify whether it is money laundering or legal behavior. Therefore, the candidate sample that can clearly determine whether it is money laundering or legal behavior is selected as the transaction sample, but it cannot be clearly determined. A candidate sample of behavior or legal behavior cannot be used as a transaction sample and needs to be excluded.
示例性的,时间粒度为天粒度。在客户天粒度下,从金融机构的交易记录中筛选出客户有交易的日期,形成客户-天粒度的交易记录,并将一个客户一天中产生的交易记录确定为一个待选样本。然后从待选样本中筛选出交易样本,该筛选过程具体包括如下几个操作:一是,确定客户-天粒度的交易记录中是否存在洗钱报告日期与洗钱活动日期相差较大的交易记录,若存在,则将这部分交易记录对应的待选样本排除,不作为交易样本选取。其中,洗钱报告日期为人工上报的日期,洗钱活动日期为金融机构的洗钱规则系统上报的日期,若两者的差异较大,则说明不能准确确定是否为洗钱行为。二是,银行等金融机构的反洗钱规则系统针对某个客户触发的洗钱上报,则将该客户对应的上报日期的交易记录以及该上报日期之前N(N大于或等于1,示例性的,N=30)天内的交易记录对应的待选样本分别筛选为交易样本,并将这些交易样本视为可疑行为,标注可疑行为的标签,label=1。三是,经过上述两个操作后,剩余的那部分待选样本均选取为交易样本,并将这些交易样本视为合法行为,标注合法行为的标签,label=0。Exemplarily, the time granularity is day granularity. Under the customer-day granularity, the date when the customer has a transaction is selected from the transaction records of the financial institution to form a customer-day granularity transaction record, and the transaction record generated by a customer in one day is determined as a candidate sample. Then, the transaction samples are screened from the samples to be selected. The screening process includes the following operations: First, determine whether there is a transaction record with a large difference between the date of the money laundering report and the date of the money laundering activity in the transaction records of the customer-day granularity. If it exists, the candidate samples corresponding to this part of the transaction records will be excluded and not selected as transaction samples. Among them, the date of the money laundering report is the date of manual reporting, and the date of the money laundering activity is the date reported by the financial institution's money laundering rules system. Second, if the anti-money laundering rule system of a financial institution such as a bank reports money laundering triggered by a certain customer, the transaction records on the reporting date corresponding to the customer and N before the reporting date (N is greater than or equal to 1, exemplarily, N The candidate samples corresponding to the transaction records within 30) days are screened as transaction samples respectively, and these transaction samples are regarded as suspicious behaviors, and the labels of suspicious behaviors are marked, label=1. Third, after the above two operations, the remaining part of the samples to be selected are selected as transaction samples, and these transaction samples are regarded as legal behaviors, and the labels of legal behaviors are marked, label=0.
需要说明的是,源域样本集和目标域样本集的交易样本的来源不同,源域样本集所涉及的知识为已有的知识,而目标域样本集中涉及有需要学习的新知识。示例性的,源域样本集的来源为金融机构A在1月所产生的交易记录,这部分交易记录中的特征已成为已有知识,目标域样本集为金融机构A在2月所产生的交易记录,这部分交易记录中包括有需要学习的新知识。为了便于知识的积累和传承,则需要获取源域样本集和目标域样本集,以利用迁移学习的方式,使反洗钱模型既可以学习到源域样本集所涉及的已有知识,又可学习到目标域样本集中所涉及需要学习的新知识。It should be noted that the sources of transaction samples in the source domain sample set and the target domain sample set are different, the knowledge involved in the source domain sample set is existing knowledge, and the target domain sample set involves new knowledge that needs to be learned. Exemplarily, the source domain sample set is the transaction records generated by financial institution A in January, the features in this part of the transaction records have become known, and the target domain sample set is generated by financial institution A in February. Transaction records, this part of transaction records includes new knowledge that needs to be learned. In order to facilitate the accumulation and inheritance of knowledge, it is necessary to obtain the source domain sample set and the target domain sample set, so that the anti-money laundering model can not only learn the existing knowledge involved in the source domain sample set, but also learn from the transfer learning method. The new knowledge that needs to be learned is involved in the sample set of the target domain.
步骤二、对交易样本进行特征拼接。Step 2: Perform feature splicing on transaction samples.
交易样本的特征主要包括用户类特征和用户行为类特征,其中,用户类特征主要描述用户的特征信息,比如,年龄、性别、存款余额、家庭成员数量等。用户行为特征类特征 主要描述用户交易行为相关的信息,比如,用户深夜转账金额、用户ATM取钱笔数、一周内用户柜面存储笔数等。The characteristics of transaction samples mainly include user-type characteristics and user-behavior-type characteristics. Among them, user-type characteristics mainly describe the user's characteristic information, such as age, gender, deposit balance, number of family members, etc. User behavior characteristics mainly describe information related to user transaction behavior, such as the amount of the user's late-night transfer, the number of withdrawals from the user's ATM, and the number of transactions stored at the user's counter within a week.
对交易样本进行特征拼接,主要用于丰富交易样本的特征,以便反洗钱模型可以学习到更多有用的反洗钱信息。在对交易样本进行特征拼接时,实际上依据交易样本现有的特征进行特征衍生。示例性的,交易样本中存在一周内用户柜面存款笔数以及一周内用户柜面每笔存储的存储金额,则可衍生出特征“一周内用用户柜面存款总金额”。Feature splicing of transaction samples is mainly used to enrich the features of transaction samples, so that the anti-money laundering model can learn more useful anti-money laundering information. When feature splicing is performed on transaction samples, feature derivation is actually performed based on the existing features of transaction samples. Exemplarily, if the transaction sample includes the number of user counter deposits in one week and the deposit amount of each deposit in the user counter in one week, the feature "total amount of user counter deposits in one week" can be derived.
示例性的,表-1为特征拼接后的交易样本。Exemplarily, Table-1 is a transaction sample after feature splicing.
表-1Table 1
客户IDCustomer ID 交易日期transaction date F1(ATM取款数)F1 (ATM withdrawals) F2(存款金额/万元)F2 (deposit amount/10,000 yuan) F3(分行号)F3 (branch number) 交易行为transaction behavior
123123 2020.1.22020.1.2 1000010000 00 203203 可疑suspicious
124124 2020.1.22020.1.2 2000020000 2000000020000000 304304 合法legitimate
125125 2020.1.32020.1.3 30003000 3399933999 335335 合法legitimate
123123 2020.1.32020.1.3 3030 4488844888 445445 合法legitimate
126126 2020.1.32020.1.3 100000100000 9018990189 515515 合法legitimate
122122 2020.1.42020.1.4 2000020000 10000001000000 895895 合法legitimate
128128 2020.1.42020.1.4 30003000 5588855888 233233 合法legitimate
124124 2020.1.42020.1.4 4343 3232 452452 可疑suspicious
在得到源域样本集和目标域样本集之后,可按照日分区(切片表)或全量表(拉链表),以多副本的方式将源域样本集和目标域样本集存储在数据库内。After the source domain sample set and the target domain sample set are obtained, the source domain sample set and the target domain sample set can be stored in the database in a multi-copy manner according to the daily partition (slice table) or the full scale (zipper table).
102、对所述源域样本集和所述目标域样本集所涉及的特征进行分类,确定所述源域样本集和所述目标域样本集的共有特征集、所述源域样本集的特有特征集和所述目标域样本集的特有特征集。102. Classify the features involved in the source domain sample set and the target domain sample set, and determine the common feature set of the source domain sample set and the target domain sample set, and the unique feature set of the source domain sample set. A feature set and a unique feature set of the target domain sample set.
对源域样本集和目标域样本集所涉及的特征进行分类的主要作用有以下两个方面:第一方面,为了检验源域样本集和目标域样本集是否共享反洗钱模型的一些参数,其中该参数包括模型的参数或模型的超参数。若检验出源域样本集和目标域样本集存在共享参数,则说明可以采用迁移学习的方式,使用源域样本集和目标域样本集训练反洗钱模型。若检验出源域样本集和目标域样本集不存在共享参数,则说明不能采用迁移学习的方式,使用源域样本集和目标域样本集训练反洗钱模型,告知业务人员重新选用源域样本集和目标域样本集即可。需要说明的是,检验源域样本集和目标域样本集是否共享反洗钱模型的一些参数的过程,实质上就是确定源域样本集和目标域样本集是否存在共有特征集的过程。第二方面,在确定源域样本集和目标域样本集共享参数时,找出源域样本集和目标域样本集 在各自的洗钱任务上的共有参数和特有参数。The main functions of classifying the features involved in the source domain sample set and the target domain sample set are as follows: First, in order to test whether the source domain sample set and the target domain sample set share some parameters of the anti-money laundering model, among which This parameter includes the parameters of the model or the hyperparameters of the model. If it is verified that the source domain sample set and the target domain sample set have shared parameters, it means that the anti-money laundering model can be trained by using the source domain sample set and the target domain sample set by means of transfer learning. If it is verified that the source domain sample set and the target domain sample set do not have shared parameters, it means that the transfer learning method cannot be used to train the anti-money laundering model using the source domain sample set and the target domain sample set, and inform the business personnel to re-select the source domain sample set and the target domain sample set. It should be noted that the process of checking whether the source domain sample set and the target domain sample set share some parameters of the anti-money laundering model is essentially the process of determining whether the source domain sample set and the target domain sample set have a common feature set. In the second aspect, when determining the shared parameters of the source domain sample set and the target domain sample set, find out the common parameters and unique parameters of the source domain sample set and the target domain sample set in their respective money laundering tasks.
下面对源域样本集和目标域样本集所涉及的特征进行分类的过程进行说明,该过程具体包括如下步骤一和步骤二:The following describes the process of classifying the features involved in the source domain sample set and the target domain sample set. The process specifically includes the following steps 1 and 2:
步骤一、确定所述源域样本集和所述目标域样本集所涉及的各连续特征的稳定性指标。Step 1: Determine the stability index of each continuous feature involved in the source domain sample set and the target domain sample set.
具体的,可通过如下公式,确定源域样本集和目标域样本集所涉及的各连续特征的稳定性指标。对于任意一个连续特征而言,其对应的稳定性指标能够反映出其在源域样本集和目标域样本集分布上的差异,可根据这个差异确定该连续特征是两个样本集的共有特征还是属于一个样本集的特有特征。Specifically, the following formula can be used to determine the stability index of each continuous feature involved in the source domain sample set and the target domain sample set. For any continuous feature, its corresponding stability index can reflect the difference in the distribution of the source domain sample set and the target domain sample set. Based on this difference, it can be determined whether the continuous feature is a common feature of the two sample sets or A unique feature that belongs to a sample set.
确定源域样本集和目标域样本集所涉及的各连续特征的稳定性指标的公式为:The formula for determining the stability index of each continuous feature involved in the source domain sample set and the target domain sample set is:
Figure PCTCN2021140997-appb-000001
Figure PCTCN2021140997-appb-000001
其中,PSI(Y e,Y;B) j表征所述源域样本集和所述目标域样本集所涉及的各连续特征中第j个连续特征的稳定性指标;Y e表征预期分布,所述预期分布为所述目标域样本集全量数据;Y表征实际分布,所述实际分布为所述源域样本集全量数据;B表征预设的分桶数量;y ij表征第j个连续特征在所述源域样本集的第i个分桶中的占比;y eij表征第j个连续特征在所述目标域样本集的第i个分桶中的占比。 Among them, PSI(Y e , Y; B) j represents the stability index of the jth continuous feature among the continuous features involved in the source domain sample set and the target domain sample set; Y e represents the expected distribution, and the The expected distribution is the full data of the target domain sample set; Y represents the actual distribution, and the actual distribution is the full data of the source domain sample set; B represents the preset number of buckets; y ij represents the jth continuous feature in The proportion of the ith bucket of the source domain sample set; y eij represents the proportion of the jth continuous feature in the ith bucket of the target domain sample set.
具体的,分桶数量可以基于业务要求确定,本实施例中不做具体限定。需要注意的是,若分桶数量太多,可能会导致每个分桶内的样本数量太少,从而失去统计意义。若分桶数量太少,则有会导致计算结果精度较低。因此,在确定分桶数量时需要合理考虑源域样本集和目标域样本集中的样本总量。在分桶时,可以采用等量的分桶方式进行分桶。示例性的,分桶数量为15。Specifically, the number of buckets may be determined based on service requirements, which is not specifically limited in this embodiment. It should be noted that if the number of buckets is too large, the number of samples in each bucket may be too small, thus losing statistical significance. If the number of buckets is too small, the accuracy of the calculation results will be lower. Therefore, the total number of samples in the source domain sample set and the target domain sample set should be reasonably considered when determining the number of buckets. When dividing buckets, the buckets can be divided by the same amount of buckets. Exemplarily, the number of buckets is 15.
具体的,连续特征的稳定性指标越小,说明该连续特征在两个样本集之间的差异越小,其为两个样本集的共有特征。连续特征的稳定性指标越大,说明该连续特征在两个样本集之间的差异越大,其为其所对应的样本集的特有特征。Specifically, the smaller the stability index of the continuous feature, the smaller the difference between the two sample sets of the continuous feature, which is a common feature of the two sample sets. The larger the stability index of the continuous feature, the greater the difference between the two sample sets of the continuous feature, which is the unique feature of the corresponding sample set.
步骤二、基于各所述连续特征的稳定性指标的大小,对各所述连续特征进行分类。Step 2: Classify each of the continuous features based on the size of the stability index of each of the continuous features.
对于任意一个连续特征而言,其对应的稳定性指标能够反映出其在源域样本集和目标域样本集分布上的差异,因此可以基于连续特征的稳定性指标的大小对连续特征进行分类。For any continuous feature, its corresponding stability index can reflect the difference in the distribution of the source domain sample set and the target domain sample set, so the continuous features can be classified based on the size of the stability index of the continuous feature.
下面对基于各连续特征的稳定性指标的大小对各连续特征进行分类的过程进行说明,该过程具体包括如下三步:The following describes the process of classifying each continuous feature based on the size of the stability index of each continuous feature. The process specifically includes the following three steps:
一是、将所述稳定性指标小于第一阈值的连续特征,分类至所述源域样本集和所述目 标域样本集的共有特征集。One is to classify the continuous features whose stability index is less than a first threshold into a common feature set of the source domain sample set and the target domain sample set.
对于稳定性指标小于第一阈值的连续特征,说明该连续特征在两个样本集之间的差异较小,其为两个样本集共有特征,因此,将这部分连续特征分类至源域样本集和目标域样本集的共有特征集。For a continuous feature whose stability index is less than the first threshold, it means that the difference between the two sample sets is small, and it is a common feature of the two sample sets. Therefore, this part of the continuous feature is classified into the source domain sample set. and the common feature set of the target domain sample set.
具体的,第一阈值的大小可以基于业务要求确定,本实施例不做具体限定。可选的,第一阈值为0.25,也就是说,将稳定性指标小于0.25的所有的连续特征均分类至共有特征集。Specifically, the size of the first threshold may be determined based on service requirements, which is not specifically limited in this embodiment. Optionally, the first threshold is 0.25, that is, all continuous features whose stability index is less than 0.25 are classified into a common feature set.
示例性的,如表-2所示,示例了经过稳定性指标计算之后,确定哪些连续特征为两个样本集的共有特征,哪些特征为两个样本集的非共有特征,其中,非共有特征还要进一步确定其是哪个样本集的特有特征。Exemplarily, as shown in Table-2, after the calculation of the stability index, it is exemplified that which continuous features are the common features of the two sample sets and which features are the non-common features of the two sample sets, wherein, the non-common features It is also necessary to further determine which sample set it is a characteristic feature.
表-2Table 2
特征feature PSI值PSI value 特征分类Feature classification
F1(ATM取款数)F1 (ATM withdrawals) 0.230.23 共有shared
F2(存款金额/万元)F2 (deposit amount/10,000 yuan) 0.250.25 非共有non-shared
F3(夜间交易次数)F3 (number of night transactions) 0.0010.001 共有shared
F4(夜间收款金额)F4 (Amount received at night) 0.0040.004 共有shared
F5(1天借方交易金额)F5 (1 day debit transaction amount) 0.30.3 非共有non-shared
F6(3天交易总金额)F6 (3-day total transaction amount) 0.1230.123 共有shared
F7(3天交易总笔数)F7 (total number of transactions in 3 days) 0.030.03 共有shared
F8(10天借贷金额比例)F8 (10-day loan amount ratio) 0.020.02 共有shared
二是、将所述稳定性指标不小于所述第一阈值的所述源域样本集所涉及的连续特征,分类至所述源域样本集的特有特征集。The second step is to classify the continuous features involved in the source domain sample set whose stability index is not less than the first threshold into a unique feature set of the source domain sample set.
对于稳定性指标小于第一阈值的源域样本集所涉及的连续特征,说明该连续特征在两个样本集之间的差异越大,其为其所对应的源域样本集的特有特征,故将其分类至源域样本集的特有特征集。For the continuous features involved in the source domain sample set whose stability index is less than the first threshold, it means that the greater the difference between the two sample sets, the continuous feature is the unique feature of the corresponding source domain sample set, so Classify it to the unique feature set of the source domain sample set.
示例性的,第一阈值为0.25,将稳定性指标大于或等于0.25的所有源域样本集所涉及的连续特征均分类至源域样本集的特有特征集。Exemplarily, the first threshold is 0.25, and the continuous features involved in all source domain sample sets whose stability index is greater than or equal to 0.25 are classified into the unique feature set of the source domain sample set.
三是、将所述稳定性指标不小于所述第一阈值的所述目标域样本集所涉及的连续特征,分类至所述目标域样本集的特有特征集。The third step is to classify the continuous features involved in the target domain sample set whose stability index is not less than the first threshold into a unique feature set of the target domain sample set.
对于稳定性指标小于第一阈值的目标域样本集所涉及的连续特征,说明该连续特征在两个样本集之间的差异越大,其为其所对应的目标域样本集的特有特征,故将其分类至目标域样本集的特有特征集。For the continuous features involved in the target domain sample set whose stability index is less than the first threshold, it means that the greater the difference between the two sample sets, the continuous feature is the unique feature of the corresponding target domain sample set, so Classify it to the unique feature set of the target domain sample set.
示例性的,第一阈值为0.25,将稳定性指标大于或等于0.25的所有目标域样本集所涉及的连续特征均分类至目标域样本集的特有特征集。Exemplarily, the first threshold is 0.25, and the continuous features involved in all target domain sample sets whose stability index is greater than or equal to 0.25 are classified into the unique feature set of the target domain sample set.
进一步的,由于源域样本集和目标域样本集所涉及的特征中不仅存在连续特征,还存在有离散特征,因此除了上述步骤一和步骤二所示的对源域样本集和目标域样本集所涉及的特征进行分类的过程外,还包括如下对所述源域样本集和所述目标域样本集所涉及的特征进行分类的过程:将所述源域样本集所涉及的离散特征分类至所述源域样本集的特有特征集;将所述目标域样本集所涉及的离散特征分类至所述目标域样本集的特有特征集。由于两个样本集所涉及的离散特征基本为用户类特征,其属于垂直隔离特征,为各自样本集特有的特征,因此直接将各样本集所涉及的离散特征分类至其各自对应的特有特征集即可。Further, since there are not only continuous features but also discrete features in the features involved in the source domain sample set and the target domain sample set, in addition to the above steps 1 and 2 for the source domain sample set and the target domain sample set In addition to the process of classifying the involved features, it also includes the following process of classifying the features involved in the source domain sample set and the target domain sample set: classifying the discrete features involved in the source domain sample set into The unique feature set of the source domain sample set; the discrete features involved in the target domain sample set are classified into the unique feature set of the target domain sample set. Since the discrete features involved in the two sample sets are basically user-type features, which are vertical isolation features and are unique to their respective sample sets, the discrete features involved in each sample set are directly classified into their corresponding unique feature sets. That's it.
103、将所述源域样本集中的特征和所述目标域样本集中的特征,统一编码到所述源域样本集和所述目标域样本集的共有特征集、所述源域样本集的特有特征集以及所述目标域样本集的特有特征集三者并集对应的特征空间中。103. Uniformly encode the features in the source domain sample set and the features in the target domain sample set into a common feature set of the source domain sample set and the target domain sample set, and a unique feature set of the source domain sample set. The feature set and the unique feature set of the target domain sample set are in the feature space corresponding to the union of the three.
为了使反洗钱模型既可以学习到源域样本集中的特征又可以学习到目标域集中的特征,因此需要将源域样本集中的特征和目标域样本集中的特征,统一编码到源域样本集和所述目标域样本集的共有特征集、源域样本集的特有特征集以及目标域样本集的特有特征集三者并集对应的特征空间中,这样处理能够使反洗钱模型既可以学习到源域样本集中的已有知识,又可以学习到目标域样本集中新的知识,既反洗钱模型可同时学习已有知识和新知识,可以实现已有知识的积累沉淀又实现了新知识的学习,从而能够提高反洗钱模型的反洗钱识别的效果。In order for the anti-money laundering model to learn both the features in the source domain sample set and the features in the target domain set, it is necessary to uniformly encode the features in the source domain sample set and the target domain sample set into the source domain sample set and the target domain sample set. The common feature set of the target domain sample set, the unique feature set of the source domain sample set, and the unique feature set of the target domain sample set are in the feature space corresponding to the union of the three, so that the anti-money laundering model can learn the source The existing knowledge in the domain sample set can also learn new knowledge in the target domain sample set. Not only the anti-money laundering model can learn the existing knowledge and new knowledge at the same time, it can realize the accumulation and precipitation of the existing knowledge and realize the learning of new knowledge. Thereby, the effect of anti-money laundering identification of the anti-money laundering model can be improved.
反洗钱模型需要的数据是数字型的,因为只有数字类型才能进行计算。因此,对于各种特征,都需要对其进行相应的编码,也是量化的过程。在编码过程中通过预设的编码机制,将源域样本集中的特征和目标域样本集中的特征,统一编码到源域样本集和所述目标域样本集的共有特征集、源域样本集的特有特征集以及目标域样本集的特有特征集三者并集对应的特征空间中。该编码机制可以根据业务要求确定,本实施例中不做具体限定。可选地,该编码机制可以为one-hot encoding。The data required by the AML model is numeric, because only numeric types can perform calculations. Therefore, for various features, they need to be encoded accordingly, which is also a process of quantization. In the encoding process, through the preset encoding mechanism, the features in the source domain sample set and the features in the target domain sample set are uniformly encoded into the common feature set of the source domain sample set and the target domain sample set, and the source domain sample set. The unique feature set and the unique feature set of the target domain sample set are in the feature space corresponding to the union of the three. The encoding mechanism may be determined according to service requirements, which is not specifically limited in this embodiment. Optionally, the encoding mechanism can be one-hot encoding.
具体的,在对共有特征编码时,对于源域样本集和目标域样本集的共有特征,比如交易行为,客户的人口统计属性等,源域样本集和目标域样本集的样本可以统一做特征编码,即单独针对特征空间做编码,直接合并样本后,统一进入特征抽取算子。Specifically, when coding the common features, for the common features of the source domain sample set and the target domain sample set, such as transaction behavior, customer demographic attributes, etc., the samples of the source domain sample set and the target domain sample set can be unified as features. Coding, that is, coding separately for the feature space, directly merging the samples, and entering the feature extraction operator uniformly.
具体的,在对两个样本集的离散特征形式的特有特征编码时,由于离散特征两个样本集的垂直隔离的特征,比如客户所属分行,交易所用的ATM编号等,则在有取值时编码,无取值时特征置空处理。Specifically, when encoding the unique features in the form of discrete features of the two sample sets, due to the vertical isolation features of the two sample sets of discrete features, such as the branch to which the customer belongs, the ATM number used by the transaction, etc. time code, and the feature will be blanked if there is no value.
具体的,在对两个样本集的连续特征形式的特有特征编码时,对于源域样本集和目标域样本集的特有特征,进行单独的空间位置分隔。源域样本集的特有特征为一个位置,目标域样本集的特有特征为一个位置。Specifically, when encoding the unique features in the form of continuous features of the two sample sets, separate spatial locations are performed for the unique features of the source domain sample set and the target domain sample set. The unique feature of the source domain sample set is a location, and the unique feature of the target domain sample set is a location.
示例性的,如表-3所示,示例了经过特征编码之后,形成的数据。Exemplarily, as shown in Table-3, the data formed after feature encoding is exemplified.
表-3table 3
Figure PCTCN2021140997-appb-000002
Figure PCTCN2021140997-appb-000002
在进行上述特征编码后,得到了一个特征空间,该特征空间中包括源域样本集和目标域样本集的共有特征集中的特征、源域样本集的特有特征集中的特征以及目标域样本集的特有特征集中的特征。该特征空间为后续反洗钱模型的训练提供了数据基础。After the above feature encoding, a feature space is obtained, which includes the features in the common feature set of the source domain sample set and the target domain sample set, the features in the unique feature set of the source domain sample set, and the target domain sample set. Features in a unique feature set. This feature space provides the data basis for the training of subsequent anti-money laundering models.
104、合并统一编码后的所述源域样本集和所述目标域样本集。104. Combine the uniformly coded source domain sample set and the target domain sample set.
反洗钱模型需要的数据是数字型的,因为只有数字类型才能进行计算。因此,在对各种特征编码后,完成了特征量化过程,合并统一编码后的源域样本集和目标域样本集,便可以形成训练反洗钱模型的训练数据。The data required by the AML model is numeric, because only numeric types can perform calculations. Therefore, after encoding various features, the feature quantization process is completed, and the uniformly encoded source domain sample set and target domain sample set can be combined to form the training data for training the anti-money laundering model.
105、基于合并后的样本集训练反洗钱模型。105. Train an anti-money laundering model based on the combined sample set.
反洗钱模型用于对金融交易过程中产生的数据进行洗钱活动识别,其用于识别该数据是洗钱行为还是合法行为,因此反洗钱模型为二分类模型。在实际应用中,反洗钱模型的具体类型可以基于业务要求确定,本实施例中不做具体限定。可选的,反洗钱模型为GBDT(梯度提升树)或LR(逻辑回归)。The anti-money laundering model is used to identify money laundering activities on the data generated in the process of financial transactions, and it is used to identify whether the data is money laundering or legal, so the anti-money laundering model is a binary model. In practical applications, the specific type of the anti-money laundering model may be determined based on business requirements, which is not specifically limited in this embodiment. Optionally, the anti-money laundering model is GBDT (Gradient Boosting Tree) or LR (Logistic Regression).
基于合并后的样本集训练反洗钱模型的过程,与输入模型参与训练的样本有关,且至少包括如下几种:The process of training an anti-money laundering model based on the combined sample set is related to the samples that the input model participates in training, and includes at least the following:
第一种,使用合并后的样本集中所有样本输入反洗钱模型进行训练。The first is to use all the samples in the combined sample set to input the anti-money laundering model for training.
此种方式,由于使用了样本集中的所有数据,因此输入的模型中的特征丰富,使得这样处理能够使反洗钱模型既可以学习到源域样本集中的已有知识,又可以学习到目标域样本集中新的知识,即反洗钱模型可同时学习已有知识和新知识,实现了已有知识的积累沉淀又实现了新知识的学习,从而能够提高反洗钱模型的反洗钱识别的效果。In this way, since all the data in the sample set is used, the features in the input model are rich, so that the anti-money laundering model can not only learn the existing knowledge in the source domain sample set, but also learn the target domain samples. Concentrating new knowledge, that is, the anti-money laundering model can learn both existing knowledge and new knowledge at the same time, realizing the accumulation and precipitation of existing knowledge and realizing the learning of new knowledge, which can improve the anti-money laundering recognition effect of the anti-money laundering model.
第二种,从合并后的样本集中提取设定数量的样本,将所提取的样本输入反洗钱模型进行训练。The second is to extract a set number of samples from the combined sample set, and input the extracted samples into the anti-money laundering model for training.
具体的,提取的样本中涉及的特征中同时包括源域样本集和目标域样本集的共有特 征、源域样本集的特有特征集和目标域样本集的特有特征集。由于仅提取了设定数量的样本,因此可以付出较小的算力,便可训练出反洗钱模型,且反洗钱模型既可以学习到源域样本集中的已有知识,又可以学习到目标域样本集中新的知识,既反洗钱模型可同时学习已有知识和新知识,实现了已有知识的积累沉淀又实现了新知识的学习,从而能够提高反洗钱模型的反洗钱识别的效果。Specifically, the features involved in the extracted samples include the common features of the source domain sample set and the target domain sample set, the unique feature set of the source domain sample set and the unique feature set of the target domain sample set. Since only a set number of samples are extracted, the anti-money laundering model can be trained with less computing power, and the anti-money laundering model can learn both the existing knowledge in the source domain sample set and the target domain. The new knowledge in the sample set enables the anti-money laundering model to learn both existing knowledge and new knowledge at the same time, realizing the accumulation and precipitation of existing knowledge and the learning of new knowledge, which can improve the anti-money laundering recognition effect of the anti-money laundering model.
示例性的,如表-4所示,该表-4为从合并后的样本集中选取的用于训练反洗钱模型的样本。Exemplarily, as shown in Table-4, Table-4 is a sample selected from the combined sample set for training an anti-money laundering model.
表-4Table 4
Figure PCTCN2021140997-appb-000003
Figure PCTCN2021140997-appb-000003
本申请实施例提供的一种反洗钱模型的训练方法,首先获取源域样本集和目标域样本集,并对源域样本集和目标域样本集所涉及的特征进行分类,确定源域样本集和目标域样本集的共有特征集、源域样本集的特有特征集和目标域样本集的特有特征集。将源域样本集中的特征和目标域样本集中的特征,统一编码到源域样本集和目标域样本集的共有特征集、源域样本集的特有特征集以及目标域样本集的特有特征集三者并集对应的特征空间中。合并统一编码后的源域样本集和目标域样本集,并基于合并后的样本集训练反洗钱模型。可见,本申请实施例提供的方案通过引入源域样本集的特征完成目标域样本集的反洗钱模型训练任务,使反洗钱模型既可以学习到源域样本集中的已有知识,又可以学习到目标域样本集中新的知识,即反洗钱模型可同时学习已有知识和新知识,实现了已有知识的积累沉淀又实现了新知识的学习,从而能够提高反洗钱模型的反洗钱识别的效果。In an anti-money laundering model training method provided by the embodiment of the present application, a source domain sample set and a target domain sample set are first obtained, the features involved in the source domain sample set and the target domain sample set are classified, and the source domain sample set is determined. and the common feature set of the target domain sample set, the unique feature set of the source domain sample set and the unique feature set of the target domain sample set. The features in the source domain sample set and the features in the target domain sample set are uniformly encoded into the common feature set of the source domain sample set and the target domain sample set, the unique feature set of the source domain sample set and the unique feature set of the target domain sample set III. in the feature space corresponding to the union. Combine the uniformly coded source domain sample set and target domain sample set, and train an anti-money laundering model based on the combined sample set. It can be seen that the solution provided by the embodiment of this application completes the training task of the anti-money laundering model of the target domain sample set by introducing the features of the source domain sample set, so that the anti-money laundering model can not only learn the existing knowledge in the source domain sample set, but also learn The target domain sample concentrates new knowledge, that is, the anti-money laundering model can learn both existing knowledge and new knowledge, realizing the accumulation of existing knowledge and the learning of new knowledge, which can improve the anti-money laundering recognition effect of the anti-money laundering model. .
进一步的,根据图1所示的方法,本申请的另一个实施例还提供了一种反洗钱模型的训练方法,如图2所示,所述方法主要包括:Further, according to the method shown in FIG. 1 , another embodiment of the present application also provides a training method for an anti-money laundering model, as shown in FIG. 2 , the method mainly includes:
201、获取源域样本集和目标域样本集,其中,源域样本和目标域样本均为用于训练 反洗钱模型的交易样本。201. Obtain a source domain sample set and a target domain sample set, where the source domain sample and the target domain sample are both transaction samples used for training an anti-money laundering model.
202、判断所述源域样本集所涉及的特征中是否存在预设类别的离散特征;如果是,执行203;如果否,执行204。202 . Determine whether there are discrete features of a preset category in the features involved in the source domain sample set; if yes, go to 203 ; if not, go to 204 .
若源域样本集中涵盖静态客户熟悉属性、交易使用的IP、交易地区、交易对手账户等预设类别的离散特征时,将影响反洗钱模型训练的效率。且这些离散特征在反洗钱场景中,在源域样本集和目标域样本集中分布差异很大,如果直接将源域样本集中这些离散特征应用到目标域样本集中,这些离散特征将失效,训练出的反洗钱模型将不能学习到这些特征,导致反洗钱模型的反洗钱效果较差。因此,为了将源域样本集中这部分预设类别的离散特征能够被反洗钱模型学习到,需要判断源域样本集所涉及的特征中是否存在预设类别的离散特征。If the source domain sample set includes static customer familiar attributes, IP used in transactions, transaction regions, counterparty accounts and other preset categories of discrete features, it will affect the efficiency of anti-money laundering model training. Moreover, in the anti-money laundering scenario, the distribution of these discrete features in the source domain sample set and the target domain sample set is very different. If these discrete features in the source domain sample set are directly applied to the target domain sample set, these discrete features will be invalid, and the training result will be invalid. The anti-money laundering model will not be able to learn these features, resulting in a poor anti-money laundering effect of the anti-money laundering model. Therefore, in order to learn the discrete features of this part of the preset category in the source domain sample set by the anti-money laundering model, it is necessary to determine whether there are discrete features of the preset category in the features involved in the source domain sample set.
若判断出源域样本集所涉及的特征中存在预设类别的离散特征,则执行步骤203,以将预设类别的离散特征转换为连续特征,从而保证这部分离散特征能够被反洗钱模型学习到。If it is determined that there are discrete features of the preset category in the features involved in the source domain sample set, step 203 is executed to convert the discrete features of the preset category into continuous features, so as to ensure that these discrete features can be learned by the anti-money laundering model arrive.
若判断出源域样本集所涉及的特征中不存在预设类别的离散特征,则说明不用进行任何的特征转换,源域样本集中的特征均可被反洗钱模型学习到,执行步骤204即可。If it is determined that the features involved in the source domain sample set do not have discrete features of the preset category, it means that no feature conversion is required, and the features in the source domain sample set can be learned by the anti-money laundering model, and step 204 can be executed. .
203、将所述预设类别的离散特征转换为连续特征。203. Convert the discrete features of the preset category into continuous features.
为了使在源域样本集的预设类别的离散特征带到目标域样本集中,则对预设类别的离散特征做离散转连续的改造。将预设类别的离散特征转换为连续特征的过程包括如下步骤一至步骤二:In order to bring the discrete features of the preset category in the source domain sample set to the target domain sample set, the discrete-to-continuous transformation is performed on the discrete features of the preset category. The process of converting the discrete features of the preset category into continuous features includes the following steps 1 to 2:
步骤一、统计所述源域样本集中与每一个所述预设类别的离散特征相关联的样本情况。Step 1: Count the situation of the samples in the source domain sample set associated with the discrete features of each of the preset categories.
统计所述源域样本集中与每一个所述预设类别的离散特征相关联的样本情况的主要目的包括如下两点:一是,通过何种关联关系将可疑风险传播出现,风险传播给了谁。二是,某种关联关系的紧密程度有多大,通过该关联关系传播到个体的风险有多大。The main purpose of counting the sample conditions associated with the discrete features of each of the preset categories in the source domain sample set includes the following two points: First, through what kind of relationship the suspicious risk is propagated, and to whom the risk is propagated . The second is how close a certain relationship is, and how big is the risk of spreading to individuals through that relationship.
统计所述源域样本集中与每一个预设类别的离散特征相关联的样本情况的具体过程为:针对每一个预设类别的离散特征均执行:统计预设时间段内与该离线特征相关的特征情况,将该特征情况确定为与该离散特征相关联的样本情况。The specific process of counting the sample conditions associated with the discrete features of each preset category in the source domain sample set is as follows: performing for each discrete feature of the preset category: counting the offline features related to the offline feature within the preset time period; The characteristic condition is determined as the sample condition associated with the discrete feature.
具体的,样本情况至少包括如下中的一种:正交易样本数量、负交易样本数量、负交易样本在所述源域样本集中的占比、正交易样本在所述源域样本集中的占比、任一个体的交易额交易次数在所述源域样本集中所述个体的总交易额交易次数的占比;其中,所述源域样本集中交易类型是合法行为的为正交易样本,交易类型为可疑行为的为负交易样本。Specifically, the sample situation includes at least one of the following: the number of positive transaction samples, the number of negative transaction samples, the proportion of negative transaction samples in the source domain sample set, and the proportion of positive transaction samples in the source domain sample set , the proportion of the transaction times of any individual in the total transaction times of the individual in the source domain sample set; wherein, the transaction type in the source domain sample set is a positive transaction sample, and the transaction type is a positive transaction sample. Negative transaction samples are suspicious behaviors.
示例性,统计一段时间内,在源域样本集中与离散特征X相关联的负样本个数或负样 本占总样本的比例。将负样本个数或负样本占总样本的比例确定为离散特征X的样本情况。Exemplarily, within a period of time, the number of negative samples or the proportion of negative samples in the total samples associated with the discrete feature X in the source domain sample set is counted. The number of negative samples or the proportion of negative samples in the total samples is determined as the sample situation of the discrete feature X.
步骤二、将各所述预设类别的离散特征的样本情况,确定为各所述预设类别的离散特征对应的连续特征。Step 2: Determine the sample conditions of the discrete features of each preset category as continuous features corresponding to the discrete features of each preset category.
204、对所述源域样本集和所述目标域样本集所涉及的特征进行分类,确定所述源域样本集和所述目标域样本集的共有特征集、所述源域样本集的特有特征集和所述目标域样本集的特有特征集。204. Classify the features involved in the source domain sample set and the target domain sample set, and determine the common feature set of the source domain sample set and the target domain sample set, and the unique feature set of the source domain sample set. A feature set and a unique feature set of the target domain sample set.
205、将所述源域样本集中的特征和所述目标域样本集中的特征,统一编码到所述源域样本集和所述目标域样本集的共有特征集、所述源域样本集的特有特征集以及所述目标域样本集的特有特征集三者并集对应的特征空间中。205. Uniformly encode the features in the source domain sample set and the features in the target domain sample set into a common feature set of the source domain sample set and the target domain sample set, and a unique feature set of the source domain sample set. The feature set and the unique feature set of the target domain sample set are in the feature space corresponding to the union of the three.
206、合并统一编码后的所述源域样本集和所述目标域样本集。206. Merge the uniformly coded source domain sample set and the target domain sample set.
207、基于合并后的样本集训练反洗钱模型。207. Train an anti-money laundering model based on the combined sample set.
进一步的,依据上述方法实施例,本申请的另一个实施例还提供了一种反洗钱模型的训练装置,如图3所示,所述装置包括:Further, according to the above method embodiment, another embodiment of the present application further provides a training device for an anti-money laundering model. As shown in FIG. 3 , the device includes:
获取单元31,被配置为获取源域样本集和目标域样本集,其中,源域样本和目标域样本均为用于训练反洗钱模型的交易样本;The obtaining unit 31 is configured to obtain a source domain sample set and a target domain sample set, wherein the source domain sample and the target domain sample are transaction samples used for training an anti-money laundering model;
分类单元32,被配置为对所述源域样本集和所述目标域样本集所涉及的特征进行分类,确定所述源域样本集和所述目标域样本集的共有特征集、所述源域样本集的特有特征集和所述目标域样本集的特有特征集;The classification unit 32 is configured to classify the features involved in the source domain sample set and the target domain sample set, and determine the common feature set of the source domain sample set and the target domain sample set, the source domain sample set The unique feature set of the domain sample set and the unique feature set of the target domain sample set;
编码单元33,被配置为将所述源域样本集中的特征和所述目标域样本集中的特征,统一编码到所述源域样本集和所述目标域样本集的共有特征集、所述源域样本集的特有特征集以及所述目标域样本集的特有特征集三者并集对应的特征空间中;The encoding unit 33 is configured to uniformly encode the features in the source domain sample set and the features in the target domain sample set into a common feature set of the source domain sample set and the target domain sample set, the source domain sample set In the feature space corresponding to the union of the unique feature set of the domain sample set and the unique feature set of the target domain sample set;
合并单元34,被配置为合并统一编码后的所述源域样本集和所述目标域样本集;The merging unit 34 is configured to merge the uniformly encoded source domain sample set and the target domain sample set;
训练单元35,被配置为基于合并后的样本集训练反洗钱模型。The training unit 35 is configured to train an anti-money laundering model based on the combined sample set.
本申请实施例提供的一种反洗钱模型的训练装置,首先获取源域样本集和目标域样本集,并对源域样本集和目标域样本集所涉及的特征进行分类,确定源域样本集和目标域样本集的共有特征集、源域样本集的特有特征集和目标域样本集的特有特征集。将源域样本集中的特征和目标域样本集中的特征,统一编码到源域样本集和目标域样本集的共有特征集、源域样本集的特有特征集以及目标域样本集的特有特征集三者并集对应的特征空间中。合并统一编码后的源域样本集和目标域样本集,并基于合并后的样本集训练反洗钱模型。可见,本申请实施例提供的方案通过引入源域样本集的特征完成目标域样本集的反洗钱模型训练任务,使反洗钱模型既可以学习到源域样本集中的已有知识,又可以学习到目 标域样本集中新的知识,即反洗钱模型可同时学习已有知识和新知识,实现了已有知识的积累沉淀又实现了新知识的学习,从而能够提高反洗钱模型的反洗钱识别的效果。An apparatus for training an anti-money laundering model provided by an embodiment of the present application first obtains a source domain sample set and a target domain sample set, classifies the features involved in the source domain sample set and the target domain sample set, and determines the source domain sample set and the common feature set of the target domain sample set, the unique feature set of the source domain sample set and the unique feature set of the target domain sample set. The features in the source domain sample set and the features in the target domain sample set are uniformly encoded into the common feature set of the source domain sample set and the target domain sample set, the unique feature set of the source domain sample set and the unique feature set of the target domain sample set III. in the feature space corresponding to the union. Combine the uniformly coded source domain sample set and target domain sample set, and train an anti-money laundering model based on the combined sample set. It can be seen that the solution provided by the embodiment of this application completes the training task of the anti-money laundering model of the target domain sample set by introducing the features of the source domain sample set, so that the anti-money laundering model can not only learn the existing knowledge in the source domain sample set, but also learn The target domain sample sets new knowledge, that is, the anti-money laundering model can learn both existing knowledge and new knowledge, realizing the accumulation of existing knowledge and the learning of new knowledge, which can improve the anti-money laundering recognition effect of the anti-money laundering model. .
可选的,如图4所示,所述分类单元32包括:Optionally, as shown in Figure 4, the classification unit 32 includes:
确定模块321,被配置为确定所述源域样本集和所述目标域样本集所涉及的各连续特征的稳定性指标;A determination module 321, configured to determine the stability index of each continuous feature involved in the source domain sample set and the target domain sample set;
第一分类模块322,被配置为基于各所述连续特征的稳定性指标的大小,对各所述连续特征进行分类。The first classification module 322 is configured to classify each of the continuous features based on the size of the stability index of each of the continuous features.
可选的,如图4所示,所述确定模块321,被配置为通过如下公式,确定所述源域样本集和所述目标域样本集所涉及的各连续特征的稳定性指标;Optionally, as shown in FIG. 4 , the determining module 321 is configured to determine the stability index of each continuous feature involved in the source domain sample set and the target domain sample set through the following formula;
所述公式为:The formula is:
Figure PCTCN2021140997-appb-000004
Figure PCTCN2021140997-appb-000004
其中,PSI(Y e,Y;B) j表征所述源域样本集和所述目标域样本集所涉及的各连续特征中第j个连续特征的稳定性指标;Y e表征预期分布,所述预期分布为所述目标域样本集全量数据;Y表征实际分布,所述实际分布为所述源域样本集全量数据;B表征预设的分桶数量;y ij表征第j个连续特征在所述源域样本集的第i个分桶中的占比;y eij表征第j个连续特征在所述目标域样本集的第i个分桶中的占比。 Among them, PSI(Y e , Y; B) j represents the stability index of the jth continuous feature among the continuous features involved in the source domain sample set and the target domain sample set; Y e represents the expected distribution, and the The expected distribution is the full data of the target domain sample set; Y represents the actual distribution, and the actual distribution is the full data of the source domain sample set; B represents the preset number of buckets; y ij represents the jth continuous feature in The proportion of the ith bucket of the source domain sample set; y eij represents the proportion of the jth continuous feature in the ith bucket of the target domain sample set.
可选的,如图4所示,所述第一分类模块322,被配置为将所述稳定性指标小于第一阈值的连续特征,分类至所述源域样本集和所述目标域样本集的共有特征集;将所述稳定性指标不小于所述第一阈值的所述源域样本集所涉及的连续特征,分类至所述源域样本集的特有特征集;将所述稳定指标不小于所述第一阈值的所述目标域样本集所涉及的连续特征,分类至所述目标域样本集的特有特征集。Optionally, as shown in FIG. 4 , the first classification module 322 is configured to classify the continuous features whose stability index is less than the first threshold into the source domain sample set and the target domain sample set. the common feature set; classify the continuous features involved in the source domain sample set whose stability index is not less than the first threshold into the unique feature set of the source domain sample set; classify the stability index not less than the first threshold The continuous features involved in the target domain sample set that are smaller than the first threshold are classified into the unique feature set of the target domain sample set.
可选的,如图4所示,所述分类单元32包括:Optionally, as shown in Figure 4, the classification unit 32 includes:
第二分类模块323,被配置为将所述源域样本集所涉及的离散特征分类至所述源域样本集的特有特征集;将所述目标域样本集所涉及的离散特征分类至所述目标域样本集的特有特征集。The second classification module 323 is configured to classify the discrete features involved in the source domain sample set into the unique feature set of the source domain sample set; classify the discrete features involved in the target domain sample set into the The unique feature set of the target domain sample set.
可选的,如图4所示,所述装置还包括:Optionally, as shown in Figure 4, the device further includes:
判断单元36,被配置为在所述分类单元32对所述源域样本集和所述目标域样本集所涉及的特征进行分类之前,判断所述源域样本集所涉及的特征中是否存在预设类别的离散特征;若存在,触发转换单元37;The judging unit 36 is configured to, before the classifying unit 32 classifies the features involved in the source domain sample set and the target domain sample set, determine whether there is a predetermined feature in the features involved in the source domain sample set. Set the discrete features of the category; if there is, trigger the conversion unit 37;
所述转换单元37,被配置为在所述判断单元36的触发下,将所述预设类别的离散特征转换为连续特征。The converting unit 37 is configured to convert the discrete features of the preset category into continuous features under the triggering of the judging unit 36 .
可选的,如图4所示,所述转换单元37,被配置为统计所述源域样本集中与每一个所述预设类别的离散特征相关联的样本情况;将各所述预设类别的离散特征的样本情况,确定为各所述预设类别的离散特征对应的连续特征。Optionally, as shown in FIG. 4 , the conversion unit 37 is configured to count the sample conditions associated with the discrete features of each of the preset categories in the source domain sample set; The sample situation of the discrete features is determined as the continuous features corresponding to the discrete features of each preset category.
可选的,如图4所示,所述样本情况至少包括如下中的一种:正交易样本数量、负交易样本数量、负交易样本在所述源域样本集中的占比、正交易样本在所述源域样本集中的占比、任一个体的交易额交易次数在所述源域样本集中所述个体的总交易额交易次数的占比;其中,所述源域样本集中交易类型是合法行为的为正交易样本,交易类型为可疑行为的为负交易样本。Optionally, as shown in FIG. 4 , the sample situation includes at least one of the following: the number of positive transaction samples, the number of negative transaction samples, the proportion of negative transaction samples in the source domain sample set, and the number of positive transaction samples in the sample set. The proportion of the source domain sample set, the proportion of the transaction number of transactions of any individual in the total transaction volume of the individual in the source domain sample set; wherein, the transaction type in the source domain sample set is legal The behaviors are positive transaction samples, and the transaction types are suspicious behaviors are negative transaction samples.
本申请实施例提供的反洗钱模型的训练装置中,各个功能模块运行过程中所采用的方法详解可以参见图1、图2方法实施例的对应方法详解,在此不再赘述。In the training device of the anti-money laundering model provided by the embodiment of the present application, for a detailed explanation of the method used in the operation of each functional module, please refer to the detailed explanation of the corresponding method of the method embodiment in FIG. 1 and FIG. 2 , which will not be repeated here.
进一步的,依据上述实施例,本申请的另一个实施例还提供了一种计算机可读存储介质,其特征在于,所述存储介质包括存储的程序,其中,在所述程序运行时控制所述存储介质所在设备执行图1或图2所述的反洗钱模型的训练方法。Further, according to the above embodiment, another embodiment of the present application further provides a computer-readable storage medium, characterized in that, the storage medium includes a stored program, wherein when the program runs, the The device where the storage medium is located executes the training method of the anti-money laundering model described in FIG. 1 or FIG. 2 .
进一步的,依据上述实施例,本申请的另一个实施例还提供了一种存储管理设备,其特征在于,所述存储管理设备包括:Further, according to the above embodiment, another embodiment of the present application further provides a storage management device, wherein the storage management device includes:
存储器,被配置为存储程序;memory, configured to store programs;
处理器,耦合至所述存储器,被配置为运行所述程序以执行图1或图2所述的反洗钱模型的训练方法。A processor, coupled to the memory, is configured to run the program to perform the training method of the anti-money laundering model described in FIG. 1 or FIG. 2 .
本申请公开了如下内容:This application discloses the following:
A1.一种反洗钱模型的训练方法,包括:A1. A training method for an anti-money laundering model, comprising:
获取源域样本集和目标域样本集,其中,源域样本和目标域样本均为用于训练反洗钱模型的交易样本;Obtain the source domain sample set and the target domain sample set, wherein the source domain sample and the target domain sample are both transaction samples used to train the anti-money laundering model;
对所述源域样本集和所述目标域样本集所涉及的特征进行分类,确定所述源域样本集和所述目标域样本集的共有特征集、所述源域样本集的特有特征集和所述目标域样本集的特有特征集;Classify the features involved in the source domain sample set and the target domain sample set, and determine the common feature set of the source domain sample set and the target domain sample set, and the unique feature set of the source domain sample set and the unique feature set of the target domain sample set;
将所述源域样本集中的特征和所述目标域样本集中的特征,统一编码到所述源域样本集和所述目标域样本集的共有特征集、所述源域样本集的特有特征集以及所述目标域样本集的特有特征集三者并集对应的特征空间中;The features in the source domain sample set and the features in the target domain sample set are uniformly encoded into the common feature set of the source domain sample set and the target domain sample set, and the unique feature set of the source domain sample set and in the feature space corresponding to the union of the unique feature sets of the target domain sample set;
合并统一编码后的所述源域样本集和所述目标域样本集;merging the uniformly encoded source domain sample set and the target domain sample set;
基于合并后的样本集训练反洗钱模型。Train an anti-money laundering model based on the combined sample set.
A2.根据A1所述的方法,对所述源域样本集和所述目标域样本集所涉及的特征进行分类,包括:A2. According to the method of A1, classify the features involved in the source domain sample set and the target domain sample set, including:
确定所述源域样本集和所述目标域样本集所涉及的各连续特征的稳定性指标;determining the stability index of each continuous feature involved in the source domain sample set and the target domain sample set;
基于各所述连续特征的稳定性指标的大小,对各所述连续特征进行分类。Each of the continuous features is classified based on the magnitude of the stability index of each of the continuous features.
A3.根据A2所述的方法,确定所述源域样本集和所述目标域样本集所涉及的各连续特征的稳定性指标,包括:A3. According to the method of A2, determine the stability index of each continuous feature involved in the source domain sample set and the target domain sample set, including:
通过如下公式,确定所述源域样本集和所述目标域样本集所涉及的各连续特征的稳定性指标;Determine the stability index of each continuous feature involved in the source domain sample set and the target domain sample set by the following formula;
所述公式为:The formula is:
Figure PCTCN2021140997-appb-000005
Figure PCTCN2021140997-appb-000005
其中,PSI(Y e,Y;B) j表征所述源域样本集和所述目标域样本集所涉及的各连续特征中第j个连续特征的稳定性指标;Y e表征预期分布,所述预期分布为所述目标域样本集全量数据;Y表征实际分布,所述实际分布为所述源域样本集全量数据;B表征预设的分桶数量;y ij表征第j个连续特征在所述源域样本集的第i个分桶中的占比;y eij表征第j个连续特征在所述目标域样本集的第i个分桶中的占比。 Among them, PSI(Y e , Y; B) j represents the stability index of the jth continuous feature among the continuous features involved in the source domain sample set and the target domain sample set; Y e represents the expected distribution, and the The expected distribution is the full data of the target domain sample set; Y represents the actual distribution, and the actual distribution is the full data of the source domain sample set; B represents the preset number of buckets; y ij represents the jth continuous feature in The proportion of the ith bucket of the source domain sample set; y eij represents the proportion of the jth continuous feature in the ith bucket of the target domain sample set.
A4.根据A2所述的方法,基于各所述连续特征的稳定性指标的大小,对各所述连续特征进行分类,包括:A4. According to the method of A2, classify each of the continuous features based on the size of the stability index of each of the continuous features, including:
将所述稳定性指标小于第一阈值的连续特征,分类至所述源域样本集和所述目标域样本集的共有特征集;classifying the continuous features whose stability index is less than the first threshold into a common feature set of the source domain sample set and the target domain sample set;
将所述稳定性指标不小于所述第一阈值的所述源域样本集所涉及的连续特征,分类至所述源域样本集的特有特征集;classifying the continuous features involved in the source domain sample set whose stability index is not less than the first threshold into a unique feature set of the source domain sample set;
将所述稳定性指标不小于所述第一阈值的所述目标域样本集所涉及的连续特征,分类至所述目标域样本集的特有特征集。Classifying the continuous features involved in the target domain sample set whose stability index is not less than the first threshold into a unique feature set of the target domain sample set.
A5.根据A1或2所述的方法,对所述源域样本集和所述目标域样本集所涉及的特征进行分类,包括:A5. According to the method of A1 or 2, classify the features involved in the source domain sample set and the target domain sample set, including:
将所述源域样本集所涉及的离散特征分类至所述源域样本集的特有特征集;classifying the discrete features involved in the source domain sample set into a unique feature set of the source domain sample set;
将所述目标域样本集所涉及的离散特征分类至所述目标域样本集的特有特征集。The discrete features involved in the target domain sample set are classified into a unique feature set of the target domain sample set.
A6.根据A1所述的方法,对所述源域样本集和所述目标域样本集所涉及的特征进行分类之前,所述方法还包括:A6. The method according to A1, before classifying the features involved in the source domain sample set and the target domain sample set, the method further includes:
判断所述源域样本集所涉及的特征中是否存在预设类别的离散特征;Judging whether there are discrete features of a preset category in the features involved in the source domain sample set;
若存在,将所述预设类别的离散特征转换为连续特征。If it exists, convert the discrete features of the preset category into continuous features.
A7.根据A6所述的方法,将所述预设类别的离散特征转换为连续特征,包括:A7. The method according to A6, converting the discrete features of the preset category into continuous features, including:
统计所述源域样本集中与每一个所述预设类别的离散特征相关联的样本情况;Counting the sample conditions associated with the discrete features of each of the preset categories in the source domain sample set;
将各所述预设类别的离散特征的样本情况,确定为各所述预设类别的离散特征对应的连续特征。The sample conditions of the discrete features of each preset category are determined as continuous features corresponding to the discrete features of each preset category.
A8.根据A7所述的方法,所述样本情况至少包括如下中的一种:正交易样本数量、负交易样本数量、负交易样本在所述源域样本集中的占比、正交易样本在所述源域样本集中的占比、任一个体的交易额交易次数在所述源域样本集中所述个体的总交易额交易次数的占比;其中,所述源域样本集中交易类型是合法行为的为正交易样本,交易类型为可疑行为的为负交易样本。A8. According to the method described in A7, the sample conditions include at least one of the following: the number of positive transaction samples, the number of negative transaction samples, the proportion of negative transaction samples in the source domain sample set, the number of positive transaction samples in the The proportion of the sample set of the source domain, the proportion of transaction times of any individual in the total transaction times of the individual in the sample set of the source domain; wherein, the transaction type of the sample set of the source domain is legal behavior are positive transaction samples, and those whose transaction type is suspicious are negative transaction samples.
B1.一种反洗钱模型的训练装置,包括:B1. An anti-money laundering model training device, comprising:
获取单元,被配置为获取源域样本集和目标域样本集,其中,源域样本和目标域样本均为用于训练反洗钱模型的交易样本;an obtaining unit, configured to obtain a source domain sample set and a target domain sample set, wherein the source domain sample and the target domain sample are transaction samples used for training an anti-money laundering model;
分类单元,被配置为对所述源域样本集和所述目标域样本集所涉及的特征进行分类,确定所述源域样本集和所述目标域样本集的共有特征集、所述源域样本集的特有特征集和所述目标域样本集的特有特征集;A classification unit, configured to classify the features involved in the source domain sample set and the target domain sample set, and determine the common feature set of the source domain sample set and the target domain sample set, the source domain sample set The unique feature set of the sample set and the unique feature set of the target domain sample set;
编码单元,被配置为将所述源域样本集中的特征和所述目标域样本集中的特征,统一编码到所述源域样本集和所述目标域样本集的共有特征集、所述源域样本集的特有特征集以及所述目标域样本集的特有特征集三者并集对应的特征空间中;an encoding unit configured to uniformly encode the features in the source domain sample set and the features in the target domain sample set into a common feature set of the source domain sample set and the target domain sample set, the source domain In the feature space corresponding to the union of the unique feature set of the sample set and the unique feature set of the target domain sample set;
合并单元,被配置为合并统一编码后的所述源域样本集和所述目标域样本集;a merging unit, configured to merge the uniformly encoded source domain sample set and the target domain sample set;
训练单元,被配置为基于合并后的样本集训练反洗钱模型。A training unit configured to train an anti-money laundering model based on the combined sample set.
B2.根据B1所述的装置,所述分类单元包括:B2. The apparatus according to B1, the classification unit comprises:
确定模块,被配置为确定所述源域样本集和所述目标域样本集所涉及的各连续特征的稳定性指标;a determination module, configured to determine the stability index of each continuous feature involved in the source domain sample set and the target domain sample set;
第一分类模块,被配置为基于各所述连续特征的稳定性指标的大小,对各所述连续特征进行分类。The first classification module is configured to classify each of the continuous features based on the size of the stability index of each of the continuous features.
B3.根据B2所述的装置,所述确定模块,被配置为通过如下公式,确定所述源域样本集和所述目标域样本集所涉及的各连续特征的稳定性指标;B3. The apparatus according to B2, wherein the determining module is configured to determine the stability index of each continuous feature involved in the source domain sample set and the target domain sample set by the following formula;
所述公式为:The formula is:
Figure PCTCN2021140997-appb-000006
Figure PCTCN2021140997-appb-000006
其中,PSI(Y e,Y;B) j表征所述源域样本集和所述目标域样本集所涉及的各连续特征中第j个连续特征的稳定性指标;Y e表征预期分布,所述预期分布为所述目标域样本集全量数据;Y表征实际分布,所述实际分布为所述源域样本集全量数据;B表征预设的分桶数量;y ij表征第j个连续特征在所述源域样本集的第i个分桶中的占比;y eij表征第j个连 续特征在所述目标域样本集的第i个分桶中的占比。 Among them, PSI(Y e , Y; B) j represents the stability index of the jth continuous feature among the continuous features involved in the source domain sample set and the target domain sample set; Y e represents the expected distribution, and the The expected distribution is the full data of the target domain sample set; Y represents the actual distribution, and the actual distribution is the full data of the source domain sample set; B represents the preset number of buckets; y ij represents the jth continuous feature in The proportion of the ith bucket of the source domain sample set; y eij represents the proportion of the jth continuous feature in the ith bucket of the target domain sample set.
B4.根据B2所述的装置,所述第一分类模块,被配置为将所述稳定性指标小于第一阈值的连续特征,分类至所述源域样本集和所述目标域样本集的共有特征集;将所述稳定性指标不小于所述第一阈值的所述源域样本集所涉及的连续特征,分类至所述源域样本集的特有特征集;将所述稳定性指标不小于所述第一阈值的所述目标域样本集所涉及的连续特征,分类至所述目标域样本集的特有特征集。B4. The apparatus according to B2, wherein the first classification module is configured to classify the continuous features whose stability index is less than a first threshold into the common features of the source domain sample set and the target domain sample set feature set; classify the continuous features involved in the source domain sample set whose stability index is not less than the first threshold into the unique feature set of the source domain sample set; classify the stability index not less than The continuous features involved in the target domain sample set of the first threshold are classified into a unique feature set of the target domain sample set.
B5.根据B1或B2所述的装置,所述分类单元包括:B5. The apparatus according to B1 or B2, wherein the classification unit comprises:
第二分类模块,被配置为将所述源域样本集所涉及的离散特征分类至所述源域样本集的特有特征集;将所述目标域样本集所涉及的离散特征分类至所述目标域样本集的特有特征集。The second classification module is configured to classify the discrete features involved in the source domain sample set into a unique feature set of the source domain sample set; classify the discrete features involved in the target domain sample set into the target A set of features specific to the domain sample set.
B6.根据B1所述的装置,所述装置还包括:B6. The apparatus according to B1, further comprising:
判断单元,被配置为在所述分类单元对所述源域样本集和所述目标域样本集所涉及的特征进行分类之前,判断所述源域样本集所涉及的特征中是否存在预设类别的离散特征;若存在,触发转换单元;a judgment unit, configured to judge whether a preset category exists in the features involved in the source domain sample set before the classification unit classifies the features involved in the source domain sample set and the target domain sample set The discrete feature of ; if it exists, trigger the conversion unit;
所述转换单元,被配置为在所述判断单元的触发下,将所述预设类别的离散特征转换为连续特征。The converting unit is configured to convert the discrete features of the preset category into continuous features under the triggering of the judging unit.
B7.根据B6所述的装置,所述转换单元,被配置为统计所述源域样本集中与每一个所述预设类别的离散特征相关联的样本情况;将各所述预设类别的离散特征的样本情况,确定为各所述预设类别的离散特征对应的连续特征。B7. The apparatus according to B6, wherein the conversion unit is configured to count the sample conditions associated with the discrete features of each of the preset categories in the source domain sample set; The sample situation of the feature is determined as the continuous feature corresponding to the discrete feature of each preset category.
B8.根据B7所述的装置,所述样本情况至少包括如下中的一种:正交易样本数量、负交易样本数量、负交易样本在所述源域样本集中的占比、正交易样本在所述源域样本集中的占比、任一个体的交易额交易次数在所述源域样本集中所述个体的总交易额交易次数的占比;其中,所述源域样本集中交易类型是合法行为的为正交易样本,交易类型为可疑行为的为负交易样本。B8. The device according to B7, the sample conditions include at least one of the following: the number of positive transaction samples, the number of negative transaction samples, the proportion of negative transaction samples in the source domain sample set, the number of positive transaction samples in the The proportion of the sample set of the source domain, the proportion of transaction times of any individual in the total transaction times of the individual in the sample set of the source domain; wherein, the transaction type of the sample set of the source domain is legal behavior are positive transaction samples, and those whose transaction type is suspicious are negative transaction samples.
C1.一种计算机可读存储介质,所述存储介质包括存储的程序,其中,在所述程序运行时控制所述存储介质所在设备执行A1至A8中任意一项所述的反洗钱模型的训练方法。C1. A computer-readable storage medium, the storage medium comprising a stored program, wherein, when the program is run, a device where the storage medium is located is controlled to perform the training of the anti-money laundering model described in any one of A1 to A8 method.
D1.一种存储管理设备,所述存储管理设备包括:D1. A storage management device, the storage management device comprising:
存储器,被配置为存储程序;memory, configured to store programs;
处理器,耦合至所述存储器,被配置为运行所述程序以执行A1至A8中任意一项所述的反洗钱模型的训练方法。A processor, coupled to the memory, is configured to run the program to perform the training method of the anti-money laundering model of any one of A1 to A8.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.
可以理解的是,上述方法及装置中的相关特征可以相互参考。另外,上述实施例中的“第一”、“第二”等是用于区分各实施例,而并不代表各实施例的优劣。It can be understood that the relevant features in the above-mentioned methods and apparatuses may refer to each other. In addition, "first", "second", etc. in the above-mentioned embodiments are used to distinguish each embodiment, and do not represent the advantages and disadvantages of each embodiment.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the system, device and unit described above may refer to the corresponding process in the foregoing method embodiments, which will not be repeated here.
在此提供的算法和显示不与任何特定计算机、虚拟系统或者其它设备固有相关。各种通用系统也可以与基于在此的示教一起使用。根据上面的描述,构造这类系统所要求的结构是显而易见的。此外,本申请也不针对任何特定编程语言。应当明白,可以利用各种编程语言实现在此描述的本申请的内容,并且上面对特定语言所做的描述是为了披露本申请的最佳实施方式。The algorithms and displays provided herein are not inherently related to any particular computer, virtual system, or other device. Various general-purpose systems can also be used with teaching based on this. The structure required to construct such a system is apparent from the above description. Furthermore, this application is not directed to any particular programming language. It should be understood that the content of the application described herein can be implemented using a variety of programming languages and that the descriptions of specific languages above are intended to disclose the best mode of the application.
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本申请的实施例可以在没有这些具体细节的情况下实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。In the description provided herein, numerous specific details are set forth. It will be understood, however, that the embodiments of the present application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
类似地,应当理解,为了精简本公开并帮助理解各个申请方面中的一个或多个,在上面对本申请的示例性实施例的描述中,本申请的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而,并不应将该公开的方法解释成反映如下意图:即所要求保护的本申请要求比在每个权利要求中所明确记载的特征更多的特征。更确切地说,如下面的权利要求书所反映的那样,申请方面在于少于前面公开的单个实施例的所有特征。因此,遵循具体实施方式的权利要求书由此明确地并入该具体实施方式,其中每个权利要求本身都作为本申请的单独实施例。Similarly, it will be appreciated that in the above description of example embodiments of the application, various features of the application are sometimes grouped together into a single embodiment, figure, or its description. This disclosure, however, should not be interpreted as reflecting an intention that the claimed application requires more features than are expressly recited in each claim. Rather, as the following claims reflect, application aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the Detailed Description are hereby expressly incorporated into this Detailed Description, with each claim standing on its own as a separate embodiment of this application.
本领域那些技术人员可以理解,可以对实施例中的设备中的模块进行自适应性地改变并且把它们设置在与该实施例不同的一个或多个设备中。可以把实施例中的模块或单元或组件组合成一个模块或单元或组件,以及此外可以把它们分成多个子模块或子单元或子组件。除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外,可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述,本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。Those skilled in the art will understand that the modules in the device in the embodiment can be adaptively changed and arranged in one or more devices different from the embodiment. The modules or units or components in the embodiments may be combined into one module or unit or component, and further they may be divided into multiple sub-modules or sub-units or sub-assemblies. All features disclosed in this specification (including accompanying claims, abstract and drawings) and any method so disclosed may be employed in any combination, unless at least some of such features and/or procedures or elements are mutually exclusive. All processes or units of equipment are combined. Each feature disclosed in this specification (including accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
此外,本领域的技术人员能够理解,尽管在此所述的一些实施例包括其它实施例中所包括的某些特征而不是其它特征,但是不同实施例的特征的组合意味着处于本申请的范围之内并且形成不同的实施例。例如,在下面的权利要求书中,所要求保护的实施例的任意之一都可以以任意的组合方式来使用。Furthermore, those skilled in the art will appreciate that although some of the embodiments described herein include certain features, but not others, included in other embodiments, that combinations of features of different embodiments are intended to be within the scope of the present application within and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
本申请的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软 件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本申请实施例的深度神经网络模型的运行方法、装置及框架中的一些或者全部部件的一些或者全部功能。本申请还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如,计算机程序和计算机程序产品)。这样的实现本申请的程序可以存储在计算机可读介质上,或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供。Various component embodiments of the present application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art should understand that a microprocessor or a digital signal processor (DSP) may be used in practice to implement some or all of the components in the method, apparatus, and framework for running the deep neural network model according to the embodiments of the present application. some or all functions. The present application can also be implemented as an apparatus or apparatus program (eg, computer programs and computer program products) for performing part or all of the methods described herein. Such a program implementing the present application may be stored on a computer-readable medium, or may be in the form of one or more signals. Such signals may be downloaded from Internet sites, or provided on carrier signals, or in any other form.
应该注意的是上述实施例对本申请进行说明而不是对本申请进行限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本申请可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。It should be noted that the above-described embodiments illustrate rather than limit the application, and alternative embodiments may be devised by those skilled in the art without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application can be implemented by means of hardware comprising several different elements and by means of a suitably programmed computer. In a unit claim enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, and third, etc. do not denote any order. These words can be interpreted as names.
工业实用性Industrial Applicability
本申请提供的方案通过引入源域样本集的特征完成目标域样本集的反洗钱模型训练任务,使反洗钱模型既可以学习到源域样本集中的已有知识,又可以学习到目标域样本集中新的知识,即反洗钱模型可同时学习已有知识和新知识,实现了已有知识的积累沉淀又实现了新知识的学习,从而能够提高反洗钱模型的反洗钱识别的效果。The solution provided in this application completes the training task of the anti-money laundering model of the target domain sample set by introducing the features of the source domain sample set, so that the anti-money laundering model can not only learn the existing knowledge in the source domain sample set, but also learn the target domain sample set New knowledge, that is, the anti-money laundering model can learn both existing knowledge and new knowledge at the same time, realizing the accumulation and precipitation of existing knowledge and realizing the learning of new knowledge, which can improve the effect of anti-money laundering identification of the anti-money laundering model.

Claims (18)

  1. 一种反洗钱模型的训练方法,包括:An anti-money laundering model training method, including:
    获取源域样本集和目标域样本集,其中,源域样本和目标域样本均为用于训练反洗钱模型的交易样本;Obtain the source domain sample set and the target domain sample set, wherein the source domain sample and the target domain sample are both transaction samples used to train the anti-money laundering model;
    对所述源域样本集和所述目标域样本集所涉及的特征进行分类,确定所述源域样本集和所述目标域样本集的共有特征集、所述源域样本集的特有特征集和所述目标域样本集的特有特征集;Classify the features involved in the source domain sample set and the target domain sample set, and determine the common feature set of the source domain sample set and the target domain sample set, and the unique feature set of the source domain sample set and the unique feature set of the target domain sample set;
    将所述源域样本集中的特征和所述目标域样本集中的特征,统一编码到所述源域样本集和所述目标域样本集的共有特征集、所述源域样本集的特有特征集以及所述目标域样本集的特有特征集三者并集对应的特征空间中;The features in the source domain sample set and the features in the target domain sample set are uniformly encoded into the common feature set of the source domain sample set and the target domain sample set, and the unique feature set of the source domain sample set and in the feature space corresponding to the union of the unique feature sets of the target domain sample set;
    合并统一编码后的所述源域样本集和所述目标域样本集;merging the uniformly encoded source domain sample set and the target domain sample set;
    基于合并后的样本集训练反洗钱模型。Train an anti-money laundering model based on the combined sample set.
  2. 根据权利要求1所述的方法,其中,对所述源域样本集和所述目标域样本集所涉及的特征进行分类,包括:The method according to claim 1, wherein classifying the features involved in the source domain sample set and the target domain sample set comprises:
    确定所述源域样本集和所述目标域样本集所涉及的各连续特征的稳定性指标;determining the stability index of each continuous feature involved in the source domain sample set and the target domain sample set;
    基于各所述连续特征的稳定性指标的大小,对各所述连续特征进行分类。Each of the continuous features is classified based on the magnitude of the stability index of each of the continuous features.
  3. 根据权利要求2所述的方法,其中,确定所述源域样本集和所述目标域样本集所涉及的各连续特征的稳定性指标,包括:The method according to claim 2, wherein determining the stability index of each continuous feature involved in the source domain sample set and the target domain sample set comprises:
    通过如下公式,确定所述源域样本集和所述目标域样本集所涉及的各连续特征的稳定性指标;Determine the stability index of each continuous feature involved in the source domain sample set and the target domain sample set by the following formula;
    所述公式为:The formula is:
    Figure PCTCN2021140997-appb-100001
    Figure PCTCN2021140997-appb-100001
    其中,PSI(Y e,Y;B) j表征所述源域样本集和所述目标域样本集所涉及的各连续特征中第j个连续特征的稳定性指标;Y e表征预期分布,所述预期分布为所述目标域样本集全量数据;Y表征实际分布,所述实际分布为所述源域样本集全量数据;B表征预设的分桶数量;y ij表征第j个连续特征在所述源域样本集的第i个分桶中的占比;y eij表征第j个连续特征在所述目标域样本集的第i个分桶中的占比。 Among them, PSI(Y e , Y; B) j represents the stability index of the jth continuous feature among the continuous features involved in the source domain sample set and the target domain sample set; Y e represents the expected distribution, and the The expected distribution is the full data of the target domain sample set; Y represents the actual distribution, and the actual distribution is the full data of the source domain sample set; B represents the preset number of buckets; y ij represents the jth continuous feature in The proportion of the ith bucket of the source domain sample set; y eij represents the proportion of the jth continuous feature in the ith bucket of the target domain sample set.
  4. 根据权利要求2所述的方法,其中,基于各所述连续特征的稳定性指标的大小,对各所述连续特征进行分类,包括:The method according to claim 2, wherein, classifying each of the continuous features based on the size of the stability index of each of the continuous features, comprising:
    将所述稳定性指标小于第一阈值的连续特征,分类至所述源域样本集和所述目标域样 本集的共有特征集;classifying the continuous features whose stability index is less than the first threshold into a common feature set of the source domain sample set and the target domain sample set;
    将所述稳定性指标不小于所述第一阈值的所述源域样本集所涉及的连续特征,分类至所述源域样本集的特有特征集;classifying the continuous features involved in the source domain sample set whose stability index is not less than the first threshold into a unique feature set of the source domain sample set;
    将所述稳定性指标不小于所述第一阈值的所述目标域样本集所涉及的连续特征,分类至所述目标域样本集的特有特征集。Classifying the continuous features involved in the target domain sample set whose stability index is not less than the first threshold into a unique feature set of the target domain sample set.
  5. 根据权利要求1或2所述的方法,其中,对所述源域样本集和所述目标域样本集所涉及的特征进行分类,包括:The method according to claim 1 or 2, wherein classifying the features involved in the source domain sample set and the target domain sample set comprises:
    将所述源域样本集所涉及的离散特征分类至所述源域样本集的特有特征集;classifying the discrete features involved in the source domain sample set into a unique feature set of the source domain sample set;
    将所述目标域样本集所涉及的离散特征分类至所述目标域样本集的特有特征集。The discrete features involved in the target domain sample set are classified into a unique feature set of the target domain sample set.
  6. 根据权利要求1所述的方法,其中,对所述源域样本集和所述目标域样本集所涉及的特征进行分类之前,所述方法还包括:The method according to claim 1, wherein before classifying the features involved in the source domain sample set and the target domain sample set, the method further comprises:
    判断所述源域样本集所涉及的特征中是否存在预设类别的离散特征;Judging whether there are discrete features of a preset category in the features involved in the source domain sample set;
    若存在,将所述预设类别的离散特征转换为连续特征。If it exists, convert the discrete features of the preset category into continuous features.
  7. 根据权利要求6所述的方法,其中,将所述预设类别的离散特征转换为连续特征,包括:The method according to claim 6, wherein converting the discrete features of the preset category into continuous features comprises:
    统计所述源域样本集中与每一个所述预设类别的离散特征相关联的样本情况;Counting the sample conditions associated with the discrete features of each of the preset categories in the source domain sample set;
    将各所述预设类别的离散特征的样本情况,确定为各所述预设类别的离散特征对应的连续特征。The sample conditions of the discrete features of each preset category are determined as continuous features corresponding to the discrete features of each preset category.
  8. 根据权利要求7所述的方法,其中,所述样本情况至少包括如下中的一种:正交易样本数量、负交易样本数量、负交易样本在所述源域样本集中的占比、正交易样本在所述源域样本集中的占比、任一个体的交易额交易次数在所述源域样本集中所述个体的总交易额交易次数的占比;其中,所述源域样本集中交易类型是合法行为的为正交易样本,交易类型为可疑行为的为负交易样本。The method according to claim 7, wherein the sample conditions include at least one of the following: the number of positive transaction samples, the number of negative transaction samples, the proportion of negative transaction samples in the source domain sample set, the positive transaction samples The proportion in the sample set of the source domain, the proportion of the transaction times of any individual in the total transaction times of the individual in the sample set of the source domain; wherein, the transaction type in the sample set of the source domain is The legal behavior is a positive transaction sample, and the transaction type is suspicious behavior is a negative transaction sample.
  9. 一种反洗钱模型的训练装置,包括:An anti-money laundering model training device, comprising:
    获取单元,被配置为获取源域样本集和目标域样本集,其中,源域样本和目标域样本均为用于训练反洗钱模型的交易样本;an acquiring unit, configured to acquire a source domain sample set and a target domain sample set, wherein both the source domain sample and the target domain sample are transaction samples used for training an anti-money laundering model;
    分类单元,被配置为对所述源域样本集和所述目标域样本集所涉及的特征进行分类,确定所述源域样本集和所述目标域样本集的共有特征集、所述源域样本集的特有特征集和所述目标域样本集的特有特征集;A classification unit, configured to classify the features involved in the source domain sample set and the target domain sample set, and determine the common feature set of the source domain sample set and the target domain sample set, the source domain sample set The unique feature set of the sample set and the unique feature set of the target domain sample set;
    编码单元,被配置为将所述源域样本集中的特征和所述目标域样本集中的特征,统一编码到所述源域样本集和所述目标域样本集的共有特征集、所述源域样本集的特有特征集以及所述目标域样本集的特有特征集三者并集对应的特征空间中;an encoding unit configured to uniformly encode the features in the source domain sample set and the features in the target domain sample set into a common feature set of the source domain sample set and the target domain sample set, the source domain In the feature space corresponding to the union of the unique feature set of the sample set and the unique feature set of the target domain sample set;
    合并单元,被配置为合并统一编码后的所述源域样本集和所述目标域样本集;a merging unit, configured to merge the uniformly encoded source domain sample set and the target domain sample set;
    训练单元,被配置为基于合并后的样本集训练反洗钱模型。A training unit configured to train an anti-money laundering model based on the combined sample set.
  10. 根据权利要求9所述的装置,其中,所述分类单元包括:The apparatus of claim 9, wherein the classification unit comprises:
    确定模块,被配置为确定所述源域样本集和所述目标域样本集所涉及的各连续特征的稳定性指标;a determination module, configured to determine the stability index of each continuous feature involved in the source domain sample set and the target domain sample set;
    第一分类模块,被配置为基于各所述连续特征的稳定性指标的大小,对各所述连续特征进行分类。The first classification module is configured to classify each of the continuous features based on the size of the stability index of each of the continuous features.
  11. 根据权利要求10所述的装置,其中,所述确定模块,被配置为通过如下公式,确定所述源域样本集和所述目标域样本集所涉及的各连续特征的稳定性指标;The apparatus according to claim 10, wherein the determining module is configured to determine the stability index of each continuous feature involved in the source domain sample set and the target domain sample set by the following formula;
    所述公式为:The formula is:
    Figure PCTCN2021140997-appb-100002
    Figure PCTCN2021140997-appb-100002
    其中,PSI(Y e,Y;B) j表征所述源域样本集和所述目标域样本集所涉及的各连续特征中第j个连续特征的稳定性指标;Y e表征预期分布,所述预期分布为所述目标域样本集全量数据;Y表征实际分布,所述实际分布为所述源域样本集全量数据;B表征预设的分桶数量;y ij表征第j个连续特征在所述源域样本集的第i个分桶中的占比;y eij表征第j个连续特征在所述目标域样本集的第i个分桶中的占比。 Among them, PSI(Y e , Y; B) j represents the stability index of the jth continuous feature among the continuous features involved in the source domain sample set and the target domain sample set; Y e represents the expected distribution, and the The expected distribution is the full data of the target domain sample set; Y represents the actual distribution, and the actual distribution is the full data of the source domain sample set; B represents the preset number of buckets; y ij represents the jth continuous feature in The proportion of the ith bucket of the source domain sample set; y eij represents the proportion of the jth continuous feature in the ith bucket of the target domain sample set.
  12. 根据权利要求10所述的装置,其中,所述第一分类模块,被配置为将所述稳定性指标小于第一阈值的连续特征,分类至所述源域样本集和所述目标域样本集的共有特征集;将所述稳定性指标不小于所述第一阈值的所述源域样本集所涉及的连续特征,分类至所述源域样本集的特有特征集;将所述稳定指标不小于所述第一阈值的所述目标域样本集所涉及的连续特征,分类至所述目标域样本集的特有特征集。The apparatus according to claim 10, wherein the first classification module is configured to classify the continuous features whose stability index is less than a first threshold into the source domain sample set and the target domain sample set the common feature set; classify the continuous features involved in the source domain sample set whose stability index is not less than the first threshold into the unique feature set of the source domain sample set; classify the stability index not less than the first threshold The continuous features involved in the target domain sample set that are smaller than the first threshold are classified into the unique feature set of the target domain sample set.
  13. 根据权利要求9或10所述的装置,其中,所述分类单元包括:The apparatus according to claim 9 or 10, wherein the classification unit comprises:
    第二分类模块,被配置为将所述源域样本集所涉及的离散特征分类至所述源域样本集的特有特征集;将所述目标域样本集所涉及的离散特征分类至所述目标域样本集的特有特征集。The second classification module is configured to classify the discrete features involved in the source domain sample set into a unique feature set of the source domain sample set; classify the discrete features involved in the target domain sample set into the target A set of features specific to the domain sample set.
  14. 根据权利要求9所述的装置,其中,所述装置还包括:The apparatus of claim 9, wherein the apparatus further comprises:
    判断单元,被配置为在所述分类单元对所述源域样本集和所述目标域样本集所涉及的特征进行分类之前,判断所述源域样本集所涉及的特征中是否存在预设类别的离散特征;若存在,触发转换单元;A judgment unit, configured to judge whether a preset category exists in the features involved in the source domain sample set before the classification unit classifies the features involved in the source domain sample set and the target domain sample set The discrete feature of ; if it exists, trigger the conversion unit;
    所述转换单元,被配置为在所述判断单元的触发下,将所述预设类别的离散特征转换为连续特征。The converting unit is configured to convert the discrete features of the preset category into continuous features under the triggering of the judging unit.
  15. 根据权利要求14所述的装置,其中,所述转换单元,被配置为统计所述源域样本集中与每一个所述预设类别的离散特征相关联的样本情况;将各所述预设类别的离散特征的样本情况,确定为各所述预设类别的离散特征对应的连续特征。The apparatus according to claim 14, wherein the conversion unit is configured to count the sample conditions associated with the discrete features of each of the preset categories in the source domain sample set; The sample situation of the discrete features is determined as the continuous features corresponding to the discrete features of each preset category.
  16. 根据权利要求15所述的装置,其中,所述样本情况至少包括如下中的一种:正交易样本数量、负交易样本数量、负交易样本在所述源域样本集中的占比、正交易样本在所述源域样本集中的占比、任一个体的交易额交易次数在所述源域样本集中所述个体的总交易额交易次数的占比;其中,所述源域样本集中交易类型是合法行为的为正交易样本,交易类型为可疑行为的为负交易样本。The device according to claim 15, wherein the sample conditions include at least one of the following: the number of positive transaction samples, the number of negative transaction samples, the proportion of negative transaction samples in the source domain sample set, the positive transaction samples The proportion in the sample set of the source domain, the proportion of the transaction times of any individual in the total transaction times of the individual in the sample set of the source domain; wherein, the transaction type in the sample set of the source domain is The legal behavior is a positive transaction sample, and the transaction type is suspicious behavior is a negative transaction sample.
  17. 一种计算机可读存储介质,所述存储介质包括存储的程序,其中,在所述程序运行时控制所述存储介质所在设备执行权利要求1至权利要求8中任意一项所述的反洗钱模型的训练方法。A computer-readable storage medium, the storage medium comprising a stored program, wherein when the program is run, a device where the storage medium is located is controlled to execute the anti-money laundering model according to any one of claims 1 to 8 training method.
  18. 一种存储管理设备,所述存储管理设备包括:A storage management device, the storage management device comprising:
    存储器,被配置为存储程序;memory, configured to store programs;
    处理器,耦合至所述存储器,被配置为运行所述程序以执行权利要求1至权利要求8中任意一项所述的反洗钱模型的训练方法。A processor, coupled to the memory, is configured to run the program to perform the training method of the anti-money laundering model of any one of claims 1 to 8.
PCT/CN2021/140997 2020-12-30 2021-12-23 Method and apparatus for training anti-money laundering model WO2022143431A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011625865.9 2020-12-30
CN202011625865.9A CN112634048B (en) 2020-12-30 2020-12-30 Training method and device for money backwashing model

Publications (1)

Publication Number Publication Date
WO2022143431A1 true WO2022143431A1 (en) 2022-07-07

Family

ID=75290309

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/140997 WO2022143431A1 (en) 2020-12-30 2021-12-23 Method and apparatus for training anti-money laundering model

Country Status (2)

Country Link
CN (1) CN112634048B (en)
WO (1) WO2022143431A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112634048B (en) * 2020-12-30 2023-06-13 第四范式(北京)技术有限公司 Training method and device for money backwashing model
CN113781052A (en) * 2021-09-07 2021-12-10 上海浦东发展银行股份有限公司 Anti-money laundering monitoring method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109214421A (en) * 2018-07-27 2019-01-15 阿里巴巴集团控股有限公司 A kind of model training method, device and computer equipment
CN110782349A (en) * 2019-10-25 2020-02-11 支付宝(杭州)信息技术有限公司 Model training method and system
US20200104726A1 (en) * 2018-09-29 2020-04-02 VII Philip Alvelda Machine learning data representations, architectures, and systems that intrinsically encode and represent benefit, harm, and emotion to optimize learning
CN111951050A (en) * 2020-08-14 2020-11-17 中国工商银行股份有限公司 Financial product recommendation method and device
CN112634048A (en) * 2020-12-30 2021-04-09 第四范式(北京)技术有限公司 Anti-money laundering model training method and device

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103729648B (en) * 2014-01-07 2017-01-04 中国科学院计算技术研究所 Domain-adaptive mode identification method and system
US20180024968A1 (en) * 2016-07-22 2018-01-25 Xerox Corporation System and method for domain adaptation using marginalized stacked denoising autoencoders with domain prediction regularization
US10776693B2 (en) * 2017-01-31 2020-09-15 Xerox Corporation Method and system for learning transferable feature representations from a source domain for a target domain
CN107909101B (en) * 2017-11-10 2019-07-12 清华大学 Semi-supervised transfer learning character identifying method and system based on convolutional neural networks
CN108197643B (en) * 2017-12-27 2021-11-30 佛山科学技术学院 Transfer learning method based on unsupervised clustering and metric learning
CN108304876B (en) * 2018-01-31 2021-07-06 国信优易数据股份有限公司 Classification model training method and device and classification method and device
CN109902798A (en) * 2018-05-31 2019-06-18 华为技术有限公司 The training method and device of deep neural network
CN110659744B (en) * 2019-09-26 2021-06-04 支付宝(杭州)信息技术有限公司 Training event prediction model, and method and device for evaluating operation event
CN110852446A (en) * 2019-11-13 2020-02-28 腾讯科技(深圳)有限公司 Machine learning model training method, device and computer readable storage medium
CN111444951B (en) * 2020-03-24 2024-02-20 腾讯科技(深圳)有限公司 Sample recognition model generation method, device, computer equipment and storage medium
CN111724083B (en) * 2020-07-21 2023-10-13 腾讯科技(深圳)有限公司 Training method and device for financial risk identification model, computer equipment and medium
CN111814977B (en) * 2020-08-28 2020-12-18 支付宝(杭州)信息技术有限公司 Method and device for training event prediction model
CN112116025A (en) * 2020-09-28 2020-12-22 北京嘀嘀无限科技发展有限公司 User classification model training method and device, electronic equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109214421A (en) * 2018-07-27 2019-01-15 阿里巴巴集团控股有限公司 A kind of model training method, device and computer equipment
US20200104726A1 (en) * 2018-09-29 2020-04-02 VII Philip Alvelda Machine learning data representations, architectures, and systems that intrinsically encode and represent benefit, harm, and emotion to optimize learning
CN110782349A (en) * 2019-10-25 2020-02-11 支付宝(杭州)信息技术有限公司 Model training method and system
CN111951050A (en) * 2020-08-14 2020-11-17 中国工商银行股份有限公司 Financial product recommendation method and device
CN112634048A (en) * 2020-12-30 2021-04-09 第四范式(北京)技术有限公司 Anti-money laundering model training method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ANONYMOUS: "Research and Implementation of Anti-money Laundering Modeling Based on Transfer Learning Technology", FINANCIAL COMPUTER OF CHINA, no. 10, 7 October 2020 (2020-10-07), pages 48 - 53, XP055948878, ISSN: 1001-0734 *

Also Published As

Publication number Publication date
CN112634048B (en) 2023-06-13
CN112634048A (en) 2021-04-09

Similar Documents

Publication Publication Date Title
Lin et al. Detecting the financial statement fraud: The analysis of the differences between data mining techniques and experts’ judgments
US20190164015A1 (en) Machine learning techniques for evaluating entities
WO2021088499A1 (en) False invoice issuing identification method and system based on dynamic network representation
CN110378786B (en) Model training method, default transmission risk identification method, device and storage medium
CN110852881B (en) Risk account identification method and device, electronic equipment and medium
Zhao et al. Financial distress prediction by combining sentiment tone features
WO2022143431A1 (en) Method and apparatus for training anti-money laundering model
CN112419030B (en) Method, system and equipment for evaluating financial fraud risk
Chen et al. Application of random forest, rough set theory, decision tree and neural network to detect financial statement fraud–taking corporate governance into consideration
CN113095927A (en) Method and device for identifying suspicious transactions of anti-money laundering
CN113609193A (en) Method and device for training prediction model for predicting customer transaction behavior
Kelley et al. Antidiscrimination laws, artificial intelligence, and gender bias: A case study in nonmortgage fintech lending
CN116468273A (en) Customer risk identification method and device
Boz et al. Reassessment and monitoring of loan applications with machine learning
Wang et al. Multiview Graph Learning for Small‐and Medium‐Sized Enterprises’ Credit Risk Assessment in Supply Chain Finance
Aly et al. Machine Learning Algorithms and Auditor’s Assessments of the Risks Material Misstatement: Evidence from the Restatement of Listed London Companies
Li et al. Textual analysis and detection of financial fraud: Evidence from Chinese manufacturing firms
Liu et al. Analysis of Beijing Tianjin Hebei regional credit system from the perspective of big data credit reporting
Mao et al. Using GNN to detect financial fraud based on the related party transactions network
Duan et al. The information content of financial statement fraud risk: An ensemble learning approach
Chen et al. Predicting a corporate financial crisis using letters to shareholders
Yang et al. An evidential reasoning rule-based ensemble learning approach for evaluating credit risks with customer heterogeneity
CN114493853A (en) Credit rating evaluation method, credit rating evaluation device, electronic device and storage medium
Li et al. Research on Efficiency in Credit Risk Prediction Using Logistic‐SBM Model
CN113450208A (en) Loan risk change early warning and model training method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21914168

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21914168

Country of ref document: EP

Kind code of ref document: A1