CN113094407A - Anti-money laundering identification method, device and system based on horizontal federal learning - Google Patents

Anti-money laundering identification method, device and system based on horizontal federal learning Download PDF

Info

Publication number
CN113094407A
CN113094407A CN202110264163.0A CN202110264163A CN113094407A CN 113094407 A CN113094407 A CN 113094407A CN 202110264163 A CN202110264163 A CN 202110264163A CN 113094407 A CN113094407 A CN 113094407A
Authority
CN
China
Prior art keywords
data
sample
feature
money laundering
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110264163.0A
Other languages
Chinese (zh)
Other versions
CN113094407B (en
Inventor
武润鹏
李衡
张岩
邹杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Gf Securities Co ltd
Original Assignee
Gf Securities Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Gf Securities Co ltd filed Critical Gf Securities Co ltd
Priority to CN202110264163.0A priority Critical patent/CN113094407B/en
Publication of CN113094407A publication Critical patent/CN113094407A/en
Application granted granted Critical
Publication of CN113094407B publication Critical patent/CN113094407B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2474Sequence data queries, e.g. querying versioned data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/02Banking, e.g. interest calculation or account maintenance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/04Trading; Exchange, e.g. stocks, commodities, derivatives or currency exchange

Abstract

The invention discloses an anti-money laundering identification method, device and system based on horizontal federal learning, wherein the method comprises the steps of firstly, carrying out feature alignment on data features provided by each participating node, and extracting basic data features for constructing an anti-money laundering model; carrying out sample synchronization according to the user ID of each data sample uploaded by each participating node and the sample generation time; and issuing a time sequence characteristic construction instruction to each participating node, constructing a final characteristic value of the required time sequence characteristic, and issuing the final characteristic value to each participating node, so that each participating node constructs an anti-money laundering identification model according to the acquired time sequence characteristic value and the characteristic value of the data characteristic of the participating node, through transverse federal learning, and finally performs anti-money laundering identification according to the constructed anti-money laundering model. By implementing the embodiment of the invention, the accuracy of anti-money laundering identification can be improved.

Description

Anti-money laundering identification method, device and system based on horizontal federal learning
Technical Field
The invention relates to the technical field of computers, in particular to an anti-money laundering identification method, device and system based on horizontal federal learning.
Background
In the existing anti-money laundering judgment based on machine learning, each security company independently trains a model by using respective transaction data, and then carries out anti-money laundering judgment; in the anti-money laundering model construction process, the required data is mainly divided into two types; one type is a single feature, the value of which depends on the current record, such as the age or professional characteristics of the customer; the other is a time-series signature, which relies on multiple records. For example, the number of transactions of a certain client in the last month, this feature needs to be obtained by summarizing all the transaction records of the client in the last month; however, different transaction data of the same client may exist in different companies, and the data of different companies have confidentiality and cannot be communicated with each other, if the anti-money laundering model is constructed only by the data of a single company, the constructed time sequence characteristics are not accurate due to incomplete data, and the accuracy of the model is low, in addition, the number of historical money laundering cases of the single company is small, the model constructed by the data of only one company has an overfitting phenomenon and a large error,
disclosure of Invention
The embodiment of the invention provides an anti-money laundering identification method, device and system based on horizontal federal learning, which can improve the accuracy of anti-money laundering identification.
An embodiment of the present invention provides an anti-money laundering identification method based on horizontal federal learning, including:
performing feature alignment on each data feature in the sample data tables of the plurality of participating nodes to generate basic data features for constructing an anti-money laundering model; each sample data table comprises a plurality of data samples, and each data sample is provided with a user ID and sample generation time;
carrying out sample synchronization on the sample data table of each participating node according to the user ID and the sample generation time; when sample synchronization is carried out, the user ID and the sample generation time of a selected data sample in a current participating node are sent to the participating node which does not own the selected data sample but owns the data sample with the same user ID as the selected data sample;
issuing a time sequence feature construction instruction to each participating node, so that each participating node calculates a basic feature value of the time sequence feature to be constructed based on a sample data table after sample synchronization according to statistical time dimension information, a feature name and a calculation mode of the required basic data feature, which are contained in the time sequence feature construction instruction, when receiving the time sequence feature construction instruction; calculating a final characteristic value of the time sequence characteristic according to each basic characteristic value;
and issuing the final characteristic value of the time sequence characteristic to each participating node, so that each participating node generates an anti-money laundering identification model through transverse federal learning according to the final characteristic value of the time sequence characteristic and the characteristic value of the data characteristic of the participating node, and performs anti-money laundering identification according to the anti-money laundering identification model.
Further, the performing feature alignment on each data feature in the sample data table of the plurality of participating nodes to generate a basic data feature for constructing an anti-money laundering model specifically includes:
taking the feature intersection of each data feature in the sample data table of each participating node to obtain a plurality of first basic data features;
calculating the global effective rate of each data characteristic except the first basic data characteristic one by one; taking the data characteristic with the global effective rate exceeding a first preset threshold value as a second basic data characteristic;
and taking all the first basic data features and all the second basic data features as the basic data features for constructing the anti-money laundering model.
Further, the global efficiency of a data feature is calculated by the following formula:
Figure BDA0002971293450000021
wherein, grIs a global efficiency of a data feature, M is the number of participating nodes, IrMLocally efficient, n, at Mth participating node for data characterizationMThe number of data samples for the mth participating node.
On the basis of the above method item embodiments, the present invention correspondingly provides apparatus item embodiments.
The invention provides an anti-money laundering identification device based on transverse federal learning, which comprises a feature alignment module, a sample synchronization module, a time sequence feature construction module and an anti-money laundering identification module, wherein the feature alignment module is used for aligning the feature of a user;
the characteristic alignment module is used for performing characteristic alignment on each data characteristic in the sample data table of the plurality of participating nodes to generate a basic data characteristic for constructing an anti-money laundering model; each sample data table comprises a plurality of data samples, and each data sample is provided with a user ID and sample generation time;
the sample synchronization module is used for carrying out sample synchronization on the sample data table of each participating node according to the user ID and the sample generation time; when sample synchronization is carried out, the user ID and the sample generation time of a selected data sample in a current participating node are sent to the participating node which does not own the selected data sample but owns the data sample with the same user ID as the selected data sample;
the time sequence feature construction module is used for issuing a time sequence feature construction instruction to each participating node so that when each participating node receives the time sequence feature construction instruction, the sample data table after sample synchronization calculates the basic feature value of the time sequence feature required to be constructed according to the statistical time dimension information contained in the time sequence construction instruction, the feature name of the required basic data feature and the calculation mode; calculating a final characteristic value of the time sequence characteristic according to each basic characteristic value;
and the anti-money laundering identification module is used for issuing the final characteristic value of the time sequence characteristic to each participating node, so that each participating node generates an anti-money laundering identification model through transverse federal learning according to the final characteristic value of the time sequence characteristic and the characteristic value of the data characteristic of the participating node, and carries out anti-money laundering identification according to the anti-money laundering identification model.
Further, the feature alignment module performs feature alignment on each data feature in the sample data table of the plurality of participating nodes to generate a basic data feature for constructing an anti-money laundering model, and specifically includes:
taking the feature intersection of each data feature in the sample data table of each participating node to obtain a plurality of first basic data features;
calculating the global effective rate of each data characteristic except the first basic data characteristic one by one; taking the data characteristic with the global effective rate exceeding a first preset threshold value as a second basic data characteristic;
and taking all the first basic data features and all the second basic data features as the basic data features for constructing the anti-money laundering model.
Further, the feature alignment module calculates a global efficiency of a data feature by the following formula:
Figure BDA0002971293450000041
wherein, grIs a global efficiency of a data feature, M is the number of participating nodes, IrMLocally efficient, n, at Mth participating node for data characterizationMThe number of data samples for the mth participating node.
On the basis of the embodiment of the device item, the invention provides an anti-money laundering identification system based on horizontal federal learning, which comprises a central node and a plurality of participating nodes; the central node comprises the anti-money laundering identification device based on the horizontal federal learning of the invention.
The embodiment of the invention has the following beneficial effects:
the embodiment of the invention provides an anti-money laundering identification method, device and system based on transverse federal learning, wherein the method comprises the steps of firstly carrying out feature alignment on data features provided by each participating node, extracting basic data features for constructing an anti-money laundering model, then carrying out sample synchronization according to user IDs (identity) of data samples uploaded by each participating node and sample generation time, and then issuing a time sequence feature construction instruction to each participating node, so that each participating node calculates a basic feature value of a time sequence feature required to be constructed based on a sample data sheet after sample synchronization according to the time sequence feature construction instruction; and then, each participating node constructs an anti-money laundering identification model according to the acquired time sequence characteristic value and the characteristic value of the data characteristic of the participating node by combining the acquired time sequence characteristic value with the characteristic value of the data characteristic of the participating node through horizontal federal learning, and finally performs anti-money laundering identification according to the constructed anti-money laundering model. Compared with the prior art, the method and the device have the advantages that the time sequence characteristics are constructed by combining the data of all the participating nodes, the problem that the constructed time sequence characteristics are inaccurate due to incomplete data is solved, the number of samples is increased through horizontal federal learning, the accuracy of the constructed anti-money laundering model is improved, and the anti-money laundering identification can be carried out more accurately.
Drawings
Fig. 1 is a schematic flow chart of an anti-money laundering identification method based on horizontal federal learning according to an embodiment of the present invention.
Fig. 2 is a schematic structural diagram of an anti-money laundering identification device based on horizontal federal learning according to an embodiment of the present invention.
Fig. 3 is a system architecture diagram of an anti-money laundering identification system based on horizontal federal learning according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, an embodiment of the present invention provides an anti-money laundering identification method based on horizontal federal learning, which at least includes:
step S101: performing feature alignment on each data feature in the sample data tables of the plurality of participating nodes to generate basic data features for constructing an anti-money laundering model; each sample data table comprises a plurality of data samples, and each data sample is provided with a user ID and a sample generation time.
Step S102: carrying out sample synchronization on the sample data table of each participating node according to the user ID and the sample generation time; when the sample synchronization is carried out, the user ID and the sample generation time of a selected data sample in the current participating node are sent to the participating node which does not own the selected data sample but owns the data sample with the same user ID as the selected data sample.
Step S103: issuing a time sequence feature construction instruction to each participating node, so that each participating node calculates a basic feature value of the time sequence feature to be constructed based on a sample data table after sample synchronization according to statistical time dimension information, a feature name and a calculation mode of the required basic data feature, which are contained in the time sequence feature construction instruction, when receiving the time sequence feature construction instruction; and calculating a final characteristic value of the time sequence characteristic according to each basic characteristic value.
Step S104: and issuing the final characteristic value of the time sequence characteristic to each participating node, so that each participating node generates an anti-money laundering identification model through transverse federal learning according to the final characteristic value of the time sequence characteristic and the characteristic value of the data characteristic of the participating node, and the anti-money laundering identification model performs anti-money laundering identification.
It should be noted that the anti-money laundering identification method based on horizontal federal learning is suitable for being operated at a central node.
In step S101, in a preferred embodiment, the performing feature alignment on each data feature in the sample data table of the plurality of participating nodes to generate a basic data feature for constructing an anti-money laundering model specifically includes:
taking the feature intersection of each data feature in the sample data table of each participating node to obtain a plurality of first basic data features; calculating the global effective rate of each data characteristic except the first basic data characteristic one by one; taking the data characteristic with the global effective rate exceeding a first preset threshold value as a second basic data characteristic; and taking all the first basic data features and all the second basic data features as the basic data features for constructing the anti-money laundering model.
In particular, in the anti-money laundering scenario of the securities industry, the problem of insufficient sample size of each participating node is often faced, and therefore lateral federal learning is introduced to solve the problem. In horizontal federal learning, each participating node tends to have a different sample, but the features held by each party overlap significantly. Therefore, before federal learning is carried out, feature alignment is carried out on each participating node, and features common to all parties are screened out for training. However, in the classical horizontal federated learning scenario, feature alignment is performed directly using feature intersections of the individual participating nodes. Thus, if some features exist only in part of the participating nodes, the features may be discarded even if the fill-in rate of the features is high. To address this problem, the present invention employs a new data feature alignment method to perform data feature alignment, which will be described in detail below.
Firstly, a sample data table of each participating node comprises a plurality of data samples, and each data sample records a plurality of data items (namely the data characteristic items); each data item comprises a data item name and a corresponding numerical value; the specific data items contained by each participating node may vary, but generally include: basic information of a user, historical transaction information of the user and historical non-transaction information of the user; basic information of the user: such as the user's age, position, annual income, gender, nationality, place of residence, etc.; the user historical trading information is the security consignment record of the user history, such as: commission price, target, etc.; the user history non-trading information is a record of the behavior of some unrelated trades performed by the user at the security company, such as: changing the records of the deposit bank and the records of fund transfer-in and transfer-out, etc.
When data characteristics are aligned, each participating node uploads the field name of each data item in the sample data table of the participating node to the central node; after receiving the data, the central node firstly calculates the intersection of the data items uploaded by each participating node, and takes the data items of all participating nodes as a first basic data characteristic;
according to the remaining data items, calculating the global effective rate of the remaining data items at the local effective rate of each participating node, extracting the data items of which the global effective rate reaches a preset threshold value, obtaining a plurality of second basic data characteristics, combining the first basic data characteristics and the second basic data characteristics, obtaining the basic data characteristics finally used for constructing an anti-money laundering model, and finishing the characteristic alignment:
the local effective rate of a data item at a single participating node can be characterized by the filling rate of the data item at the single participating node, and the higher the filling rate is, the higher the local effective rate is; if the data item is not in the sample data table of a participating node, the local effective rate of the data item at the participating node is 0;
effective fraction of the local area IrIndicating the efficiency for characterizing the data characteristics held by a single participating node.
Global effective rate of grAnd representing the overall efficiency of the data characteristic in all the participating nodes, and determining whether the characteristic participates in the subsequent federal learning training process.
The globally efficient computation may take the following form:
Figure BDA0002971293450000071
in the formula, grIs a global efficiency of a data feature, M is the number of participating nodes, IrMLocally efficient, n, at Mth participating node for data characterizationMThe number of data samples for the mth participating node.
Calculating the global validity of the rest data items by adopting the formula, and if the global validity g is reachedrGreater than a first preset threshold gth(ii) a Then the data characteristic is used as a second basic data characteristic for the subsequent training of the anti-money laundering model, the effect of the anti-money laundering recognition model can be improved by adopting the characteristic alignment method provided by the invention, wherein the first preset threshold value gthMay be determined as a hyper-parameter in subsequent lateral federal learning model training.
For step S102; the anti-money laundering scenario is a typical time-series class scenario, and the constructed data sample is often time-information (i.e., the sample generation time) because the same customer may have different money laundering risks at different times. Thus different customers may be treated as different samples at different times. The same client may trade at different participating nodes; therefore, the data samples in the sample data tables of different participating nodes may have the same data sample (the data sample is considered to be the same if the user ID and the sample generation time are both the same as each other, taking the user ID and the sample generation time as the criteria); it is also possible to have samples with the same user ID but different sample generation times
As shown in table 1, the representation provides data samples held by three different participating nodes:
Figure BDA0002971293450000081
TABLE 1
As can be seen from Table 1, participating node m1And m3Having identical data samples, i.e.
Figure BDA0002971293450000082
And
Figure BDA0002971293450000083
participating node m1And m2The held data samples are different from one another, but the samples are different
Figure BDA0002971293450000084
And
Figure BDA0002971293450000085
are all of user U1. Because the clients of the participating nodes are overlapped, when the characteristics of the participating nodes are constructed, the data of the same client contained in other participating nodes can be used for improving the model effect. Therefore, in modeling the anti-money laundering scene in the securities industry, not only feature alignment but also sample synchronization are carried out. The sample synchronization method specifically comprises the following steps:
firstly, each participating node sends a respective data sample to a central node by using an ins _ sync message, the sent data sample only comprises a user ID and sample generation time, and the central node integrates the ins _ sync message sent by each participating node after receiving the ins _ sync message. Followed by sample synchronization of the various participating nodes. The specific synchronization mode is as follows: and for a certain sample of a certain participating node, sending the user ID and the sample generation time of the certain sample to each participating node which does not own the sample but owns the sample with the same user ID as the sample. The samples held by each participating node after sample synchronization are shown in table 2:
Figure BDA0002971293450000091
TABLE 2
From table 2, it can be seen that after sample synchronization, the participating node m1Increase a
Figure BDA0002971293450000092
The user ID of (a) and the sample generation time,participating node m2Increase a
Figure BDA0002971293450000093
User ID and sample generation time; participating node m3(ii) a Increase a
Figure BDA0002971293450000094
User ID and sample generation time. It should be noted that, when sample synchronization is performed, only the user ID and the sample generation time are synchronized, and the numerical values of the data items in the data sample are not synchronized; e.g. the above-mentioned participating node m1After sample synchronization, m1The user ID of U1 is added to the sample data table of (2), the sample generation time is 20201010, but the values of the data items in the data sample are all null.
For step S103, after the feature alignment and the sample synchronization, the time sequence feature construction process required for training the anti-money laundering model may be started, as mentioned in the above background art, the time sequence feature construction requires historical data, and the historical data of the same client may be scattered in each participating node. For this type of feature, it is therefore necessary for the individual participating nodes to be constructed with the aid of a central node. In the following, for some common timing characteristics, a communication protocol is designed to construct the characteristics on the premise of ensuring the security of basic data. Other more complex feature constructions may be combined or modified based on these common timing-like constructions.
The following details a given sample s, which is required to construct the relevant communication protocol for each type of feature on the data column c within the time window w.
1: summing type time series feature construction (e.g. to ask a user for the amount of the last month of a transaction)
Taking the configuration of the w _ sum _ trx _ amt _3m feature as an example, the feature means the total transaction amount of the customer within three months before the sample date. To construct the feature, the central node sends a summation type time sequence feature construction instruction to each participating node through a window _ sum _ cal message. The format of the window _ sum _ cal message is shown in Table 3
proto _ type (protocol type) window_sum_cal
fe _ name (feature name) w_sum_trx_amt_3m
W (time window length) 3 months old
C (data column) Trx_amt
TABLE 3
In table 3, the proto _ type protocol type corresponds to the calculation method included in the timing structure instruction of the present invention, W (time window length) corresponds to the statistical time dimension information included in the timing structure instruction of the present invention, and C (data column) corresponds to the feature name of the required basic data feature included in the timing structure value instruction of the present invention.
After each participating node receives the window _ sum _ cal message, the sum of the values of the data column c of each sample in the corresponding time window is directly calculated for the sample held by each participating node
Figure BDA0002971293450000101
And then the data is sent to the central node through a window _ sum _ result message. The window _ sum _ result message format is as follows
Shown in Table 4:
Figure BDA0002971293450000111
TABLE 4
After receiving the window _ sum _ result protocol from each participating node, the central node directly sums the basic characteristic values of each participating node to obtain the final characteristic value of the characteristic. The final characteristic value is then sent back to each participating node via window _ sum _ notify.
For example: suppose at this time Trx _ amt is the transaction amount and the data sample is<ID1,20201225>(ii) a Then join node m1、m2、m3When the message w _ sum _ trx _ amt _3m is received, the data sample is pointed to<ID1,20201225>Based on the sample data table, extracting the data value of the data item of the "transaction amount" of the client ID1 in the time period 20200925-20201225, and summing the data values to obtain the summed value (i.e. the basic characteristic value of the time sequence characteristic required to be constructed); then each participating node sends the summed value, the user ID of the corresponding sample and the sample generation time to the central node; and the central node sums the summed values of all the participating nodes again to obtain a final value (namely, a final characteristic value of the time sequence characteristic). This final value is the value of the sum of the transaction amounts for the customer ID1 within the first 3 months of 2020/12/25 (i.e., 2020/09/25-2020/12/25). After calculating this value, the central node sends the final characteristic value back to the respective participating nodes.
2. Constructing a most-valued class time sequence characteristic: for such features, it can be constructed in a similar way to the summation-like features described above, for each sample, calculating the maximum/minimum value of the data column c within the time window w by the participating node to which the sample belongs, and sending the result to the central node.
3. And (3) constructing an average value class time sequence characteristic: taking the configuration of the w _ avg _ trx _ amt _3m feature as an example, the feature means the average transaction amount (total transaction amount divided by total transaction number) of the customer within three months before the sample date. To construct this feature, the central node first issues an indication to each participating node via a window _ avg _ cal message. The message format is shown in table 5.
Figure BDA0002971293450000121
TABLE 5
After receiving the window _ avg _ cal packet, each participating node calculates the sum of the data columns c within the time window w for each sample
Figure BDA0002971293450000122
Then, the data quantity sum of each sample of each participating node in the time window w
Figure BDA0002971293450000123
And calculated
Figure BDA0002971293450000124
And sending the data to the central node through a window _ avg _ result message. The window _ avg _ result message format is shown in table 6:
Figure BDA0002971293450000125
Figure BDA0002971293450000131
TABLE 6
After the intermediate node receives the data sent by each participating node, the average value of the data column c of each sample in the time window w can be calculated through the following formula
Figure BDA0002971293450000132
And then sent back to the respective participating nodes.
Figure BDA0002971293450000133
4. And (3) constructing standard deviation time sequence characteristics: take the construction of the w _ std _ trx _ amt _3m feature as an example, which means the standard deviation of the transaction amount of the customer within three months before the sample date, usingTo characterize the customer's discrete degree of each recent transaction amount. Given a sample s, the standard deviation of the data column c is characterized in order to find it within the time window w
Figure BDA0002971293450000134
The central node firstly sends down indication to each participating node through a window _ std _ cal message. The message format is shown in table 7:
Figure BDA0002971293450000135
TABLE 7
Further, through the average value class feature construction process, the central node can obtain the global average value of the data column c of the sample s in the time window w
Figure BDA0002971293450000136
The central node then sends the average to all participating nodes holding the sample via a window _ mss _ cal message. The message format is shown in table 8.
Figure BDA0002971293450000141
Table 8 the participating nodes, after receiving the protocol request, calculate the MSS value by the following formula.
Figure BDA0002971293450000142
In the above formula, the first and second carbon atoms are,
Figure BDA0002971293450000143
representing the set of data records in the participating node m that the sample s contains in its time window w. Vm,r,cRepresenting the value of column c of data record r in participating node m. Then each participating node transmits own information through the window _ mss _ result message
Figure BDA0002971293450000144
And calculated
Figure BDA0002971293450000145
The value is sent to the central node. The message format is shown in table 9:
Figure BDA0002971293450000146
TABLE 9
The central node can calculate the characteristic value of each sample according to the following formula according to the received data. And sends the characteristic value to the participating node that originally holds the sample.
Figure BDA0002971293450000151
And constructing each time sequence characteristic and a corresponding final characteristic value according to the construction mode of each time sequence characteristic.
For step S104, the central node issues the final characteristic value of each time sequence characteristic to each participating node, each participating node trains a preliminary anti-money laundering identification model according to the final characteristic value of the issued time sequence characteristic in combination with the numerical value of the data item of the central node and sends the obtained gradient information to the central node, the central node aggregates the gradient information sent by each participating node to generate combined gradient information and issues the combined gradient information to each participating node, so that each participating node iteratively updates the preliminary anti-money laundering identification model according to the combined gradient information to obtain a final anti-money laundering identification model; and then carrying out anti-money laundering recognition based on the anti-money laundering model obtained by final training.
It should be noted that the central node and each participating node in the present invention can be understood as a server.
On the basis of the embodiment of the method item, the invention correspondingly provides an embodiment of a device item;
as shown in fig. 2, an embodiment of the present invention provides an anti-money laundering recognition apparatus based on horizontal federal learning, including: the device comprises a feature alignment module, a sample synchronization module, a time sequence feature construction module and a feature distribution module;
the characteristic alignment module is used for performing characteristic alignment on each data characteristic in the sample data table of the plurality of participating nodes to generate a basic data characteristic for constructing an anti-money laundering model; each sample data table comprises a plurality of data samples, and each data sample is provided with a user ID and sample generation time;
the sample synchronization module is used for carrying out sample synchronization on the sample data table of each participating node according to the user ID and the sample generation time; when sample synchronization is carried out, the user ID and the sample generation time of a selected data sample in a current participating node are sent to the participating node which does not own the selected data sample but owns the data sample with the same user ID as the selected data sample;
the time sequence feature construction module is used for issuing a time sequence feature construction instruction to each participating node so that each participating node calculates a basic feature value of the time sequence feature to be constructed according to the statistical time dimension information contained in the time sequence feature construction instruction, the feature name of the required basic data feature and the calculation mode when receiving the time sequence feature construction instruction; calculating a final characteristic value of the time sequence characteristic based on a sample data table after sample synchronization according to each basic characteristic value;
and the anti-money laundering identification module is used for issuing the final characteristic value of the time sequence characteristic to each participating node, so that each participating node generates an anti-money laundering identification model through transverse federal learning according to the final characteristic value of the time sequence characteristic and the characteristic value of the data characteristic of the participating node, and carries out anti-money laundering identification according to the anti-money laundering identification model.
In a preferred embodiment, the feature alignment module performs feature alignment on each data feature in the sample data table of the plurality of participating nodes to generate a basic data feature for constructing an anti-money laundering model, and specifically includes: taking the feature intersection of each data feature in the sample data table of each participating node to obtain a plurality of first basic data features; calculating the global effective rate of each data characteristic except the first basic data characteristic one by one; taking the data characteristic with the global effective rate exceeding a first preset threshold value as a second basic data characteristic; and taking all the first basic data features and all the second basic data features as the basic data features for constructing the anti-money laundering model.
In a preferred embodiment, the feature alignment module calculates a global efficiency of a data feature by the following formula:
Figure BDA0002971293450000161
wherein, grIs a global efficiency of a data feature, M is the number of participating nodes, IrMLocally efficient, n, at Mth participating node for data characterizationMThe number of data samples for the mth participating node.
It should be noted that the above device item embodiments correspond to the method item embodiments of the present invention, and can implement any one of the anti-money laundering identification methods based on horizontal federal learning of the present invention; in addition, the described device embodiments are merely illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiment of the apparatus provided by the present invention, the connection relationship between the modules indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement it without inventive effort.
On the basis of the above device item embodiment, the present invention correspondingly provides a system item embodiment;
as shown in fig. 3, an embodiment of the present invention provides an anti-money laundering identification system based on horizontal federal learning, which includes a central node and a plurality of participating nodes; wherein, the central node comprises any one of the above mentioned anti-money laundering identification devices based on horizontal federal learning.
The embodiment of the invention has the following beneficial effects:
the embodiment of the invention carries out feature synchronization and sample synchronization on the data of each participating node, and then combines the data of each participating node to construct the time sequence feature, thereby avoiding the problem of inaccurate constructed time sequence feature caused by incomplete data, improving the accuracy of the anti-money laundering identification model, enlarging the number of samples through horizontal federal learning, and further improving the accuracy of the constructed anti-money laundering model. And finally, the constructed model can be used for more accurately carrying out anti-money laundering recognition.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims (7)

1. An anti-money laundering identification method based on horizontal federal learning is characterized by comprising the following steps:
performing feature alignment on each data feature in the sample data tables of the plurality of participating nodes to generate basic data features for constructing an anti-money laundering model; each sample data table comprises a plurality of data samples, and each data sample is provided with a user ID and sample generation time;
carrying out sample synchronization on the sample data table of each participating node according to the user ID and the sample generation time; when sample synchronization is carried out, the user ID and the sample generation time of a selected data sample in a current participating node are sent to the participating node which does not own the selected data sample but owns the data sample with the same user ID as the selected data sample;
issuing a time sequence feature construction instruction to each participating node, so that each participating node calculates a basic feature value of the time sequence feature to be constructed based on a sample data table after sample synchronization according to statistical time dimension information, a feature name and a calculation mode of the required basic data feature, which are contained in the time sequence feature construction instruction, when receiving the time sequence feature construction instruction; calculating a final characteristic value of the time sequence characteristic according to each basic characteristic value;
and issuing the final characteristic value of the time sequence characteristic to each participating node, so that each participating node generates an anti-money laundering identification model through transverse federal learning according to the final characteristic value of the time sequence characteristic and the characteristic value of the data characteristic of the participating node, and performs anti-money laundering identification according to the anti-money laundering identification model.
2. The anti-money laundering identification method based on horizontal federal learning of claim 1, wherein the generating of the basic data features for constructing the anti-money laundering model by performing feature alignment on each data feature in the sample data table of the plurality of participating nodes specifically comprises:
taking the feature intersection of each data feature in the sample data table of each participating node to obtain a plurality of first basic data features;
calculating the global effective rate of each data characteristic except the first basic data characteristic one by one; taking the data characteristic with the global effective rate exceeding a first preset threshold value as a second basic data characteristic;
and taking all the first basic data features and all the second basic data features as the basic data features for constructing the anti-money laundering model.
3. The method for anti-money laundering identification based on horizontal federal learning of claim 2, wherein the global effectiveness of a data feature is calculated by the following formula:
Figure FDA0002971293440000021
wherein, grIs a global efficiency of a data feature, M is the number of participating nodes, IrMLocally efficient, n, at Mth participating node for data characterizationMThe number of data samples for the mth participating node.
4. An anti-money laundering recognition apparatus based on horizontal federal learning, comprising: the system comprises a feature alignment module, a sample synchronization module, a time sequence feature construction module and an anti-money laundering identification module;
the characteristic alignment module is used for performing characteristic alignment on each data characteristic in the sample data table of the plurality of participating nodes to generate a basic data characteristic for constructing an anti-money laundering model; each sample data table comprises a plurality of data samples, and each data sample is provided with a user ID and sample generation time;
the sample synchronization module is used for carrying out sample synchronization on the sample data table of each participating node according to the user ID and the sample generation time; when sample synchronization is carried out, the user ID and the sample generation time of a selected data sample in a current participating node are sent to the participating node which does not own the selected data sample but owns the data sample with the same user ID as the selected data sample;
the time sequence feature construction module is used for issuing a time sequence feature construction instruction to each participating node so that each participating node calculates a basic feature value of the time sequence feature to be constructed according to the statistical time dimension information contained in the time sequence feature construction instruction, the feature name of the required basic data feature and the calculation mode when receiving the time sequence feature construction instruction; calculating a final characteristic value of the time sequence characteristic based on a sample data table after sample synchronization according to each basic characteristic value;
and the anti-money laundering identification module is used for issuing the final characteristic value of the time sequence characteristic to each participating node, so that each participating node generates an anti-money laundering identification model through transverse federal learning according to the final characteristic value of the time sequence characteristic and the characteristic value of the data characteristic of the participating node, and carries out anti-money laundering identification according to the anti-money laundering identification model.
5. The anti-money laundering identification device based on horizontal federal learning of claim 4, wherein the feature alignment module performs feature alignment on each data feature in the sample data table of a plurality of participating nodes to generate a basic data feature for constructing an anti-money laundering model, and specifically comprises:
taking the feature intersection of each data feature in the sample data table of each participating node to obtain a plurality of first basic data features;
calculating the global effective rate of each data characteristic except the first basic data characteristic one by one; taking the data characteristic with the global effective rate exceeding a first preset threshold value as a second basic data characteristic;
and taking all the first basic data features and all the second basic data features as the basic data features for constructing the anti-money laundering model.
6. The anti-money laundering identification device based on horizontal federal learning of claim 5, wherein the feature alignment module calculates a global effectiveness rate of a data feature by the following formula:
Figure FDA0002971293440000031
wherein, grIs a global efficiency of a data feature, M is the number of participating nodes, IrMLocally efficient, n, at Mth participating node for data characterizationMThe number of data samples for the mth participating node.
7. An anti-money laundering recognition system based on horizontal federal learning, comprising: a central node and a plurality of participating nodes; wherein the central node comprises the horizontal federal learning based anti-money laundering identification device as claimed in any one of claims 4 to 6.
CN202110264163.0A 2021-03-11 2021-03-11 Anti-money laundering identification method, device and system based on horizontal federal learning Active CN113094407B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110264163.0A CN113094407B (en) 2021-03-11 2021-03-11 Anti-money laundering identification method, device and system based on horizontal federal learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110264163.0A CN113094407B (en) 2021-03-11 2021-03-11 Anti-money laundering identification method, device and system based on horizontal federal learning

Publications (2)

Publication Number Publication Date
CN113094407A true CN113094407A (en) 2021-07-09
CN113094407B CN113094407B (en) 2022-07-19

Family

ID=76667016

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110264163.0A Active CN113094407B (en) 2021-03-11 2021-03-11 Anti-money laundering identification method, device and system based on horizontal federal learning

Country Status (1)

Country Link
CN (1) CN113094407B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070100744A1 (en) * 2005-11-01 2007-05-03 Lehman Brothers Inc. Method and system for administering money laundering prevention program
CN109598385A (en) * 2018-12-07 2019-04-09 深圳前海微众银行股份有限公司 Anti money washing combination learning method, apparatus, equipment, system and storage medium
CN110309923A (en) * 2019-07-03 2019-10-08 深圳前海微众银行股份有限公司 Laterally federation's learning method, device, equipment and computer storage medium
US20190325528A1 (en) * 2018-04-24 2019-10-24 Brighterion, Inc. Increasing performance in anti-money laundering transaction monitoring using artificial intelligence
CN110852884A (en) * 2019-11-15 2020-02-28 成都数联铭品科技有限公司 Data processing system and method for anti-money laundering recognition
CN111325572A (en) * 2020-01-21 2020-06-23 深圳前海微众银行股份有限公司 Data processing method and device
CN111898769A (en) * 2020-08-17 2020-11-06 中国银行股份有限公司 Method and system for establishing user behavior period model based on horizontal federal learning
CN111967910A (en) * 2020-08-18 2020-11-20 中国银行股份有限公司 User passenger group classification method and device
CN112364943A (en) * 2020-12-10 2021-02-12 广西师范大学 Federal prediction method based on federal learning

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070100744A1 (en) * 2005-11-01 2007-05-03 Lehman Brothers Inc. Method and system for administering money laundering prevention program
US20190325528A1 (en) * 2018-04-24 2019-10-24 Brighterion, Inc. Increasing performance in anti-money laundering transaction monitoring using artificial intelligence
CN109598385A (en) * 2018-12-07 2019-04-09 深圳前海微众银行股份有限公司 Anti money washing combination learning method, apparatus, equipment, system and storage medium
CN110309923A (en) * 2019-07-03 2019-10-08 深圳前海微众银行股份有限公司 Laterally federation's learning method, device, equipment and computer storage medium
CN110852884A (en) * 2019-11-15 2020-02-28 成都数联铭品科技有限公司 Data processing system and method for anti-money laundering recognition
CN111325572A (en) * 2020-01-21 2020-06-23 深圳前海微众银行股份有限公司 Data processing method and device
CN111898769A (en) * 2020-08-17 2020-11-06 中国银行股份有限公司 Method and system for establishing user behavior period model based on horizontal federal learning
CN111967910A (en) * 2020-08-18 2020-11-20 中国银行股份有限公司 User passenger group classification method and device
CN112364943A (en) * 2020-12-10 2021-02-12 广西师范大学 Federal prediction method based on federal learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杨强: "AI与数据隐私保护:联邦学习的破解之道", 《信息安全研究》 *

Also Published As

Publication number Publication date
CN113094407B (en) 2022-07-19

Similar Documents

Publication Publication Date Title
CN110263024B (en) Data processing method, terminal device and computer storage medium
Zhou et al. Patents, trademarks, and their complementarity in venture capital funding
Hasbrouck et al. Order arrival, quote behavior, and the return‐generating process
CN107705113A (en) A kind of cross-border inter-bank method of payment of block chain based on Baas frameworks and system
US11068885B2 (en) Method and system for deanomymizing cryptocurrency users by analyzing bank transfers to a cryptocurrency exchange
JP2004528657A5 (en)
CN111598679B (en) Block chain-based multi-law person-to-person combined loan method, system and medium
CN108111368B (en) Function test method and device of transaction system
CN112488804A (en) Electronic commerce system based on big data cloud platform
WO2018065411A1 (en) Computer system
Almagsoosi et al. Effect of the volatility of the crypto currency and its effect on the market returns
CN109934700A (en) A kind of method and device of arbitrage detecting
CN109271415B (en) Data processing method and device for credit investigation database
CN113094407B (en) Anti-money laundering identification method, device and system based on horizontal federal learning
CN117094764A (en) Bank integral processing method and device
EP1542147A2 (en) Global balancing tool
CN110458555A (en) Dispute process method, apparatus, electronic equipment and storage medium
US20200175562A1 (en) Gem trade and exchange system and previous-block verification method for block chain transactions
US20150348081A1 (en) System and method for managing deposit account rewards based on customizable payment card transaction details
CN107025545A (en) A kind of transaction processing method and transaction system
CN112017028B (en) Remittance path recommendation method and device
CN115082177A (en) Automatic certification making method, device, equipment and medium for decoration and amortization of banking institution
TWM597939U (en) Credit evaluation system
CN109767248A (en) Client requirement information processing method, apparatus and system
TWI824128B (en) financial calculation system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Wu Runpeng

Inventor after: Xin Zhiyun

Inventor after: Li Heng

Inventor after: Zhang Yan

Inventor after: Zou Jie

Inventor before: Wu Runpeng

Inventor before: Li Heng

Inventor before: Zhang Yan

Inventor before: Zou Jie