CN112214207A

CN112214207A - Design method based on distributed and big data anti-money laundering batch processing architecture

Info

Publication number: CN112214207A
Application number: CN202011120633.8A
Authority: CN
Inventors: 李�真; 张荣燕; 杨富安; 王维龙; 赵新浪; 杨章春
Original assignee: Tianyi Electronic Commerce Co Ltd
Current assignee: Tianyi Electronic Commerce Co Ltd
Priority date: 2020-10-19
Filing date: 2020-10-19
Publication date: 2021-01-12

Abstract

The invention discloses a design method based on a distributed and big data anti-money laundering batch processing architecture. The invention has the following advantages and effects: adopting a KAFKA distributed publishing and subscribing message technology, loading source data with high throughput, and solving the problem of source data loading performance; a distributed anti-money laundering batch processing calculation application program and a spark memory calculation engine are adopted to realize high-performance data calculation and solve the problem of batch processing calculation performance; the TiKV data storage engine of the TiDB distributed database is adopted, multi-copy, distributed and efficient storage is realized, the storage of a data Key-Value structure is realized, and the problems of data query performance and data backup are solved; the architecture can horizontally expand the TiDB distributed database server, the application server, the Spark server and the KAFKA server, and can achieve the improvement of system performance by expanding machine resources.

Description

Design method based on distributed and big data anti-money laundering batch processing architecture

Technical Field

The invention relates to the technical field of computer software application, in particular to a design method based on a distributed and big data anti-money laundering batch processing architecture.

Background

Along with the increasing rampant of criminal behaviors such as telecommunication fraud, illegal collection of resources, smuggling and drug-selling and even terrorism, the money-washing activity disturbs the social order more and more. When a financial institution faces massive data, in order to find suspicious transactions in time, obtain valuable information clues and block criminal behaviors, the financial institution is difficult to realize by manual analysis and identification, and a scientific and effective anti-money laundering system must be built to assist in carrying out anti-money laundering work. However, in the prior art, the construction of the anti-money laundering system is faced with the following problems: (1) with the increase of the data volume of financial institutions, when huge data volume calculation is faced, the traditional relational database management system cannot meet the timeliness of anti-money laundering batch processing calculation, and influences the reporting timeliness of large-amount suspicious transactions; (2) most of the traditional anti-money laundering batch processing calculation architectures are based on single-machine and stateful calculation, so that efficient fault-tolerant processing cannot be performed, and once a system fails, shutdown treatment is required; (3) in the face of massive data large-amount suspicious transaction calculation, customer money laundering risk rating, list backtracking and the like, the calculation nodes of the money laundering prevention system cannot support horizontal expansion, and the single-node batch processing calculation gradually reveals fatigue.

Disclosure of Invention

The technical problem to be solved by the invention is to overcome the defects of the prior art and provide a design method based on a distributed and big data anti-money laundering batch processing architecture.

In order to solve the technical problems, the invention provides the following technical scheme:

the invention relates to a design method based on a distributed and big data anti-money laundering batch processing architecture, which comprises a data loader, a data calculator, a data memory, an operation executor and an operation controller, wherein the data loader is responsible for loading and warehousing source data, and the main component of the data loader is a KAFKA cluster; the data calculator is responsible for data processing, cleaning, conversion, suspicious transaction calculation, rating calculation and list backtracking calculation, and the main component of the data calculator is a Spark cluster; the data storage is used for storing calculation results, and the main component of the data storage is a TiDB database server cluster; the operation executor is responsible for the cooperative calculation and monitoring of work tasks among the data loader, the data calculator and the data memory, and the main component of the operation executor is an anti-money laundering batch processing calculation application program; the job controller is responsible for task scheduling within the job executor, and its main component is an anti-money laundering batch scheduling application. The specific implementation steps of each part of the data loader, the data calculator, the data memory, the operation executor and the operation controller are as follows:

s11, distributed job controller: the method comprises the following steps that an operation controller issues an operation task of the day at zero point every day, the operation task comprises data loading, cleaning, conversion, suspicious indexes, rules, model calculation, suspicious case calculation, suspicious report calculation, risk rating calculation, list backtracking calculation and historical data filing, the dependency relationship among the operation tasks has date dependence and task dependence, the task of the day is not executed and completed until the task of the day is executed and completed, the task of the day is executed until the task of the day is executed and not executed, the current task is executed until the task of the last day is executed and not completed, the current task is executed until the task of the last day is executed, the first task is prepared by relying on source data, the operation task dependency relationship is manually set, and the operation task is dispatched by an operation controller;

s12, distributed operation executor: when the batch processing job task meets the task execution dependency relationship, the job controller can schedule the job executor to execute the current job task, when the job executor works, the job execution state and the log can be recorded, if the task execution error is reported, the error place can be positioned through log information, the positioning reason can be assisted through log analysis, and the data loader, the data calculator and the data memory cooperate with the job executor to finish the batch processing job task execution;

s13, source data: the source data file is a text file which is output according to a fixed separator, batch processing job tasks depend on whether the source data file is ready or not, a timing task can be set, whether the source data file is generated or not is detected at zero point every day, if the source data file is generated, the task is executed, if the source data file is not generated, the source data is continuously waited for, the job controller can regularly go to a file server to detect the generation condition of the source data, and if the source data is generated, the job controller can schedule a job executor to go to the file server to pull a source file to a local machine of the job executor to load and store the source file into a;

s14, distributed data loader: the method comprises the steps that a single-node operation executor pulls a source file each time, the content of the source file is read to a KAFKA cluster, the KAFKA queue content starts to be consumed and loaded into a warehouse in a distributed mode after the KAFKA queue information is monitored by the operation executor cluster, and multiple node operation executors form distributed deployment and can pull different source data files and load the source data files into the warehouse at the same time;

s15, distributed data calculator: when mass data is calculated, a Spark calculation interface is called by a job executor through a Spark launcher, after a Spark cluster master node acquires scheduling information, an Apache Livy component deployed on the master node goes to the job executor to take a jar package program to be executed and submits the jar package program to Spark cluster calculation, and after the calculation is completed, a Ti-Spark component writes a calculation result of Spark into a TiDB database TiKV server;

s16, distributed data storage: the job scheduler, the job executor task state information and the log information are directly connected with the TiDB distributed database through JDBC and written into the TiKV cluster for storage, and the Spark calculation result is connected with the TiDB distributed database through a Ti-Spark component and written into the TiKV cluster for storage.

Compared with the prior art, the invention has the following beneficial effects:

(1) the data loading part of the architecture design adopts a KAFKA distributed publish-subscribe message technology, high throughput loads source data, and solves the problem of source data loading performance;

(2) the architecture design data calculation part adopts a distributed anti-money laundering batch processing calculation application program and a spark memory calculation engine to realize high-performance data calculation and solve the problem of batch processing calculation performance;

(3) the architecture design data storage part adopts a TiKV data storage engine of a TiDB distributed database, multi-copy, distributed and efficient storage and data Key-Value structure storage, solves the problems of data query performance and data backup, and combines a TiDB distributed database management system with distributed anti-money laundering batch processing calculation application to realize a large-amount suspicious transaction T +1 reporting bank (T is the transaction occurrence date) and guarantee the large-amount suspicious transaction reporting timeliness;

(4) the anti-money laundering batch processing calculation application, the anti-money laundering batch processing scheduling application, the Spark calculation service and the TiDB distributed database service are distributed deployment, any machine system has a fault, and the anti-money laundering system can normally run without shutdown;

(5) by adopting the framework, the batch processing performance problem of the anti-money laundering system can be solved by horizontally expanding the KAFKA server, the application program server, the Spark server and the TiDB distributed data server.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is a relational diagram of a data loader, data calculator, data storage, job executor, job controller of the present invention;

FIG. 2 is a diagram of the data loading architecture of the present invention;

FIG. 3 is a diagram of the data computing architecture of the present invention;

fig. 4 is a flow chart of the present invention.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.

Example 1

The core of the invention is that the distributed architecture design is adopted, the high-performance batch processing calculation of mass data can be realized, and the fast loading of source data and the high-efficiency calculation and storage of data can be realized through the distributed operation controller, the distributed operation executor, the distributed data loader, the distributed data calculator and the distributed data storage.

The invention provides a design and a method based on a distributed database and a big data anti-money laundering monitoring and analyzing system architecture, which comprises a data loader, a data calculator, a data memory, an operation executor and an operation controller, wherein the data loader is used for loading data; the data loader is responsible for loading and warehousing source data, and the main component of the data loader is a KAFKA cluster; the data calculator is responsible for data processing, cleaning, conversion, suspicious transaction calculation, rating calculation, list backtracking calculation and the like, and the main component of the data calculator is a Spark cluster; the data storage is responsible for storing calculation results (including intermediate results and final result data), and the main component of the data storage is a TiDB database server cluster; the operation executor is responsible for the cooperative calculation and monitoring of work tasks among the data loader, the data calculator and the data memory, and the main component of the operation executor is an anti-money laundering batch processing calculation application program; the job controller is responsible for task scheduling within the job executor, and its main component is an anti-money laundering batch scheduling application. The data loader, the data calculator, the data memory, the operation executor and the operation controller are in the relation as shown in FIG. 1;

(1) a data loader: a source data file in a text format is stored on a file server (or an sftp server), a job executor firstly pulls a source file to a self server, then reads the content of the source file, and assembles and provides the read content to a KAFKA cluster, the job executor acts as a KAFKA cluster producer at the moment, a KAFKA queue message is monitored by the job executor, the job executor acts as a KAFKA cluster consumer at the moment, the job executor starts to consume KAFKA queue data and writes the KAFKA queue data into a TiDB, and in the data transfer process, because the job executor is in distributed deployment, the source data file content is simultaneously produced and consumed by multiple nodes, so that the purpose of quickly warehousing the source data file is achieved, the data transfer efficiency can be improved by horizontally expanding the FKKAA server, and the data loading architecture is designed as shown in figure 2;

(2) a data calculator: the data calculator can realize source-attached data processing, cleaning and standard interface data output, realize large-amount suspicious index, rule and model calculation output of suspected money washing cases, realize risk rating characteristic calculation output of money washing risk levels of clients, realize doujones and political name list backtracking calculation output name list early warning and the like, the operation executor firstly sends out a calculation request to the Spark, after learning the calculation request, the Apache Livy component deployed on the master node of the Spark cluster gets to the operation executor to take a jar package program to be executed and submits the program to the Spark cluster, after the computation of the Spark cluster is completed, the Ti-Spark component writes the calculation result of the Spark into the TiDB database TiKV server, in the data calculation process, because the operation executor is deployed in a distributed manner, the Spark calculation can be simultaneously requested by multiple nodes, high-efficiency calculation is achieved, the data is stored and put in storage, the computing capacity can be improved by horizontally expanding the Spark server, the data computing architecture design is shown in FIG. 3;

(3) a data memory: the operation controller and the operation executor can be connected with the TiDB distributed database through JDBC, in addition, the Ti-Spark component can connect Spark with the TiKV component of the TiDB database to realize data storage, the TiDB database is deployed in a distributed mode, and a plurality of copies store data, so that a database does not need to be backed up independently, the data multi-node storage can improve the query efficiency, the data storage capacity can be improved through horizontally expanding the TiKV server, and the data query efficiency can be improved through horizontally expanding the PD server of the TiDB database;

(4) an operation executor: the operation executor is a java program for batch computing, can be deployed in a distributed mode and is mainly responsible for cooperative computing and monitoring of work tasks among the data loader, the data calculator and the data storage, and the development mainly relates to a technical framework of Springboot + mybases + dubbo + zookeeper;

(5) an operation controller: the job controller is a java program for batch scheduling, can be deployed in a distributed mode, is mainly responsible for generation of daily batch task instances, task scheduling management, task log recording and the like, and develops a main related technical framework of Springboot + Mybatis + dubbo + zookeeper.

The specific implementation steps are as follows:

s11, distributed job controller: the method comprises the following steps that an operation controller issues an operation task of the day at zero point of every day, the operation task comprises data loading, cleaning, conversion, suspicious indexes, rules, model calculation, suspicious case calculation, suspicious report calculation, risk rating calculation, list backtracking calculation, historical data filing and the like, the dependency relationship among the operation tasks has date dependence and task dependence, the task of the day is not executed and completed until the task of the day is executed and completed, the task of the day is executed until the task of the day is executed and not completed, the current task is executed until the task of the day is executed and completed, the current task is executed, the first task is prepared by relying on source data, the operation task dependency relationship (task execution sequence) is manually set, and the operation task is dispatched by the operation controller;

s13, source data (file): the source data file is a text file derived according to a fixed separator (such as a comma separator), batch processing job tasks depend on whether the source data file is ready or not, a timing task can be set, whether the source data file is generated or not is detected at zero point every day, if the source data file is generated, the task is executed, if the source data file is not generated, the source data is continuously waited, a job controller can regularly go to a file server (such as an stfp server) to detect the generation condition of the source data, and if the source data is generated, the job controller can schedule a job executor to pull a source file from the file server to a local machine of the job executor to be loaded and put in storage;

s16, distributed data storage: the job scheduler, the job executor task state information, the log information and the like are directly connected with the TiDB distributed database through JDBC and written into the TiKV cluster for storage, and the Spark calculation result is connected with the TiDB distributed database through a Ti-Spark component and written into the TiKV cluster for storage.

Compared with the prior art, the method can make up the deficiency of the prior art in the performance of anti-money laundering batch processing calculation, and has the following beneficial effects:

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A design method based on a distributed and big data anti-money laundering batch processing architecture is characterized by comprising a data loader, a data calculator, a data storage, a job executor and a job controller, wherein the data loader is used for loading and storing source data in a warehouse, and the main component of the data loader is a KAFKA cluster; the data calculator is responsible for data processing, cleaning, conversion, suspicious transaction calculation, rating calculation and list backtracking calculation, and the main component of the data calculator is a Spark cluster; the data storage is used for storing calculation results, and the main component of the data storage is a TiDB database server cluster; the operation executor is responsible for the cooperative calculation and monitoring of work tasks among the data loader, the data calculator and the data memory, and the main component of the operation executor is an anti-money laundering batch processing calculation application program; the job controller is responsible for task scheduling in the job executor, and the main component of the job controller is an anti-money laundering batch processing scheduling application program;

the specific implementation steps of each part of the data loader, the data calculator, the data memory, the operation executor and the operation controller are as follows: