CN114741171A

CN114741171A - Data mining method and device, electronic equipment, medium and computer product

Info

Publication number: CN114741171A
Application number: CN202210350899.4A
Authority: CN
Inventors: 陈淑婷; 李鹏; 邓杨; 周琳琳; 曾垂鑫; 蔡朴锐; 廖清碧; 陈俊; 蒋日溪; 陈新添
Original assignee: China Construction Bank Corp
Current assignee: China Construction Bank Corp
Priority date: 2022-04-02
Filing date: 2022-04-02
Publication date: 2022-07-12

Abstract

The invention discloses a data mining method, a data mining device, electronic equipment, a medium and a computer product, and relates to the technical field of big data. The method comprises the following steps: acquiring a service type in a mining task; generating real-time online reasoning tasks or offline batch reasoning tasks corresponding to the mining tasks according to the service types; and acquiring data to be mined corresponding to the mining task in the data storage layer, and executing real-time online reasoning tasks or offline batch reasoning tasks based on the data to be mined to obtain a data mining result. The technical scheme of the invention can realize data mining for different types of services by adopting reasoning tasks with different timeliness, not only meets the requirement of monitoring the fund flow in time, but also ensures the normal development of the financial institution services, and solves the problem that the abnormal condition in the fund flow cannot be monitored in time due to the lag of the supervision mode.

Description

Data mining method and device, electronic equipment, medium and computer product

Technical Field

The embodiment of the invention relates to the technical field of big data, in particular to a data mining method, a data mining device, electronic equipment, a medium and a computer product.

Background

In order to meet the supervision requirements of fund flow, the autonomous monitoring mode has become the basis of fund supervision work of each financial institution.

Because the current financial transaction activities are frequent, huge financial transaction data are generated, and the capital flow direction is relatively complex, the supervision work of the capital flow needs to occupy a large amount of computing resources. In order to ensure the normal development of financial institution business, the supervision and processing of the fund flow are usually selected in the period when the financial transaction is not very active, and whether the fund flow is abnormal or not is determined. However, this monitoring method is relatively delayed, and abnormal situations in the fund flow cannot be monitored in time.

Disclosure of Invention

The embodiment of the invention provides a data mining method, a data mining device, electronic equipment, a medium and a computer product, which can solve the problems that the conventional supervision mode is lagged and abnormal conditions in a fund flow cannot be monitored in time.

In a first aspect, an embodiment of the present invention provides a data mining method, including:

acquiring a service type in an excavation task;

generating real-time online reasoning tasks or offline batch reasoning tasks corresponding to the mining tasks according to the service types;

and acquiring data to be mined corresponding to the mining task in the data storage layer, and executing the real-time online reasoning task or the offline batch reasoning task based on the data to be mined to obtain a data mining result.

In a second aspect, an embodiment of the present invention further provides a data mining apparatus, where the apparatus includes:

the excavation task acquisition module is used for executing and acquiring the service type in the excavation task;

the reasoning task generating module is used for executing a real-time online reasoning task or an offline batch reasoning task corresponding to the mining task generated according to the service type;

and the reasoning task execution module is used for executing and acquiring the data to be mined corresponding to the mining task in the data storage layer, and executing the real-time online reasoning task or the offline batch reasoning task based on the data to be mined to obtain a data mining result.

In a third aspect, an embodiment of the present invention further provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the computer program to implement the data mining method according to any one of the embodiments of the present invention.

In a fourth aspect, the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the data mining method according to any one of the embodiments of the present invention.

In a fifth aspect, the present invention further provides a computer program product, including a computer program, which when executed by a processor implements the data mining method according to any one of the embodiments of the present invention.

In the embodiment of the invention, the real-time online reasoning task or the offline batch reasoning task is respectively generated by the service type in the mining task, and the real-time online reasoning task or the offline batch reasoning task is executed based on the data to be mined, so that the data mining of different types of services by adopting reasoning tasks with different timeliness is realized, the requirement of monitoring the fund flow in time is met, the normal development of the financial institution service is ensured, and the problem that the abnormal condition in the fund flow cannot be monitored in time due to the lag of the supervision mode is solved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is a flowchart of a data mining method according to an embodiment of the present invention;

FIG. 2 is a flow chart of another data mining method provided by the embodiments of the present invention;

FIG. 3 is a flowchart of another data mining method according to an embodiment of the present invention;

FIG. 4 is a flowchart of another data mining method according to an embodiment of the present invention;

FIG. 5 is a flowchart of another data mining method according to an embodiment of the present invention;

FIG. 6a is a block diagram of a data mining component according to an embodiment of the present invention;

fig. 6b is a diagram of functional dependency relationship between other components of a data mining process and a fund flow monitoring platform according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a data mining device according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present invention, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance. According to the technical scheme, the data acquisition, storage, use, processing and the like meet relevant regulations of national laws and regulations.

Fig. 1 is a flowchart of a data mining method according to an embodiment of the present invention. The method can be executed by a data mining device, the data mining device can be realized in a hardware and/or software mode, and the data mining device can be configured in electronic equipment. As shown in fig. 1, the method includes:

and S110, acquiring the service type in the mining task.

The mining task is a task which is configured by an authentication user and realizes data mining. Specifically, after the user login success is detected, the user role information is obtained according to the user identification, and the user authentication is carried out according to the user role information. And acquiring the conditions for executing data mining configured by the authenticated user, and generating a mining task when the conditions are detected to be met.

The service type is a type of service data to be subjected to data mining. For example, the traffic types may include antipodal public traffic and antipodal private traffic. The public affairs, namely bank public affairs, mainly include electronic banking, unit deposit affairs, credit affairs, institution affairs, international affairs, fund clearing, asset recommendation, fund trusteeship and the like. The private business, namely the bank private business, mainly comprises personal savings, personal consumption credit, personal house loan, personal investment financing business, bank card business, transfer POS business, personal electronic bank and the like.

In one case, before acquiring the service type in the mining task, the method further comprises creating inference service of a data mining model, and issuing real-time online inference service to the API gateway, or issuing offline batch inference service to the task center. The data mining model is a model for executing data mining operation. Specifically, the data mining model is constructed based on a notebook programmable modeling environment or a visual drag-type modeling environment. The notebook is essentially a Web application program, is convenient for creating and sharing literature and chemical program documents, supports real-time codes, mathematical equations, visualization and marking, and is mainly used for data cleaning and conversion, numerical simulation, statistical modeling, machine learning and the like.

The inference service refers to deploying a trained data mining model to a production environment to provide data mining functions to the production environment. Specifically, the inference service for creating the data mining model comprises model conversion on the trained data mining model, and then the inference service is created by combining inference parameters configured according to inference requirements. The reasoning parameters comprise model package names, calculation node specifications, calculation node numbers, environment variables and the like.

And after the inference service is created, issuing a real-time online inference service to the API gateway or issuing an offline batch inference service to the task center. The API gateway is a gateway in the micro service architecture and is used for processing load balancing, caching, routing, access control, service proxy, monitoring, logs and the like. The task center is used for detecting various resource scheduling tasks of data mining, and comprises a batch prediction task, a retraining task, a model export task, a spark application log and the like. The attribute information and the execution result of the resource scheduling task can be displayed through the task center. The attribute information includes a job name, a project name, a start time, an end time, status information, and the like.

Illustratively, the acquiring the service type in the mining task comprises acquiring a service field in the mining task, and determining the service type according to a field value of the service field.

And S120, generating a real-time online reasoning task or an offline batch reasoning task corresponding to the mining task according to the service type.

The real-time online reasoning task is used for realizing online reasoning of the data to be mined through the online reasoning service to obtain a reasoning result. The real-time online reasoning tasks may be generated based on a real-time online reasoning service. Specifically, a set number of real-time online reasoning services are selected, and mining parameters in the mining tasks are configured to the set number of real-time online reasoning services to generate the real-time online reasoning tasks.

The offline batch reasoning task is used for realizing offline batch reasoning of the data to be mined through the offline batch reasoning service to obtain a reasoning result. The offline batch inference tasks may be generated based on an offline batch inference service. Specifically, a set number of offline batch reasoning services are selected, mining parameters in the mining tasks are configured to the set number of offline batch reasoning services, and the offline batch reasoning tasks are generated.

Illustratively, if the business type is the private business, the offline batch reasoning task corresponding to the mining task is generated based on the offline batch reasoning service. Specifically, mining parameters in the mining tasks are obtained, and the offline batch reasoning tasks corresponding to the mining tasks are generated based on the mining parameters and the offline batch reasoning services. The mining parameters comprise the parameters of regions, capital flow directions, time and the like.

And if the service type is the public service, generating a real-time online reasoning task corresponding to the mining task based on the real-time online reasoning service. Specifically, mining parameters in the mining task are obtained, and a real-time online reasoning task corresponding to the mining task is generated based on the mining parameters and the real-time online reasoning service. The mining parameters comprise the parameters of regions, capital flow direction, time and the like.

S130, acquiring data to be mined corresponding to the mining tasks in the data storage layer, and executing the real-time online reasoning tasks or the offline batch reasoning tasks based on the data to be mined to obtain data mining results.

The data storage layer is used for managing business data. Specifically, the data Storage layer includes a data warehouse tool hive, an Object Storage (COS), an Oracle database, a Massively Parallel Processing (MPP) database, and the like.

The data to be mined are business data meeting mining parameters in a mining task. For example, the data to be mined can be business data of which the business type, region, fund flow direction and time meet the requirements of mining parameters.

Exemplarily, obtaining regional information, fund flow information and time in the mining task; acquiring corresponding data to be mined from a data storage layer according to the region information, the resource flow direction information and the time; and executing real-time online reasoning tasks or offline batch reasoning tasks based on the data to be mined by setting the cluster to obtain a data mining result.

Specifically, if the mining task is a stand-alone data mining task, a real-time online reasoning task or an offline batch reasoning task is executed through the container cluster Kubernetes based on data to be mined, and a data mining result is obtained. Kubernets is a brand-new container technology-based distributed architecture solution, is a Google open source container cluster management system, and is called K8S for short. K8S provides a unified access entry (internal IP address and a DNS name) for multiple containers and load balances all associated containers. A Kubernetes cluster typically includes a Master Node and a plurality of Node nodes, and a Node may be considered as a physical machine or a virtual machine.

And if the mining task is a distributed data mining task, executing a real-time online reasoning task or an offline batch reasoning task based on the data to be mined through a distributed cluster spark to obtain a data mining result. When the Spark application runs on the cluster, the Spark application comprises a plurality of independent processes, and the processes are coordinated through Spark context objects in the driver. The SparkContext can communicate with a variety of cluster managers (including the cluster manager that the Spark deploys independently, meso or YARN). Once connected to the cluster manager, Spark applies for an executor (executor) on each cluster node for the application to perform computational tasks and store data.

In the embodiment of the invention, the real-time online reasoning task or the offline batch reasoning task is respectively generated through the service type in the mining task, the data to be mined corresponding to the mining task in the data storage layer is obtained, and the real-time online reasoning task or the offline batch reasoning task is executed based on the data to be mined, so that the data mining of different types of services by adopting reasoning tasks with different timeliness is realized, the requirement of monitoring the fund flow in time is met, the normal development of the financial institution service is ensured, and the problem that the abnormal condition in the fund flow cannot be monitored in time due to the lag of the supervision mode is solved.

Fig. 2 is a flowchart of another data mining method according to an embodiment of the present invention, where a feature of storing public service data in a data storage layer is added on the basis of the embodiment and the foregoing embodiment. As shown in fig. 2, the method includes:

s210, when the data acquisition condition is met, acquiring the public service data of the public component through the data acquisition component OGG, and sending the public service data to the distributed publishing and subscribing message system Kafka.

The data collection condition may be a pre-configured trigger condition for collecting data from the source database. For example, the data acquisition condition may be a timing acquisition, or a periodic acquisition, or a manual click acquisition, etc.

The data acquisition component OGG (Oracle Golden Gate) is log-based structured data replication software, can realize real-time capture, transformation and delivery of a large amount of transaction data, realize data synchronization of a source database and a target database and keep data delay of a sub-second level. The method obtains incremental changes of data by analyzing the online log or the filing log of the source database, and then applies the changes to the target database, thereby realizing the synchronization of the source database and the target database. The OGG can implement sub-second-level real-time replication of a large amount of data between heterogeneous IT infrastructures (including almost all common operating system platforms and database platforms), and thus can be applied in a plurality of scenarios such as emergency systems, online reporting, real-time data warehouse provisioning, transaction tracking, data synchronization, centralization/distribution, disaster recovery, database upgrade and migration, dual-service centers, and the like. Meanwhile, the OGG can implement a variety of flexible topologies, such as one-to-one, broadcast (one-to-many), aggregation (many-to-one), bidirectional, point-to-point, and cascade.

The public component is a component for realizing public service. Specifically, the common component includes a deposit component, a credit component, and the like.

The distributed publish-subscribe messaging system Kafka is an open source streaming platform developed by the Apache software foundation and written in Scala and Java. Kafka is a high throughput, distributed publish-subscribe messaging system that can handle all the action flow data of a consumer in a web site. This action (web browsing, searching and other user actions) is a key factor in many social functions on modern networks. These data are typically addressed by handling logs and log aggregations due to throughput requirements. For log data like Hadoop and offline analysis systems, but the limitation of requiring real-time processing, using Kafka is a viable solution. The purpose of Kafka is to unify online and offline message processing through the parallel loading mechanism of Hadoop, and also to provide real-time messages through clustering.

Illustratively, the data collection component OGG reads a Log Online Redo Log or Archive Log in a source database by using an extraction Process (Extract Process), then performs parsing, extracts only change information of the data therein, such as DML operations-add, delete, change operations, converts the extracted information into a Golden Gate-defined intermediate format, and stores the Golden Gate-defined intermediate format in a queue file (trail file). And then the queue file (trail file) is transmitted to the distributed publishing and subscribing message system Kafka by a transmission process through TCP/IP.

S220, processing the public service data in the Kafka through a stream type calculation engine Flink according to a supervision and delivery requirement, and storing the processed public service data in the data storage layer through the Kafka according to regions and fund flow directions.

The supervision submission requirement is the requirement of a supervision department on the submission data of the financial institution, a series of processing such as analysis, summarization and the like is carried out on the supervision submission requirement to obtain a supervision calculation rule, and the supervision calculation rule is configured to a streaming calculation engine flight.

The streaming computing engine Flink is an open source big data computing engine, supports batch processing and streaming processing, and can be used for some event-based applications. First Flink is a pure streaming computing engine whose underlying data model is data streaming. A stream may be a borderless, unrestricted stream, i.e. stream processing in general. Or may be bounded and have a flow limit, such as batch processing. Therefore, Flink supports both streaming and batch processing with one set of architectures. Second, one advantage of Flink is to support stateful computations. A stateless process is said to occur if the result of processing an event (or piece of data) is only related to the content of the event itself; conversely, the results are also associated with previously processed events, referred to as stateful processing. Slightly more complex data processing, such as basic aggregation, the association between data streams is stateful. In the field of supervision reporting, unbounded data can be understood as transaction running data which is not terminated, the data is generated anytime and anywhere, the data volume is huge, timely and effective processing is needed, and the running data needs to be reported timely due to supervision requirements. Therefore, the embodiment of the application fully considers the characteristics of the application scenario, and the streaming computation engine Flink is adopted to perform real-time computation on the data in the distributed publish-subscribe message system Kafka based on the supervision computation rule, so as to achieve the purpose of real-time processing. Then, the processed public service data are grouped according to the region and the dimensionality of the capital flow direction, and each group of service data is sent to a data storage layer for storage through a distributed publishing and subscribing message system Kafka.

And S230, acquiring the service type in the mining task.

And S240, generating a real-time online reasoning task or an offline batch reasoning task corresponding to the mining task according to the service type.

S250, acquiring data to be mined corresponding to the mining tasks in the data storage layer, and executing the real-time online reasoning tasks or the offline batch reasoning tasks based on the data to be mined to obtain data mining results.

In the embodiment of the invention, the data acquisition component OGG is used for acquiring the public service data generated in real time, and the acquisition mode is based on log acquisition data instead of acquiring data from the database, so that the processing pressure of the database is not increased, and the streaming computing engine Flink is used for ensuring the real-time processing of the service data, and the real-time transaction is affected differently.

Fig. 3 is a flowchart of another data mining method provided in the embodiment of the present invention, and a feature of storing private business data in a data storage layer is added on the basis of the embodiment and the foregoing embodiment. As shown in fig. 3, the method includes:

s310, when the data acquisition condition is met, acquiring private business data of the personal payment settlement component at regular time through the data transmission component NFT, and transmitting the private business data to a batch processing calculation area.

Among them, the data Transfer component NFT (Network File Transfer) is a process of receiving or transmitting a File or data through a local or global Network (e.g., Internet) using a local Transfer protocol of a Network. Private business data such as personal settlement business data can be collected by the data transmission component NFT. The data transmission component NFT is started in two different ways: a pull-based transmission is initiated when the recipient initiates a download or transmission, and a push-based transmission is initiated when the sender initiates a transmission. The transmission speed depends mainly on the capacity of the network and to a lesser extent on the function of the protocol used. The data transmission component NFT may transparently take place on the network. It may also occur transparently, for example, when a node communicates with other nodes without actively notifying the user of the transmission. Such as through FTP when a user requests a particular web site or downloads a file via HTTP. The protocol used for transmission describes conventions for how the transmission is made between two endpoints, such as how the bit stream that makes up the file and other relevant metadata (e.g., file name) is sent, size, timestamp, and any headers needed for the transmission to begin or complete successfully.

The personal settlement component is a component that implements a personal settlement service. For example, the personal settlement component includes a deposit component, a credit card component, and a credit component, among others.

The batch processing calculation area is a database for processing calculation tasks in parallel. For example, the batch calculation zone may be implemented based on a greenplus database. The greenplus mainly comprises a Master node, a Segment node and an interconnect. The greenplus master is an entry into the greenplus database system that accepts SQL statements, distributes workloads to other database instances (segment instances), and stores and processes data by them. The greenplus interconnect is responsible for communication between different PostgreSQL instances. Greenplus segments are independent PostgreSQL databases, each segment storing a portion of the data. Most of the query processing is done by segment. The Master node does not store any user data, but only carries out access control on the client and stores the metadata of the table distribution logic. The Segment node is responsible for data storage, and the distribution key can be optimized to fully utilize the IO performance of the Segment node to expand the IO performance of the whole cluster.

Illustratively, when it is detected that the data acquisition time is met or the time interval from the last data acquisition meets the set interval threshold, the data transmission component NFT acquires the private business data in the personal payment settlement component, and sends the private business data to the batch processing calculation area.

S320, processing the private business data according to the supervision and submission requirements through the batch processing calculation area, and storing the processed private business data into the data storage layer according to regions and fund flow directions.

The supervision submission requirement is a requirement of a supervision department on the submission data of the financial institution, a series of processing such as analysis, summarization and the like is carried out on the supervision submission requirement to obtain a supervision calculation rule, and the supervision calculation rule is configured in the batch processing calculation area.

Illustratively, the private business data sent by the data transmission component NFT is processed by the batch processing calculation area based on the supervisory calculation rule, the processed private business data are grouped according to the dimensions of the region and the capital flow direction, and each group of the processed private business data is stored in the data storage layer.

And S330, acquiring the service type in the mining task.

And S340, generating a real-time online reasoning task or an offline batch reasoning task corresponding to the mining task according to the service type.

S350, acquiring data to be mined corresponding to the mining tasks in the data storage layer, and executing the real-time online reasoning tasks or offline batch reasoning tasks based on the data to be mined to obtain data mining results.

In the embodiment of the invention, the data transmission component NFT acquires the private business data of the personal payment settlement component at regular time, transmits the private business data to the batch processing calculation area, processes the private business data according to the supervision and delivery requirement through the batch processing calculation area, and stores the processed private business data to the data storage layer according to the region and the capital flow direction, thereby meeting the requirement of batch processing of the private business data with low timeliness requirement.

Fig. 4 is a flowchart of another data mining method according to an embodiment of the present invention, where features of a training data mining model are added on the basis of the embodiment and the foregoing embodiment. As shown in fig. 4, the method includes:

s401, matching data set authority control information by using a user role and a project name, and determining a target data set matched with the user role and the project name.

The data set authority control information is used for representing authority control information of each data set according to user roles and items. The requirement of data compliance is met through the authority control on the data set. In addition, the data in the data set can be classified and managed according to the data types, and the data can be labeled or subjected to data sharing processing according to model training requirements. The data in the data set may be data uploaded locally to the object store or data from a large data platform within the system. The data set provides basic data needed for training for machine learning.

Exemplarily, user role information is obtained according to a user identifier, a project name is obtained, and a data set which can be used by a current project of a current user is determined and recorded as a target data set according to user role information and project name matching data set authority control information.

S402, training a data mining model based on the target data set and the set data mining framework.

The setting data mining framework can be a data mining framework which is mainstream at present. For example, the data mining framework is set to include scimit-left, XGboost, LightGBM, Spark Mllib, and the like.

In particular, a data mining model may be trained based on a target data set using a notewood modeling environment. The above data mining framework is provided in a notewood modeling environment, and includes a mainstream data mining algorithm, and various mirrors are built in, and a rich SDK (Software Development Kit) is provided.

In one case, the trained data mining model automatically performs model management, and the user can compare different models in the same experiment and view model meta information of the model. The model management defines a set of model meta-information composition standards, which comprise a model file model.

And S403, creating inference service of the data mining model, and issuing real-time online inference service to the API gateway, or issuing offline batch inference service to the task center.

S404, acquiring a service field in the mining task, and determining a service type according to the field value of the service field, wherein the service type comprises a private service and a public service.

S405, if the service type is a private service, generating an offline batch reasoning task corresponding to the mining task based on the offline batch reasoning service.

S406, if the service type is the public service, generating a real-time online reasoning task corresponding to the mining task based on the real-time online reasoning service.

And S407, acquiring region information, fund flow direction information and time in the excavation task.

S408, acquiring corresponding data to be mined from the data storage layer according to the region information, the resource flow direction information and the time.

And S409, if the mining task is a stand-alone data mining task, executing the real-time online reasoning task or the offline batch reasoning task through a container cluster Kubernetes based on the data to be mined to obtain a data mining result.

And S410, if the mining task is a distributed data mining task, executing the real-time online reasoning task or the offline batch reasoning task through a distributed cluster spark based on the data to be mined to obtain a data mining result.

In the embodiment of the invention, the authority control information of the data set matched with the user role and the project name is adopted, the target data set matched with the user role and the project name is determined, and the data mining framework training data mining model is set based on the target data set, so that the authority control of the data set required by the training data mining model is realized, and the requirement of data compliance is met.

Fig. 5 is a flowchart of another data mining method according to an embodiment of the present invention, and the present embodiment and the above embodiments are added with features of a directed acyclic graph for constructing a data mining model. As shown in fig. 5, the method includes:

s501, obtaining dragging operation information and parameter configuration information of the components in the graphical interface.

And the dragging operation information is used for representing the operation information of the components in the graphical interface. For example, the drag operation information may include a dragged component, an end position of the drag operation, and the like. The parameter configuration information is used to represent information for configuring parameters for the dragged component.

Illustratively, in a drag-visualization modeling environment, various components are exposed through a toolbar of a graphical interface, and the interface is created by selecting a component and dragging it to the model. And detecting the dragging operation from the components in the toolbar to the model creation interface, displaying the corresponding components at the end positions of the dragging operation in the model creation interface, and acquiring the parameter configuration information of each component in the model creation interface from the user.

Specifically, displaying the corresponding component at the end position of the drag operation in the model creation interface includes: and taking the end position of the drag operation in the model creation interface as the center of the vertex in the directed acyclic graph, and connecting the vertices through edges. In one case, vertices in the directed acyclic graph represent events, edges between the vertices represent activities, and parameter configuration information of the activities is obtained.

S502, constructing a directed acyclic graph of the data mining model according to the dragging operation information and the parameter configuration information through the vertex linking algorithm component.

Wherein, the vertex link algorithm component is a component of a link algorithm encapsulating the vertex of the directed acyclic graph.

A Graph (Graph) is composed of a finite, non-empty set of vertices and a set of edges between the vertices, usually expressed as: g (V, E), where G represents a graph, V is the set of vertices in the graph G, and E is the set of edges in the graph G. The graphs are classified into an undirected graph and a directed graph according to the presence or absence of directionality of an edge. If in a directed graph, it is not possible to go from a certain vertex back to the point through several edges, then the graph is a directed acyclic graph (i.e., DAG graph).

Illustratively, a directed acyclic graph of the data mining model is constructed based on each component in the model creation interface and the corresponding parameter configuration information linking algorithm component. Specifically, modeling requirements can be met by dragging a machine learning algorithm built in the visual modeling environment.

S503, establishing inference service of the data mining model, and issuing real-time online inference service to the API gateway, or issuing offline batch inference service to the task center.

S504, acquiring a service field in the mining task, and determining a service type according to a field value of the service field, wherein the service type comprises a private service and a public service.

And S505, if the service type is a private service, generating an offline batch reasoning task corresponding to the mining task based on the offline batch reasoning service.

S506, if the service type is the public service, generating a real-time online reasoning task corresponding to the mining task based on the real-time online reasoning service.

And S507, acquiring region information, fund flow direction information and time in the excavation task.

And S508, acquiring corresponding data to be mined from the data storage layer according to the region information, the resource flow direction information and the time.

And S509, if the mining task is a stand-alone data mining task, executing the real-time online reasoning task or the offline batch reasoning task through a container cluster Kubernets based on the data to be mined to obtain a data mining result.

And S510, if the mining task is a distributed data mining task, executing the real-time online reasoning task or the offline batch reasoning task through a distributed cluster spark based on the data to be mined to obtain a data mining result.

In the embodiment of the invention, the data mining model is constructed in a visualized manner by acquiring the dragging operation information and the parameter configuration information of the components in the graphical interface and creating the directed acyclic graph of the data mining model according to the dragging operation information and the parameter configuration information through the vertex linking algorithm.

In an optional implementation manner, the data mining method provided by the embodiment of the present invention is implemented by a data mining component. Fig. 6a is a block diagram of a data mining component according to an embodiment of the present invention. As shown in FIG. 6a, data mining component 600 includes a user interface layer 610, a component services layer 620, and a computing resources layer 630. The user interface layer 610 includes a notewood integrated development environment, a drag-and-drop workflow development environment, and a console, among others. The component service layer 620 comprises a data set management module, an experiment management module, a model management module, a notewood management module, a workflow management module, an inference service management module and a retraining task management module. The computing resource layer 630 comprises a data store layer. The user interface layer 610 submits data mining tasks to the component services layer 620. The component services layer 620 sends the computing resource layer 630 information such as creating notewood, submitting workflow to run, reasoning services and task publishing or retraining task submission. The data store layer is used to store model files and provide data in response to data access requests. The data storage layer mainly provides access of training data and off-line reasoning data, supports various heterogeneous data sources, and provides a standard data interface for model training and off-line batch through the SDK.

Fig. 6b is a diagram of functional dependency relationship between the data mining process and other components of the fund flow monitoring platform according to the embodiment of the present invention, and as shown in fig. 6b, the data mining module includes a data set management module, a Notebook management module in model development, a workflow management module in model development, a model management module, an inference service module, and a task center.

The data set management module supports local uploading of data to an object for storage, and also supports addition of existing data to a data set from a big data platform, so that basic data required by training is provided for machine learning. The data set supports data access authority control according to user roles and items, classification management is carried out according to data types, and data labeling and data sharing functions are supported.

A notewood management module in model research and development provides a mainstream data mining framework, such as scimit-spare, XGboost, lightGBM, Spark Mllib and the like, comprises a mainstream data mining full-flow algorithm, is internally provided with various images, and provides rich SDK.

A workflow management module in model research and development provides a method for constructing a model DAG graph in modes of graphical interface dragging, algorithm component linking, parameter configuration and the like. And rich machine learning algorithms are built in, so that the modeling requirement can be met.

The model management module defines a set of model meta-information composition standards, manages the registered models through model research and development, and a user can compare different models under the same experiment and check some meta-information of the models.

And the reasoning service module is used for issuing external services through reasoning services of the model, and comprises real-time online reasoning and offline batch reasoning. The real-time online reasoning service supports automatic scaling and realizes high-concurrency real-time access. The offline batch reasoning service supports offline prediction of mass data, and intelligently realizes splitting and merging of the data.

The task center is mainly used for detecting various resource scheduling tasks of data mining, and comprises a batch prediction task, a retraining task, a model export task, a spark application log and the like. The task center can visually check the conditions and output results of various tasks, and helps a user to efficiently realize task management.

Specifically, the data set adding is realized by connecting the data set management module with the data management component. And a Notebook management module or a workflow management module in model research and development carries out modeling through data service or object storage and reading data. In the case of a single-machine operation mode, tasks such as model development, inference service or retraining are operated through K8S. In a cluster operation mode, tasks such as SPARK operation model development, reasoning service or retraining and the like are used. The model management module is connected with model research and development and is used for modeling and generating a model. The reasoning service module is connected with the model management module and used for creating reasoning service. For the case of online forecasts, the online forecasts are published to the API gateway. For the case of batch forecasts, the batch forecasts are published to the batch forecast management of the task center. The batch forecast management is connected with the data development component. And the retraining task management of the task center regularly sends retraining tasks to the model management module for regularly updating the model.

Fig. 7 is a schematic structural diagram of a data mining device according to an embodiment of the present invention. The data mining device may be implemented in hardware and/or software, and may be configured in an electronic device. As shown in fig. 7, the apparatus includes: a mining task obtaining module 710, an inference task generating module 720 and an inference task executing module 730.

A mining task obtaining module 710, configured to perform obtaining of a service type in a mining task;

the inference task generating module 720 is configured to execute a real-time online inference task or an offline batch inference task corresponding to the mining task generated according to the service type;

and the inference task execution module 730 is used for executing and acquiring the data to be mined corresponding to the mining task in the data storage layer, and executing the real-time online inference task or the offline batch inference task based on the data to be mined to obtain a data mining result.

Optionally, the apparatus further comprises:

and the service issuing module is used for establishing inference service of the data mining model before acquiring the service type in the mining task and issuing real-time online inference service to the API gateway or issuing offline batch inference service to the task center.

Optionally, the mining task obtaining module 710 is specifically configured to perform:

and acquiring a service field in the mining task, and determining a service type according to the field value of the service field, wherein the service type comprises a private service and a public service.

Optionally, the inference task generating module 720 is specifically configured to perform:

if the service type is a private service, generating an offline batch reasoning task corresponding to the mining task based on the offline batch reasoning service;

and if the service type is the public service, generating a real-time online reasoning task corresponding to the mining task based on the real-time online reasoning service.

Optionally, the apparatus further comprises:

the model training module is used for determining a target data set matched with a user role and a project name by adopting user role and project name matching data set authority control information before the inference service of the data mining model is established; training a data mining model based on the target dataset and a set data mining framework.

Optionally, the apparatus further comprises:

the graph building module is used for acquiring dragging operation information and parameter configuration information of components in the graphical interface before reasoning service of the data mining model is established; and constructing a directed acyclic graph of the data mining model according to the dragging operation information and the parameter configuration information through a vertex link algorithm component.

Optionally, the apparatus further comprises:

the public service data storage module is used for acquiring the public service data of the public component through the data acquisition component OGG when the data acquisition condition is met, and sending the public service data to the distributed publishing and subscribing message system Kafka; and processing the public service data in the Kafka according to a supervision and delivery requirement by using a stream type calculation engine Flink, and storing the processed public service data in the data storage layer through the Kafka according to regions and fund flow directions.

Optionally, the apparatus further comprises:

the private data storage module is used for acquiring private business data of the personal payment settlement component at regular time through the data transmission component NFT when the data acquisition condition is met, and transmitting the private business data to the batch processing calculation area; and processing the private business data according to a supervision submission requirement through the batch processing calculation area, and storing the processed private business data into the data storage layer according to regions and fund flow directions.

Optionally, the inference task execution module 730 includes:

the information acquisition submodule is used for acquiring the region information, the fund flow direction information and the time in the excavation task;

the data acquisition submodule is used for acquiring corresponding data to be mined from the data storage layer according to the region information, the resource flow direction information and the time;

and the reasoning task execution submodule is used for executing the real-time online reasoning task or the offline batch reasoning task based on the data to be mined by setting the cluster to obtain a data mining result.

Optionally, the inference task execution submodule is specifically configured to execute:

if the mining task is a stand-alone data mining task, the real-time online reasoning task or the offline batch reasoning task is executed through a container cluster Kubernetes based on the data to be mined to obtain a data mining result;

and if the mining task is a distributed data mining task, executing the real-time online reasoning task or the offline batch reasoning task through a distributed cluster spark based on the data to be mined to obtain a data mining result.

The data mining device provided by the embodiment of the invention can execute the data mining method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, as shown in fig. 8, the electronic device includes a processor 810 and a memory 820; the number of the processors 810 in the electronic device may be one or more, and one processor 810 is taken as an example in fig. 8; the processor 810 and the memory 820 in the electronic device may be connected by a bus or other means, and the connection by the bus is exemplified in fig. 8.

The memory 820 may be used as a computer-readable storage medium for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the data mining method in the embodiments of the present invention (e.g., the mining task obtaining module 710, the inference task generating module 720, and the inference task executing module 730 in the data mining apparatus). The processor 810 executes various functional applications of the electronic device and data processing by executing software programs, instructions and modules stored in the memory 820, that is, implements the data mining method described above.

The memory 820 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 820 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 820 may further include memory located remotely from the processor 810, which may be connected to an electronic device through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Embodiments of the present invention also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform a method of data mining, the method comprising:

acquiring a service type in a mining task;

Of course, the storage medium provided by the embodiment of the present invention contains computer-executable instructions, and the computer-executable instructions are not limited to the method operations described above, and may also perform related operations in the data mining method provided by any embodiment of the present invention.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly can be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes instructions for enabling an electronic device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

Embodiments of the present invention further provide a computer program product, including a computer program, which when executed by a processor implements the data method provided in any of the embodiments of the present application.

Computer program product in implementing the computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments illustrated herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method of data mining, comprising:

acquiring a service type in a mining task;

2. The method of claim 1, prior to obtaining the traffic type in the mining task, further comprising:

and establishing inference service of the data mining model, and issuing real-time online inference service to the API gateway, or issuing offline batch inference service to the task center.

3. The method of claim 2, wherein the obtaining the type of traffic in the mining task comprises:

4. The method according to claim 3, wherein the generating of the real-time online reasoning task or the offline batch reasoning task corresponding to the mining task according to the service type includes:

5. The method of claim 2, further comprising, prior to creating the inference service of the data mining model:

adopting user role and project name matching data set authority control information to determine a target data set matched with the user role and the project name;

training a data mining model based on the target dataset and a set data mining framework.

6. The method of claim 2, prior to creating the inference service of the data mining model, further comprising:

obtaining dragging operation information and parameter configuration information of a component in a graphical interface;

and constructing a directed acyclic graph of the data mining model according to the dragging operation information and the parameter configuration information through a vertex link algorithm component.

7. The method of claim 1, further comprising, upon satisfaction of a data acquisition condition:

acquiring the public service data of the public component through the data acquisition component OGG, and sending the public service data to the distributed publishing and subscribing message system Kafka;

and processing the official business data in the Kafka through a stream type calculation engine Flink according to a supervision reporting requirement, and storing the processed official business data to the data storage layer through the Kafka according to regions and fund flow directions.

8. The method of claim 1, further comprising, upon satisfaction of a data acquisition condition:

the method comprises the steps that private business data of a personal payment settlement component are collected at regular time through a data transmission component NFT, and the private business data are transmitted to a batch processing calculation area;

and processing the private business data according to a supervision submission requirement through the batch processing calculation area, and storing the processed private business data into the data storage layer according to regions and fund flow directions.

9. The method according to any one of claims 1 to 8, wherein the obtaining of data to be mined corresponding to the mining task in a data storage layer, and the executing of the real-time online reasoning task or the offline batch reasoning task based on the data to be mined to obtain a data mining result comprises:

acquiring region information, fund flow direction information and time in the excavation task;

acquiring corresponding data to be mined from the data storage layer according to the region information, the resource flow direction information and the time;

and executing the real-time online reasoning tasks or offline batch reasoning tasks based on the data to be mined by setting the cluster to obtain a data mining result.

10. The method of claim 9, wherein the performing the real-time online reasoning task or the offline batch reasoning task based on the data to be mined by setting a cluster to obtain a data mining result comprises:

11. A data mining device, comprising:

12. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the computer program, implements the data mining method of any one of claims 1-10.

13. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the data mining method of any one of claims 1 to 10.

14. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements a data mining method according to any one of claims 1-10.