CN116737803B

CN116737803B - Visual data mining arrangement method based on directed acyclic graph

Info

Publication number: CN116737803B
Application number: CN202311004734.2A
Authority: CN
Inventors: 王德鑫; 谭炜波; 吴国勇; 蒋旭; 李涛; 柴力伟; 张国楠; 王超; 焦瑞松; 周文平; 汪欢
Original assignee: TIANJIN SHENZHOU GENERAL DATA TECHNOLOGY CO LTD
Current assignee: TIANJIN SHENZHOU GENERAL DATA TECHNOLOGY CO LTD
Priority date: 2023-08-10
Filing date: 2023-08-10
Publication date: 2023-11-17
Anticipated expiration: 2043-08-10
Also published as: CN116737803A

Abstract

The invention provides a visual data mining arrangement method based on a directed acyclic graph. The user can drag one or more components from a preset data processing component library into the canvas, establish a connection relationship to form a directed acyclic graph, and display the directed acyclic graph on the interface. After the user sets the parameter information of the components, the parameters are automatically loaded into the directed acyclic graph. After the process is finished, the generated data is transmitted to the background for processing. In the process of running the flow, the functions of state monitoring and log checking are provided. Meanwhile, the directed acyclic graph also has version management capability. The invention solves the problems of complex traditional data mining tools, high threshold and continuously increased use cost by combining the front-end visualization technology and the form of the directed acyclic graph.

Description

Visual data mining arrangement method based on directed acyclic graph

Technical Field

The invention relates to the field of visual data mining, in particular to a visual data mining arrangement method based on a directed acyclic graph.

Background

In the data mining task, the application of the directed acyclic graph provides an intuitive and flexible way for a user to construct a complex data processing flow. By splitting the task into separate nodes, each representing a particular data processing component, the user more clearly understands and designs the overall data mining process.

First, a user selects a component suitable for the requirements from a preset data processing component library. These components may include various functions such as data cleansing, feature selection, model training, and the like. By dragging the selected components into the canvas and establishing the connection relationship of the directed edges, the user explicitly specifies the transfer direction and the processing order of the data in the flow. Such a visualization operation makes construction of the data mining task more intuitive and easy to operate.

Second, the user sets specific parameter information for each component. These parameters include data input, data output, model parameters, and the like. When the user sets the parameters of the components, the information is automatically loaded into the corresponding nodes in the directed acyclic graph, so that each node is ensured to have correct configuration. Such a design allows a user to more conveniently customize and adjust the functionality of each component to meet different data mining requirements.

In addition to node-level parameter settings, the directed acyclic graph also introduces the concept of global parameters. The user sets global parameters in the graph that are shared by all components in the graph. The modification of the global parameters only needs to be carried out in one place, so that the whole data mining flow can be influenced at the same time. The global parameter mechanism greatly simplifies the complexity of parameter adjustment, reduces the workload of repeatedly modifying parameters, and improves the efficiency of debugging and optimization.

Once the data mining flow is arranged, the nodes and the connection relations in the graph transmit the data converted into the JSON format to the background for processing. The background executes the functions of the components according to the correct sequence according to the dependency relationship among the nodes, and carries out corresponding processing when the data flow passes through each component. This manner of data transfer ensures the correct flow and handling of data in the flow.

However, due to the complexity and expertise of data mining tasks, expertise and skills are often required to achieve this effectively. This results in a relatively high threshold for data mining, making it difficult for general data mining tools to be directed towards non-professionals. The use of conventional data mining tools may be confusing and frustrating to those who have no deep knowledge of the data mining technique.

Data mining becomes increasingly complex with the proliferation and variety of data processing components. In conventional data mining tools, users may need to be familiar with a variety of different tools and techniques, as well as the interactions and application scenarios between them. This is a challenge for non-professionals because they need to invest a great deal of time and effort in learning and understanding these complex concepts and methods.

In addition, the use of conventional data mining tools also increases the cost of use. As data processing components increase, users need to purchase and learn more tools and techniques, which can result in additional economic and time investment. Also, it may be necessary for non-professionals to hire professionals or receive training in order to be able to properly use and understand these tools and techniques. This further increases the cost and complexity of the use of data mining.

Conventional data mining tools have some limitations in addressing the above-described issues. They often lack intuitiveness and ease of use, and do not provide adequate support and guidance to enable non-professionals to easily perform data mining tasks. Therefore, to solve these problems, a more intelligent and user-friendly data mining tool is needed, which can reduce the threshold of data mining, so that non-professional personnel can effectively perform data mining tasks.

Disclosure of Invention

In view of the above-identified problems in the prior art, the present invention provides a method of visual data mining orchestration based on directed acyclic graphs that overcomes or at least partially solves the above-identified problems. Constructing a directed acyclic graph by using a front-end visualization technology so as to complete data mining arrangement, wherein the arrangement method comprises the following steps:

Step 1, a directed acyclic graph is customized, and a data mining strategy is determined;

step 2, setting node parameters of nodes in the directed acyclic graph;

step 3, setting global parameters of the directed acyclic graph;

step 4, executing the directed acyclic graph, wherein the node can dynamically display the execution state and display the execution log;

and 5, carrying out version management on the directed acyclic graph.

Preferably, in the step 1, the step of customizing the directed acyclic graph, determining a data mining strategy specifically includes:

the data mining strategy is formulated according to the type of the nodes in the directed acyclic graph, the priority of the node execution and the inter-node connection strategy.

Preferably, the types of the nodes in the directed acyclic graph comprise SQL, python, spark and/or flank;

the priority of the node execution comprises five types LOWEST, LOW, MEDIUM, HIGH, HIGHEST, and the priority is from low to high.

Preferably, in the step 2, node parameters of nodes in the directed acyclic graph are set, and the method specifically includes:

different types of nodes set different parameter contents, the parameters all follow a consistent JSON structure, and node attributes are set to specified fields.

Preferably, in the step 3, global parameters of the directed acyclic graph are set, and the method specifically includes:

The global parameter is set to be used by all nodes in the directed acyclic graph in a delivery manner.

Preferably, in the step 4, the executing the directed acyclic graph, the node may dynamically display an execution state and display an execution log, and specifically includes:

step 4-1, in the execution process, the running node displays different icons according to the running state fed back by the background; wherein the operating state comprises RUNNING, PAUSE, STOP, KILL, FAILURE or SUCCESS;

step 4-2, generating and displaying an execution log for recording the execution state and related information of the node; the execution log specifically comprises an operation result log and a system log; the operation result log shows whether the node is successfully executed, if so, the log causing the failure is displayed; the system log shows all logs in the execution process.

Preferably, in the step 5, version management is performed on the directed acyclic graph, and the method specifically includes:

setting a plurality of versions for the directed acyclic graph according to service requirements, wherein each version corresponds to a matched data mining strategy; the version states corresponding to the versions include: in development, debugging is passed, brought on line, or maintained.

Preferably, in the step 1, a directed acyclic graph is customized, and a data mining strategy is determined, where the data mining strategy is formulated according to a type of a node in the directed acyclic graph, and specifically includes:

step 1-1a, a custom directed acyclic graph: using visualization tools to customize the structure and nodes of the directed acyclic graph;

step 1-2a, determining the type of the node: the types of nodes include SQL, python, spark and/or Flink;

and step 1-3a, formulating a data mining strategy according to the type of the node in the directed acyclic graph.

Preferably, in the step 1, a directed acyclic graph is customized, and a data mining policy is determined, where the data mining policy is formulated according to a priority executed by a node, and specifically includes:

step 1-1b, a custom directed acyclic graph: using visualization tools to customize the structure and nodes of the directed acyclic graph; determining node types, node numbers and connection relationships among nodes in the directed acyclic graph according to specific data mining tasks and requirements;

step 1-2b, the node performs priority setting: setting execution priority for each node in the directed acyclic graph; dividing the nodes into LOWEST, LOW, MEDIUM, HIGH, HIGHEST five types, and setting the execution priority of the nodes according to the types and the functions of the nodes;

Step 1-3b, formulating a data mining strategy: according to the execution priority of the nodes, a corresponding data mining strategy is formulated; determining processing logic and operations of each node according to the priority of the nodes, and determining the order of the processing logic and the operations in the directed acyclic graph;

step 1-4b, node processing sequence and data transfer: and determining the processing sequence and the data transmission mode of the nodes according to the execution priority of the nodes.

Preferably, in the step 1, a directed acyclic graph is customized, and a data mining policy is determined, where the data mining policy is formulated according to an inter-node connection policy, and the method specifically includes:

step 1-1c, self-defining a directed acyclic graph: using visualization tools to customize the structure and nodes of the directed acyclic graph; determining node types, node numbers and connection relationships among nodes in the directed acyclic graph according to specific data mining tasks and requirements;

step 1-2c, formulating a connection strategy between nodes: according to the node type and the data processing logic, a connection strategy between the nodes is formulated;

step 1-3c, formulating a data mining strategy: and according to the connection relation and the data transmission mode between the nodes, a corresponding data mining strategy is formulated.

The beneficial effects of the application include: through the visual interface and the dragging operation, the threshold of a non-professional using a data mining tool is reduced, so that the data mining flow is more visual and easier to understand. The user freely constructs the directed acyclic graph, and the data mining strategy is determined according to specific requirements, so that the data mining task is more flexible and customizable. The user sets the parameter information of each node, and realizes parameter sharing through global parameters, so that the parameter configuration process is simplified, and the efficiency and consistency of the data mining task are improved. By dynamically displaying the execution state of the node and displaying the execution log, a user monitors the execution progress and the result of the data mining task in real time, and the task is convenient to debug and optimize. The version management function is provided, so that a user is helped to manage and compare data mining tasks of different versions, and iteration and reproduction of the tasks are facilitated.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate and together with the description serve to explain the application, if necessary:

FIG. 1 is a framework diagram of a directed acyclic graph based visual data mining orchestration method of the present application.

Detailed Description

The present application will now be described in detail with reference to the drawings and the specific embodiments thereof, wherein the exemplary embodiments and the description are for the purpose of illustrating the application only and are not to be construed as limiting the application.

The application discloses a visual data mining arrangement method based on a directed acyclic graph, which is combined with an attached figure 1, and realizes the construction and operation of the directed acyclic graph through a front-end visual technology, thereby completing the data mining arrangement.

The application adopts a visual data mining arrangement method and a front-end visual technology to display the flow of the data mining task in the form of a directed acyclic graph, so that a user can intuitively construct and manage the data mining flow. And the user self-defines the structure and the connection mode of the directed acyclic graph according to specific data mining requirements, so that the data mining strategy is determined. The user sets parameters for each node in the directed acyclic graph to control the behavior of the node and the data processing manner. At the same time, the directed acyclic graph also supports setting global parameters, which are shared among all nodes in the graph, providing greater flexibility and uniformity.

The application discloses a visual data mining arrangement method based on a directed acyclic graph, which comprises the following steps:

step 2, setting node parameters of nodes in the directed acyclic graph;

step 3, setting global parameters of the directed acyclic graph;

and 5, carrying out version management on the directed acyclic graph.

Preferably, in the step 1, the step of customizing the directed acyclic graph, determining a data mining strategy specifically includes: the data mining strategy is formulated according to the type of the nodes in the directed acyclic graph, the priority of the node execution and the inter-node connection strategy.

The method comprises the following specific operations:

a. user friendly visual interfaces are built using front-end development techniques such as HTML, CSS, and JavaScript to expose the directed acyclic graph and operations associated therewith.

b. The user pulls the preset data processing components and connects the data processing components through a connection relationship to construct a directed acyclic graph, the acyclic nature of the graph is verified by using graph theory algorithms such as topological ordering and the like, and a data mining strategy is determined according to the structure and the connection mode of the directed acyclic graph.

c. The user sets parameter information of each node through interface operation, including input data, data processing method, output result and the like, and the directed acyclic graph also supports setting global parameters which are shared among all nodes in the graph, and the parameter information is loaded into the directed acyclic graph for subsequent execution.

d. The directed acyclic graph is executed, namely, the data processing operation of each node is executed one by one according to the connection relation and parameter setting of the node. In the executing process, the interface dynamically displays the executing state of the node and displays an executing log so that a user can monitor the progress of the task and check errors.

e. The directed acyclic graph provides version management functionality, i.e., the user saves different versions of the graph structure and parameter settings, and switches and compares. Thus, the user can conveniently trace back and manage the data mining tasks in different stages.

By implementing the technical implementation scheme, the visualized data mining arrangement method based on the directed acyclic graph can reduce the threshold of data mining, improve the visualization and flexibility of the data mining task and enable non-professional staff to participate in the method. Compared with the traditional data mining tool, the method has the characteristics of more intuitionism, flexibility and high efficiency, is hopeful to promote the development of the data mining technology, and plays an important role in practical application.

In the directed acyclic graph based visual data mining orchestration method, the type of node is a key component. Different node types (e.g., SQL, python, spark, flink, etc.) are combined into different data mining policies.

The SQL node is used for executing the SQL query and operating the database, processing the SQL statement by using the existing SQL parser and execution engine, and transmitting the result to other nodes for subsequent processing.

And the Python node is used for executing the Python code, executing the Python script provided by the user by using the Python interpreter, and performing tasks such as data processing, feature extraction, model training and the like.

Spark node: for performing large-scale data processing and analysis tasks based on Spark framework, processing large-scale data sets using the distributed computing power of Spark, and performing complex data mining operations.

The Flink node is used for executing streaming data processing tasks based on the Flink framework, processing real-time data streams by utilizing streaming computing power of the Flink, and performing streaming data mining and analysis.

The data mining strategy is formulated according to the types of nodes in the directed acyclic graph, and the specific steps are as follows:

1. determining the type of the node: first, the type of each node in the directed acyclic graph needs to be determined, including SQL, python, spark and/or Flink. These nodes represent different data processing and mining modes, with different functions and characteristics.

Data mining policy for sql nodes:

Data source: the data source is determined according to the node with the SQL node type, and can be a relational database, a data warehouse or other data storage supporting SQL query.

Query statement: SQL query statements are written, and appropriate query operations, such as selection, filtering, aggregation, connection, etc., are selected according to the specific data mining task.

Data preprocessing: in SQL nodes, some data preprocessing operations, such as missing value processing, data conversion, normalization, etc., may be performed.

And (3) outputting results: and selecting an appropriate mode to output a query result according to task requirements, such as saving the query result to a file, exporting the query result to other databases, visually displaying the query result and the like.

Data mining policy for python nodes:

and (3) loading data: and selecting a proper data loading base or API according to the node with the node type Python, and loading a data source.

Data preprocessing: and according to the characteristics of the data and the requirements of the mining task, performing preprocessing operations such as data cleaning, conversion, standardization and the like by using a Python library.

Characteristic engineering: and extracting meaningful features by using a Python data processing and feature engineering library, and performing feature selection, transformation and construction.

Model training and evaluation: an appropriate machine learning or deep learning library is selected, the model is trained and its performance and accuracy are evaluated.

Result analysis and output: and analyzing and visually displaying the mining result according to the task requirement, and outputting a report or a chart.

Data mining policy for spark node:

and (3) loading data: the data sources, which may be file systems, databases, or other supported data sources, are loaded using Spark frameworks, depending on the node whose node type is Spark.

Data preprocessing: the data processing function of Spark is utilized to carry out operations such as data cleaning, conversion, normalization and the like, and the functions, the operations and the conversion methods provided by Spark are utilized.

Distributed computing: and performing large-scale data mining tasks such as distributed clustering, distributed machine learning and the like by utilizing the distributed computing power of Spark.

Model training and evaluation: model training and evaluation was performed using Spark's machine learning library (MLlib), and appropriate algorithms and parameters were selected for model training and tuning.

Result analysis and output: and analyzing and displaying the mining result by using the distributed data processing and visualization tools of Spark, and outputting a report or a chart.

Data mining policy for flink nodes:

and (3) loading data: depending on the node whose node type is Flink, the data source is loaded using the Flink framework, which may be a file system, message queue, or other supported data source.

Stream processing: and (3) processing the data stream in real time by utilizing the stream processing function of the Flink, and performing real-time data mining operation such as real-time clustering and real-time prediction.

Window calculation: and grouping and aggregating the data streams by utilizing the window computing function of the Flink, and performing windowed data mining tasks.

Model training and evaluation: in the streaming process, incremental model training and evaluation can be performed, model training and tuning using the machine learning library of flank (flankml).

Result analysis and output: and analyzing and displaying the real-time mining result by utilizing the flow data processing and visualization tools of the Flink, and outputting a report or a chart.

Through the self-defining directed acyclic graph and the formulation of data mining strategies according to the types of the nodes, proper data processing and analysis methods can be selected for the nodes of different types, and the accuracy and the high efficiency of the whole data mining process are ensured. Therefore, according to the requirements of tasks, tools and technologies such as SQL, python, spark and Flink can be fully utilized to complete various data mining tasks.

By providing multiple types of node components, users are free to combine and customize data mining policies according to actual needs. The policy supports the integration of new component types into the node, thereby extending the available data processing capacity. The relationship between the structure and the nodes of the directed acyclic graph is displayed through the graphical interface, so that the data mining task is more visual and easier to understand. Each node is provided with a parameter configuration interface, a user flexibly sets parameters, and meanwhile, the sharing and unified management of the parameters are realized through global parameters. The execution state and the execution log of the node are displayed in real time, so that a user is helped to monitor the task progress and conduct error checking. Through the version management function, a user manages and compares the data mining tasks of different versions, and iteration and reproduction of the tasks are facilitated.

That is to say, the visual data mining arrangement method based on the directed acyclic graph realizes flexible, extensible and visual data mining task arrangement. The node components of different types provide rich data processing capability, so that non-professionals can also construct a data mining flow according to own requirements, and parameter configuration and task monitoring are performed through a visual interface. The method has higher efficiency, flexibility and maintainability, and is helpful for promoting the development of data mining technology and improving the usability of data mining tasks.

Furthermore, in the visualized data mining orchestration method based on directed acyclic graphs, the execution priority of the nodes is an important concept. The execution priority of the nodes determines the execution sequence and importance of each node in the data mining strategy.

The associated priority is defined as:

lower: the lowest priority, which means that the node has the lowest execution priority, is at the last execution position in the whole data mining flow.

LOW: low priority indicates that the node has a lower execution priority, but is still ahead of the general executing node.

MEDIUM: medium priority, meaning that the execution priority of the node is centered, typically between the higher and lower priority nodes.

HIGH: high priority indicates that the node has a higher priority of execution than most nodes.

High est: the highest priority, which means that the node has the highest execution priority, is typically the node that is executed first in the entire data mining flow.

The priority of node execution may affect the data mining policy. The directed acyclic graph is customized, a data mining strategy is formulated according to the priority of node execution in the directed acyclic graph, and the following detailed steps are explained:

1. custom directed acyclic graphs: using visualization tools, the structure and nodes of the directed acyclic graph are customized. And determining the node types, the node numbers and the connection line relations among the nodes in the directed acyclic graph according to specific data mining tasks and requirements. The directed acyclic graph can be constructed on the canvas by dragging components or manually writing code.

2. Setting of node execution priority: execution priority is set for each node in the directed acyclic graph. Nodes are classified into five categories LOWEST, LOW, MEDIUM, HIGH, HIGHEST to represent how urgent and importance the node is to execute. And according to the type and the function of the node, the execution priority of the node is reasonably set so as to ensure that the data mining tasks are executed according to the correct sequence.

3. Formulating a data mining strategy: and according to the execution priority of the node, formulating a corresponding data mining strategy. Processing logic and operations of each node are determined based on the priority of the nodes and their order in the directed acyclic graph is determined.

Lower priority: these nodes are typically the result of a precondition or dependency on other nodes, so that execution of these nodes needs to be ensured first. Operations such as data loading, data preprocessing or data preparation are performed in the nodes so as to ensure availability and correctness of subsequent nodes.

LOW priority: these nodes are typically data processing operations such as feature engineering, data conversion, or data cleansing, etc., that can be performed after the pre-node has completed its execution. In these nodes, the data may be processed and prepared using appropriate methods and algorithms.

MEDIUM priority: these nodes are typically model training and optimization operations based on the results of previous nodes. In these nodes, suitable machine learning or deep learning algorithms can be selected and model training and parameter tuning performed based on the characteristics of the data.

HIGH priority: these nodes are typically the steps of model evaluation and verification. In these nodes, the performance and accuracy of the model can be evaluated using appropriate evaluation indexes and techniques, and necessary adjustments and improvements can be made according to the evaluation results.

High est priority: these nodes are typically the final result output and report generation steps. In these nodes, the appropriate manner and format may be selected to output data mining results, such as generating reports, visually exposing or exporting data files, etc., according to business needs.

4. Node processing order and data transfer: and determining the processing sequence and the data transmission mode of the nodes according to the execution priority of the nodes. Ensuring that nodes execute in the correct order in the directed acyclic graph and ensuring that data flows correctly from one node to another.

The order of execution of the nodes is arranged according to their priorities to ensure that nodes with lower priorities are executed first and then nodes with higher priorities are executed.

The input data of each node is ensured to come from the output data of the front-end node, so that the integrity and the correctness of the data are ensured. Through the connection relation between the nodes, the data is ensured to be transferred and processed according to the correct sequence.

The directed acyclic graph is customized, and a data mining strategy is formulated according to the priority of node execution, so that the nodes can be ensured to execute according to the correct sequence, and the data is reasonably processed and mining operation is performed according to the priority and the dependency relationship of the nodes. Therefore, the execution flow of the data mining task can be effectively optimized, and the data processing efficiency and the result accuracy are improved.

Through the execution priority of the nodes, a user flexibly controls the execution sequence and importance of the data mining tasks, and the requirements under different scenes are met. By reasonably setting the priority of the nodes, the execution sequence of the tasks is optimized, and the overall execution efficiency is improved. The priority of the nodes is allowed to be dynamically adjusted by a user in the task execution process so as to adapt to real-time requirement change. The structure of the directed acyclic graph is clear and definite, the priority of the node is set as one of the attributes of the node, and the structure of the whole data mining task is more clear and visible.

That is to say, the visualized data mining arrangement method based on the directed acyclic graph realizes task scheduling and control based on node priority. And setting the execution sequence of the nodes according to the demands and the priorities of the tasks by a user, so that the data mining strategy is flexibly customized. Meanwhile, the performance efficiency and flexibility of the tasks are improved due to the characteristics of dynamic adjustment, parallel execution and the like, so that the data mining tasks can be performed more controllably and efficiently.

Finally, in the visual data mining arrangement method based on the directed acyclic graph, the connection strategy between the nodes is an important factor for determining the data flow direction and the dependency relationship. Different link policies combine to form different data mining policies.

The link type is defined as:

direct connection: indicating that there are direct data flows and dependencies between nodes. The output of one node serves as the input to the other node and the data is transferred directly.

Conditional connection: indicating that the connections between nodes have conditional restrictions. The output of a node is selectively transferred to other nodes according to certain conditions, which are based on data attributes, calculation results, rules set by a user, and the like.

Parallel connection: the output representing a node is simultaneously delivered to multiple nodes, which are executed in parallel, rather than sequentially in linear order.

The directed acyclic graph is customized, a data mining strategy is formulated according to the inter-node connection strategy, and the following detailed steps are explained:

2. Inter-node connection strategy: in a directed acyclic graph, the connections between nodes reflect the flow and manner of data transfer. Link policies between nodes are formulated based on the specific node type and data processing logic to ensure that data can flow correctly from one node to another.

Data transmission mode: the data transmission mode is determined, and may be direct connection, data pipeline, data stream, etc. And selecting a proper data transmission mode according to the characteristics and processing requirements of the data so as to ensure the correct flow of the data between the nodes.

Connection direction: the direction of the connection between the nodes, i.e. the flow direction of the data, is determined. According to the logic and the dependency relationship of data processing, the direction of each connecting line is determined so as to ensure that data flows according to the correct sequence and avoid the dead cycle or loss of the data.

Connection type: different wire types can be defined according to the processing mode and the dependency relationship of the data. For example, data dependent links, data synchronization links, control flow links, etc. may be defined to indicate the purpose and manner of data transfer between nodes.

3. Formulating a data mining strategy: and according to the connection relation and the data transmission mode between the nodes, a corresponding data mining strategy is formulated. Processing logic and operations of each node are determined based on paths and rules for data passing from one node to another.

Data dependency relationship: and determining the dependency relationship of the data according to the connection relationship between the nodes. The input data of each node is ensured to come from the output data of the front-end node, so that the integrity and the correctness of the data are ensured.

Node operation and data processing: and according to the functions of the nodes and the data transmission modes, corresponding node operation and data processing strategies are formulated. And selecting proper algorithms, methods and tools according to the types of the nodes, and performing operations such as data preprocessing, feature engineering, model training and the like.

And (3) data flow control: and determining a control mode of the data flow according to the connection relation among the nodes. The method can flexibly control and process the data flow by using methods such as condition judgment, circulation control and the like so as to meet the requirements of data mining tasks.

The flow and sequence of data processing can be flexibly defined by customizing the directed acyclic graph and formulating a data mining strategy according to the inter-node connection strategy, and the data can be ensured to be transferred and processed according to a correct path in the directed acyclic graph. Therefore, the customized data mining task can be realized, and the control and optimization of the data flow can be performed according to specific requirements.

By defining different connection types, the user flexibly combines the nodes according to the data flow direction and the requirements of the dependency relationship to form different data mining strategies. Conditional wiring allows the output of the node to be selectively passed according to the set conditions, providing finer granularity data flow control. The parallel connection allows the output of the nodes to be simultaneously transmitted to a plurality of nodes and executed in parallel, so that the execution efficiency of the task is improved.

In the visual data mining arrangement method based on the directed acyclic graph, the setting of node parameters and the application of global parameters play an important role.

Different types of nodes have different parameter setting contents according to functions and requirements of the nodes. For example, SQL nodes contain parameters such as query statements, database connection information, etc.; the Python node comprises parameters such as script paths, input and output file paths and the like; the Spark node contains configuration parameters of the Spark job, and the like.

The node parameters all follow a consistent JSON structure, thus ensuring a uniform format and data type of the parameters. The specific attributes of the nodes are set by specifying fields, so that the specific attributes of different nodes are flexibly defined.

Global parameters are parameters used in all nodes in the directed acyclic graph. They are used to share data or configuration information throughout the data mining process.

The global parameters are used by the node parameters by means of a pass-through. For example, a certain parameter of a node references a global parameter in order to obtain or pass the value of the global parameter. This facilitates the transfer of data or sharing of configuration information between nodes.

The global parameters are changed before each execution. The user modifies the value of the global parameter prior to execution, thereby affecting the overall data mining flow. This provides flexibility and adjustability, enabling a user to dynamically adjust global settings in the data mining process according to demand.

During execution, each node can display different running states according to feedback of the background. Common node operating states include: RUNNING, PAUSE, STOP, KILL, FAILURE, SUCCESS, etc.

To visually demonstrate the operational state of a node, different icons or visual cues are used to represent the different states. For example, the states of the nodes are displayed by using icons, progress bars, prompt messages and the like with different colors, so that a user is helped to know the execution progress and the result of the data mining task.

And finally, the execution log is a key component part in the visual data mining arrangement method based on the directed acyclic graph and is used for recording the execution state and related information of the nodes.

The running result log is used for showing the execution state of each node so as to determine whether the node successfully executes. Typically, the execution state of a node is marked as successful or failed. If the node fails to execute, the running results log will provide detailed information that caused the failure, such as error messages, exception stacks, etc., so that the user can quickly locate and resolve the problem.

The system log records all operations and events in the execution process to provide more detailed execution information. These logs include start-up of the node, run time, resource consumption, input and output data, etc.

The detailed record of the system log is helpful for the user to know each link in the execution process, and is convenient for debugging and troubleshooting. For example, if a node executes for too long, the system log displays its run time and the resources consumed, helping the user to optimize execution efficiency.

The execution log should be persisted for ease of subsequent review and analysis. This means that the log is saved in a persistent storage medium (e.g., database, log file, etc.), not just in memory. The log persistence has the advantages that a user reviews log information in the execution process at any time and performs operations such as fault removal, performance analysis and the like. Meanwhile, the persistent log is also used as a basis for recording and auditing.

In the visual data mining arrangement method based on the directed acyclic graph, supporting multiple versions is an important function. This means that the user creates a number of different directed acyclic graph versions according to the traffic requirements, each version corresponding to a specific data mining strategy.

Multi-version management allows users to create multiple directed acyclic graph versions in the same data mining project to meet different business needs. Each version represents a different data mining strategy or scheme.

The user modifies and expands based on the existing version or creates a new version as required. Thus, different versions are developed and compared in parallel in the same project.

Each directed acyclic graph version has a corresponding version state. These states describe the different phases in which the versions are located, facilitating project management and team collaboration.

Common version states include:

in the development: meaning that the version is undergoing a development and design phase, and has not yet been debugged and tested.

In the debugging: indicating that the version is being debugged, its correctness and feasibility are ensured by running and verifying.

Debugging is carried out by: the version is verified by debugging, the expected effect is achieved, and the next stage is ready to be entered.

And (3) line feeding: it means that the version has completed debugging and is ready to be brought online, applied to the actual data mining task.

Maintenance: indicating that the version is on-line and in use in the actual application, subsequent maintenance and management is required.

Through multi-version management and version state setting, a user can easily compare and switch different data mining strategies. The change of the version state also reflects the life cycle and progress condition of the data mining project, and has important significance for project management and team cooperation.

The beneficial effects of the invention include: through the visual interface and the dragging operation, the threshold of a non-professional using a data mining tool is reduced, so that the data mining flow is more visual and easier to understand. The user freely constructs the directed acyclic graph, and the data mining strategy is determined according to specific requirements, so that the data mining task is more flexible and customizable. The user sets the parameter information of each node, and realizes parameter sharing through global parameters, so that the parameter configuration process is simplified, and the efficiency and consistency of the data mining task are improved. By dynamically displaying the execution state of the node and displaying the execution log, a user monitors the execution progress and the result of the data mining task in real time, and the task is convenient to debug and optimize. The version management function is provided, so that a user is helped to manage and compare data mining tasks of different versions, and iteration and reproduction of the tasks are facilitated.

The foregoing description is only of the preferred embodiments of the invention, and all changes and modifications that come within the meaning and range of equivalency of the structures, features and principles of the invention are therefore intended to be embraced therein.

Claims

1. A visual data mining arrangement method based on a directed acyclic graph is characterized in that: the method for realizing data mining arrangement by utilizing a front-end visualization technology to construct a directed acyclic graph comprises the following steps:

Step 1, a directed acyclic graph is customized, and a data mining strategy is determined; specifically, the method comprises the steps of determining the type of the node: first, the type of each node in the directed acyclic graph needs to be determined, including SQL, python, spark and/or Flink, and the nodes represent different data processing and mining modes and have different functions and characteristics, and the type of the node in the directed acyclic graph includes SQL, python, spark and/or Flink;

the priority of the node execution comprises LOWEST, LOW, MEDIUM, HIGH, HIGHEST five types, and the priority is from low to high; then, the process is carried out,

determining a data source according to the nodes with the node types of SQL, wherein the data source is a relational database and a data warehouse supporting SQL query;

writing SQL query sentences, and selecting proper query operations according to specific data mining tasks, wherein the operations comprise selection, filtering, aggregation and connection;

performing data preprocessing, wherein the SQL node comprises missing value processing, data conversion and standardization;

and finally, outputting a result: selecting a proper mode to output a query result according to task requirements, storing the query result in a file, exporting the query result to other databases, and displaying the query result in a visual way;

a. wherein, by constructing a user-friendly visual interface, the directed acyclic graph and the operation related to the directed acyclic graph are displayed;

b. The user pulls preset data processing components and connects the data processing components through a connection relationship to construct a directed acyclic graph, the acyclic nature of the graph is verified by using a topological ordering graph theory algorithm, and a data mining strategy is determined according to the structure and the connection mode of the directed acyclic graph;

c. setting parameter information of each node through interface operation by a user, wherein the parameter information comprises input data, a data processing method and an output result, the directed acyclic graph also supports setting global parameters, the parameters are shared among all nodes in the graph, and the parameter information is loaded into the directed acyclic graph for subsequent execution;

d. in the execution process, the interface dynamically displays the execution state of the nodes and displays an execution log so as to facilitate a user to monitor the progress of tasks and check errors;

e. providing version management function for directed acyclic graphs, and storing graph structures and parameter settings of different versions by a user, and switching and comparing;

different icons or visual prompts are used for representing different states, and icons, progress bars and prompt information modes with different colors are used for displaying the states of the nodes, so that a user is helped to know the execution progress and the result of a data mining task;

Step 2, setting node parameters of nodes in the directed acyclic graph;

step 3, setting global parameters of the directed acyclic graph;

step 5, carrying out version management on the directed acyclic graph;

step 1, self-defining a directed acyclic graph, determining a data mining strategy, specifically comprising the data mining strategy, making according to the type of the nodes in the directed acyclic graph, the priority of the node execution and the inter-node connection strategy,

the data mining strategy for the Python nodes is as follows:

and (3) loading data: selecting a proper data loading base or API according to the node with the node type Python, and loading a data source;

data preprocessing: according to the characteristics of data and the requirements of mining tasks, performing data cleaning, conversion and normalized preprocessing operations by using a Python library;

characteristic engineering: extracting meaningful features by using a Python data processing and feature engineering library, and performing feature selection, transformation and construction;

model training and evaluation: selecting a proper machine learning or deep learning library, training a model and evaluating the performance and accuracy of the model;

result analysis and output: analyzing and visually displaying the mining result according to task demands, and outputting a report or a chart;

The data mining strategy for Spark node is as follows:

and (3) loading data: loading a data source by using a Spark framework according to a node with the Spark type;

data preprocessing: performing data cleaning, conversion and normalization operations by utilizing the data processing function of Spark, and utilizing functions, operation and conversion methods provided by Spark;

distributed computing: performing large-scale data mining tasks, including distributed clustering and distributed machine learning, by using the distributed computing power of Spark;

model training and evaluation: model training and evaluation are carried out by using a Spark machine learning library, and proper algorithms and parameters are selected for model training and tuning;

result analysis and output: analyzing and displaying the mining result by using Spark distributed data processing and visualization tools, and outputting a report or a chart;

the data mining strategy for the Flink node is as follows:

and (3) loading data: loading a data source by using a Flink frame according to the node with the node type of Flink;

stream processing: processing the data stream in real time by utilizing the stream processing function of the Flink, and performing real-time data mining operation;

window calculation: grouping and aggregating the data streams by utilizing the window computing function of the Flink, and performing windowed data mining tasks;

Model training and evaluation: in the streaming processing, incremental model training and evaluation are carried out, and model training and tuning are carried out by utilizing a machine learning library of the Flink;

2. The method for arranging the visual data mining based on the directed acyclic graph according to claim 1, wherein the step 2 is provided with node parameters of nodes in the directed acyclic graph, and specifically comprises the following steps:

3. The method for arranging visual data mining based on the directed acyclic graph according to claim 1, wherein the step 3 is provided with global parameters of the directed acyclic graph, and specifically includes:

4. The method for arranging visual data mining based on directed acyclic graph according to claim 1, wherein, said step 4, executing said directed acyclic graph, a node can dynamically show execution status and display execution log, specifically comprising:

step 4-2, generating and displaying an execution log for recording the execution state and related information of the node; the execution log specifically comprises an operation result log and a system log; the operation result log shows whether the node is successfully executed, if so, the log causing the failure is displayed; the system presents all logs during log execution.

5. The method for arranging visual data mining based on directed acyclic graph according to claim 1, wherein the step 5 is implemented for version management of the directed acyclic graph, and specifically includes:

6. The method for arranging the visual data mining based on the directed acyclic graph according to claim 1, wherein the step 1 is to customize the directed acyclic graph, determine a data mining strategy, and the data mining strategy is formulated according to the types of nodes in the directed acyclic graph, and specifically comprises the following steps:

7. The method for arranging the visual data mining based on the directed acyclic graph according to claim 1, wherein the step 1 is to customize the directed acyclic graph, determine a data mining strategy, and the data mining strategy is formulated according to the priority executed by the node, and specifically comprises the following steps:

8. The method for arranging the visual data mining based on the directed acyclic graph as claimed in claim 1, wherein the step 1 is to customize the directed acyclic graph, determine a data mining strategy, and the data mining strategy is formulated according to an inter-node connection strategy, and specifically comprises the following steps: