CN111078094A

CN111078094A - Distributed machine learning visualization device

Info

Publication number: CN111078094A
Application number: CN201911225636.5A
Authority: CN
Inventors: 鄂海红; 宋美娜; 刘芳; 周康; 王晓晖
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2019-12-04
Filing date: 2019-12-04
Publication date: 2020-04-28
Anticipated expiration: 2039-12-04
Also published as: CN111078094B

Abstract

The invention discloses a distributed machine learning visualization device, which comprises: the system comprises a component module, a machine learning work module, a configuration module, a log module and a report module, wherein the component module is used for providing a dragging component and a viewable editing component of a report; the machine learning working module is used for providing a working area for machine learning, allowing the dragging component to be dragged into the machine learning working module, and performing flow diagram connection; the configuration module is used for providing component configuration content and dynamically updating according to the current configuration; the log module is used for providing a current running state; and the report module is used for providing the details of each node in the current working area and the visualized content of the operation result when generating a report. The device can provide a distributed machine learning platform with low threshold and high visualization degree for non-machine learning professionals, can effectively deal with the problems of mass data and high-precision machine learning, and is simple and easy to realize.

Description

Distributed machine learning visualization device

Technical Field

The invention relates to the technical field of big data machine learning, in particular to a distributed machine learning visualization device.

Background

Most of current machine learning visualization systems are gradually configured and gradually advanced modeling technologies, namely, configuration and operation are performed on each step of machine modeling, and the obtained result is advanced to the next step. Only a few platforms promote the modeling integration technology in the machine learning visualization process, and the development is made on the machine learning visualization technology. For example, a modeling method of a visual machine learning training model includes implementing visual process design, visual model verification, and visual intermediate result viewing, so that a data analyst can perform machine learning training without encoding. The process design can drag a graphical algorithm group to establish data flow direction among algorithms in the graphical algorithm component and generate a process description language; then, analyzing the flow description language generated by the flow designer through a flow analyzer, and creating a corresponding learning component and a Spark learning pipeline; and submitting the Spark learning pipeline to a Spark cluster by the process scheduler for model training. Thereby realizing high-quality machine learning modeling.

It is easy to find that most of the existing technologies process machine learning visualization step by step, and rarely process the visualization with pipelined flow-type techniques. In few flow processing technologies, modeling and model evaluation of data analysis methods are also emphasized, and data exploration, data preprocessing, feature engineering and other parts which can provide strategies and data guarantees for data analysis cannot be completely added into an integrated modeling flow. Moreover, when the data size is large, the machine learning algorithm will take longer to run, so when a problem occurs in the process running, running from the starting point again will bring a problem of repeated time consumption. Meanwhile, a summary part after modeling and model evaluation does not have a perfect acquisition channel, and only by means of repeated checking and observation of a user, a visual report which is easy to edit and extract information is not provided, so that the problems that the contents of a modeling process, a model result and the like are not well summarized, improved and reused are brought.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, the invention aims to provide a distributed machine learning visualization device, which can provide a distributed machine learning platform with low threshold and high visualization degree for non-machine learning professionals, can effectively solve the problems of mass data and high-precision machine learning, and is simple and easy to implement.

In order to achieve the above object, an embodiment of the present invention provides a distributed machine learning visualization apparatus, including: the component module is used for providing a dragging component and a viewable editing component of the report, wherein the dragging component comprises a data source component, an algorithm component, a model component and a project component; the machine learning work module is used for providing a work area for machine learning, allowing the dragging component to be dragged into the machine learning work module, performing flow diagram connection, and comprises: the functions of configuring component parameters, checking the result of the current node, starting to run a subsequent machine learning process by the current node, saving a model and generating a report; the configuration module is used for providing component configuration content and dynamically updating according to the current configuration; the log module is used for providing a current running state; and the report module is used for providing the details of each node in the current working area and the visual content of the operation result when generating the report, supporting editable, and supporting re-editing of the report when viewing the report.

The distributed machine learning visualization device integrates data exploration, data preprocessing, feature engineering, a machine learning algorithm, a model generation process and a modeling process pipeline of model evaluation; the distributed machine learning pipeline integration is realized by combining two stages of verification execution and full data execution in the machine learning process; dynamically generating editable visual reports for contents such as a modeling process, a modeling result and the like; therefore, a distributed machine learning platform with low threshold and high visualization degree is provided for non-machine learning professionals, the problems of mass data and high-precision machine learning can be effectively solved, and the method is simple and easy to implement.

In addition, the distributed machine learning visualization apparatus according to the above embodiment of the present invention may further have the following additional technical features:

further, in an embodiment of the present invention, the data source component supports importing csv and eexcel data, connecting a database to import data, and existing data; the algorithm component comprises modules of data exploration, data preprocessing, feature engineering, data analysis, model evaluation and the like, each module contains respective algorithm, and each algorithm has preset configuration and operation logic; the model component comprises a model saved by a user; the project components include new projects and existing projects.

Further, in an embodiment of the present invention, the machine learning work module is specifically used for machine learning modeled pipeline process construction, machine learning modeled pipeline process monitoring, machine learning modeled pipeline process translation, machine learning modeled pipeline process operation, machine learning modeled pipeline process saving, machine learning modeled pipeline process reporting, and distributed machine learning pipeline process.

Further, in an embodiment of the present invention, a node id in a pipeline process is used as a key value in a data object, and a start node is placed in a startNode as a starting point of executing a machine learning process, wherein a current node type is determined by a pre and next fields of each node, pre represents a forward node of the current node, and next represents a backward node of the current node.

Further, in an embodiment of the present invention, when the length of pre is greater than 1, the current node is an aggregation node, and the current node may be executed only after the upstream processes of all nodes in pre are executed; when the length of the next is larger than 1, the current node is a separated node, and after the current node is executed, all nodes in the next are executed in parallel; when the lengths of pre and next are both 1, the current node is linear, the current node is executed after the execution of the unique node in pre is finished, and the unique nodes in the next are sequentially executed after the execution of the current node is finished; when the pre length is 0, the current node is represented as a start node, and when the next length is 0, the current node is represented as an end node.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic structural diagram of a distributed machine learning visualization apparatus according to an embodiment of the present invention;

FIG. 2 is a process flow diagram of a distributed machine learning visualization apparatus according to an embodiment of the present invention;

FIG. 3 is a simplified exemplary diagram of a machine learning pipeline process according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of data flow according to an embodiment of the invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

The present application is based on the recognition and discovery by the inventors of the following problems:

under the large background of rapid development of internet application technology, the human society is increasingly active on the internet, and massive data is created on various websites and social applications, thereby bringing about the vigorous development of machine learning technology. However, from the viewpoint of working performance, in the face of increasing magnitude of data and improving the accuracy requirement of the machine learning algorithm, the machine learning performance of a single machine cannot meet the requirements of calculation speed and storage capacity. From the practical operation, the machine learning technology has complex operation and high starting point, and is difficult to master by a user; meanwhile, the result generated by machine learning is more astringent and difficult to understand by users. To address the above-mentioned problems, many techniques that combine distributed machine learning with visualization are beginning to emerge.

There are still some problems in these machine learning visualization techniques: firstly, as the model construction of machine learning comprises a plurality of steps of data exploration, data preprocessing, characteristic engineering, data analysis, model generation, model evaluation, model tuning, data prediction and the like, in the process, a user inevitably has the requirements of parameter adjustment, model reconstruction and the like, which leads to the complex operation that the user needs to overturn the model for many times and rebuild the model from scratch; secondly, when the user completes model construction, only one model parameter result is presented to the user, and complete high-visualization display cannot be presented for model systematization, model evaluation and comparison of multiple models.

In view of the above problems, embodiments of the present invention are intended to solve: the single machine system is difficult to deal with the problems of mass data and high-precision machine learning, and the machine learning threshold is high, the operation is complex and the result is abstract. Therefore, the embodiment of the invention provides a visualization technology of distributed machine learning, and provides a machine learning model construction process: data exploration, data preprocessing, feature engineering, a machine learning algorithm, a model generation and model evaluation are integrated; a report generation technology of high editable and high visual is provided; spark-based distributed machine learning techniques are proposed. The distributed machine learning platform is low in threshold and high in visualization degree and aims to provide non-machine learning professionals.

The following describes a distributed machine learning visualization apparatus proposed according to an embodiment of the present invention with reference to the drawings.

Fig. 1 is a schematic structural diagram of a distributed machine learning visualization apparatus according to an embodiment of the present invention.

As shown in fig. 1, the distributed machine learning visualization apparatus 10 includes: component module 100, machine learning task module 200, configuration module 300, log module 400, and reporting module 500.

The component module 100 is used for providing a dragging component and a viewable editing component of the report, wherein the dragging component comprises a data source component, an algorithm component, a model component and a project component. The machine learning work module 200 is used to provide a work area for machine learning, allow a dragging component to be dragged into the module, perform flow diagram connection, and includes: the functions of configuring component parameters, viewing the current node result, starting to run a subsequent machine learning process by the current node, saving a model and generating a report. The configuration module 300 is used to provide component configuration content and dynamically update according to the current configuration. The log module 400 is used to provide the current operating state. The report module 500 is configured to provide details of each node in the current working area and visual contents of the operation result when generating a report, and support editable, and support re-editing of the report when viewing the report. The device 10 of the embodiment of the invention can provide a distributed machine learning platform with low threshold and high visualization degree for non-machine learning professionals, can effectively solve the problems of mass data and high-precision machine learning, and is simple and easy to implement.

It will be appreciated that the processing flow of the apparatus 10 of the embodiment of the present invention is shown in fig. 2, wherein the component module 100: and 4 types of dragging components such as data sources (new data can be uploaded and existing data can be uploaded), algorithms (data exploration, data preprocessing, characteristic engineering and data analysis), models, projects (newly-built projects and existing projects) and the like and viewable editing components of reports are provided. Machine learning work module 200: providing a working area for machine learning, allowing components to be dragged into the module, performing flow diagram connection, and comprising: and configuring component parameters, checking the result of the current node, starting to run a subsequent machine learning process from the current node, storing a model, generating a report and the like. The configuration module 300: and providing component configuration content and dynamically updating according to the current configuration. The log module 400: the current operating state is provided. The reporting module 500: when the report is generated, the details of each node in the current working area and the visual content of the operation result are provided, and the editable report is supported.

The embodiment of the invention combines the distributed machine learning technology and the visualization technology, and is divided into five modules in total, and each module is explained in detail below.

Component module

The component module 100 provides five parts components of data source, algorithm, model, project and report. The report module can click to check the existing report and edit the report again, and the rest four parts of the components can be dragged to the machine learning working module to participate in the machine learning modeling process. All components read data from the database to dynamically update the component contents.

The data source component (1) supports importing csv and excel data, connecting a database to import data and existing data. (2) The algorithm component comprises parts of data exploration, data preprocessing, feature engineering, data analysis, model evaluation and the like, wherein each part comprises respective algorithms, such as global statistics, frequency analysis of the data exploration, K-means clustering of the data analysis, association analysis and the like, and each algorithm has preset configuration and operation logic. (3) The model component includes models that the user has saved. (4) The project components comprise newly-built projects and existing projects, a user can newly build a brand-new project and can also multiplex the existing projects, a project flow chart is reproduced in the machine learning work module, the projects are re-edited, and the complex operation of repeated modeling is reduced.

(II) machine learning work module

The machine learning work module 200 provides a work area for machine learning, and is a core module of the embodiment of the present invention. Data sources, algorithms, models and project components in the component module 100 can be dragged into a work area, and the process modeling and the operation on each component node are completed in the work area.

The machine learning working module 200 adopts a dragging technology and a JSPUMB technology to complete the process construction of machine learning modeling. The working process is as follows:

1. constructing a machine learning modeling pipeline flow: a node of a component (when the component is data, a model and an algorithm) dragged into a working area generates a unique node id as an identifier, the node can be configured by clicking, and the right click is used for deleting the node, viewing the node data (result), starting to operate from the current node and the like; and dragging the bottom end point of the node to the top end point of the other node by the JSPUMB component to generate a connecting line, and clicking the connecting line to delete the connecting line. Machine learning pipeline is constructed from points and lines and the data flow is displayed. (taking the example shown in FIG. 3)

1.1, machine learning modeling pipeline flow recurrence: when the project component is dragged into the working area, the pipeline flow of the working area is redrawn in the working area when the project is saved, the operation result and the configuration parameters of each node can be directly checked, and the operation can be edited and operated again.

2. Machine learning modeling pipeline flow monitoring: and monitoring the addition and deletion of nodes and connecting lines in the working area and the modification of configuration content through a click event, and further dynamically updating the change of the machine learning execution flow and the change of the configuration content.

3. Machine learning modeling pipeline flow translation: and translating the machine learning flow chart according to the data flow to obtain a data structure which can be understood and can judge the execution sequence.

4. Machine learning modeling pipeline flow operation: and the machine learning model constructed in the operation area displays the operation state in the log module, if the failure condition occurs, the failure reason is displayed in the log, and the machine learning model can be selected to start to operate again from the failure node after adjustment. In the operation process, the operation result of the completed node can be checked at any time, and the operation result is visually displayed by combining an echarts technology and a chart.

5. And (3) storing a pipeline flow by machine learning modeling: the pipeline process storage model can be stored, the project can be stored, and the stored content comprises the content of a pipeline process diagram of a working area, node configuration, a node operation result and the like.

6. Machine learning modeling pipeline flow report: and dynamically generating a report module according to the current modeling process and each result.

7. Distributed machine learning pipeline flow: after the verification is performed based on the above, the pipeline process is submitted to the spark cluster to be run in full.

Further, the above-mentioned machine learning process translates the pipeline process according to the data flow, which is described in detail herein. The embodiment of the invention divides the nodes of the process into three types: linear, aggregate, and split types, three types of data flow are shown in fig. 4. Linear means that the node contains only one input and one output, convergent means that the node contains multiple inputs, and disjunctive means that the node contains multiple outputs.

The node type of a complete machine learning pipeline process is necessarily in the three data flow directions, so that a data structure can be defined to judge the data flow direction, and the pipeline process is translated into the data structure which can be used for judging the execution sequence. The data structure generated by FIG. 3 is as follows:

in the above data structure, the node id in the pipeline flow is used as the key value in the data object, and the start node is placed in the startNode, which is used as the starting point for executing the machine learning flow. The current node type is determined by the pre and next fields for each node, with pre representing the forward node of the current node (which may also be understood as the parent node in the tree structure) and next representing the backward node of the current node (which may also be understood as the child node in the tree structure). The judging method comprises the following steps:

(1) when the length of pre is greater than 1, the current node is an aggregation node, and the current node can be executed only after the upstream processes of all nodes in pre are executed.

(2) And when the length of the next is greater than 1, the current node is a separated node, and after the current node is executed, all the nodes in the next should be executed in parallel.

(3) When the lengths of the pre and the next are both 1, the current node is linear, the current node can be executed only by waiting for the execution of the unique node in the pre to be completed, and the execution of the current node is completed only by sequentially executing the unique node in the next.

(4) When the pre length is 0, this node is indicated as the start node, and when the next length is 0, this node is indicated as the end node.

Taking FIG. 3 as an example, the data1 node would exist in a startNode, and is started by data 1. According to the translated data structure, the data is a separated node, so that the full-table statistics and the frequency statistics are executed in parallel. And the subsequent full-table statistics are that the linear node only needs to perform data column splitting downwards, and meanwhile, the parallel frequency statistics are that the parallel frequency statistics are after the separation type node, and two algorithm nodes of filling null values and filtering are performed in parallel. And the data integration is an aggregation node, the data integration is executed after the data column splitting and the null value filling are completed, and then the rest nodes are linear nodes and are executed sequentially.

In addition, the machine learning module 200 also involves data interaction with the component module 100, the configuration module 300, the log module 400, and the report module 500. Js is realized based on vuex, so vuex is used in the system to manage and interact complex data in the system.

(III) configuration module

For each node in the machine learning pipeline process, firstly, the node has its own basic configuration mode, secondly, the configuration content of each node in the upstream process may affect the configuration content, and thirdly, the configuration content may affect the nodes in the downstream process. For example, in fig. 3, data column is split for data1, and new data fields [ "year", "month", "day" ], are generated according to the configuration delimiter, so that when the algorithm nodes following the data column are configured, the data fields in the configuration content need to be added with 3 new fields generated by the data column splitting method, and if the splitting nodes are deleted later, the nodes using the split generated data fields in the following configuration need to be reconfigured. Therefore, the strategy adopted by the module for each node configuration is as follows:

1. and acquiring default configuration content of the current node according to a preset configuration mode.

2. Tracing according to a data structure obtained by the pipeline flow translation in the machine learning working module, namely searching upwards from the current node to the starting node, thereby obtaining all paths from the starting node to the current node; and the initial node reads the node configuration on the path downwards to generate the latest configuration content.

3. And after the configuration is finished, prompting the configuration of the existing nodes which are influenced in the downstream process and updating the configuration content.

Taking fig. 3 as an example, configuring the normalized content requires looking up all paths: normalization- > splitting of data column- > full-table statistics- > data 1; normalization- > filling in null value- > frequency statistics- > data 1. And then gradually merging the configuration from the starting node to the two paths, and updating the configuration content. Meanwhile, when the normalized configuration content is changed, the configuration content of the node association analysis and model evaluation node behind the normalized configuration content is updated, and if the influence of the existing configuration is generated, a prompt is made.

(IV) Log Module

The log module 400 tracks the running status of the machine learning process by polling, and converts the running status into friendly text content to be displayed in the log portion. Especially when the model runs in a problem, the log module can provide error information for the user to guide the user to complete modification. The user may then continue to run by the failed node after the modification. The specific functions of the module for acquiring and displaying the log are realized as follows:

1. and after the machine learning working module submits operation, acquiring a current operation state, a current operation node and a new operation result generated after last query by using a polling technology.

2. And displaying the contents of the state, the node, the execution time and the like in the log part, and if the operation fails, displaying the failure reason and error positioning in the log part.

3. And changing the display content of the machine learning work area according to the current running state, and displaying four different states of successful running, running failure and not running as different visual effects.

(V) report module

The report module 500 provides a highly editable, visual report generation tool for the user. Aiming at each node in the machine learning pipeline process, a platform generates a basic report module according to items such as configuration content and operation results of the node, and the operation results are visualized in the report module by combining various forms such as charts. Thereby generating a very illustrative, conclusive, learnable machine learning report. The specific function of the module for generating the report is realized as follows:

1. a report template is preset for different components, and echarts technology is applied to the report template to be visualized in combination with various forms such as charts.

2. And acquiring information of all nodes, node configuration, node operation results and the like of the machine learning working module. And selecting a correct report template according to the type of the node to generate a report module. All modules are sequentially composed into an initial report.

3. The user adds and deletes the modules in the report according to the requirement of the user, and simultaneously can add the contents of summarized characters, charts and the like, thereby generating and storing the final report.

4. For the saved report, the existing module content can be added and deleted when the report is edited again, and meanwhile, the chart and the characters which are added by the user independently are still modified.

According to the distributed machine learning visualization device provided by the embodiment of the invention, data exploration, data preprocessing, feature engineering, a machine learning algorithm, a model generation and a modeling process pipeline of model evaluation are integrated; the distributed machine learning pipeline integration is realized by combining two stages of verification execution and full data execution in the machine learning process; dynamically generating editable visual reports for contents such as a modeling process, a modeling result and the like; therefore, a distributed machine learning platform with low threshold and high visualization degree is provided for non-machine learning professionals, the problems of mass data and high-precision machine learning can be effectively solved, and the method is simple and easy to implement.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A distributed machine learning visualization device, comprising:

the component module is used for providing a dragging component and a viewable editing component of the report, wherein the dragging component comprises a data source component, an algorithm component, a model component and a project component;

the machine learning work module is used for providing a work area for machine learning, allowing the dragging component to be dragged into the machine learning work module, performing flow diagram connection, and comprises: the functions of configuring component parameters, checking the result of the current node, starting to run a subsequent machine learning process by the current node, saving a model and generating a report;

the configuration module is used for providing component configuration content and dynamically updating according to the current configuration;

the log module is used for providing a current running state;

and the report module is used for providing the details of each node in the current working area and the visual content of the operation result when generating the report, supporting editable, and supporting re-editing of the report when viewing the report.

2. The apparatus of claim 1,

the data source component supports importing csv and excel data, connecting a database to import data and existing data;

the algorithm component comprises modules of data exploration, data preprocessing, feature engineering, data analysis, model evaluation and the like, each module contains respective algorithm, and each algorithm has preset configuration and operation logic;

the model component comprises a model saved by a user;

the project components include new projects and existing projects.

3. The device according to claim 1, wherein the machine learning work module is specifically used for machine learning modeled pipeline process construction, machine learning modeled pipeline process monitoring, machine learning modeled pipeline process translation, machine learning modeled pipeline process operation, machine learning modeled pipeline process saving, machine learning modeled pipeline process reporting, and distributed machine learning pipeline process.

4. The apparatus of claim 2, wherein a node id in a pipeline flow is used as a key value in a data object, and an initial node is placed in a startNode as a starting point for executing a machine learning flow, wherein a current node type is determined by a pre and next fields for each node, wherein pre represents a forward node of the current node, and wherein next represents a backward node of the current node.

5. The apparatus of claim 4,

when the length of pre is greater than 1, the current node is an aggregation node, and the current node can be executed only after the upstream processes of all nodes in the pre are executed;

when the length of the next is larger than 1, the current node is a separated node, and after the current node is executed, all nodes in the next are executed in parallel;

when the lengths of pre and next are both 1, the current node is linear, the current node is executed after the execution of the unique node in pre is finished, and the unique nodes in the next are sequentially executed after the execution of the current node is finished;

when the pre length is 0, the current node is represented as a start node, and when the next length is 0, the current node is represented as an end node.