WO2020035076A1 - 对机器学习过程的数据处理步骤进行可视化的方法和系统 - Google Patents

对机器学习过程的数据处理步骤进行可视化的方法和系统 Download PDF

Info

Publication number
WO2020035076A1
WO2020035076A1 PCT/CN2019/101444 CN2019101444W WO2020035076A1 WO 2020035076 A1 WO2020035076 A1 WO 2020035076A1 CN 2019101444 W CN2019101444 W CN 2019101444W WO 2020035076 A1 WO2020035076 A1 WO 2020035076A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
data
display control
data processing
node
Prior art date
Application number
PCT/CN2019/101444
Other languages
English (en)
French (fr)
Inventor
方荣
杨博文
黄亚建
杨慧斌
詹镇江
Original Assignee
第四范式(北京)技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 第四范式(北京)技术有限公司 filed Critical 第四范式(北京)技术有限公司
Publication of WO2020035076A1 publication Critical patent/WO2020035076A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/451Execution arrangements for user interfaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present disclosure relates to the field of machine learning, and more particularly, to a method and system for visualizing data processing steps of a machine learning process.
  • data as the raw material of the machine learning process is of great significance to the effect of machine learning models.
  • it is often necessary to perform corresponding processing on the data such as data cleaning, data filling, and data splicing. Or feature extraction.
  • the data processing process can be implemented by running the code written by the programmer, or by the machine learning platform according to the user-entered script, configuration, and / or interactive operations.
  • the entire data processing process often involves a large amount of data or complex Processing operation.
  • the interaction between existing machine learning platforms and users is poor.
  • General users cannot intuitively understand the logical ideas and working details of the data processing process, which makes it difficult for users to quickly determine the data processing process when problems occur in the machine learning process. Which step has an exception or error. Therefore, the application and promotion of machine learning technology brings a lot of inconvenience.
  • a method for visualizing data processing steps of a machine learning process including: parsing data processing steps of a predefined machine learning process to obtain the data processing Step profile information, wherein the profile information includes data information and / or processing information of the data processing step; generating an understanding view of the data processing step for describing the machine learning process based on the obtained profile information; and graphically The understanding view is displayed in a visualized manner.
  • a computer-readable medium for visualizing data processing steps of a machine learning process, wherein the computer-readable medium is recorded on the computer-readable medium for use by one or more processors.
  • a computer program that executes the aforementioned method of visualizing data processing steps of a machine learning process.
  • a computing device for visualizing data processing steps of a machine learning process.
  • the computing device may include one or more storage devices and one or more processors.
  • a computer-executable instruction set is stored in the one or more storage devices, and when the one or more processors execute the computer-executable instruction set, the aforementioned data processing steps of the machine learning process are visualized.
  • a system for visualizing data processing steps of a machine learning process including: an interpretation device for analyzing data processing steps of a predefined machine learning process To obtain the profile information of the data processing step, wherein the profile information includes data information and / or processing information of the data processing step; and a view generating device for generating the machine for depicting the machine based on the obtained profile information.
  • An understanding view of the data processing steps of the learning process and a display device for graphically displaying the understanding view.
  • FIG. 1 is an example of configuring a machine learning process by constructing a directed acyclic graph (DAG) in a prior art machine learning platform.
  • DAG directed acyclic graph
  • FIG. 2 illustrates a system for visualizing data processing steps of a machine learning process according to an exemplary embodiment of the present disclosure.
  • FIG. 3 illustrates a flowchart of a method of visualizing data processing steps of a machine learning process according to an exemplary embodiment of the present disclosure.
  • FIG. 4 illustrates an understanding view of data processing steps for depicting a machine learning process according to an exemplary embodiment of the present disclosure.
  • FIG. 5 illustrates another understanding view of data processing steps for depicting a machine learning process according to an exemplary embodiment of the present disclosure.
  • FIG. 6 illustrates a flowchart of a method of visualizing a generation process of a specific feature according to an exemplary embodiment of the present disclosure.
  • FIG. 7 illustrates a process presentation view for describing a generation process of a specific feature according to an exemplary embodiment of the present disclosure.
  • Machine learning including deep learning
  • machine learning is an inevitable product of the development of artificial intelligence to a certain stage. It is committed to using computing methods to mine valuable potential information from massive data and use experience to improve the performance of the system itself.
  • experience usually exists in the form of "data”.
  • models can be generated from data, that is, empirical data is provided to machine learning algorithms, which can be generated based on these empirical data. The model, when faced with a new situation, the model will provide the corresponding judgment, that is, the prediction result.
  • Machine learning may be implemented in the form of "supervised learning”, “unsupervised learning”, or “semi-supervised learning”, and it should be noted that the exemplary embodiments of the present disclosure do not specifically limit specific machine learning algorithms.
  • the data processing process is at least a part of a process from introducing original data to outputting a sample, and the entire process may also be referred to as feature engineering.
  • the data processing process may include one or more data processing steps. According to an exemplary embodiment of the present disclosure, details of the data processing steps may be obtained through analysis.
  • the present disclosure proposes a method and system for visualizing a machine learning process.
  • the method and system can visualize data processing steps of the machine learning process, so that users can quickly and intuitively understand the data processing process.
  • FIG. 1 is an example of configuring a machine learning process by constructing a directed acyclic graph (DAG) in a prior art machine learning platform.
  • DAG directed acyclic graph
  • each module can represent the relevant steps in the machine learning process.
  • the data processing process selected by the thick line frame occupies a larger part of the work.
  • the data fields on which the sample features are based may come from a wide table generated by concatenating multiple data tables. For example, if a bank uses a machine learning model to determine fraudulent transactions, it will send users
  • the information table, bank card information table and transaction record table are stitched into a wide table for processing.
  • the raw data records in the data table may need to undergo a series of operations such as cleaning, format conversion (for example, date format conversion), and time-series stitching.
  • the related processing is modularized by showing each step as a node in the DAG, this method does not help the user to quickly understand the overall idea of the data processing process or the specific work done. If the user wants to know the details, he has to proactively add a description for each module or check the specific content of each module one by one, which will increase the user's use burden. Moreover, in many cases, the specific content of these modules is completely the original code processed accordingly, and the user has to have a certain knowledge ability to understand the data processing process through the code.
  • FIG. 2 illustrates a system 100 for visualizing data processing steps of a machine learning process according to an exemplary embodiment of the present disclosure.
  • the system 100 includes an interpretation device 101, a view generation device 102, and a presentation device 103.
  • the interpretation device 101 may analyze data processing steps of a predefined machine learning process to obtain profile information of the data processing steps, where the profile information includes data information and / or processing information of the data processing steps.
  • the analysis processing may be performed on the corresponding data processing steps before, simultaneously with, or after the machine learning process is run, so that information such as input, output, intermediate results, processing details, etc. on the data processing steps can be obtained.
  • Information may be performed on the corresponding data processing steps before, simultaneously with, or after the machine learning process is run, so that information such as input, output, intermediate results, processing details, etc. on the data processing steps can be obtained.
  • the view generating device 102 may generate an understanding view for describing data processing steps of the machine learning process based on the acquired profile information.
  • the view generating device 102 may form, based on the parsed information of each data processing step itself, a relationship that reflects the data processing steps on the one hand, and reflects the data information of each data processing step on the other hand, and / Or understand the view of processing information.
  • the display device 103 may display the understanding view in a graphical manner.
  • the display device 103 may display the understanding view to the user through an output device such as a display (not shown).
  • the display device 103 may display the understanding view in a specific form or effect to help the user through the display Understanding view to understand the related data processing process.
  • FIG. 3 illustrates a flowchart of a method of visualizing data processing steps of a machine learning process according to an exemplary embodiment of the present disclosure.
  • the machine learning process is set by a user of the machine learning platform.
  • the machine learning process may be expressed as a directed acyclic graph (DAG) generated by a user by dragging a node module, wherein the user may configure data and / or operations corresponding to each node module.
  • the machine learning process may be represented as a computer program code manually written by a user.
  • DAG directed acyclic graph
  • one or more data processing steps are usually required to operate on the original data. These operations will bring changes to the fields, and the operations performed by the data processing steps can be quickly perceived from the changes in the fields.
  • the predefined machine learning process may include one or more data processing steps.
  • the one or more data processing steps may include a data introduction step, a data cleaning step, and a data splicing step. , Timing aggregation step or feature extraction step, and so on.
  • the processing results of these data processing steps can be considered as output tables (for example, data tables or sample tables).
  • the data processing steps may be selectively visualized according to user needs or predetermined settings. As shown in FIG. 3, the method for visualizing the data processing steps may include steps S11, S12, and S13.
  • step S11 the data processing steps of the predefined machine learning process are parsed to obtain profile information of the data processing steps, where the profile information may include data information of the data processing steps and / or Process information.
  • the data processing step may include various steps such as a data introduction step, a data cleaning step, a data table splicing step, a time series aggregation step, and / or a feature extraction step.
  • the data processing step may be one or more data processing steps that the user wants to know in the machine learning process, and may be all or a part of the data processing steps in the machine learning process.
  • the profile information may include a name of the data processing step, a name of an output table of the data processing step, a number of rows of the output table, a number of columns of the output table, a field name of the output table, a processing procedure, and At least one of the step descriptions added by the user.
  • step S12 the view generation device 102 generates an understanding view for describing the data processing steps based on the acquired profile information.
  • the view generating device 102 may generate an understanding view capable of reflecting input data or output data of each data processing step itself, and / or at least a part of a processing method involved in each data processing step. In this understanding view, the dependency of each data processing step on the execution order can be further reflected.
  • the display device 103 graphically displays the understanding view.
  • the display device 103 may display the understanding view in any graphical form (for example, a flowchart, a structure diagram, a table, an item, a graphic, etc.), so that a user can easily view various data on the understanding view. Overview of the processing steps.
  • the understanding view may be a flowchart representing the data processing steps, wherein the nodes in the flowchart respectively correspond to each data processing step, and the nodes of each data processing step are related to each other according to a dependency relationship.
  • each node may have a corresponding display control.
  • the display control may be a display frame having various shapes. At least a part of the profile information may be further displayed in or around the display frame. It should be noted that the profile information can be directly displayed in or around the display frame; in addition, the profile information can also be displayed in a hidden manner, so that the related content is only after the user performs the corresponding trigger operation (for example, clicking on the display control) show.
  • the process of displaying the understanding view in a graphical manner by using the display device 103 may include: displaying the data of the machine learning process by listing the summary information of the corresponding data processing steps in the display control of each node. Processing steps.
  • the machine learning platform can preset which profile information will be listed in the display controls of each node, and can also set or adjust the profile information to be listed in each display control according to the user's selection.
  • the processing of listing the summary information of the corresponding data processing step in the display control of each node may include: using the display device 103, listing the overview of the corresponding data processing step in the display control of each node by default The primary display information among the information; and the supplementary display information among the overview information of the corresponding data processing step is further listed in the display control in response to the user's operation on the display control.
  • the profile information can be listed in the display control hierarchically or hierarchically.
  • the content to be displayed hierarchically or hierarchically may be determined in advance by a machine learning platform, and it may also be determined which profile information is included in the primary display information and / or which profile information is included in the supplementary display information according to the settings of the user.
  • the supplementary display information may be displayed at one time, and may also be displayed hierarchically or hierarchically.
  • the operation of the display control by the user may be any operation performed on the display control by the user in order to further understand the data processing steps corresponding to the display control.
  • a user may click a display control of a node on a user interaction interface of a machine learning platform to further understand supplementary display information of a data processing step corresponding to the node.
  • the primary display information includes at least one of the name of the data processing step, the name of the output table, the number of rows in the output table, the number of columns in the output table, and the description of the steps for adding, and / or supplementary display
  • the information includes at least part of the field names of the output table and / or at least part of the processing procedure of the data processing step.
  • the size of the display control is adaptively adjusted to further list the supplementary display information of the corresponding data processing step.
  • the processing of using the display device 103 to list the profile information of the corresponding data processing step in the display control of each node may further include: adaptively adjusting the display control ’s content according to the content of the profile information listed in the display control. size.
  • adaptively adjusting the size of the display control according to the content of the profile information listed in the display control may include adaptively adjusting the size of the display control according to the content (ie, the amount of content). . That is, the size of each display control depends on the content of the profile information to be displayed therein.
  • the processing of listing the summary information of the corresponding data processing steps in the display control of each node may further include: using the display device 103 to list the corresponding data with prominent visual effects in the display control of each node The newly generated field name among the field names of the output table of the processing step.
  • the data processing step may be a step such as data table splicing, in which case, at least one field in the data table as the splicing table may be used (here, the field corresponds to a column of the data table) The splicing into the original data table becomes a newly generated field in the output table of the data processing step.
  • the prominent visual effects may include, but are not limited to: fonts are enlarged, font formats are different from font formats of other field names, in a specific style (eg, bold, italic, underlined, and / or underlined) And / or specific color fonts.
  • the profile information listed in the display control of each node may include all field names of the output table of the corresponding data processing step, where field names having the same initial source table are arranged together.
  • the data processing step may be a step such as data table splicing.
  • each initial data table may undergo multiple data table splicing processes at different stages to obtain an output table as a result of splicing. That is, an output table may be a result obtained by continually splicing at least a part of or all fields from different other data tables on the basis of the original data table.
  • the source The field names in the same initial source table can be arranged together in the display control of the node.
  • the initial source table can indicate data that was originally introduced into a machine learning system (for example, a machine learning platform) without going through any data processing steps. Tables can be tracked through the stitching process of the data table to get field names with the same initial source table.
  • the field names with the same initial source table are listed in the display controls of all nodes according to the same visual effect.
  • the visual effects may include, but are not limited to: font size, font format, font style (eg, bold, italic, underlined, and / or underlined), and / or font color, and the like.
  • a single data processing step may indicate one or more further processing steps performed on all fields or at least one of the fields of a data record or a sample record.
  • the further processing steps included in a single data processing step may be referred to as sub-steps.
  • one or more sub-steps may be obtained. Sub flowchart.
  • the data processing step is divided into a data introduction step and a non-data introduction step, wherein the data introduction step may indicate that data (for example, a data file, a data table, etc.) is initially introduced into a machine learning system (for example, a machine learning platform)
  • Steps can also be directed to the steps of introducing data into a specific machine learning process (for example, applying data that already exists in a machine learning system to a specific machine learning process).
  • the above two steps can also be a unified single step.
  • the non-data introduction step includes steps other than the data introduction step in the data processing step, for example, a data cleaning step, a data table splicing step, a time series aggregation step, a feature extraction step, and the like.
  • the display control of the node corresponding to the data introduction step and the display control of the node corresponding to the non-data introduction step have respective shapes, for example, the shape of the display control, the border line type, the border color, the background color, the background pattern, At least one of the font format, font style (for example, bold, italic, and / or underlined), font color, etc. in the display control may be different according to different types of data processing steps. Specific description will be made below with reference to FIGS. 4 and 5.
  • FIG. 4 illustrates an example of an understanding view for describing data processing steps of a machine learning process according to an exemplary embodiment of the present disclosure.
  • the data processing steps depicted in the understanding view shown in FIG. 4 include data introduction steps, data table splicing steps, and feature extraction steps in the order of execution, but the present disclosure is not limited to this, but can be directed to any data processing process.
  • the corresponding various data processing steps provide an understanding view.
  • the method shown in FIG. 3 is used to generate and display the understanding view shown in FIG. 4.
  • the understanding view is formed by connecting the display controls 201, 202, 203, 204, 205, and 206 in accordance with the execution order of the data processing steps of the corresponding nodes.
  • the arrows between the display controls indicate the execution order between the data processing steps corresponding to the display controls.
  • the data processing step corresponding to the display control 201 is the first data introduction step.
  • the first data introduction step is used to introduce a data table named cmb0404_app_trx_detail into the machine learning process. Accordingly, the execution result of the first data introduction step is the output table cmb0404_app_trx_detail. .
  • the name (cmb0404_app_trx_detail) of the output table and the number of rows and columns of the output table (80,000 rows and 16 columns) in the overview information of the first data introduction step are listed in the display control 201.
  • other data information and / or processing information of the first data introduction step may be listed in the display control 201 so that the user can understand other details of the first data introduction step.
  • the display control 201 may also list the step description "Import Transaction Table" added by the user (not shown in FIG. 4).
  • the data processing step corresponding to the display control 202 is a second data introduction step.
  • the second data introduction step is used to introduce a data table named cmb0404_fraud into the machine learning process.
  • the execution result of the second data introduction step is the output table cmb0404_fraud.
  • the name (cmb0404_fraud) of the output table and the number of rows and columns of the output table (822 rows and 1 column) in the overview information of the second data introduction step are listed in the display control 202.
  • other data information and / or processing information of the second data introduction step may be listed in the display control 202 so that the user can understand other details of the second data introduction step.
  • the display control 202 may also list a step description added by the user describing "introducing a determined risk transaction table" (not shown in FIG. 4).
  • the data processing step corresponding to the display control 203 is a first data table splicing step.
  • the first data table splicing step is used to splice the output table of the first data introduction step and the output table of the second data introduction step to generate and output a name.
  • the data table is sql: 01_join_fraud (the name of the data table can be the operation name obtained through parsing processing, or the default name can be provided in other ways, and the user can also modify or adjust the name actively).
  • the name (sql: 01_join_fraud) of the output table and the number of rows and columns of the output table (80,000 rows and 17 columns) in the overview information of the first data table splicing step are listed in the display control 203.
  • the display control 203 also lists the user-added step description in the overview information of the first data table splicing step (splicing the transaction table and the determined risk transaction, generating a label field flag), so that the user can understand the first data The function of the table stitching step.
  • other data information and / or processing information of the first data table splicing step may be listed in the display control 203.
  • the first data table splicing step can be parsed to obtain profile information of the first data table splicing step.
  • the profile information can be It includes at least one of the name of the output table in the first data table splicing step, the number of rows in the output table, the number of columns in the output table, the field name of the output table, the processing process, and the step description added by the user.
  • the method may further determine the data source of each field of the output table of the first data table splicing step, that is, the source data table.
  • the data processing step corresponding to the display control 204 is a third data introduction step.
  • the third data introduction step is used to introduce a data table named cmb0404_ip_mapping into the machine learning process.
  • the execution result of the third data introduction step is an output table cmb0404_ip_mapping.
  • the name (cmb0404_ip_mapping) of the output table and the number of rows and columns of the output table (79999 rows and 3 columns) in the overview information of the third data introduction step are listed in the display control 204.
  • other data information and / or processing information of the third data introduction step may be listed in the display control 204.
  • the data processing step corresponding to the display control 205 is a second data table splicing step, and the second data table splicing step is used to splice the output table of the first data table splicing step with the output table of the third data introduction step to generate and output A data table named sql: 02_join_ip_mapping.
  • sql 02_join_ip_mapping
  • the name (sql: 02_join_ip_mapping) of the output table in the summary information of the second data table splicing step is listed in the display control 205.
  • other data information and / or processing information of the second data table splicing step may be listed in the display control 205, so that the user can understand the details of the second data table splicing step.
  • the data processing step corresponding to the display control 206 is a feature extraction step.
  • the feature extraction step is used to extract features from each data record of the output table of the second data table splicing step to generate and output a corresponding feature table.
  • the name fe of the feature extraction step is listed in the display control 206 (the name of the feature extraction step may be an operation name obtained through parsing processing, or a default name may be provided in other ways.
  • the user can also actively modify or adjust the name).
  • other data information and / or processing information of the feature extraction step may be listed in the display control 206 so that the user can understand the details of the feature extraction step. For example, a sub-flow chart (not shown in FIG. 4) for describing the processing procedure of the feature extraction step may also be listed in the display control 206.
  • the sizes of the display controls 201, 202, 203, 204, 205, and 206 are adaptively adjusted according to the contents listed.
  • the data processing steps corresponding to the display controls 201, 202, and 204 are data introduction steps
  • the data processing steps corresponding to the display controls 203, 205, and 206 are non-data introduction steps.
  • the background colors of the display controls 201, 202, and 204 corresponding to the data introduction step are different from the background colors of the display controls 203, 205, and 206 corresponding to the non-data introduction step.
  • the shape of the display control corresponding to the data introduction step may be different from the shape of the display control corresponding to the non-data introduction step.
  • each display control can also be different according to the type of profile information.
  • the shape of the display control corresponding to the feature extraction step can also be different from the shape of the display control corresponding to the data table splicing step. .
  • the user can also actively modify or add any additional information in the display control.
  • some specific profile information can also be displayed hierarchically in the display control.
  • listed in the display control of each node of the understanding view shown in FIG. 4 may be the primary display information among the profile information of the corresponding data processing step.
  • the supplementary display information among the summary information of the corresponding data processing step may be further listed in the display control.
  • FIG. 5 illustrates another example of an understanding view for describing data processing steps of a machine learning process according to an exemplary embodiment of the present disclosure, the example being based on at least a part of the data processing steps among the understanding views shown in FIG. 4 Further understanding of the view.
  • the display controls 201 and 301 correspond to the same node
  • the display controls 202 and 302 correspond to the same node
  • the display controls 203 and 303 correspond to the same node
  • the display controls 204 and 304 correspond to the same node
  • the display control 205 And 305 correspond to the same node.
  • the nodes corresponding to the display control 206 are omitted in FIG. 5.
  • the display control 201 In response to the operation of the display control 201 by the user, the display control 201 becomes the form of the display control 301, that is, all the field names of the output table of the first data introduction step are further listed in the display control 301 (optionally, also List at least some field names).
  • the display control 202 In response to the user's operation on the display control 202, the display control 202 becomes the form of the display control 302, that is, all the field names of the output table of the second data introduction step are further listed in the display control 302 (optionally, also List at least some field names).
  • the display control 204 in response to the user's operation on the display control 204, the display control 204 becomes the form of the display control 304, that is, all the field names or at least a part of the fields of the output table of the third data introduction step are further listed in the display control 304 name.
  • the display control 203 becomes the form of the display control 303, that is, all the field names of the output table of the first data table splicing step are further listed in the display control 303 (optionally, also Only at least a part of the field names may be listed), wherein field names having the same initial source table are arranged together, and / or field names having the same initial source table are listed according to the same visual effect.
  • the field name flag of the newly generated field is listed with a prominent visual effect.
  • the display control 205 in response to the user's operation on the display control 205, the display control 205 becomes a display control 305, that is, all the field names of the output table of the first data table splicing step are further listed in the display control 305 (optional) Alternatively, only at least a part of the field names may be listed), where field names having the same initial source table are arranged together, and / or field names having the same initial source table are listed according to the same visual effect. Optionally, the field names ip_city and ip_country of the newly generated fields are listed with prominent visual effects.
  • each display control be individually caused to display corresponding supplementary display information therein, but also through a unified A trigger mechanism (for example, a click on any one display control or a click on a specially set button) causes all display controls to simultaneously display their respective supplementary display information.
  • a trigger mechanism for example, a click on any one display control or a click on a specially set button
  • an understanding view can be generated and displayed for any type and / or any number of data processing steps in the machine learning process, so as to facilitate user understanding.
  • each data processing step is directed to the entire data table or feature table.
  • the method and system for visualizing data processing steps of a machine learning process may further visualize the generation process of understanding specific features in a view, that is, features Retrospective.
  • the method may display a process presentation view for describing a specific feature in the understanding view for describing a generation process of the specific feature.
  • a process presentation view for describing a specific feature in the understanding view for describing a generation process of the specific feature A detailed description is given below with reference to FIGS. 6 and 7.
  • FIG. 6 illustrates a flowchart of a method of visualizing a generation process of a specific feature according to an exemplary embodiment of the present disclosure.
  • the method includes steps S21, S22, S23, and S24.
  • the interpretation device 101 may determine a specific feature in the understanding view.
  • a specific feature in the understanding view may also be determined by a separate determining device (not shown).
  • the user may select a display control corresponding to the specific feature in the understanding view.
  • the interpretation device 101 may determine the specific feature in response to a user's selection operation at step S21 to analyze the generation process of the specific feature.
  • the interpretation device 101 analyzes at least one data processing step for generating the specific feature in the machine learning process to obtain generation process information of the specific feature, wherein the generation process information Data information and / or processing information including the at least one data processing step.
  • the machine learning process may be expressed as a directed acyclic graph (DAG) generated by a user by dragging a node module, wherein the user may configure data and / or corresponding to each node module operating.
  • the machine learning process may be represented as a computer program code manually written by a user.
  • the at least one data processing step for generating a specific feature may include a data introduction step, a data cleaning step, a data splicing step, a time-series aggregation step, a feature extraction step, and the like.
  • the processing results of these data processing steps may be fields related to the extraction process of the specific feature or a complete output table including the fields.
  • the parsing processing may be performed on the corresponding at least one data processing step before, simultaneously with, or after the machine learning process is run, so that information such as input, output, Information on intermediate results, processing details, etc.
  • the at least one data processing step parsed by the interpretation device 101 is traced from the perspective of generating the specific feature, that is, a processing object or a processing result targeted by the at least one data processing step It can be used directly or indirectly to generate the specific features.
  • the at least one data processing step may involve a feature extraction process for generating the specific feature, and here, the feature extraction process may indicate an extraction process for generating only the specific feature (not involving other features). Extraction processing).
  • the at least one data processing step may involve a splicing process for splicing a data table (the data table may be a direct source data table or an indirect source data table of a field on which the specific feature depends), and the splicing
  • the process-related data information may relate to all fields in the data table, or only to fields related to the generation of the specific feature.
  • the data processing steps related to the feature of interest can be selected from the complex data processing steps of the entire machine learning process to help users understand the meaning of the features more clearly.
  • the view generating device 102 may generate a process presentation view for describing a generating process of the specific feature based on the generating process information.
  • the view generating device 102 may form, based on the parsed information of each data processing step itself, a relationship that reflects the data processing steps on the one hand, and reflects the data information of each data processing step on the other hand, and / Or process information display view.
  • the display device 103 may graphically display the process display view at step S24.
  • the display device 103 may display the process display view to the user through an output device such as a display (not shown).
  • the display device 103 may display the process display view in a specific form or effect to help the user Understand the process of generating specific features by showing the process display view.
  • the data information of the at least one data processing step may include information about input items and / or output items of the at least one data processing step
  • the processing information of the at least one data processing step may include information about the Information on the processing of at least one data processing step.
  • the input items or output items of the at least one data processing step may only involve fields related to the extraction operation of a specific feature, or may include a complete output table including the above-mentioned fields.
  • the processing information of the at least one data processing step may relate to a processing process of each data processing step, and the processing process may include at least one sub-step.
  • the information of each sub-step may be obtained through analysis processing.
  • the process display view may be a flowchart representing the generation process of the specific feature, wherein the nodes in the flowchart may represent input items, output items, and / or processes of corresponding data processing steps, respectively.
  • the process of graphically displaying the process display view may include: the display device 103 may display information about input items, output items, and / or processing procedures of corresponding data processing steps in a display control of each node. .
  • each node may have a corresponding display control, and the display control may be a display frame having various shapes. Information about input items, output items, and / or processing processes may be further displayed in or around the display frame.
  • the above information can be directly displayed in or around the display box; in addition, the above information can also be displayed in a hidden manner, so that the related content is only after the user performs the corresponding trigger operation (for example, clicking on the display control) show.
  • the information that will be listed in each display control can also be set or adjusted according to the user's selection by the machine learning platform in advance setting which information will be listed in the display controls of each node.
  • the at least one data processing step may include a feature extraction step for generating the specific feature.
  • the data information of the feature extraction step may include information about input items and / or output items of the feature extraction step
  • the processing information of the feature extraction step may include information about the processing process of the feature extraction step.
  • the feature extraction step refers to a process of obtaining corresponding features for a corresponding data table by processing one or more source fields according to a specific extraction method.
  • the extraction method here includes, but is not limited to: arithmetic operations such as rounding on numeric fields, logarithmic operations, such as directly using the complete field as a feature, and truncating partial fields (for example, the year portion of a complete date field) ) Conversion means, such as discretization of continuous-valued features, feature operation means that combine different features, and so on.
  • the data information may include information about source fields, information about output characteristics or intermediate results, and / or information about data tables including source fields, and the like.
  • the processing information may include information about each feature extraction means or its further refinement operation.
  • the flowchart in the process display view may include: a node representing a source field as an input of the feature extraction step, a node representing an extraction process as a process of the feature extraction step, and / Or a node representing the specific feature as an output item of the feature extraction step.
  • the process of graphically displaying the process display view may further include: the display device 103 may display the name of the source field in the display control of the node representing the source field, and in the display control of the node representing the extraction process. The name and / or process information of the extraction process is displayed, and / or the name of the specific feature is displayed in a display control of a node representing the specific feature.
  • separate nodes may be set in the process display view to represent the corresponding input items, output items, and processing procedures, respectively. That is, in order to more clearly trace the key information involved in the generation of specific features, separate display controls can be set for the key information corresponding to a single data processing step. In the display control, the name and / or process information of the key information may be further listed.
  • the flow information of the extraction processing process may include the names of one or more processing methods applied in the extraction processing process, and the nodes representing the extraction processing process include child nodes that may respectively represent the one or more processing methods.
  • the process of graphically displaying the process display view may further include: the display device 103 may separately display the names of the one or more processing methods in the display control of the child node.
  • the extraction process may involve one or more processing methods, for example, the operation of rounding a numeric field first and then taking a logarithm.
  • processing methods can generally correspond to a sub-flow chart, wherein each processing method can correspond to a child node, the connection relationship between the child nodes reflects the dependency relationship between the various processing methods, and the display control of the child node can be List the names of the corresponding processing methods.
  • the flowchart may further include: a node representing a source data table of the source field.
  • the process of graphically displaying the process display view may further include: the display device 103 may display the name of the source data table in a display control representing a node of the source data table.
  • the flowchart may further introduce a node in the data table where the source field of the feature is located. That is, in the exemplary embodiment of the present disclosure, the display of input items can be accomplished by using multiple nodes having an inclusive relationship or a progressive relationship.
  • a data table (for example, the data table in which the source field is located) can be further displayed as a characteristic indirect source.
  • the name and / or other relevant information of the source data table may be listed in the display control of the source data table.
  • the at least one data processing step may further include an upstream processing step of the feature extraction step, wherein the upstream processing step may be used to generate a source data table of the source field.
  • the flowchart can further include steps other than the feature extraction step. These steps can be mainly introduced or spliced to obtain the data table where the source field of the feature is located. .
  • the upstream processing step may include one or more data table splicing steps.
  • the data information of the one or more data table splicing steps may include information about input items and / or output items of the one or more data table splicing steps, the one or more data table splicing steps
  • the processing information may include information about a processing procedure of the one or more data table splicing steps.
  • the source data table where the source field of the feature is may be the final output result of one or more data mosaics. In this case, at least one displayed in the process presentation view
  • the data processing step may further include a data table splicing step corresponding to each table operation.
  • the parsing process for the data table splicing step can obtain the name of the spliced data table, the fields actually spliced in the data table, the name of the data table generated after splicing, the fields included in the data table generated after splicing, etc.
  • the flowchart may further include: a node representing an input data table that is an input item of the one or more data table splicing steps and / or a process that represents the one or more data table splicing steps.
  • Process stitching process nodes may further include: the display device 103 may separately display the name of the input data table in a display control representing a node of the input data table, and / or the display device 103 may The name of the stitching process is displayed in the display control of the node representing the stitching process.
  • the data information and / or processing information of the data table splicing step may be displayed in various ways similar to the feature extraction step.
  • the display control of the node corresponding to the specific feature the display control of the node corresponding to the feature extraction step, the display control of the node corresponding to the source field, the display control of the node corresponding to the stitching process, and
  • the display controls of the nodes of the source data table and / or the display controls of the nodes corresponding to the input data table have their respective forms. For example, at least one of the shape, border line style, border color, background color, background pattern, font format, font style (for example, bold, italic, and / or underlined) of the display control, font color, etc. Items can be different depending on the nodes that correspond to different content.
  • the process of graphically displaying the process display view may further include: the display device 103 may respond to a user's selection operation of a specific display control in the process display view, in details corresponding to the specific display control.
  • the display control lists detailed information about the input items, output items, and / or processing procedures displayed in the specific display control.
  • further details about the listed in the flowchart node can be further displayed in a special detailed display control. Details of the inputs, outputs, and / or processes of each step.
  • the detail display control can be set around the corresponding display control, or can be arranged at any position in the entire interface.
  • the detailed display control may be extended from the original display control. For example, when the user selects a specific display control, the specific display control is further expanded to accommodate detailed information to be displayed.
  • the detailed information about the input item and / or output item may include a name corresponding to the input item and / or output item, a description added by the user, the number of rows of the data table, the number of columns of the data table, and fields of the data table At least one of a name, a field type of the data table, at least a part of the data in the data table, and statistical analysis information of the data in the data table.
  • the detailed information about the processing process may include at least one of a name corresponding to the processing process, a description added by the user, code information, and a transformation process of the sample data.
  • the detailed information about the data content may include not only attribute information or statistical information about the data, but also at least a part of the sample data itself.
  • the detailed information about the processing process may involve code content such as configuration or script related to the data processing process, or may further include a processing process demonstration of at least a part of the sample data.
  • FIG. 7 illustrates an example of a process presentation view for describing a generation process of a specific feature according to an exemplary embodiment of the present disclosure.
  • the method for visualizing the generation process of specific features according to the present disclosure is used to generate the process display view shown in FIG. 7.
  • the flowchart on the left in FIG. 7 is a flowchart in which the respective display controls are connected according to the dependency relationship between the corresponding generation process elements, wherein the arrows between the display controls are used to indicate the dependency relationship between the display controls.
  • the generation process element includes various elements involved in the generation process of the specific feature, for example, the specific feature, processing process, processing method in processing process, source field, source data table, and input data table.
  • the feature name f_trxdate_registerdate_diff of the specific feature that the user is interested in is listed in the display control 401.
  • the generation process information of the specific features can be obtained.
  • the generating process information it can be determined that the specific feature is generated through a feature extraction step.
  • data information and / or processing information of the feature extraction step can be obtained.
  • the data information of the feature extraction step may include information about input items and / or output items of the feature extraction step.
  • the processing information of the feature extraction step may include information on how to generate a feature f_trxdate_registerdate_diff based on a source field.
  • the data information of the feature extraction step may include the names trx_date and register_date of the source fields of the input items of the feature extraction step, and the output items of the feature extraction step.
  • Feature name f_trxdate_registerdate_diff for a specific feature.
  • the processing information of the feature extraction step may include information about an extraction process of the feature extraction step, that is, it may include a name and / or process information of the feature extraction step.
  • datediff, lineartrans, ("0.01", “0"), and discrete are the names of the processing methods applied during the extraction process, and the execution order of the processing methods is datediff ⁇ lineartrans, ("0.01", “0") ⁇ discrete.
  • This information may be included in the process information of the feature extraction step.
  • the name (FE) and flow information of the extraction process are displayed in the display control 402.
  • the process information may be displayed through a sub-flow chart composed of display controls of the child nodes.
  • the names of the corresponding processing methods are displayed in the display controls 402a, 402b, and 402c, respectively.
  • the names of the corresponding source fields can also be displayed in the display control 402 or the display controls 403 and 404 upstream of the display control 402a, respectively.
  • the source data table and / or generation process of the source field may be further displayed.
  • the source data table is generated through a data table splicing step.
  • the input data table and the splicing process of the data table splicing step may be further displayed.
  • the source data table of the source field may be an output table of the data table splicing step (not shown in the example of FIG. 7).
  • the name of the splicing process of the data table splicing step is displayed as sql: 01_join_fraud.
  • the names of the input data tables cmb0404_app_trx_detail and cmb0404_fraud of the data table splicing step are displayed, respectively.
  • the above display controls may have different forms according to different types of corresponding generation process elements.
  • the display controls 406 and 407 correspond to the input data table and can be displayed as oval controls;
  • the display control 405 corresponds to the stitching process and can be displayed as a rectangular control;
  • the display controls 403 and 404 correspond to the source field Can be displayed as a parallelogram control;
  • the display control 402 corresponds to the extraction process and includes display controls 402a, 402b, and 402c corresponding to the processing method. Therefore, the display control 402 can be displayed as a rectangular control embedded with multiple oval controls.
  • the multiple elliptical controls are display controls 402a, 402b, and 402c, respectively;
  • the display control 401 corresponds to a specific feature and can be displayed as a rounded rectangular control.
  • the difference in the form is not limited to the shape of the display control, and it may include the shape of the display control, the border line style, the border color, the background color, the background pattern, the font format in the display control, the font style (for example, , Bold, italic, and / or underlined), font color, and so on.
  • the process presentation view according to the present disclosure may include only the left-side flowchart in FIG. 7. Additionally, as an optional manner, in response to a user's selection operation of a specific display control in the flowchart, listing in the detailed display control corresponding to the specific display control regarding the specific display control is displayed. Details of the inputs, outputs, and / or processes of the.
  • the details display control 506 lists the name of the input data table (cmb0404_app_trx_detail) corresponding to the display control 406, the description (transaction table) added by the user, the number of rows and columns of the input data table (80,000 rows and 18 columns).
  • a corresponding detailed display control 505 can be generated and displayed.
  • the name (sql: 01_join_fraud5) of the splicing process corresponding to the display control 405 is listed, the description added by the user (the splicing transaction table and the determined risk transaction, and the label field flag is generated), the code information (the first 1 -6 lines of code), the number of rows and columns of the output data table (80,000 rows and 18 columns).
  • the detailed display control 503 lists the data statistical analysis information of the source field corresponding to the display control 403.
  • the data statistical analysis information may include information such as summary, statistics, and high-frequency values.
  • the detailed display control 502a lists the transformation process of the sample data due to the processing method (named datediff) corresponding to the display control 402a.
  • the detailed display control 502a lists the input sample data (respectively trx_date). And the data of the register_date field) to the output sample data (corresponding to the processing result of the DateDiff processing method), wherein the field type of the output sample data is an integer (Int).
  • the process of processing the data by a processing method (named datediff) is schematically illustrated.
  • a part of the example data records may be displayed through a transformation process of part or all of the feature extraction steps.
  • an entry for quickly entering the data preview and / or an entry for the program configuration of the processing process can be set in each detailed display control.
  • the process display view according to the present disclosure is not limited to the example shown in FIG. 7.
  • more or less generated process information may be displayed for specific features according to user needs or settings.
  • relevant information about the process of directly generating specific features can be shown, information about the entire generation process from the introduction of raw data to the generation of specific features can be shown, or part of the generation in the entire generation process can be shown in detail
  • the relevant information of the process and the remaining relevant information of the generating process can be simplified or omitted.
  • each device included in the system 100 for visualizing data processing steps of a machine learning process may also be implemented by hardware, software, firmware, middleware, microcode, or any combination thereof. .
  • the program code or code segments for performing the corresponding operations may be stored in a computer-readable storage medium such as a storage medium, so that the processor can read and run the corresponding Program code or code segment to perform the corresponding operation.
  • the exemplary embodiments of the present disclosure may be implemented as a computer-readable storage medium that visualizes data processing steps of a machine learning process, wherein the computer-readable medium is recorded on the computer-readable medium for execution by one or more processors.
  • the processor may be implemented as a computing device.
  • the present disclosure provides a computer-readable storage medium storing instructions, wherein when the instructions are executed by at least one computing device, the at least one computing device is caused to execute for implementing to machine learning The data processing steps of the process perform the relevant steps of visualization.
  • the exemplary embodiment of the present disclosure may also be implemented as a computing device that visualizes data processing steps of a machine learning process, the computing device including one or more storage devices and one or more processors, wherein, A computer-executable instruction set is stored in the one or more storage devices, and when the one or more processors execute the computer-executable instruction set, a data processing step for executing a machine learning process is performed. Visualization method.
  • the processor may be implemented as a computing device, and accordingly, the solution of the present disclosure may be implemented as a system including at least one computing device and at least one storage device storing instructions, wherein the instructions are executed by the at least one When the computing device is running, the at least one computing device is caused to execute related steps for visualizing data processing steps of the machine learning process.
  • the computing device may be deployed in a server or a client, or may be deployed on a node device in a distributed network environment.
  • the computing device may be a PC computer, a tablet device, a personal digital assistant, a smart phone, a web application, or other device capable of executing the above-mentioned instruction set.
  • the computing device does not have to be a single computing device, but may also be an assembly of any device or circuit capable of individually or jointly executing the above instructions (or instruction set).
  • the computing device may also be part of an integrated control system or system manager, or a portable electronic device that may be configured to interface with a local or remote (e.g., via wireless transmission) interface.
  • the processor may include a central processing unit (CPU), a graphics processor (GPU), a programmable logic device, a special-purpose processor system, a microcontroller, or a microprocessor.
  • processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.
  • Some operations described in the method for visualizing data processing steps of a machine learning process according to an exemplary embodiment of the present disclosure may be implemented by software, some operations may be implemented by hardware, and furthermore, by software A combination of hardware to achieve these operations.
  • the processor may execute instructions or code stored in one of the storage devices, wherein the storage device may also store data. Instructions and data can also be sent and received over a network via a network interface device, which can employ any known transmission protocol.
  • the storage device may be integrated with the processor, for example, the RAM or the flash memory is arranged in an integrated circuit microprocessor or the like.
  • the storage device may include a stand-alone device, such as an external disk drive, a storage array, or other storage device usable by any database system.
  • the storage device and the processor may be operatively coupled, or may communicate with each other, for example, through an I / O port, a network connection, or the like, so that the processor can read a file stored in the storage device.
  • the computing device may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the computing device may be connected to each other via a bus and / or a network.
  • a video display such as a liquid crystal display
  • a user interaction interface such as a keyboard, mouse, touch input device, etc.
  • Operations involved in a method of visualizing data processing steps of a machine learning process may be described as various interconnected or coupled function blocks or function diagrams. However, these functional blocks or functional diagrams may be equally integrated into a single logical device or operated on imprecise boundaries.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • User Interface Of Digital Computer (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

本公开提供一种对机器学习过程的数据处理步骤进行可视化的方法和系统。所述方法包括:对预先定义的机器学习过程的数据处理步骤进行解析,以获取所述数据处理步骤的概况信息,其中,所述概况信息包括数据处理步骤的数据信息和/或处理信息;基于获取的概况信息来生成用于描绘所述机器学习过程的数据处理步骤的理解视图;以及以图形化方式展示所述理解视图。

Description

对机器学习过程的数据处理步骤进行可视化的方法和系统 技术领域
本公开涉及机器学习领域,更具体地讲,涉及对机器学习过程的数据处理步骤进行可视化的方法和系统。
背景技术
随着大数据时代的到来,很多行业产生海量数据,并且数据种类、数据规模和数据维度都在不断膨胀。为了从海量数据中发现知识和价值,机器学习技术的应用越来越广泛。
这里,数据作为机器学习过程的原材料,对于机器学习模型的效果具有重要的意义,而为了能将数据应用于机器学习,往往需要对数据执行相应的处理,例如,数据清洗、数据填充、数据拼接或特征抽取等。
实践中,数据处理过程可通过运行程序员编写的代码来实现,也可由机器学习平台根据用户输入的脚本、配置和/或交互操作来实现,整个数据处理过程往往涉及庞大的数据量或复杂的处理操作。现有的机器学习平台与用户之间的交互性较差,一般用户无法直观地了解数据处理过程的逻辑思路和工作细节,导致例如在机器学习过程出现问题时,用户难以快速确定数据处理过程中的哪个步骤发生异常或错误。因此,对机器学习技术的应用和推广带来诸多不便。
发明内容
根据本公开的示例性实施例,提供一种对机器学习过程的数据处理步骤进行可视化的方法,所述方法包括:对预先定义的机器学习过程的数据处理步骤进行解析,以获取所述数据处理步骤的概况信息,其中,所述概况信息包括数据处理步骤的数据信息和/或处理信息;基于获取的概况信息来生成用于描绘所述机器学习过程的数据处理步骤的理解视图;以及以图形化方式展示所述理解视图。
根据本公开的另一示例性实施例,提供一种对机器学习过程的数据处理步骤进行可视化的计算机可读介质,其中,在所述计算机可读介质上记录有用于由一个或多个处理器执行前述的对机器学习过程的数据处理步骤进行可视化的方法的计算机程序。
根据本公开的另一示例性实施例,提供一种对机器学习过程的数据处理步骤进行可视化的计算装置,所述计算装置可以包括一个或多个存储装置和一个或多个处理器,其中,在所述一个或多个存储装置中存储有计算机可执行指令集合,当所述一个或多个处理器执行所述计算机可执行指令集合时,执行前述的对机器学习过程的数据处理步骤进行可视化的方法。
根据本公开的另一示例性实施例,提供一种对机器学习过程的数据处理步骤进行可视化的系统,所述系统包括:解释装置,用于对预先定义的机器学习过程的数据处理步骤进行解析,以获取所述数据处理步骤的概况信息,其中,所述概况信息包括数据处理步骤的数据信息和/或处理信息;视图生成装置,用于基于获取的概况信息来生成用于描绘所述机器学习过程的数据处理步骤的理解视图;以及展示装置,用于以图形化方式展示所述理解视图。
有益效果
通过应用根据本公开的示例性实施例的对机器学习过程的数据处理步骤进行可视化的方法和系统,可以方便用户可视化地使用机器学习平台,直观地了解机器学习过程的数据处理步骤的具体情况,增强机器学习平台与用户之间的交互,从而便于用户控制机器学习过程,帮助用户迅速发现机器学习过程中出现的问题。
将在接下来的描述中部分阐述本公开总体构思另外的方面和/或优点,还有一部分通过描述将是清楚的,或者可以经过本公开总体构思的实施而得知。
附图说明
通过下面结合示例性地示出实施例的附图进行的描述,本公开示例性实施例的上述和其他目的和特点将会变得更加清楚,其中:
图1是在现有技术的机器学习平台中通过构建有向无环图(DAG)来配置机器学习过程的示例。
图2示出根据本公开示例性实施例的用于对机器学习过程的数据处理步骤进行可视化的系统。
图3示出根据本公开的示例性实施例的对机器学习过程的数据处理步骤进行可视化的方法的流 程图。
图4示出根据本公开的示例性实施例的用于描绘机器学习过程的数据处理步骤的理解视图。
图5示出根据本公开的示例性实施例的用于描绘机器学习过程的数据处理步骤的另一理解视图。
图6示出根据本公开的示例性实施例的对特定特征的生成过程进行可视化的方法的流程图。
图7示出根据本公开示例性实施例的用于描绘特定特征的生成过程的过程展示视图。在下文中,将结合附图详细描述本公开,贯穿附图,相同或相似的元件将用相同或相似的标号来指示。
具体实施方式
提供以下参照附图进行的描述,以帮助全面理解由权利要求及其等同物限定的本公开的示例性实施例。所述描述包括各种特定细节以帮助理解,但这些细节被认为仅是示例性的。因此,本领域的普通技术人员将认识到:在不脱离本发明本公开的范围和精神的情况下,可对这里描述的实施例进行各种改变和修改。此外,为了清楚和简明,可省略已知功能和构造的描述。
在此需要说明的是,在本公开中出现的“并且/或者”、“和/或”均表示包含三种并列的情况。例如“包括A和/或B”表示包括A和B中的至少一个,即包括如下三种并列的情况:(1)包括A;(2)包括B;(3)包括A和B。类似地,“包括A、B和/或C”表示包括A、B和C中的至少一个。又例如“执行步骤一并且/或者步骤二”表示执行步骤一和步骤二中的至少一个,即表示如下三种并列的情况:(1)执行步骤一;(2)执行步骤二;(3)执行步骤一和步骤二。
随着海量数据的出现,人工智能技术迅速发展。机器学习(包括深度学习)等是人工智能发展到一定阶段的必然产物,其致力于通过计算的手段,从海量数据中挖掘有价值的潜在信息,利用经验来改善系统自身的性能。在计算机系统中,“经验”通常以“数据”形式存在,通过机器学习算法,可从数据中产生“模型”,也就是说,将经验数据提供给机器学习算法,从而能基于这些经验数据产生模型,在面对新的情况时,模型会提供相应的判断,即,预测结果。机器学习可被实现为“有监督学习”、“无监督学习”或“半监督学习”的形式,应注意,本公开的示例性实施例对具体的机器学习算法并不进行特定限制。在本公开的实施例中,数据处理过程是从引入原始数据到输出样本的过程中的至少一部分,整个过程也可称为特征工程。所述数据处理过程可以包括一个或多个数据处理步骤,根据本公开的示例性实施例,所述数据处理步骤的细节可通过解析而获得。
本公开提出了用于对机器学习过程进行可视化的方法和系统,所述方法和系统可以对机器学习过程的数据处理步骤进行可视化,以便于用户迅速地直观理解数据处理过程。
图1是在现有技术的机器学习平台中通过构建有向无环图(DAG)来配置机器学习过程的示例。
在图1所示的示例中,每个模块可代表机器学习过程中的相关步骤,可以看出,粗线框选出的数据处理过程占据了较大部分的工作。这是因为,在机器学习过程中,样本特征所基于的数据字段可能来自于多个数据表拼接之后产生的宽表,比如,在银行利用机器学习模型来判断欺诈交易的情况下,会将用户信息表、银行卡信息表和交易记录表拼接为一张宽表去处理。此外,作为示例,数据表中的原始数据记录会需要经过清洗、格式转换(例如,日期格式转换)、时序拼接等一系列操作。
尽管在图1所示的示例中,通过将各步骤展示为DAG中的节点来将相关处理模块化,然而该方式并不能帮助用户迅速了解数据处理过程的整体思路或具体做了哪些工作。如果用户希望了解详情,则不得不主动为每一个模块添加说明或一一查看各个模块的具体内容,这样会加重用户的使用负担。而且,在很多情况下,这些模块的具体内容又完全是相应处理的原始代码,用户不得不具备一定的知识能力才能够通过代码来理解数据处理过程。
图2示出根据本公开示例性实施例的用于对机器学习过程的数据处理步骤进行可视化的系统100。所述系统100包括解释装置101、视图生成装置102和展示装置103。
解释装置101可以对预先定义的机器学习过程的数据处理步骤进行解析,以获取所述数据处理步骤的概况信息,其中,所述概况信息包括数据处理步骤的数据信息和/或处理信息。这里,可根据实际情况,在所述机器学习过程运行之前、运行同时或运行之后对相应的数据处理步骤执行解析处理,使得能够获取关于数据处理步骤的诸如输入、输出、中间结果、处理细节等的信息。
视图生成装置102可以基于获取的概况信息来生成用于描绘所述机器学习过程的数据处理步骤的理解视图。这里,视图生成装置102可基于解析出的各个数据处理步骤自身的信息,形成能够一方面反映出数据处理步骤之间的依赖关系,另一方面反映出每个数据处理步骤自身的数据信息和/或处理信息的理解视图。
展示装置103可以以图形化方式展示所述理解视图。这里,展示装置103可借由显示器(未示出)等输出装置向用户展示所述理解视图,作为示例,展示装置103可通过特定的形式或效果来展示 所述理解视图,以帮助用户通过展示的理解视图来了解相关的数据处理过程。
在下文中,结合图3至图7详细说明所述系统100对机器学习过程的数据处理步骤进行可视化的处理。
图3示出根据本公开示例性实施例的对机器学习过程的数据处理步骤进行可视化的方法的流程图。在本公开的实施例中,机器学习过程由机器学习平台的用户来设置。例如,所述机器学习过程可表现为用户通过拖拽节点模块的方式生成的有向无环图(DAG),其中,用户可配置每个节点模块所对应的数据和/或操作。又例如,所述机器学习过程可表现为用户手动编写的计算机程序代码。在所述机器学习过程中,通常需要利用一个或多个数据处理步骤对原始数据进行操作,这些操作都会带来字段的变化,从字段的变化可以快速地感知数据处理步骤执行的操作。
在根据本公开的实施例中,预先定义的机器学习过程可以包括一个或多个数据处理步骤,作为示例,所述一个或多个数据处理步骤可以包括数据引入步骤、数据清洗步骤、数据拼接步骤、时序聚合步骤或特征抽取步骤等等。这些数据处理步骤的处理结果可被视为输出表(例如,数据表或样本表)。在所述机器学习过程运行前、正在运行时或者在所述机器学习过程已经结束之后,可以根据用户需求或预定设置而选择性地对数据处理步骤进行可视化。如图3所示,对所述数据处理步骤进行可视化的方法可以包括步骤S11、S12和S13。
在步骤S11,对所述预先定义的机器学习过程的数据处理步骤进行解析,以获取所述数据处理步骤的概况信息,其中,所述概况信息可以包括所述数据处理步骤的数据信息和/或处理信息。在本公开的实施例中,所述数据处理步骤可包括数据引入步骤、数据清洗步骤、数据表拼接步骤、时序聚合步骤和/或特征抽取步骤等多种步骤。此外,所述数据处理步骤可以是所述机器学习过程中用户想要了解的一个或多个数据处理步骤,并且可以是所述机器学习过程中的全部数据处理步骤或一部分数据处理步骤。可选地,所述概况信息可包括所述数据处理步骤的名称、所述数据处理步骤的输出表的名称、输出表的行数、输出表的列数、输出表的字段名称、处理过程和用户添加的步骤描述之中的至少一项。
在解释装置101执行步骤S11之后,在步骤S12,视图生成装置102基于获取的概况信息来生成用于描绘所述数据处理步骤的理解视图。这里,视图生成装置102可产生能够体现各个数据处理步骤自身的输入数据或输出数据、和/或各个数据处理步骤所涉及的至少一部分处理方法等的理解视图。在该理解视图中,还可进一步反映出各个数据处理步骤在执行顺序上的依赖关系。
然后,在步骤S13,展示装置103以图形化方式展示所述理解视图。这里,展示装置103可按照任何图形化的形式(例如,流程图、结构图、表格、项目、图形等)来展示所述理解视图,使得用户能够在所述理解视图上容易地查看到各个数据处理步骤的概况。
可选地,所述理解视图可以是表示所述数据处理步骤的流程图,其中,所述流程图中的节点分别对应于每个数据处理步骤,并且各个数据处理步骤的节点按照依赖关系而相互连接,这里,每个节点可具有对应的显示控件,该显示控件可以是具有各种形状的显示框,在显示框内或显示框周围可进一步展示至少一部分概况信息。应注意,概况信息可以被直接显示在显示框内或显示框周围;此外,概况信息也可以采取隐藏方式来进行显示,使得相关内容在用户执行相应的触发操作(例如,点击显示控件)之后才显示出来。
可选地,利用展示装置103以图形化方式展示所述理解视图的处理可以包括:通过在每个节点的显示控件中列出对应的数据处理步骤的概况信息来展示所述机器学习过程的数据处理步骤。这里,作为示例,可由机器学习平台预先设置在每个节点的显示控件中会列出哪些概况信息,也可根据用户的选择来设置或调整将在各个显示控件中列出的概况信息。
可选地,在每个节点的显示控件中列出对应的数据处理步骤的概况信息的处理可以包括:利用展示装置103,在每个节点的显示控件中默认列出对应的数据处理步骤的概况信息之中的首要展示信息;以及响应于用户对显示控件的操作,在显示控件中进一步列出对应的数据处理步骤的概况信息之中的补充展示信息。
具体说来,考虑到显示控件中可展示内容的有限性,或者,考虑到用户对不同概况信息的关注程度或关注顺序,可分层次地或分级地在显示控件中列出概况信息。作为示例,可由机器学习平台预先确定分层次或分级展示的内容,也可根据用户的设置来确定在首要展示信息中包括哪些概况信息和/或在补充展示信息中包括哪些概况信息。这里,补充展示信息可一次性展示,也可进一步地分层次或分级展示。在该示例中,用户对显示控件的操作可以是用户为了进一步了解所述显示控件对应的数据处理步骤而针对所述显示控件执行的任何操作。在这个实施例中,用户可在机器学习平台的用户交互界面上点击节点的显示控件,以进一步了解所述节点对应的数据处理步骤的补充展示信息。
可选地,首要展示信息包括数据处理步骤的名称、输出表的名称、输出表的行数、输出表的列数 和用于添加的步骤描述之中的至少一项,并且/或者,补充展示信息包括输出表的至少一部分字段名称和/或所述数据处理步骤的至少一部分处理过程。
这里,在显示控件列出对应的数据处理步骤的首要展示信息之后,响应于用户对显示控件的操作,显示控件的大小被自适应地调整以进一步列出对应的数据处理步骤的补充展示信息。
可选地,利用展示装置103在每个节点的显示控件中列出对应的数据处理步骤的概况信息的处理还可以包括:根据显示控件中列出的概况信息的内容来自适应地调整显示控件的大小。在根据本公开的实施例中,根据显示控件中列出的概况信息的内容来自适应地调整显示控件的大小可以包括根据所述内容的多少(即,内容量)来自适应地调整显示控件的大小。也就是说,每个显示控件的大小取决于其中所要展示的概况信息的内容多少。
可选地,在每个节点的显示控件中列出对应的数据处理步骤的概况信息的处理还可以包括:利用展示装置103在每个节点的显示控件中以突出的视觉效果列出对应的数据处理步骤的输出表的字段名称之中的新生成的字段名称。这里,作为示例,所述数据处理步骤可以是诸如数据表拼接这样的步骤,在这种情况下,可将作为拼接表的数据表中的至少一个字段(这里,字段对应于数据表的列)拼接到原有的数据表,成为了所述数据处理步骤的输出表中新生成的字段。作为示例,所述突出的视觉效果可包括但不限于:字体被放大、字体格式不同于其它字段名称的字体格式、以特定样式(例如,加粗、斜体、加底纹和/或加下划线)和/或特定颜色的字体显示。
可选地,在每个节点的显示控件中列出的概况信息可以包括对应的数据处理步骤的输出表的所有字段名称,其中,具有相同初始来源表的字段名称被排列在一起。这里,作为示例,数据处理步骤可以是诸如数据表拼接的步骤,根据本公开的示例性实施例,各个初始数据表可能会在不同阶段经过多次数据表拼接处理而得到作为拼接结果的输出表,也就是说,某个输出表可以是在原始数据表的基础上,不断拼接来自不同的其它数据表的至少一部分字段或全部字段而得到的结果,相应地,在所述输出表中,来源于相同的初始来源表的字段名称可在节点的显示控件中被排列在一起,这里,初始来源表可指示最初引入机器学习系统(例如,机器学习平台)中而尚未经过任何数据处理步骤的数据表,可通过追踪数据表的拼接流程来获取具有相同初始来源表的字段名称。
可选地,在所有节点的显示控件中按照相同的视觉效果来列出具有相同初始来源表的字段名称。所述视觉效果可包括但不限于:字体大小、字体格式、字体样式(例如,加粗、斜体、加底纹和/或加下划线)和/或字体颜色等等。
可选地,所述数据处理步骤的处理过程在节点的显示控件中通过子流程图的形式被列出。根据本公开的示例性实施例,单个数据处理步骤可指示针对数据记录或样本记录的全部字段或其中的至少一个字段执行的一个或多个进一步处理步骤。为了便于描述,可将单个数据处理步骤中所包含的进一步处理步骤称为子步骤,相应地,通过对数据处理步骤执行的处理方法级别的解析处理,可得到由一个或多个子步骤所组成的子流程图。
可选地,数据处理步骤被划分为数据引入步骤和非数据引入步骤,其中,数据引入步骤可指示将数据(例如,数据文件、数据表等)最初引入机器学习系统(例如,机器学习平台)的步骤,也可指示将数据引入特定机器学习流程的步骤(例如,将已经存在于机器学习系统的数据应用于某个特定的机器学习流程),这里,上述两种步骤也可以是统一的单个步骤。另外,非数据引入步骤包括所述数据处理步骤中除了数据引入步骤以外的其它步骤,例如,数据清洗步骤、数据表拼接步骤、时序聚合步骤、特征抽取步骤等。
这里,对应于数据引入步骤的节点的显示控件和对应于非数据引入步骤的节点的显示控件分别具有各自的形态,例如,显示控件的形状、边框线型、边框颜色、背景颜色、背景图案、显示控件中的字体格式、字体样式(例如,加粗、斜体和/或加下划线)、字体颜色等中的至少一项可以根据不同类型的数据处理步骤而不同。下面将参照图4和图5进行具体说明。
图4示出根据本公开示例性实施例的用于描绘机器学习过程的数据处理步骤的理解视图的示例。为了简洁,图4示出的理解视图所描绘的数据处理步骤包括按照执行顺序的数据引入步骤、数据表拼接步骤和特征抽取步骤,但是本公开不限于此,而是可以针对与任意数据处理过程相应的各种数据处理步骤提供理解视图。
利用图3所示的方法生成并展示图4所示的理解视图,所述理解视图为将显示控件201、202、203、204、205和206按照对应节点的数据处理步骤的执行顺序连接而成的流程图。在图4中,利用显示控件之间的箭头指示显示控件对应的数据处理步骤之间的执行顺序。
显示控件201对应的数据处理步骤为第一数据引入步骤,第一数据引入步骤用于将名称为cmb0404_app_trx_detail的数据表引入机器学习过程中,相应地,第一数据引入步骤的执行结果为输出 表cmb0404_app_trx_detail。如图4所示,在显示控件201中列出了第一数据引入步骤的概况信息中的输出表的名称(cmb0404_app_trx_detail)和输出表的行列数(80000行16列)。此外,在显示控件201中还可列出第一数据引入步骤的其它数据信息和/或处理信息,以便于用户理解第一数据引入步骤的其它细节。例如在显示控件201中还可列出用户添加的步骤描述“引入交易表”(图4中未示出)。
显示控件202对应的数据处理步骤为第二数据引入步骤,第二数据引入步骤用于将名称为cmb0404_fraud的数据表引入机器学习过程中,第二数据引入步骤的执行结果为输出表cmb0404_fraud。如图4所示,在显示控件202中列出了第二数据引入步骤的概况信息中的输出表的名称(cmb0404_fraud)和输出表的行列数(822行1列)。此外,在显示控件202中还可列出第二数据引入步骤的其它数据信息和/或处理信息,以便于用户理解第二数据引入步骤的其它细节。例如,在显示控件202中还可列出用户添加的步骤描述“引入确定的风险交易表”(图4中未示出)。
显示控件203对应的数据处理步骤为第一数据表拼接步骤,第一数据表拼接步骤用于将第一数据引入步骤的输出表与第二数据引入步骤的输出表进行拼接,以生成并输出名称为sql:01_join_fraud的数据表(该数据表的名称可以是通过解析处理而获取的操作名称,或者,可采用其它方式来提供默认名称,此外,用户也可主动修改或调整该名称)。如图4所示,在显示控件203中列出了第一数据表拼接步骤的概况信息中的输出表的名称(sql:01_join_fraud)和输出表的行列数(80000行17列)。此外,在显示控件203中还列出了第一数据表拼接步骤的概况信息中的用户添加的步骤描述(拼接交易表和确定的风险交易,生成label字段flag),以便于用户理解第一数据表拼接步骤的功能。此外,在显示控件203中还可列出第一数据表拼接步骤的其它数据信息和/或处理信息。
在这个示例中,利用根据本公开的对机器学习过程的数据处理步骤进行可视化的方法可以对第一数据表拼接步骤进行解析,以获取第一数据表拼接步骤的概况信息,所述概况信息可以包括第一数据表拼接步骤输出表的名称、输出表的行数、输出表的列数、输出表的字段名称、处理过程和用户添加的步骤描述之中的至少一项。在对第一数据表拼接步骤进行解析的过程中,所述方法可以进一步确定第一数据表拼接步骤的输出表的各个字段的数据来源,即,来源数据表。
显示控件204对应的数据处理步骤为第三数据引入步骤,第三数据引入步骤用于将名称为cmb0404_ip_mapping的数据表引入机器学习过程中,第三数据引入步骤的执行结果为输出表cmb0404_ip_mapping。如图4所示,在显示控件204中列出了第三数据引入步骤的概况信息中的输出表的名称(cmb0404_ip_mapping)和输出表的行列数(79999行3列)。此外,在显示控件204中还可列出第三数据引入步骤的其它数据信息和/或处理信息。
显示控件205对应的数据处理步骤为第二数据表拼接步骤,第二数据表拼接步骤用于将第一数据表拼接步骤的输出表与第三数据引入步骤的输出表进行拼接,以生成并输出名称为sql:02_join_ip_mapping的数据表。如图4所示,在显示控件205中列出了第二数据表拼接步骤的概况信息中的输出表的名称(sql:02_join_ip_mapping)。此外,在显示控件205中还可列出第二数据表拼接步骤的其它数据信息和/或处理信息,以便于用户理解第二数据表拼接步骤的细节。
显示控件206对应的数据处理步骤为特征抽取步骤,特征抽取步骤用于从第二数据表拼接步骤的输出表的各条数据记录中抽取特征,以生成并输出相应的特征表。如图4所示,在显示控件206中列出了特征抽取步骤的名称fe(该特征抽取步骤的名称可以是通过解析处理而获取的操作名称,或者,可采用其它方式来提供默认名称,此外,用户也可主动修改或调整该名称)。此外,在显示控件206中还可列出特征抽取步骤的其它数据信息和/或处理信息,以便于用户理解特征抽取步骤的细节。例如,在显示控件206中还可列出用于描绘特征抽取步骤的处理过程的子流程图(图4中未示出)。
如图4所示,显示控件201、202、203、204、205和206的大小根据所列出的内容多少而自适应地调整。显示控件201、202和204所对应的数据处理步骤为数据引入步骤,显示控件203、205和206所对应的数据处理步骤为非数据引入步骤。为了增加视觉效果,数据引入步骤所对应的显示控件201、202和204的背景颜色不同于非数据引入步骤所对应的显示控件203、205和206的背景颜色。附加地或可选地,数据引入步骤所对应的显示控件的形状可不同于非数据引入步骤所对应的显示控件的形状。此外,每个显示控件中的字体格式和/或颜色也可根据概况信息类型的不同而不同,特征抽取步骤所对应的显示控件的形态也可不同于数据表拼接步骤所对应的显示控件的形态。用户还可在显示控件中主动修改或添加任何附加信息。
作为示例,在显示控件中还可分层次地展示对应的某些具体概况信息。例如,在图4所示的理解视图的每个节点的显示控件中列出的可以是对应的数据处理步骤的概况信息之中的首要展示信息。相应地,响应于用户对显示控件的操作,还可以在显示控件中进一步列出对应的数据处理步骤的概况信息之中的补充展示信息。下面参照图4和图5进行详细说明。
图5示出根据本公开示例性实施例的用于描绘机器学习过程的数据处理步骤的理解视图的另一示例,该示例为基于图4所示的理解视图之中的至少一部分数据处理步骤而进一步展示的理解视图。
参照图4和图5,显示控件201和301对应于同一节点,显示控件202和302对应于同一节点,显示控件203和303对应于同一节点,显示控件204和304对应于同一节点,显示控件205和305对应于同一节点。为了简洁,在图5中省略了与显示控件206对应的节点。
响应于用户对显示控件201的操作,显示控件201变为显示控件301的形态,即,在显示控件301中进一步列出第一数据引入步骤的输出表的所有字段名称(可选地,也可仅列出至少一部分字段名称)。响应于用户对显示控件202的操作,显示控件202变为显示控件302的形态,即,在显示控件302中进一步列出第二数据引入步骤的输出表的所有字段名称(可选地,也可仅列出至少一部分字段名称)。类似地,响应于用户对显示控件204的操作,显示控件204变为显示控件304的形态,即,在显示控件304中进一步列出第三数据引入步骤的输出表的所有字段名称或至少一部分字段名称。
响应于用户对显示控件203的操作,显示控件203变为显示控件303的形态,即,在显示控件303中进一步列出第一数据表拼接步骤的输出表的所有字段名称(可选地,也可仅列出至少一部分字段名称),其中,具有相同初始来源表的字段名称被排列在一起,并且/或者,按照相同的视觉效果列出具有相同初始来源表的字段名称。可选地,以突出的视觉效果列出新生成字段的字段名称flag。
类似地,响应于用户对显示控件205的操作,显示控件205变为显示控件305的形态,即,在显示控件305中进一步列出第一数据表拼接步骤的输出表的所有字段名称(可选地,也可仅列出至少一部分字段名称),其中,具有相同初始来源表的字段名称被排列在一起,并且/或者,按照相同的视觉效果列出具有相同初始来源表的字段名称。可选地,以突出的视觉效果列出新生成字段的字段名称ip_city和ip_country。
图4和图5示出的理解视图仅是示例,但是本公开不限于此,例如,在上述示例中,不仅可分别促使每个显示控件在其中显示相应的补充展示信息,还可通过统一的触发机制(例如,对任意一个显示控件的点击或对专门设置的按钮的点击)而促使所有显示控件同时显示各自对应的补充展示信息。此外,根据用户需求或预定设置,可以针对机器学习过程中的任意类型和/或任意数量的数据处理步骤生成并展示理解视图,以便于用户理解。
在根据本公开的上述实施例中,向用户展示的理解视图中展示了机器学习过程的多个数据处理步骤,这里,作为示例,每个数据处理步骤所针对的是整个数据表或特征表。
进一步地,为了帮助用户了解特定特征的生成过程,根据本公开的对机器学习过程的数据处理步骤进行可视化的方法和系统还可以进一步对理解视图中的特定特征的生成过程进行可视化,即,特征追溯。
基于根据本公开的实施例的理解视图,所述方法可以针对理解视图中的特定特征展示用于描绘该特定特征的生成过程的过程展示视图。下面参照图6和图7进行详细描述。
图6示出根据本公开示例性实施例的对特定特征的生成过程进行可视化的方法的流程图。
如图6所示,所述方法包括步骤S21、S22、S23和S24。在步骤S21,解释装置101可以确定所述理解视图中的特定特征。可选地,也可以由单独的确定装置(未示出)来确定所述理解视图中的特定特征。在本公开的实施例中,作为示例,如果用户想要了解所述理解视图中的特定特征的生成过程,则用户可以在所述理解视图中选择与特定特征对应的显示控件。解释装置101可以在步骤S21响应于用户的选择操作而确定所述特定特征,以对所述特定特征的生成过程进行解析。
然后,在步骤S22,解释装置101对所述机器学习过程中用于生成所述特定特征的至少一个数据处理步骤进行解析,以获取所述特定特征的生成过程信息,其中,所述生成过程信息包括所述至少一个数据处理步骤的数据信息和/或处理信息。根据本公开的示例性实施例,机器学习过程可表现为用户通过拖拽节点模块的方式生成的有向无环图(DAG),其中,用户可配置每个节点模块所对应的数据和/或操作。又例如,所述机器学习过程可表现为用户手动编写的计算机程序代码。相应地,用于生成特定特征的所述至少一个数据处理步骤可包括数据引入步骤、数据清洗步骤、数据拼接步骤、时序聚合步骤和/或特征抽取步骤等等。这些数据处理步骤的处理结果可以是与所述特定特征的抽取过程相关的字段或包括所述字段的完整输出表。
这里,可根据实际情况,在所述机器学习过程运行之前、运行同时或运行之后对相应的至少一个数据处理步骤执行解析处理,使得能够获取关于所述至少一个数据处理步骤的诸如输入、输出、中间结果、处理细节等的信息。这里,应注意的是,解释装置101所解析的所述至少一个数据处理步骤是从生成所述特定特征的角度来追溯的,即,所述至少一个数据处理步骤所针对的处理对象或处理结果可直接或间接地用于生成所述特定特征。例如,所述至少一个数据处理步骤可涉及用于生成所述特定 特征的特征抽取过程,这里,所述特征抽取过程可指示仅用于生成所述特定特征的抽取处理(而不涉及其它特征的抽取处理)。又例如,所述至少一个数据处理步骤可涉及用于拼接出数据表(该数据表可以是所述特定特征所依赖字段的直接来源数据表或间接来源数据表)的拼接过程,与所述拼接过程相关的数据信息可涉及数据表中的所有字段,也可仅涉及与所述特定特征的生成有关的字段。通过这种方式,能够在整个机器学习过程的复杂数据处理步骤中挑选出和关注的特征相关的数据处理步骤,以帮助用户更清晰地了解特征的含义。
在步骤S23,视图生成装置102可以基于所述生成过程信息来生成用于描绘所述特定特征的生成过程的过程展示视图。这里,视图生成装置102可基于解析出的各个数据处理步骤自身的信息,形成能够一方面反映出数据处理步骤之间的依赖关系,另一方面反映出每个数据处理步骤自身的数据信息和/或处理信息的过程展示视图。
展示装置103可以在步骤S24以图形化方式展示所述过程展示视图。这里,展示装置103可借由显示器(未示出)等输出装置向用户展示所述过程展示视图,作为示例,展示装置103可通过特定的形式或效果来展示所述过程展示视图,以帮助用户通过展示的过程展示视图来了解特定特征的生成过程。
可选地,所述至少一个数据处理步骤的数据信息可以包括关于所述至少一个数据处理步骤的输入项和/或输出项的信息,所述至少一个数据处理步骤的处理信息可以包括关于所述至少一个数据处理步骤的处理过程的信息。这里,如上所述,所述至少一个数据处理步骤的输入项或输出项可仅涉及与特定特征的抽取操作相关的字段,也可涉及包括上述字段的完整输出表。另外,所述至少一个数据处理步骤的处理信息可涉及每个数据处理步骤各自的处理过程,该处理过程可包括至少一个子步骤,这里,可通过解析处理来获取各个子步骤的信息。
可选地,所述过程展示视图可以是表示所述特定特征的生成过程的流程图,其中,所述流程图中的节点可以分别表示对应的数据处理步骤的输入项、输出项和/或处理过程。相应地,以图形化方式展示所述过程展示视图的处理可以包括:展示装置103可以在每个节点的显示控件中展示关于对应的数据处理步骤的输入项、输出项和/或处理过程的信息。这里,每个节点可具有对应的显示控件,该显示控件可以是具有各种形状的显示框,在显示框内或显示框周围可进一步展示关于输入项、输出项和/或处理过程的信息。应注意,上述信息可以被直接显示在显示框内或显示框周围;此外,上述信息也可以采取隐藏方式来进行显示,使得相关内容在用户执行相应的触发操作(例如,点击显示控件)之后才显示出来。这里,作为示例,可由机器学习平台预先设置在每个节点的显示控件中会列出哪些信息,也可根据用户的选择来设置或调整将在各个显示控件中列出的信息。
可选地,所述至少一个数据处理步骤可以包括用于生成所述特定特征的特征抽取步骤。所述特征抽取步骤的数据信息可以包括关于所述特征抽取步骤的输入项和/或输出项的信息,所述特征抽取步骤的处理信息可以包括关于所述特征抽取步骤的处理过程的信息。这里,所述特征抽取步骤是指对于相应的数据表,针对其中的一个或多个来源字段按照特定的抽取方法进行处理,以得到特征的过程。作为示例,这里的抽取方法包括但不限于:诸如针对数值型字段进行取整、取对数的算术运算手段、诸如直接将完整字段作为特征、截取部分字段(例如,完整日期字段中的年份部分)的转换手段、诸如对连续值特征进行离散化、将不同特征进行组合的特征运算手段等等。相应地,数据信息可包括关于来源字段的信息、关于输出特征或中间结果的信息和/或关于包括来源字段的数据表的信息等。处理信息可包括关于各个特征抽取手段或其进一步细化操作的信息。
可选地,所述过程展示视图中的流程图可以包括:表示作为所述特征抽取步骤的输入项的来源字段的节点、表示作为所述特征抽取步骤的处理过程的抽取处理过程的节点和/或表示作为所述特征抽取步骤的输出项的所述特定特征的节点。相应地,以图形化方式展示所述过程展示视图的处理还可以包括:展示装置103可以在表示来源字段的节点的显示控件中展示来源字段的名称,在表示抽取处理过程的节点的显示控件中展示抽取处理过程的名称和/或流程信息,并且/或者,在表示所述特定特征的节点的显示控件中展示所述特定特征的名称。根据本公开的示例性实施例,可在所述过程展示视图中设置单独的节点来分别代表相应的输入项、输出项和处理过程。也就是说,为了更加清楚地追溯特定特征的生成过程中所涉及的关键信息,可为与单个数据处理步骤相应的关键信息设置单独的显示控件。在所述显示控件中,可进一步列出关键信息的名称和/或流程信息。
可选地,抽取处理过程的流程信息可以包括抽取处理过程中应用的一个或多个处理方法的名称,表示抽取处理过程的节点包括可以分别表示所述一个或多个处理方法的子节点。相应地,以图形化方式展示所述过程展示视图的处理还可以包括:展示装置103可以在所述子节点的显示控件中分别展示所述一个或多个处理方法的名称。这里,抽取处理过程可涉及一个或多个处理方法,例如,对数值型字 段先取整后再取对数的运算。这些处理方法总体上可对应于一个子流程图,其中,每个处理方法可对应一个子节点,子节点之间的连接关系反映了各个处理方法之间的依赖关系,子节点的显示控件中可分别列出对应的处理方法的名称。
可选地,所述流程图还可以包括:表示所述来源字段的来源数据表的节点。相应地,以图形化方式展示所述过程展示视图的处理还可以包括:展示装置103可以在表示所述来源数据表的节点的显示控件中展示所述来源数据表的名称。这里,为了更清楚地了解特征生成过程中所涉及的数据,流程图中可进一步引入表示特征的来源字段所在的数据表的节点。也就是说,在本公开的示例性实施例中,对于输入项的展示,可借助具有包含关系或递进关系的多个节点来完成,例如,在流程图中除了显示作为特征的直接来源的来源字段的节点之外,还可进一步显示作为特征的间接来源的数据表(例如,来源字段所在的数据表)。这里,可在来源数据表的显示控件中列出该来源数据表的名称和/或其它相关信息。
可选地,所述至少一个数据处理步骤还可以包括特征抽取步骤的上游处理步骤,其中,所述上游处理步骤可以用于生成所述来源字段的来源数据表。这里,为了更清楚地追溯特征生成的根本来源,所述流程图还可进一步包括除了特征抽取步骤以外的其它步骤,这些步骤可主要通过引入或拼接的方式来得到特征的来源字段所在的数据表。
可选地,所述上游处理步骤可以包括一个或多个数据表拼接步骤。相应地,所述一个或多个数据表拼接步骤的数据信息可以包括关于所述一个或多个数据表拼接步骤的输入项和/或输出项的信息,所述一个或多个数据表拼接步骤的处理信息可以包括关于所述一个或多个数据表拼接步骤的处理过程的信息。根据本公开的示例性实施例,作为示例,特征的来源字段所在的来源数据表可以是一次或多次数据拼表的最终输出结果,在这种情况下,过程展示视图中所显示的至少一个数据处理步骤可进一步包括与每次拼表操作对应的数据表拼接步骤。针对数据表拼接步骤的解析处理可得到被拼接的数据表的名称、数据表中被实际拼接的字段、拼接后生成的数据表的名称、拼接后生成的数据表所包括的字段等,此外,还可得到关于具体拼接过程的信息,例如,两个或以上数据表拼接时的主从拼接关系、对齐字段等。
可选地,所述流程图还可以包括:表示作为所述一个或多个数据表拼接步骤的输入项的输入数据表的节点和/或表示作为所述一个或多个数据表拼接步骤的处理过程的拼接处理过程的节点。相应地,以图形化方式展示所述过程展示视图的处理还可以包括:展示装置103可以在表示输入数据表的节点的显示控件中分别展示输入数据表的名称,并且/或者,展示装置103可以在表示拼接处理过程的节点的显示控件中分别展示拼接处理过程的名称。根据本公开的示例性实施例,可按照与特征抽取步骤类似的各种方式来展示数据表拼接步骤的数据信息和/或处理信息。此外,作为示例,针对多次数据拼接的情况,为了避免重复,可仅设置与输入项对应的节点,而不设置与输出项对应的节点。这是因为,在某些情况下,后续数据拼接步骤的输入表同时也是先前数据拼接步骤的输出表,因此,上述方式可避免出现指示同一数据表的重复节点。
可选地,对应于所述特定特征的节点的显示控件、对应于特征抽取步骤的节点的显示控件、对应于来源字段的节点的显示控件、对应于拼接处理过程的节点的显示控件、对应于来源数据表的节点的显示控件和/或对应于输入数据表的节点的显示控件分别具有各自的形态。例如,显示控件的形状、边框线型、边框颜色、背景颜色、背景图案、显示控件中的字体格式、字体样式(例如,加粗、斜体和/或加下划线)、字体颜色等中的至少一项可以根据对应于不同内容的节点而不同。
可选地,以图形化方式展示所述过程展示视图的处理还可以包括:展示装置103可以响应于用户对过程展示视图中的特定显示控件的选择操作,在与所述特定显示控件对应的详情显示控件中列出关于所述特定显示控件中展示的输入项、输出项和/或处理过程的详情信息。根据本公开的示例性实施例,除了借助以上描述的流程图节点来展示各个相关数据处理步骤的至少一部分信息之外,还可在专门的详情显示控件中进一步展示关于流程图节点中列出的各步骤的输入项、输出项和/或处理过程的详情信息。这里,详情显示控件可设置在对应的显示控件的周围,也可排列在整个界面中的任意位置。此外,作为另一示例,详情显示控件还可由原来的显示控件扩充而得,例如,当用户选择了特定显示控件时,该特定显示控件会进一步扩大以容纳需要显示的详情信息。
可选地,关于输入项和/或输出项的详情信息可以包括与输入项和/或输出项对应的名称、用户添加的描述、数据表的行数、数据表的列数、数据表的字段名称、数据表的字段类型、数据表中的至少一部分数据、数据表中的数据的统计分析信息中的至少一项。关于处理过程的详情信息可以包括与处理过程对应的名称、用户添加的描述、代码信息和示例数据的变换过程中的至少一项。这里,关于数据内容的详情信息不仅可包括关于数据的属性信息或统计信息,还可包括至少一部分示例数据本身。此外, 关于处理过程的详情信息可涉及与数据处理过程相关的诸如配置或脚本的代码内容,或可进一步包括至少一部分示例数据的处理过程演示。通过在过程展示视图的基础上进一步展示与各个展示内容对应的详情信息,有助于用户全方面地直观了解整个特征生成过程所涉及的各种细节,从而更加有效地设计或运行机器学习过程。
为了更直观地描述过程展示视图,假设在根据本公开的一个实施例中,用户对理解视图中展示的特定特征f_trxdate_registerdate_diff感兴趣,希望进一步了解所述特定特征的生成过程。下面将参照图7详细描述用于描绘所述特定特征的生成过程的过程展示视图,但是本公开不限于此,所述特定特征可以是理解视图中展示的任何一个或多个特征。
图7示出根据本公开示例性实施例的用于描绘特定特征的生成过程的过程展示视图的示例。利用根据本公开的对特定特征的生成过程进行可视化的方法生成图7所示的过程展示视图。
图7中的左侧流程图是各个显示控件按照对应的生成过程元素之间的依赖关系连接而成的流程图,其中,利用显示控件之间的箭头指示显示控件之间的依赖关系。在本文中,所述生成过程元素包括所述特定特征的生成过程中涉及的各种元素,例如,所述特定特征、处理过程、处理过程中的处理方法、来源字段、来源数据表和输入数据表。
如图7所示,在显示控件401中列出了用户感兴趣的特定特征的特征名称f_trxdate_registerdate_diff。
通过对机器学习过程中用于生成所述特定特征的数据处理步骤进行解析,可以获取所述特定特征的生成过程信息。根据所述生成过程信息可以确定所述特定特征是通过特征抽取步骤产生的。通过对所述特征抽取步骤进行解析,可以获取所述特征抽取步骤的数据信息和/或处理信息。所述特征抽取步骤的数据信息可以包括关于所述特征抽取步骤的输入项和/或输出项的信息。所述特征抽取步骤的处理信息可以包括如何基于来源字段来生成特征f_trxdate_registerdate_diff的信息。
在图7示出的实施例中,所述特征抽取步骤的数据信息可以包括作为所述特征抽取步骤的输入项的来源字段的名称trx_date和register_date、作为所述特征抽取步骤的输出项的所述特定特征的特征名称f_trxdate_registerdate_diff。此外,所述特征抽取步骤的处理信息可以包括关于所述特征抽取步骤的抽取处理过程的信息,即,可以包括所述特征抽取步骤的名称和/或流程信息。在这个实施例中,通过对所述特征抽取步骤进行解析,可以确定用于生成所述特定特征的抽取处理过程:f_trxdate_registerdate_diff=discrete(lineartrans(datediff(trx_date,register_date),"0.01","0"))
其中,datediff、lineartrans,("0.01","0")和discrete分别为抽取处理过程中应用的处理方法的名称,并且处理方法的执行顺序是datediff→lineartrans,("0.01","0")→discrete。这些信息可被包括在所述特征抽取步骤的流程信息中。
如图7所示,在显示控件402中展示抽取处理过程的名称(FE)和流程信息。可选地,所述流程信息可以通过由子节点的显示控件构成的子流程图来展示。在显示控件402a、402b和402c中分别展示对应的处理方法的名称。此外,还可以在显示控件402或显示控件402a上游的显示控件403和404中分别展示对应的来源字段的名称。
可选地,还可进一步展示所述来源字段的来源数据表和/或生成过程。根据所述特定特征的生成过程信息,所述来源数据表是通过数据表拼接步骤生成的。可选地,可进一步展示所述数据表拼接步骤的输入数据表和拼接处理过程。所述来源字段的来源数据表可以为所述数据表拼接步骤的输出表(在图7的示例中未示出)。
如图7所示,在显示控件403和404上游的显示控件405中,展示所述数据表拼接步骤的拼接处理过程的名称sql:01_join_fraud。在显示控件405上游的显示控件406和407中,分别展示所述数据表拼接步骤的输入数据表的名称cmb0404_app_trx_detail和cmb0404_fraud。
通过图7中的左侧流程图可以直观地了解特定特征(名称为f_trxdate_registerdate_diff)的生成过程:将两个输入数据表(名称分别为cmb0404_app_trx_detail和cmb0404_fraud)输入到数据表拼接步骤(拼接处理过程的名称为sql:01_join_fraud),以进行数据表拼接;在数据表拼接步骤之后执行特征抽取步骤,并且,只有所述数据表拼接步骤的输出数据表中的两个字段(名称分别为trx_date和register_date)与所述特定特征的抽取处理过程相关联,所述两个字段可被称为来源字段;然后,通过在所述特征抽取步骤的抽取处理过程中对所述来源字段应用多个处理方法(datediff→lineartrans,("0.01","0")→discrete)来生成所述特定特征。
可选地,上述显示控件可以根据对应的生成过程元素类型的不同而具有各自不同的形态。例如,如图7所示,显示控件406和407对应于输入数据表,可以展示为椭圆形控件;显示控件405对应于拼接处理过程,可以展示为矩形控件;显示控件403和404对应于来源字段,可以展示为平行四边形 控件;显示控件402对应于抽取处理过程,并且包含对应于处理方法的显示控件402a、402b和402c,因此,显示控件402可以展示为嵌入多个椭圆形控件的矩形控件,所述多个椭圆形控件分别为显示控件402a、402b和402c;显示控件401对应于特定特征,可以展示为圆角矩形控件。
可选地,所述形态的不同不仅限于显示控件的形状的不同,其可以包括显示控件的形状、边框线型、边框颜色、背景颜色、背景图案、显示控件中的字体格式、字体样式(例如,加粗、斜体和/或加下划线)、字体颜色等中的至少一项的不同。
根据本公开的过程展示视图可仅包括图7中的左侧流程图。附加地,作为可选方式,还可响应于用户对所述流程图中的特定显示控件的选择操作,在与所述特定显示控件对应的详情显示控件中列出关于所述特定显示控件中展示的输入项、输出项和/或处理过程的详情信息。
如图7所示,如果用户点击显示控件406,则可以生成并展示对应的详情显示控件506。在详情显示控件506中列出与显示控件406对应的输入数据表的名称(cmb0404_app_trx_detail)、用户添加的描述(交易表)、输入数据表的行数和列数(80000行18列)。
如果用户点击显示控件405,则可以生成并展示对应的详情显示控件505。在详情显示控件505中列出与显示控件405对应的拼接处理过程的名称(sql:01_join_fraud5)、用户添加的描述(拼接交易表和确定的风险交易,生成label字段flag)、代码信息(第1-6行代码)、输出数据表的行数和列数(80000行18列)。
如果用户点击显示控件403,则可以生成并展示对应的详情显示控件503。在详情显示控件503中列出与显示控件403对应的来源字段的数据统计分析信息,所述数据统计分析信息可以包括概要、统计、高频取值等信息。
如果用户点击显示控件402a,则可以生成并展示对应的详情显示控件502a。在详情显示控件502a中列出由于与显示控件402a对应的处理方法(名称为datediff)而产生的示例数据的变换过程,例如,在详情显示控件502a中列出从输入的示例数据(分别为trx_date和register_date字段的数据)至输出的示例数据(对应于DateDiff处理方法的处理结果)的变换过程,其中,输出的示例数据的字段类型为整型(Int)。从而示意性地说明处理方法(名称为datediff)对数据进行处理的过程。这里,应理解,可显示一部分示例数据记录经过部分或全部特征抽取步骤的变换过程。
此外,在各个详情显示控件中还可以设置快速进入数据预览的入口和/或快速进入处理过程的程序配置的入口。
根据本公开的过程展示视图不仅限于图7示出的示例,在根据本公开的过程展示视图中可以根据用户需求或设置来针对特定特征展示更多或更少的生成过程信息。例如,可以仅展示直接生成特定特征的处理过程的相关信息,可以展示从引入原始数据开始一直到生成特定特征为止的整个生成过程的相关信息,或者可以详细展示所述整个生成过程中的部分生成过程的相关信息而剩余的生成过程的相关信息可以被简化或省略。
另一方面,根据本公开示例性实施例的对机器学习过程的数据处理步骤进行可视化的系统100所包括的各个装置也可以通过硬件、软件、固件、中间件、微代码或其任意组合来实现。
当以软件、固件、中间件或微代码实现时,用于执行相应操作的程序代码或者代码段可以存储在诸如存储介质的计算机可读存储介质中,使得处理器可通过读取并运行相应的程序代码或者代码段来执行相应的操作。例如,本公开的示例性实施例可以实现为对机器学习过程的数据处理步骤进行可视化的计算机可读存储介质,其中,在所述计算机可读介质上记录有用于由一个或多个处理器执行对机器学习过程的数据处理步骤进行可视化的方法的计算机程序。所述处理器可实现为计算装置。换句话来讲,本公开提供了一种存储指令的计算机可读存储介质,其中,当所述指令被至少一个计算装置运行时,促使所述至少一个计算装置执行用于实现为对机器学习过程的数据处理步骤进行可视化的相关步骤。
作为另一示例,本公开的示例性实施例还可以实现为对机器学习过程的数据处理步骤进行可视化的计算装置,该计算装置包括一个或多个存储装置和一个或多个处理器,其中,在所述一个或多个存储装置中存储有计算机可执行指令集合,当所述一个或多个处理器执行所述计算机可执行指令集合时,执行用于执行对机器学习过程的数据处理步骤进行可视化的方法。
作为示例,所述处理器可实现为计算装置,相应地,本公开的方案可实现为包括至少一个计算装置和至少一个存储指令的存储装置的系统,其中,所述指令在被所述至少一个计算装置运行时,促使所述至少一个计算装置执行用于实现为对机器学习过程的数据处理步骤进行可视化的相关步骤。
具体说来,所述计算装置可以部署在服务器或客户端中,也可以部署在分布式网络环境中的节点装置上。此外,所述计算装置可以是PC计算机、平板装置、个人数字助理、智能手机、web应用或 其他能够执行上述指令集合的装置。
这里,所述计算装置并非必须是单个的计算装置,还可以是任何能够单独或联合执行上述指令(或指令集)的装置或电路的集合体。计算装置还可以是集成控制系统或系统管理器的一部分,或者可被配置为与本地或远程(例如,经由无线传输)以接口互联的便携式电子装置。
在所述计算装置中,处理器可包括中央处理器(CPU)、图形处理器(GPU)、可编程逻辑装置、专用处理器系统、微控制器或微处理器。作为示例而非限制,处理器还可包括模拟处理器、数字处理器、微处理器、多核处理器、处理器阵列、网络处理器等。
根据本公开示例性实施例的对机器学习过程的数据处理步骤进行可视化的方法中所描述的某些操作可通过软件方式来实现,某些操作可通过硬件方式来实现,此外,还可通过软硬件结合的方式来实现这些操作。
处理器可运行存储在存储装置之一中的指令或代码,其中,所述存储装置还可以存储数据。指令和数据还可经由网络接口装置而通过网络被发送和接收,其中,所述网络接口装置可采用任何已知的传输协议。
存储装置可与处理器集成为一体,例如,将RAM或闪存布置在集成电路微处理器等之内。此外,存储装置可包括独立的装置,诸如,外部盘驱动、存储阵列或任何数据库系统可使用的其他存储装置。存储装置和处理器可在操作上进行耦合,或者可例如通过I/O端口、网络连接等互相通信,使得处理器能够读取存储在存储装置中的文件。
此外,所述计算装置还可包括视频显示器(诸如,液晶显示器)和用户交互接口(诸如,键盘、鼠标、触摸输入装置等)。计算装置的所有组件可经由总线和/或网络而彼此连接。
根据本公开示例性实施例的对机器学习过程的数据处理步骤进行可视化的方法所涉及的操作可被描述为各种互联或耦合的功能块或功能示图。然而,这些功能块或功能示图可被均等地集成为单个的逻辑装置或按照非确切的边界进行操作。
以上描述了本公开的各示例性实施例,应理解,上述描述仅是示例性的,并非穷尽性的,本公开不限于所披露的各示例性实施例。在不偏离本公开的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。因此,本公开的保护范围应该以权利要求的范围为准。

Claims (50)

  1. 一种由至少一个计算装置执行的对机器学习过程的数据处理步骤进行可视化的方法,包括:
    对预先定义的机器学习过程的数据处理步骤进行解析,以获取所述数据处理步骤的概况信息,其中,所述概况信息包括数据处理步骤的数据信息和处理信息中的至少一个;
    基于获取的概况信息来生成用于描绘所述机器学习过程的数据处理步骤的理解视图;以及
    以图形化方式展示所述理解视图。
  2. 如权利要求1所述的方法,其中,所述概况信息包括所述数据处理步骤的名称、所述数据处理步骤的输出表的名称、输出表的行数、输出表的列数、输出表的字段名称、处理过程和用户添加的步骤描述之中的至少一项。
  3. 如权利要求2所述的方法,其中,所述理解视图为表示所述机器学习过程的数据处理步骤的流程图,其中,所述流程图中的节点分别对应于每个数据处理步骤;并且,
    以图形化方式展示所述理解视图的处理包括:通过在每个节点的显示控件中列出对应的数据处理步骤的概况信息来展示所述机器学习过程的数据处理步骤。
  4. 如权利要求3所述的方法,其中,在每个节点的显示控件中列出对应的数据处理步骤的概况信息的处理包括:
    在每个节点的显示控件中默认列出对应的数据处理步骤的概况信息之中的首要展示信息;以及
    响应于用户对显示控件的操作,在显示控件中进一步列出对应的数据处理步骤的概况信息之中的补充展示信息。
  5. 如权利要求4所述的方法,其中,首要展示信息包括数据处理步骤的名称、输出表的名称、输出表的行数、输出表的列数和用于添加的步骤描述之中的至少一项,并且,补充展示信息包括输出表的至少一部分字段名称和所述数据处理步骤的至少一部分处理过程中的至少一个。
  6. 如权利要求3所述的方法,其中,在每个节点的显示控件中列出对应的数据处理步骤的概况信息的处理包括:
    根据显示控件中列出的概况信息的内容来自适应地调整显示控件的大小。
  7. 如权利要求3所述的方法,其中,在每个节点的显示控件中列出对应的数据处理步骤的概况信息的处理还包括:在每个节点的显示控件中以突出的视觉效果列出对应的数据处理步骤的输出表的字段名称之中的新生成的字段名称。
  8. 如权利要求3所述的方法,其中,在每个节点的显示控件中列出的概况信息包括对应的数据处理步骤的输出表的所有字段名称,其中,具有相同初始来源表的字段名称被排列在一起。
  9. 如权利要求8所述的方法,其中,以图形化方式展示所述理解视图的处理还包括:在所有节点的显示控件中按照相同的视觉效果来列出具有相同初始来源表的字段名称。
  10. 如权利要求3所述的方法,其中,所述数据处理步骤的处理过程在节点的显示控件中通过子流程图的形式被列出。
  11. 如权利要求3所述的方法,其中,数据处理步骤被划分为数据引入步骤和非数据引入步骤,并且,对应于数据引入步骤的节点的显示控件和对应于非数据引入步骤的节点的显示控件分别具有各自的形态。
  12. 如权利要求1所述的方法,所述方法还包括:
    确定所述理解视图中的特定特征;
    对所述机器学习过程中用于生成所述特定特征的至少一个数据处理步骤进行解析,以获取所述特定特征的生成过程信息,其中,所述生成过程信息包括所述至少一个数据处理步骤的数据信息和处理信息中的至少一个;
    基于所述生成过程信息来生成用于描绘所述特定特征的生成过程的过程展示视图;以及
    以图形化方式展示所述过程展示视图。
  13. 如权利要求12所述的方法,其中,所述至少一个数据处理步骤的数据信息包括关于所述至少一个数据处理步骤的输入项和输出项中的至少一个的信息,
    所述至少一个数据处理步骤的处理信息包括关于所述至少一个数据处理步骤的处理过程的信息。
  14. 如权利要求13所述的方法,其中,所述过程展示视图为表示所述特定特征的生成过程的流程图,其中,所述流程图中的节点分别表示对应的数据处理步骤的输入项、输出项和处理过程中的至少一个;并且,
    以图形化方式展示所述过程展示视图的处理包括:在每个节点的显示控件中展示关于对应的数据处理步骤的输入项、输出项和处理过程中的至少一个的信息。
  15. 如权利要求14所述的方法,其中,所述至少一个数据处理步骤包括用于生成所述特定特征的特征抽取步骤,并且,
    所述特征抽取步骤的数据信息包括关于所述特征抽取步骤的输入项和输出项中的至少一个的信息,
    所述特征抽取步骤的处理信息包括关于所述特征抽取步骤的处理过程的信息。
  16. 如权利要求15所述的方法,其中,所述流程图包括:
    表示作为所述特征抽取步骤的输入项的来源字段的节点、表示作为所述特征抽取步骤的处理过程的抽取处理过程的节点和表示作为所述特征抽取步骤的输出项的所述特定特征的节点中的至少一个,并且,
    以图形化方式展示所述过程展示视图的处理还包括:在表示来源字段的节点的显示控件中展示来源字段的名称,在表示抽取处理过程的节点的显示控件中展示抽取处理过程的名称和流程信息中的至少一个,并且,在表示所述特定特征的节点的显示控件中展示所述特定特征的名称。
  17. 如权利要求16所述的方法,其中,抽取处理过程的流程信息包括抽取处理过程中应用的一个或多个处理方法的名称,
    表示抽取处理过程的节点包括分别表示所述一个或多个处理方法的子节点,
    以图形化方式展示所述过程展示视图的处理还包括:在所述子节点的显示控件中分别展示所述一个或多个处理方法的名称。
  18. 如权利要求17所述的方法,其中,所述流程图还包括:表示所述来源字段的来源数据表的节点,并且,
    以图形化方式展示所述过程展示视图的处理还包括:在表示所述来源数据表的节点的显示控件中展示所述来源数据表的名称。
  19. 如权利要求18所述的方法,其中,所述至少一个数据处理步骤还包括特征抽取步骤的上游处理步骤,其中,所述上游处理步骤用于生成所述来源字段的来源数据表。
  20. 如权利要求19所述的方法,其中,所述上游处理步骤包括一个或多个数据表拼接步骤,并且,
    所述一个或多个数据表拼接步骤的数据信息包括关于所述一个或多个数据表拼接步骤的输入项和输出项中的至少一个的信息,
    所述一个或多个数据表拼接步骤的处理信息包括关于所述一个或多个数据表拼接步骤的处理过程的信息。
  21. 如权利要求20所述的方法,其中,
    所述流程图还包括:表示作为所述一个或多个数据表拼接步骤的输入项的输入数据表的节点和表示作为所述一个或多个数据表拼接步骤的处理过程的拼接处理过程的节点中的至少一个,并且,
    以图形化方式展示所述过程展示视图的处理还包括:在表示输入数据表的节点的显示控件中分别展示输入数据表的名称,并且,在表示拼接处理过程的节点的显示控件中分别展示拼接处理过程的名称。
  22. 如权利要求21所述的方法,其中,对应于所述特定特征的节点的显示控件、对应于特征抽取步骤的节点的显示控件、对应于来源字段的节点的显示控件、对应于拼接处理过程的节点的显示控件、对应于来源数据表的节点的显示控件和对应于输入数据表的节点的显示控件中的至少一个分别具有各自的形态。
  23. 如权利要求14所述的方法,其中,以图形化方式展示所述过程展示视图的处理还包括:
    响应于用户对过程展示视图中的特定显示控件的选择操作,在与所述特定显示控件对应的详情显示控件中列出关于所述特定显示控件中展示的输入项、输出项和处理过程中的至少一个的详情信息。
  24. 如权利要求23所述的方法,其中,关于输入项和输出项中的至少一个的详情信息包括与所述输入项和输出项中的至少一个对应的名称、用户添加的描述、数据表的行数、数据表的列数、数据表的字段名称、数据表的字段类型、数据表中的至少一部分数据、数据表中的数据的统计分析信息以及字段的数据统计分析信息中的至少一项,
    关于处理过程的详情信息包括与处理过程对应的名称、用户添加的描述、代码信息和示例数据的变换过程中的至少一项。
  25. 一种存储指令的计算机可读存储介质,其中,当所述指令被至少一个计算装置运行时,促使所述至少一个计算装置如权利要求1至24中任一项所述的方法。
  26. 一种对机器学习过程的数据处理步骤进行可视化的系统,所述系统包括:
    解释装置,用于对预先定义的机器学习过程的数据处理步骤进行解析,以获取所述数据处理步骤的概况信息,其中,所述概况信息包括数据处理步骤的数据信息和处理信息中的至少一个;
    视图生成装置,用于基于获取的概况信息来生成用于描绘所述机器学习过程的数据处理步骤的理 解视图;
    以及展示装置,用于以图形化方式展示所述理解视图。。
  27. 一种包括至少一个计算装置和至少一个存储指令的存储装置的系统,其中,所述指令在被所述至少一个计算装置运行时,促使所述至少一个计算装置执行用于对机器学习过程的数据处理步骤进行可视化的以下步骤:
    对预先定义的机器学习过程的数据处理步骤进行解析,以获取所述数据处理步骤的概况信息,其中,所述概况信息包括数据处理步骤的数据信息和处理信息中的至少一个;
    基于获取的概况信息来生成用于描绘所述机器学习过程的数据处理步骤的理解视图;以及
    以图形化方式展示所述理解视图。
  28. 如权利要求27所述的系统,其中,所述概况信息包括所述数据处理步骤的名称、所述数据处理步骤的输出表的名称、输出表的行数、输出表的列数、输出表的字段名称、处理过程和用户添加的步骤描述之中的至少一项。
  29. 如权利要求28所述的系统,其中,所述理解视图为表示所述机器学习过程的数据处理步骤的流程图,其中,所述流程图中的节点分别对应于每个数据处理步骤;并且,
    以图形化方式展示所述理解视图的处理步骤包括:通过在每个节点的显示控件中列出对应的数据处理步骤的概况信息来展示所述机器学习过程的数据处理步骤。
  30. 如权利要求29所述的系统,其中,在每个节点的显示控件中列出对应的数据处理步骤的概况信息的处理步骤包括:
    在每个节点的显示控件中默认列出对应的数据处理步骤的概况信息之中的首要展示信息;以及
    响应于用户对显示控件的操作,在显示控件中进一步列出对应的数据处理步骤的概况信息之中的补充展示信息。
  31. 如权利要求30所述的系统,其中,首要展示信息包括数据处理步骤的名称、输出表的名称、输出表的行数、输出表的列数和用于添加的步骤描述之中的至少一项,并且,补充展示信息包括输出表的至少一部分字段名称和所述数据处理步骤的至少一部分处理过程中的至少一个。
  32. 如权利要求29所述的系统,其中,在每个节点的显示控件中列出对应的数据处理步骤的概况信息的处理步骤包括:
    根据显示控件中列出的概况信息的内容来自适应地调整显示控件的大小。
  33. 如权利要求29所述的系统,其中,在每个节点的显示控件中列出对应的数据处理步骤的概况信息的处理步骤还包括:在每个节点的显示控件中以突出的视觉效果列出对应的数据处理步骤的输出表的字段名称之中的新生成的字段名称。
  34. 如权利要求29所述的系统,其中,在每个节点的显示控件中列出的概况信息包括对应的数据处理步骤的输出表的所有字段名称,其中,具有相同初始来源表的字段名称被排列在一起。
  35. 如权利要求34所述的系统,其中,以图形化方式展示所述理解视图的处理步骤还包括:在所有节点的显示控件中按照相同的视觉效果来列出具有相同初始来源表的字段名称。
  36. 如权利要求29所述的系统,其中,所述数据处理步骤的处理过程在节点的显示控件中通过子流程图的形式被列出。
  37. 如权利要求29所述的系统,其中,数据处理步骤被划分为数据引入步骤和非数据引入步骤,并且,对应于数据引入步骤的节点的显示控件和对应于非数据引入步骤的节点的显示控件分别具有各自的形态。
  38. 如权利要求27所述的系统,其中,还包括以下步骤
    确定所述理解视图中的特定特征;
    对所述机器学习过程中用于生成所述特定特征的至少一个数据处理步骤进行解析,以获取所述特定特征的生成过程信息,其中,所述生成过程信息包括所述至少一个数据处理步骤的数据信息和处理信息中的至少一个,
    基于所述生成过程信息来生成用于描绘所述特定特征的生成过程的过程展示视图,并且,
    以图形化方式展示所述过程展示视图。
  39. 如权利要求38所述的系统,其中,所述至少一个数据处理步骤的数据信息包括关于所述至少一个数据处理步骤的输入项和输出项中的至少一个的信息,
    所述至少一个数据处理步骤的处理信息包括关于所述至少一个数据处理步骤的处理过程的信息。
  40. 如权利要求39所述的系统,其中,所述过程展示视图为表示所述特定特征的生成过程的流程图,其中,所述流程图中的节点分别表示对应的数据处理步骤的输入项、输出项和处理过程中的至少一 个;并且,
    以图形化方式展示所述过程展示视图的处理步骤包括:在每个节点的显示控件中展示关于对应的数据处理步骤的输入项、输出项和处理过程中的至少一个的信息。
  41. 如权利要求40所述的系统,其中,所述至少一个数据处理步骤包括用于生成所述特定特征的特征抽取步骤,并且,
    所述特征抽取步骤的数据信息包括关于所述特征抽取步骤的输入项和输出项中的至少一个的信息,
    所述特征抽取步骤的处理信息包括关于所述特征抽取步骤的处理过程的信息。
  42. 如权利要求41所述的系统,其中,
    所述流程图包括:表示作为所述特征抽取步骤的输入项的来源字段的节点、表示作为所述特征抽取步骤的处理过程的抽取处理过程的节点和表示作为所述特征抽取步骤的输出项的所述特定特征的节点中的至少一个,并且,
    以图形化方式展示所述过程展示视图的处理步骤还包括:在表示来源字段的节点的显示控件中展示来源字段的名称,在表示抽取处理过程的节点的显示控件中展示抽取处理过程的名称和流程信息中的至少一个,并且,在表示所述特定特征的节点的显示控件中展示所述特定特征的名称。
  43. 如权利要求42所述的系统,其中,抽取处理过程的流程信息包括抽取处理过程中应用的一个或多个处理方法的名称,
    表示抽取处理过程的节点包括分别表示所述一个或多个处理方法的子节点,
    以图形化方式展示所述过程展示视图的处理步骤还包括:在所述子节点的显示控件中分别展示所述一个或多个处理方法的名称。
  44. 如权利要求43所述的系统,其中,所述流程图还包括:表示所述来源字段的来源数据表的节点,并且,
    以图形化方式展示所述过程展示视图的处理步骤还包括:在表示所述来源数据表的节点的显示控件中展示所述来源数据表的名称。
  45. 如权利要求44所述的系统,其中,所述至少一个数据处理步骤还包括特征抽取步骤的上游处理步骤,其中,所述上游处理步骤用于生成所述来源字段的来源数据表。
  46. 如权利要求45所述的系统,其中,所述上游处理步骤包括一个或多个数据表拼接步骤,并且,
    所述一个或多个数据表拼接步骤的数据信息包括关于所述一个或多个数据表拼接步骤的输入项和输出项中的至少一个的信息,
    所述一个或多个数据表拼接步骤的处理信息包括关于所述一个或多个数据表拼接步骤的处理过程的信息。
  47. 如权利要求46所述的系统,其中,所述流程图还包括:
    表示作为所述一个或多个数据表拼接步骤的输入项的输入数据表的节点和表示作为所述一个或多个数据表拼接步骤的处理过程的拼接处理过程的节点中的至少一个,并且,
    以图形化方式展示所述过程展示视图的处理步骤还包括:在表示输入数据表的节点的显示控件中分别展示输入数据表的名称,并且,在表示拼接处理过程的节点的显示控件中分别展示拼接处理过程的名称。
  48. 如权利要求47所述的系统,其中,对应于所述特定特征的节点的显示控件、对应于特征抽取步骤的节点的显示控件、对应于来源字段的节点的显示控件、对应于拼接处理过程的节点的显示控件、对应于来源数据表的节点的显示控件和对应于输入数据表的节点的显示控件中的至少一个分别具有各自的形态。
  49. 如权利要求40所述的系统,其中,以图形化方式展示所述过程展示视图的处理步骤还包括:
    响应于用户对过程展示视图中的特定显示控件的选择操作,在与所述特定显示控件对应的详情显示控件中列出关于所述特定显示控件中展示的输入项、输出项和处理过程中的至少一个的详情信息。
  50. 如权利要求49所述的系统,其中,关于输入项和输出项中的至少一个的详情信息包括与所述输入项和输出项中的至少一个对应的名称、用户添加的描述、数据表的行数、数据表的列数、数据表的字段名称、数据表的字段类型、数据表中的至少一部分数据、数据表中的数据的统计分析信息以及字段的数据统计分析信息中的至少一项,
    关于处理过程的详情信息包括与处理过程对应的名称、用户添加的描述、代码信息和示例数据的变换过程中的至少一项。
PCT/CN2019/101444 2018-08-17 2019-08-19 对机器学习过程的数据处理步骤进行可视化的方法和系统 WO2020035076A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810940269.6 2018-08-17
CN201810940269.6A CN110188886B (zh) 2018-08-17 2018-08-17 对机器学习过程的数据处理步骤进行可视化的方法和系统

Publications (1)

Publication Number Publication Date
WO2020035076A1 true WO2020035076A1 (zh) 2020-02-20

Family

ID=67713849

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/101444 WO2020035076A1 (zh) 2018-08-17 2019-08-19 对机器学习过程的数据处理步骤进行可视化的方法和系统

Country Status (2)

Country Link
CN (1) CN110188886B (zh)
WO (1) WO2020035076A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117131071A (zh) * 2023-10-26 2023-11-28 中国证券登记结算有限责任公司 一种数据处理方法、装置、电子设备及计算机可读介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107169575A (zh) * 2017-06-27 2017-09-15 北京天机数测数据科技有限公司 一种可视化机器学习训练模型的建模系统和方法
US20180060404A1 (en) * 2016-08-29 2018-03-01 Linkedin Corporation Schema abstraction in data ecosystems
CN107844837A (zh) * 2017-10-31 2018-03-27 第四范式(北京)技术有限公司 针对机器学习算法进行算法参数调优的方法及系统

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5923328A (en) * 1996-08-07 1999-07-13 Microsoft Corporation Method and system for displaying a hierarchical sub-tree by selection of a user interface element in a sub-tree bar control
JP2003248676A (ja) * 2002-02-22 2003-09-05 Communication Research Laboratory 解データ編集処理装置、解データ編集処理方法、自動要約処理装置、および自動要約処理方法
CN100373855C (zh) * 2002-05-24 2008-03-05 中兴通讯股份有限公司 一种可为多设备兼容的界面显示系统及方法
US20040153445A1 (en) * 2003-02-04 2004-08-05 Horvitz Eric J. Systems and methods for constructing and using models of memorability in computing and communications applications
CN100535913C (zh) * 2006-06-29 2009-09-02 中国科学院上海生命科学研究院 一种用于芯片数据分析的可视化分析和展示方法
CN101504736A (zh) * 2009-02-27 2009-08-12 江汉大学 基于Delphi软件实现神经网络算法的方法
JP6558364B2 (ja) * 2014-05-22 2019-08-14 ソニー株式会社 情報処理装置、情報処理方法及びプログラム
CN104021460B (zh) * 2014-06-27 2018-07-10 北京太格时代自动化系统设备有限公司 一种工作流程管理系统及工作流程处理方法
CN106021245A (zh) * 2015-03-18 2016-10-12 华为技术有限公司 数据的可视化方法和装置
CN104978947B (zh) * 2015-07-17 2018-06-05 京东方科技集团股份有限公司 显示状态的调节方法、显示状态调节装置及显示装置
CN105892633A (zh) * 2015-11-18 2016-08-24 乐视致新电子科技(天津)有限公司 手势识别方法及虚拟现实显示输出设备
CN106802792B (zh) * 2016-12-09 2020-01-03 合肥国为电子有限公司 基于bp神经网络的交互界面操作请求队列处理方法
CN108279890B (zh) * 2017-01-06 2021-12-24 阿里巴巴集团控股有限公司 组件发布方法、组件构建方法及图形化机器学习算法平台
CN108228861B (zh) * 2018-01-12 2020-09-01 第四范式(北京)技术有限公司 用于执行机器学习的特征工程的方法及系统

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180060404A1 (en) * 2016-08-29 2018-03-01 Linkedin Corporation Schema abstraction in data ecosystems
CN107169575A (zh) * 2017-06-27 2017-09-15 北京天机数测数据科技有限公司 一种可视化机器学习训练模型的建模系统和方法
CN107844837A (zh) * 2017-10-31 2018-03-27 第四范式(北京)技术有限公司 针对机器学习算法进行算法参数调优的方法及系统

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117131071A (zh) * 2023-10-26 2023-11-28 中国证券登记结算有限责任公司 一种数据处理方法、装置、电子设备及计算机可读介质
CN117131071B (zh) * 2023-10-26 2024-01-26 中国证券登记结算有限责任公司 一种数据处理方法、装置、电子设备及计算机可读介质

Also Published As

Publication number Publication date
CN110188886B (zh) 2021-08-20
CN110188886A (zh) 2019-08-30

Similar Documents

Publication Publication Date Title
US10254848B2 (en) Cross-platform data visualizations using common descriptions
US20220392144A1 (en) Image rendering method and apparatus, electronic device, and storage medium
JP4812337B2 (ja) フォームタイプを使用してフォームを生成する方法および装置
Kasyanov et al. Information visualisation based on graph models
US10768904B2 (en) System and method for a computational notebook interface
KR101773574B1 (ko) 데이터 테이블의 차트 시각화 방법
US8229735B2 (en) Grammar checker for visualization
CN110968294B (zh) 一种业务领域模型建立系统及方法
WO2023284312A1 (zh) 工作流程的构建方法及装置、设备、计算机存储介质及计算机程序产品
US20130080879A1 (en) Methods and apparatus providing document elements formatting
CN110050270A (zh) 用于针对产品的要求的视觉可追溯性的系统和方法
US11321885B1 (en) Generating visualizations of analytical causal graphs
WO2019133224A1 (en) Interactive learning tool
CN110209902B (zh) 对机器学习过程中的特征生成过程可视化的方法和系统
Kasyanov Methods and tools for structural information visualization
US20130191809A1 (en) Graphical representation of an order of operations
WO2020035076A1 (zh) 对机器学习过程的数据处理步骤进行可视化的方法和系统
US10394529B2 (en) Development platform of mobile native applications
Bolte et al. Vis-a-Vis: Visual exploration of visualization source code evolution
CN112100069A (zh) 一种面向simscript语言的离散事件仿真事件队列可视化方法
TW201506654A (zh) 使用運算式樹狀架構之規則視覺化
JP7151146B2 (ja) コンピュータプログラム、情報処理方法及びコンピュータ
CN114217794A (zh) 页面设计方法、客户端设备、可读介质及程序产品
CN109766093B (zh) 协同实时编辑的方法、装置、电子设备及存储介质
WO2013170525A1 (zh) 一种以逻辑流程图示方式展示或绘制计算机应用程序组织逻辑流程关系的方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19849729

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19849729

Country of ref document: EP

Kind code of ref document: A1