CN110956272B - Method and system for realizing data processing - Google Patents
Method and system for realizing data processing Download PDFInfo
- Publication number
- CN110956272B CN110956272B CN201911061020.9A CN201911061020A CN110956272B CN 110956272 B CN110956272 B CN 110956272B CN 201911061020 A CN201911061020 A CN 201911061020A CN 110956272 B CN110956272 B CN 110956272B
- Authority
- CN
- China
- Prior art keywords
- feature
- model
- node
- machine learning
- directed acyclic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 182
- 238000012545 processing Methods 0.000 title claims abstract description 178
- 238000010801 machine learning Methods 0.000 claims abstract description 507
- 238000012549 training Methods 0.000 claims description 310
- 230000000694 effects Effects 0.000 claims description 234
- 238000004422 calculation algorithm Methods 0.000 claims description 98
- 238000011156 evaluation Methods 0.000 claims description 93
- 239000002131 composite material Substances 0.000 claims description 72
- 230000008569 process Effects 0.000 claims description 60
- 238000000605 extraction Methods 0.000 claims description 42
- 230000012010 growth Effects 0.000 claims description 36
- 230000004044 response Effects 0.000 claims description 36
- 238000004458 analytical method Methods 0.000 claims description 25
- 238000004806 packaging method and process Methods 0.000 claims description 23
- 238000013468 resource allocation Methods 0.000 claims description 23
- 238000003860 storage Methods 0.000 claims description 22
- 238000007781 pre-processing Methods 0.000 claims description 21
- 238000012216 screening Methods 0.000 claims description 20
- 238000007726 management method Methods 0.000 claims description 9
- 230000000670 limiting effect Effects 0.000 claims description 5
- 238000012986 modification Methods 0.000 claims description 5
- 230000004048 modification Effects 0.000 claims description 5
- 125000002015 acyclic group Chemical group 0.000 claims description 4
- 238000003491 array Methods 0.000 claims description 4
- 230000000875 corresponding effect Effects 0.000 description 256
- 238000012360 testing method Methods 0.000 description 47
- 238000013103 analytical ultracentrifugation Methods 0.000 description 26
- 238000007477 logistic regression Methods 0.000 description 24
- 238000012804 iterative process Methods 0.000 description 21
- 230000003698 anagen phase Effects 0.000 description 19
- 230000006870 function Effects 0.000 description 17
- 238000005457 optimization Methods 0.000 description 17
- 230000002829 reductive effect Effects 0.000 description 12
- 238000004364 calculation method Methods 0.000 description 9
- 230000002860 competitive effect Effects 0.000 description 9
- 238000009826 distribution Methods 0.000 description 9
- 238000003066 decision tree Methods 0.000 description 8
- 230000007246 mechanism Effects 0.000 description 8
- 238000004088 simulation Methods 0.000 description 7
- 230000009466 transformation Effects 0.000 description 7
- 230000001174 ascending effect Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 238000012163 sequencing technique Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 238000005070 sampling Methods 0.000 description 4
- 230000001960 triggered effect Effects 0.000 description 4
- 230000002159 abnormal effect Effects 0.000 description 3
- 230000003247 decreasing effect Effects 0.000 description 3
- 230000002708 enhancing effect Effects 0.000 description 3
- 238000012795 verification Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000004140 cleaning Methods 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 230000002844 continuous effect Effects 0.000 description 2
- 230000005484 gravity Effects 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 230000036961 partial effect Effects 0.000 description 2
- 230000001131 transforming effect Effects 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000002790 cross-validation Methods 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 230000003631 expected effect Effects 0.000 description 1
- 230000008676 import Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000000977 initiatory effect Effects 0.000 description 1
- 238000003064 k means clustering Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000013517 stratification Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
- 238000011144 upstream manufacturing Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
- G06F16/9027—Trees
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/451—Execution arrangements for user interfaces
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Medical Informatics (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Human Computer Interaction (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method and a system for realizing data processing. The method comprises the following steps: responding to the operation of generating the directed acyclic graph of the user, and generating a corresponding directed acyclic graph; and responding to the operation of running the directed acyclic graph, and executing the data processing flow corresponding to the directed acyclic graph. The scheme of the invention provides the method and the system for realizing data processing widely applicable to different scenes, and meets the increasingly complex machine learning scene requirements of users.
Description
Technical Field
The present invention relates to the field of machine data processing, and more particularly, to a method and system for implementing data processing.
Background
With the advent of mass data, artificial intelligence technology has evolved rapidly, and machine learning is an inevitable product of the evolution of artificial intelligence to a certain stage, which aims at mining valuable potential information from a large amount of data by means of computation.
In the field of machine learning, machine learning models are often trained by providing empirical data to machine learning algorithms, and the trained machine learning models can be applied to provide corresponding prediction results in the face of new prediction data. However, many of the tasks involved in the machine learning process (e.g., feature preprocessing and selection, model algorithm selection, hyper-parameter adjustment, etc.) often require both computer (especially machine learning) expertise and specific business experience associated with the predicted scenario, thus requiring significant human costs. In order to reduce the threshold of utilizing machine learning techniques, many machine learning systems (e.g., machine learning platforms) are presented, however, existing machine learning platforms are limited in how to train out the corresponding model (or implement the corresponding model management) based on the accumulated data, which enables the functionality that the platform can support to be very limited and fixed. In addition, the existing platform is often only aimed at a relatively single machine learning scene, and is difficult to meet the increasingly complex machine learning scene requirements of users.
Disclosure of Invention
The invention aims to provide a method and a system for realizing data processing, which can be widely applied to different scenes.
One aspect of the present invention provides a method of implementing data processing, comprising:
responding to the operation of generating the directed acyclic graph of the user, and generating a corresponding directed acyclic graph;
and responding to the operation of running the directed acyclic graph, and executing the data processing flow corresponding to the directed acyclic graph.
Another aspect of the invention provides a computer-readable storage medium storing instructions that, when executed by at least one computing device, cause the at least one computing device to perform a method as described above.
Yet another aspect of the invention provides a system comprising at least one computing device and at least one storage device storing instructions that, when executed by the at least one computing device, cause the at least one computing device to perform the method as described above.
In yet another aspect of the present invention, there is provided a data processing system, wherein the system comprises:
an operation unit adapted to generate a corresponding directed acyclic graph in response to an operation of generating the directed acyclic graph by a user;
And the operation unit is suitable for responding to the operation of operating the directed acyclic graph and executing the data processing flow corresponding to the directed acyclic graph.
The scheme of the invention provides the method and the system widely applicable to different scenes, and meets the increasingly complex machine learning scene requirements of users.
Drawings
The foregoing and other objects and features of the invention will become more apparent from the following detailed description taken in conjunction with the accompanying drawings in which:
FIG. 1 illustrates a flow chart of a method of implementing data processing according to an embodiment of the invention;
FIG. 2 shows a schematic diagram of a first graphical user interface according to an embodiment of the invention;
FIG. 3 illustrates an example of a search tree for generating combined features according to an exemplary embodiment of the present invention;
fig. 4 shows a schematic diagram of a system implementing data processing according to an embodiment of the invention.
Detailed Description
Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.
Fig. 1 shows a flow chart of a method of implementing data processing according to an embodiment of the invention. As shown in fig. 1, the method includes:
step S110, responding to the operation of generating the directed acyclic graph of the user, and generating the corresponding directed acyclic graph.
In this step, the user's operation of generating the directed acyclic graph (Directed Acyclic Graph, DAG) may be varied, for example, the user may input a scripting language for generating the directed acyclic graph for constructing the directed acyclic graph in a drag-and-drop manner based on the provided graphical user interface. The directed acyclic graph generated here represents a data processing flow, nodes in the directed acyclic graph are data itself or computational logic for processing the data, and links between the nodes represent logical relationships and sequences of data processing procedures.
Step S120, in response to the operation of running the directed acyclic graph, executing the data processing flow corresponding to the directed acyclic graph.
In this step, the operation of running the directed acyclic graph may be an operation triggered by the user in real time, or may be an operation triggered at a timing according to a time set by the user. The data processing flow corresponding to the directed acyclic graph is a data processing flow related to machine learning.
In the method described in fig. 1, by generating the DAG and executing the data processing flow corresponding to the DAG, the construction and editing of the big data processing process are easier to be carried out, and the threshold of utilizing the machine learning technology is reduced.
In one embodiment of the present invention, step S110 in the method of fig. 1, in response to the operation of generating the directed acyclic graph by the user, generating the corresponding directed acyclic graph includes: displaying a first graphical user interface comprising a node presentation area and a canvas area, wherein node types in the node presentation area comprise data, samples, models and operators; responsive to the operation of exposing the region selection node at the node, the respective node is displayed at the canvas region, and responsive to the operation of connecting the nodes, a connection is generated between the respective nodes in the canvas region to generate the directed acyclic graph.
In one embodiment of the invention, the node presentation area includes an element list including data, samples, and models, and an operator list including various data processing operators related to machine learning; the node presentation area also includes a file list including a directed acyclic graph. The nodes of the node display area also include directed acyclic graphs.
FIG. 2 shows a schematic diagram of a first graphical user interface according to an embodiment of the invention. Referring to fig. 2, the left side of the first graphical user interface presents a region for nodes, including a file list, an element list, and an operator list. The nodes in the list may be further presented by selecting the list, for example, the operator list is selected as shown in fig. 2, thus further presenting the various operators in the operator list. The middle portion of the first graphical user interface is the canvas area, a constructed DAG diagram is illustrated in fig. 2.
In one embodiment of the invention, the method depicted in FIG. 1 further comprises at least one of the following:
1) Responsive to an operation of selecting a directed acyclic graph at a node presentation area, displaying the selected directed acyclic graph at a canvas area for direct execution or modification editing; for example, selecting a file list in the interface shown in FIG. 2 may further reveal various files in the file list, including the DAGs saved as files, which are displayed in the canvas area when one of the DAGs is further selected;
2) Responsive to an operation to save the directed acyclic graph in the canvas area, saving the directed acyclic graph and adding the saved directed acyclic graph to the node-presentation area; for example, at the interface shown in FIG. 2, the saved DAG is added to the file list;
3) Responsive to an operation to export the directed acyclic graph, outputting the corresponding directed acyclic graph to a specified export location; for example, in the interface shown in FIG. 2, a user may select a DAG in the file list, then click the right button may appear to download the control, and further select the download control, then download the DAG to the specified location.
In one embodiment of the invention, the method depicted in FIG. 1 further comprises at least one of the following (where the element is data, a sample, or a model):
1) Storing corresponding elements in response to an operation of importing the elements from outside, and adding the elements into the node display area; for example, referring to the interface shown in fig. 2, the imported element is added to the element list.
2) Saving elements generated in the process of executing a data processing flow corresponding to the directed acyclic graph, and adding the saved elements into the node display area; for example, referring to the interface shown in FIG. 2, intermediate data, samples, and models generated by running the DAG are added to the list of elements.
3) Providing a management page for managing elements generated in the process of executing a data processing flow corresponding to the directed acyclic graph, so that a user can check and delete the intermediate elements through the management page; the intermediate data, samples and models generated by running the DAG are managed separately for easy viewing and deletion by the user.
4) In response to an operation of exporting an element, the corresponding element is output to a designated export location, e.g., in the interface shown in fig. 2, a user may select one element in the list of elements, click a right button may present a download control, further select the download control, and may download the element to the designated location.
In one embodiment of the invention, the method depicted in FIG. 1 further comprises at least one of the following:
1) Responding to the operation of importing operators from the outside, storing codes corresponding to the corresponding operators, and adding the corresponding operators into a node display area; for example, referring to fig. 2, the imported operator is added to the operator list.
2) Providing an operator code editing interface, acquiring and storing an input code from the interface, and adding a corresponding operator into a node display area; for example, referring to fig. 2, the corresponding operator is added to the operator list.
In one embodiment of the present invention, the method of fig. 1 further comprises:
responding to the operation of selecting one node in the canvas area, displaying a configuration interface of the node, and completing the related configuration of the corresponding node according to the configuration operation on the configuration interface; for example, referring to FIG. 2, the "data splitting" operator in the canvas area is selected, and the rightmost side of the first graphical user interface displays a configuration interface of the "data splitting operator" at which the "data splitting" operator can be configured;
when the nodes are not necessarily configured, or the configured parameters do not meet the preset requirements, displaying prompt identifiers at the corresponding nodes in the canvas area; for example, referring to FIG. 2, an exclamation mark may be displayed on nodes in the canvas area that are not necessarily configured or are not configured in error to alert the user.
In one embodiment of the invention, the method depicted in FIG. 1 further comprises at least one of the following:
1) Displaying a graphic control running the directed acyclic graph in the first graphic user interface, responding to the operation of triggering the graphic control, and executing a data processing flow corresponding to the directed acyclic graph according to each node in the directed acyclic graph and the connection relation among the nodes; for example, a "run" control may be displayed in the appropriate location, or, as shown in FIG. 2, a "play" button may be displayed in the upper left corner of the canvas area for initiating the operation of the DAG.
2) And displaying a timer on the first graphical user interface, wherein the timer is used for timing the time for executing the data processing flow corresponding to the directed acyclic graph in real time. For example, a timer may be displayed in the canvas area for the user to view the runtime of the DAG.
In one embodiment of the invention, the method depicted in FIG. 1 further comprises at least one of the following:
1) In the process of executing the data processing flow corresponding to the directed acyclic graph, displaying information for representing the progress of executing the corresponding node on each node of the directed acyclic graph on the first graphical user interface; for example, the progress of the node being executed may be displayed in the form of a progress bar on the node, or the percentage of progress at that time may be displayed in real time according to the progress of the node being executed.
2) In the process of executing the data processing flow corresponding to the directed acyclic graph, displaying an in-operation identifier on each node of the directed acyclic graph on the first graphical user interface, and displaying the executed identifier on one node when the data processing flow corresponding to the node is executed; for example, when the DAG is running, a funnel is displayed on each node in the DAG as the running identifier, and when the data processing flow corresponding to one node is executed, the funnel is replaced by a green opposite hook as the executed identifier.
3) Responding to the operation of checking the operation result of the node in the directed acyclic graph, and acquiring and displaying operation result data corresponding to the node; for example, the operation of viewing the operation result of the node in the directed acyclic graph may specifically be that the user clicks a right button on a node to reselect a state viewing control that appears, or may be that an icon corresponding to the operation result type is displayed in the vicinity of the node in response to the operation of clicking a node by the user, and operation result data corresponding to the node is displayed in response to the operation of clicking the icon by the user.
In one embodiment of the invention, the method described in FIG. 1 includes one or more of the following:
1) The data, samples, and models in the canvas area all support one or more of the following operations: copying, deleting and previewing;
2) Operators in the canvas area support one or more of the following operations: copying, deleting, previewing, running the current task, starting running from the current task, running to the current task, checking logs, and checking task details. For example, after a mouse selects an operator, clicking the right button can view the operations supported by the operator. The copying is to copy and paste operators with configured/unconfigured parameters into any canvas for repeated use; deleting is to delete operators already dragged into canvas; renaming is renaming operators to facilitate marking and identification; running the current task means that if the data accessed from the upstream meet the conditions, the current task can be selectively run; starting running from the current task, namely starting running from the currently selected operator to the last; the operation to the current task means that the operation is finished from the starting unit to the current operator; previewing is to preview the operation result of the operator; the checking log is used for positioning more detailed error information when the operator operation fails, and can be checked after the operator operation is finished; task details can jump to the Yarn cluster task link/PWS Console page.
3) For the directed acyclic graph with the completed operation in the canvas area, in response to clicking one of the operators, displaying product type marks respectively corresponding to the types of products output by the operator, in response to clicking the product type marks, displaying a product related information interface, wherein the product information related interface comprises: a control for previewing the product, a control for exporting the product, a control for importing the product into an element list, basic information of the product and path information for storing the product; wherein the product types of the operator outputs include: data, samples, models, and reports.
In one embodiment of the present invention, the node presentation area in the method illustrated in fig. 1 includes the following operators:
1) Data splitting operator: data splitting refers to a computing unit that splits a data set into two pieces of data. Typically, one of the data is used as data for training the model, and the other is used for model evaluation, i.e., validating the model.
The data splitting operator is configured through a configuration interface of the data splitting operator. The data splitting has one input connection point, two output connection points, the output result of the left output point is named as data set 1, and the right output point is named as data set 2. The split two data tables can be connected with a plurality of downstream operations, and the data tables can be continuously split for a plurality of times to obtain more data tables. The data splitting method provided in the configuration interface of the data splitting operator comprises one or more of proportional splitting, regular splitting and sorting-then-splitting.
When proportional resolution is selected, proportional sequential resolution, proportional random resolution, and proportional hierarchical resolution may be further selected. When random splitting is selected, an input area for setting random seed parameters is further provided on a configuration interface, the value range of the random seed parameters is any integer between 0 and 999999999, the system generates random numbers according to the given random seeds, and the purpose of setting the random seeds is to reproduce the random splitting result if necessary. When the hierarchical splitting is selected, an input area for setting fields for hierarchical basis is further provided on the configuration interface, and the hierarchical splitting refers to splitting after splitting the data set into ten layers according to the designated fields. Whether to open the layering splitting can be customized, and selecting layering splitting requires designating a layering basis field. The reason for using hierarchical splitting is that some data sets are unevenly distributed under the control of a certain field, and bad influence of uneven distribution on modeling effect is avoided. Assuming that we have to make a voting prediction of a presidential election, that dataset is highly correlated with the demographics of the country, this time it may be desirable to compare against practical problems by population stratification splitting. Examples: if the data table is layered according to the gender field, the field is assumed to have two values of 0 and 1, 50% of the samples with the value of 0 are sampled if the split ratio is set to be 0.5 at this time, then 50% of the samples with the value of 1 are sampled, the two sampled contents are taken as the data set 1, and the rest is taken as the data set 2.
Providing an input area for inputting splitting rules when selecting splitting by rule; the split rule expression may be written using SQL syntax, rule-compliant data is output to dataset 1, non-rule-compliant and return value NULL is output to dataset 2.
The split ratio selection item, the input area for setting the sorting field, and the sorting direction selection item are further provided on the configuration interface when sorting first and then splitting are selected. Sorting and splitting refers to sorting a certain field in a data table and splitting the field according to proportion, wherein the sorting mode can be ascending or descending. Generally, when modeling a scene, future data needs to be predicted through historical data, samples should be split after being time ordered to avoid "time crossing".
2) Data cleaning operator: data for each column may be processed using data cleansing, which may be performed by defining a configuration function for a column in the data table. The data is input into a data table, the data table is still output after being processed by the configuration function, and the original data column cleaned after cleaning is replaced by the column output after being processed.
3) Data table statistics operator: the data statistics can be carried out on field information in the data table by using a data table statistics operator, and the statistical items include NULL number, average value, variance, median, mode and the like. And the data distribution and the statistical information of the data table are checked, so that the subsequent feature processing and modeling work can be facilitated.
4) Feature extraction operator: the configuration interface of the feature extraction operator is provided with an interface for adding an input source and a script editing inlet, and at least one of a sample random ordering option, a feature accurate statistics option, an option for outputting whether a sample is compressed, an output plaintext option, a label type option and an output result storage type option is also provided. The script extracted by the features can be input through the script editing entrance, or the data source data can be automatically displayed, the rapid configuration script box is displayed in response to the triggered operation of the configuration generation button, the user can rapidly generate script configuration only by inputting the target value, and further, the script content in the editing box is checked in response to the operation of the anomaly verification button, so that error prompt is carried out. Not all algorithms need to shuffle the feature order after feature extraction and shuffle itself consumes resources, so we can use a "sample random order" switch to control whether random shuffling is to be performed after each feature extraction, which is turned on by default, if it does not need to be turned off selectively. In the feature extraction process, the system automatically counts feature dimensions, and further displays the profile of the features in the produced sample data. However, since feature dimension statistics requires deduplication counting of the entire features, there is a large performance penalty in a large-scale data scenario. The platform will therefore approximate the feature dimensions by default here and ensure that the relative error of the statistics is less than 0.01. If accurate strict statistics are required, the feature accurate statistics switch is turned on. The sample represents a file with larger occupied space resource in the platform file, so that the output text file is suggested to be compressed so as to improve the space utilization rate. In the process of feature extraction, the processed features are subjected to code transformation, original text information corresponding to the features is lost, and in order to facilitate the advanced processing of a subsequent model (such as model debug), whether to output related plaintext information while a sample is output can be controlled through a plaintext output switch, and the plaintext information is stored in a system path under the condition of opening, so that certain storage expense is brought. Feature extraction supports target value labeling for three scenarios: two-classification, multi-classification and regression are respectively corresponding to three label methods of binary_ label, multiclass _label and regression_label, and script editing is required to be performed after advanced selection.
5) Feature importance analysis operator: the input of the feature importance analysis operator is a sample table with target values, and the output is a feature importance evaluation report, wherein the report contains importance coefficients of all the features, and one or more of feature numbers, sample numbers and basic statistical information of all the features. Feature importance analysis refers to analyzing the importance relationship (irrelevant to the model) between each feature and the target value in the sample table, wherein the relationship can be represented by an importance coefficient, so that the abnormal phenomenon of the feature is analyzed, and the model feature is adjusted and optimized.
6) Automatic feature combination operator: at least one of a feature selection item, a grading index selection item, a learning rate setting item and a termination condition selection item is provided in a configuration interface of the automatic feature combination operator, wherein the feature selection item is used for determining each feature for feature combination, and the termination condition selection item comprises the maximum number of running feature pools and the maximum number of output features. Feature combination is a method for enhancing feature description capability and improving personalized prediction effect. The automatic feature combination operator can perform various feature combination analyses on a sample table, and is mainly used for generating features generated based on a combination method and evaluating importance.
7) Automatic adjustment parameter operator: the automatic parameter adjustment operator is used for searching out proper parameters from a given parameter range according to a parameter adjustment algorithm, training a model by using the searched parameters and carrying out model evaluation; at least one of a feature selection setting item, a parameter adjustment method option and a parameter adjustment times setting item is provided in a configuration interface of the automatic parameter adjustment operator, wherein all features or user-defined feature selection can be selected in the feature selection setting item, and random search or grid search can be selected in the parameter adjustment method option.
8) TensorFlow operator: the TensorFlow operator is used for running TensorFlow codes written by a user, and an input source setting item and a TensorFlow code file path setting item are provided in a configuration interface of the TensorFlow operator. The TensorFlow operator provides the user with sufficient flexibility, thereby also requiring the user to be quite familiar with TensorFlow to write the TensorFlow code by himself. If one wants to use a distributed Tensorflow, one needs to write the distributed code of the Tensorflow itself.
9) Custom script operators: an interface for providing a user with a custom scripting operator using a specific scripting language, the custom scripting operator providing input source settings and script editing entries in its configuration. A number of custom scripting operators may be provided, each corresponding to a different scripting language. Such as SQL, custom scripting language, etc., which are commonly used in the industry.
In an embodiment of the invention, the aforementioned feature importance analysis operator determines the importance of a feature by at least one of the following three ways:
the first way to determine feature importance is: training at least one feature pool model based on a sample set, wherein the feature pool model refers to a machine learning model for providing a prediction result about a machine learning problem based on at least a part of features contained in a sample, obtaining an effect of the at least one feature pool model, and determining importance of the features according to the obtained effect of the at least one feature pool model; wherein the feature pool model is trained by performing a discretization operation on at least one continuous feature among the at least a portion of the features.
Specifically comprises the following steps: (A) Obtaining a history data record, wherein the history data record comprises a marker for a machine learning problem and at least one attribute information for each feature used to generate a machine learning sample; (B) Training at least one feature pool model using the acquired historical data records, wherein the feature pool model refers to a machine learning model that provides a prediction result regarding a machine learning problem based on at least a portion of the individual features; (C) Acquiring the effect of the at least one feature pool model, and determining the importance of each feature according to the acquired effect of the at least one feature pool model; wherein in step (B) a feature pool model is trained by performing a discretization operation on at least one continuous feature among the at least a portion of features.
Here, it is assumed that the history data record has attribute information { p } 1 ,p 2 ,…,p m And corresponding labels (where m is a positive integer), based on which machine learning samples corresponding to machine learning problems may be generated that are to be applied to model training and/or testing for the machine learning problems. The characteristic portion of the machine learning sample may be expressed as { f 1 ,f 2 ,…,f n While exemplary embodiments of the present invention aim at determining the feature part { f } (where n is a positive integer) 1 ,f 2 ,…,f n The importance of each feature in the sequence. From { f 1 ,f 2 ,…,f n Selecting at least a portion of the features as features of a training sample of the feature pool model and marking the corresponding history data record as a marking of the training sample. Some or all of the continuous features among the selected at least a portion of the features are subject to discretization. One or more feature pool models can be trained, wherein the importance of corresponding features can be comprehensively obtained based on the prediction effect difference of the same feature pool model (the same feature pool model can be based on all or a part of features of a machine learning sample) on an original test data set and a transformation test data set, wherein the transformation test data set is obtained by transforming the values of certain target features in the original test data set, and thus, the prediction effect difference can reflect the prediction effect of the target features, namely, the importance; alternatively, the importance of the corresponding features may be derived synthetically based on the difference in the predicted effect of different feature pool models on the same test dataset (i.e., the original test dataset), where the different feature pool models may be designed based on different feature combinations, such that the difference in the predicted effect reflects the respective predicted effects, i.e., importance, of the different features; in particular, a single feature model may be trained for each feature of the machine learning sample, and accordingly, the predictive effect of the single feature model may represent the importance of the feature on which it is based. It should be noted that the two ways of measuring the importance of the features described above may be used alone or in combination.
Wherein in step (C) the importance of the respective feature on which the feature pool model is based is determined from the difference between the effects of the feature pool model on the original test dataset and the transformed test dataset; wherein the transformation test dataset refers to a dataset obtained by replacing the value of the target feature whose importance is to be determined in the original test dataset with one of the following: zero value, random value, value obtained by scrambling the original value of the target feature. Wherein the at least one feature pool model includes one overall feature model, wherein the overall feature model refers to a machine learning model that provides a prediction result regarding a machine learning problem based on all of the individual features.
Wherein the at least one feature pool model includes a plurality of machine learning models that provide predictions about machine learning problems based on different feature sets; wherein in step (C) the importance of the individual features is determined from the differences between the effects of the at least one feature pool model on the original test dataset.
Wherein the at least one feature pool model includes one or more main feature pool models and at least one sub feature pool model corresponding to each main feature pool model, respectively, wherein the sub feature pool model refers to a machine learning model that provides a prediction result regarding a machine learning problem based on the remaining features other than a target feature whose importance is to be determined among the features on which the corresponding main feature pool model is based; wherein in step (C) the importance of the respective target feature is determined from the difference between the effects of the main feature pool model and the respective sub-feature pool model corresponding thereto on the original test dataset.
Wherein the at least one feature pool model includes a plurality of single feature models, wherein a single feature model refers to a machine learning model that provides a prediction result regarding a machine learning problem based on a target feature to be determined of importance among the respective features; wherein in step (C) the importance of the respective target feature is determined from the differences between the effects of the single feature model on the original test dataset.
For example, assume a feature pool modelThe basis is the characteristic part { f } of the machine learning sample 1 ,f 2 ,…,f n Three features { f } among 1 ,f 3 ,f 5 And wherein the continuous feature f 1 The training samples of the feature pool model are discretized, and accordingly, the AUC of the feature pool model on the test data set can reflect the feature combination { f } 1 ,f 3 ,f 5 Predictive power of }. In addition, it is assumed that there are two features on which another feature pool model is based, namely { f 1 ,f 3 -likewise, continuous feature f 1 After discretization, the AUC of the feature pool model on the test data set can reflect the feature combination { f }, accordingly 1 ,f 3 Predictive power of }. On the basis, the difference between the two AUCs can be used to reflect the characteristic f 5 Is of importance.
For another example, assume that the feature on which a feature pool model is based is the feature portion { f } of the machine learning sample 1 ,f 2 ,…,f n Three features { f } among 1 ,f 3 ,f 5 And wherein the continuous feature f 1 The training samples of the feature pool model are discretized, and accordingly, the AUC of the feature pool model on the original test data set can reflect the feature combination { f } 1 ,f 3 ,f 5 Predictive power of }. Here, in order to determine the target feature f 5 By the importance of the features f in each test sample comprised by the original test dataset 5 And processing the values of (2) to obtain a transformed test dataset, and further obtaining the AUC of the feature pool model on the transformed test dataset. On the basis, the difference between the two AUCs can be used for reflecting the target characteristic f 5 Is of importance. As an example, in the transformation process, the feature f in each original test sample may be 5 By replacing the value of (a) with a zero value, a random number, or by substituting feature f 5 Obtained after scrambling the order by the original values of (a).
Wherein the discretization operation includes a basic binning operation and at least one additional operation. The at least one additional operation includes at least one operation among the following types of operations: logarithmic operation, exponential operation, absolute value operation, gaussian transformation operation. The at least one additional operation comprises an additional binning operation in the same manner as the basic binning operation binning but with different binning parameters; alternatively, the at least one additional operation includes an additional binning operation that is different from the basic binning operation binning mode. The basic binning operation and the additional binning operation correspond to equal width binning operations of different widths or equal depth binning of different depths, respectively. The different widths or different depths numerically form an equal ratio array or an equal difference array. Wherein the step of performing a basic binning operation and/or an additional binning operation comprises: an outlier box is additionally provided such that consecutive features having outliers are separated into the outlier box.
Wherein in step (B), the feature pool model is trained based on a logistic regression algorithm.
Wherein the effect of the feature pool model comprises an AUC of the feature pool model. The raw test dataset is made up of acquired historical data records, wherein in step (B) the acquired historical data records are divided into sets of historical data records to train each feature pool model step by step, and step (B) further comprises: performing prediction on a next set of historical data records by using a feature pool model trained by a current set of historical data records to obtain grouping AUCs corresponding to the next set of historical data records, and synthesizing all the grouping AUCs to obtain the AUCs of the feature pool model; and after obtaining the grouping AUC corresponding to the next set of historical data records, continuing training the feature pool model trained by the current set of historical data records by utilizing the next set of historical data records.
Wherein in step (B), when a feature pool model trained on a current set of history data records is used to perform prediction for a next set of history data records, when the next set of history data records includes missing history data records lacking attribute information for generating at least a part of features on which the feature pool model is based, a grouping AUC corresponding to the next set of history data records is obtained based on one of the following processes: calculating a group AUC by using only the prediction results of the other historical data records except the missing historical data record in the next set of historical data records; calculating a grouping AUC by using the predicted results of all the history data records of the next group of history data records, wherein the predicted results of the missing history data records are set as default values, and the default values are determined based on the value range of the predicted results or the mark distribution of the acquired history data records; and multiplying the AUC calculated by the prediction results of the other historical data records except the missing historical data record in the next set of historical data records by the proportion of the other historical data records in the next set of historical data records to obtain a grouping AUC.
Wherein in step (B), when training the feature pool model based on a logarithmic probability regression algorithm, the regularization term set for continuous features is different from the regularization term set for discontinuous features.
Wherein step (B) further comprises: providing an interface to the user for configuring at least one of the following items of the feature pool model: at least a part of the features on which the feature pool model is based, the algorithm type of the feature pool model, the algorithm parameters of the feature pool model, the operation type of the discretization operation, and the operation parameters of the discretization operation, and in the step (B), training the feature pool model according to the items configured by the user through the interface. Wherein in step (B) the interface is provided to the user in response to an indication by the user of the determined importance of the feature.
Wherein, the method further comprises: (D) The importance of the determined individual features is presented to the user graphically. Wherein in step (D), each feature is presented in order of importance of the feature, and/or a part of the features is highlighted, wherein the part of the features includes important features corresponding to high importance, unimportant features corresponding to low importance, and/or abnormal features corresponding to abnormal importance.
A second way of determining feature importance: determining a basic feature subset of the sample, and determining a plurality of target feature subsets of which the importance is to be determined; for each target feature subset of the plurality of target feature subsets, acquiring a corresponding composite machine learning model, wherein the composite machine learning model comprises a basic sub-model trained according to a lifting framework and an additional sub-model, wherein the basic sub-model is trained based on the basic feature subset, and the additional sub-model is trained based on each target feature subset; and determining importance of the plurality of target feature subsets based on the effect of the composite machine learning model.
The method specifically comprises the following steps: (A1) Determining a subset of base features of the machine learning sample, wherein the subset of base features includes at least one base feature; (B1) Determining a plurality of target feature subsets of importance of the machine learning sample to be determined, wherein each target feature subset includes at least one target feature; (C1) For each target feature subset of the plurality of target feature subsets, acquiring a corresponding composite machine learning model, wherein the composite machine learning model comprises a basic sub-model trained according to a lifting framework and an additional sub-model, wherein the basic sub-model is trained based on the basic feature subset, and the additional sub-model is trained based on each target feature subset; and (D1) determining the importance of the plurality of target feature subsets according to the effect of the composite machine learning model.
Wherein in step (D1), the importance of the plurality of target feature subsets is determined from differences between effects of the composite machine learning model on the same dataset.
Wherein the effect of the composite machine learning model comprises an AUC of the composite machine learning model.
Wherein the target feature is generated based on the base feature.
Wherein the target feature is a combined feature obtained by combining at least one basic feature.
Wherein in step (C1), a composite machine learning model corresponding to each of the target feature subsets is obtained by training a plurality of composite machine learning models in parallel.
Wherein the target feature subset comprises one combined feature obtained by combining at least one basic feature, and the method further comprises: (E1) The importance of the determined individual combined features is presented to the user graphically.
Wherein in step (C1) a corresponding composite machine learning model is obtained by training the additional sub-model with the already trained base sub-model fixed.
Wherein the basic sub-model and the additional sub-model are of the same type.
According to an exemplary embodiment of the present invention, for each target feature subset, a corresponding composite machine learning model is acquired. Here, the training of the composite machine learning model may be completed by itself, or the composite machine learning model that has been trained may be obtained from the outside. Here, the composite machine learning model includes a base sub-model and an additional sub-model trained from a lifting framework (e.g., a gradient lifting framework), wherein the base sub-model and the additional sub-model may be the same type of model, e.g., the base sub-model and the additional sub-model may both be linear models (e.g., a logistic regression model), and further the base sub-model and the additional sub-model may also be of different types. Here, the lifting framework of each composite machine learning model may be identical, i.e., each composite machine learning model has the same type of basic sub-model and the same type of additional sub-model, differing only in the subset of target features upon which the additional sub-model is based.
Assuming that a single composite machine learning model is denoted as F, where F may be represented by a base submodel F base And an additional submodel f add The composition, assuming that the input training data record is denoted as x, the basic sub-model f after corresponding feature processing according to the determined basic feature subset and the target feature subset base The corresponding sample portion is characterized by x b Additional submodel f add The corresponding sample portion is characterized by x a . Accordingly, the composite machine learning model F may be constructed according to the following equation:
F(x)=f base (x b )+f add (x a )
it should be noted, however, that the basic sub-model and the additional sub-model may be trained based on different sets of training data records in addition to being trained based on the same set of training data records. For example, both of the sub-models may be trained based on the entire training data record, or may be trained based on a part of the training data record sampled from the entire training data record. As an example, the basic sub-model and the additional sub-model may be assigned corresponding training data records according to a preset sampling strategy, e.g. more training data records may be assigned to the basic sub-model and less training data records may be assigned to the additional sub-model, where the training data records assigned by the different sub-models may have a proportion of intersections or no intersections at all. By determining the training data records used by each sub-model according to a sampling strategy, the effect of the overall machine learning model can be further improved.
A third way of determining feature importance: pre-ordering the importance of at least one candidate feature in the sample, and screening a part of candidate features from the at least one candidate feature according to a pre-ordering result to form a candidate feature pool; and re-ordering the importance of each candidate feature in the candidate feature pool, and selecting at least one candidate feature with higher importance from the candidate feature pool as an important feature according to the re-ordering result.
The method specifically comprises the following steps: (A2) Acquiring a historical data record, wherein the historical data record comprises a plurality of attribute information; (B2) Generating at least one candidate feature based on the plurality of attribute information; (C2) Pre-ordering the importance of the at least one candidate feature, and screening a part of candidate features from the at least one candidate feature according to a pre-ordering result to form a candidate feature pool; and (D2) carrying out importance reordering on each candidate feature in the candidate feature pool, and selecting at least one candidate feature with higher importance from the candidate feature pool as an important feature according to the reordering result.
Wherein in step (C2) the pre-ordering is based on a first number of history data records; in step (D2), the reordering is performed based on a second number of history data records, and the second number is not less than the first number. The second number of history data records includes the first number of history data records.
In the step (C2), candidate features with higher importance are screened from the at least one candidate feature according to the pre-sequencing result to form a candidate feature pool.
Wherein, in step (C2), the pre-ordering is performed by: for each candidate feature, a pre-ordered single feature machine learning model is obtained, and the importance of each candidate feature is determined based on the effect of each pre-ordered single feature machine learning model, wherein each pre-ordered single feature machine learning model corresponds to each candidate feature. As an example, assume that there are N (N is an integer greater than 1) candidate features f n Wherein n is [1, N ]]. Accordingly, the pre-ordering apparatus 300 may utilize at least a portion of the historical data records to construct N pre-ordered single feature machine learning models (where each pre-ordered single feature machine learning model is based on a respective single candidate feature f) n To predict for machine learning problems), then measure the effect of the N pre-ordered single feature machine learning models on the same test dataset (e.g., area under AUC (ROC (subject work feature, receiver Operating Characteristic) curve, area Under ROC Curve), MAE (mean absolute error ), etc.), and determine the order of importance of the individual candidate features based on the ordering of the effects.
Wherein, in step (C2), the pre-ordering is performed by: for each candidate feature, a pre-ordered overall machine learning model is obtained, and the importance of each candidate feature is determined based on the effect of each pre-ordered overall machine learning model, wherein the pre-ordered overall machine learning model corresponds to the pre-ordered basic feature subset and each candidate feature. As an example, the pre-ordered overall machine learning model herein may be a logarithmic probability regression (LR) model; accordingly, the samples of the pre-ordered overall machine learning model are derived from pre-orderingThe basic feature subset and the each candidate feature composition are ordered. Assume that there are N candidate features f n Accordingly, at least a portion of the historical data records may be utilized to construct N pre-ordered overall machine learning models (wherein the sample features of each pre-ordered overall machine learning model include a fixed subset of pre-ordered base features and corresponding candidate features f) n ) The effectiveness of these N pre-ordered overall machine learning models on the same test dataset (e.g., AUC, MAE, etc.) is then measured, and the order of importance of the individual candidate features is determined based on the ordering of the effectiveness.
Wherein, in step (C2), the pre-ordering is performed by: for each candidate feature, a pre-ordered composite machine learning model is obtained, the importance of each candidate feature is determined based on the effect of each pre-ordered composite machine learning model, wherein the pre-ordered composite machine learning model comprises a pre-ordered basic sub-model based on the lifting framework and a pre-ordered additional sub-model, wherein the pre-ordered basic sub-model corresponds to a subset of the pre-ordered basic features, and the pre-ordered additional sub-model corresponds to each candidate feature. As an example, assume that there are N candidate features f n Accordingly, at least a portion of the historical data records may be utilized to construct N pre-ordered composite machine learning models (wherein each pre-ordered composite machine learning model is based on a fixed subset of pre-ordered basic features and corresponding candidate features f) n Predicting machine learning problems in terms of a lifting framework), then measuring the effectiveness (e.g., AUC, MAE, etc.) of the N pre-ordered composite machine learning models on the same test dataset, and determining the order of importance of each candidate feature based on the ranking of the effectiveness. Preferably, in order to further improve the operation efficiency and reduce the resource consumption, the pre-ordering apparatus 300 may perform the pre-ordering algorithm by respectively for each candidate feature f under the condition of fixing the pre-ordering basic submodel n The pre-ordered additional sub-models are trained to build individual pre-ordered composite machine learning models.
Wherein the pre-ordered basic feature subset includes unit features individually represented by at least one attribute information itself among the plurality of attribute information, and the candidate features include combined features combined from the unit features.
Wherein, in step (D2), the reordering is performed by: for each candidate feature in the candidate feature pool, a re-ordered single feature machine learning model is obtained, and the importance of each candidate feature is determined based on the effect of each re-ordered single feature machine learning model, wherein the re-ordered single feature machine learning model corresponds to each candidate feature.
Wherein, in step (D2), the reordering is performed by: for each candidate feature in the candidate feature pool, a re-ordering overall machine learning model is obtained, the importance of each candidate feature is determined based on the effect of each re-ordering overall machine learning model, wherein the re-ordering composite machine learning model corresponds to the re-ordering basic feature subset and each candidate feature.
Wherein, in step (D2), the reordering is performed by: for each candidate feature in the candidate feature pool, a re-ordering composite machine learning model is obtained, the importance of each candidate feature is determined based on the effect of each re-ordering composite machine learning model, wherein the re-ordering composite machine learning model comprises a re-ordering basic sub-model and a re-ordering additional sub-model based on the lifting frame, wherein the re-ordering basic sub-model corresponds to the re-ordering basic feature subset, and the re-ordering additional sub-model corresponds to each candidate feature.
Wherein the re-ordered basic feature subset includes unit features individually represented by at least one attribute information itself among the plurality of attribute information, and the candidate features include combined features combined from the unit features.
The method further comprises the following steps: (E2) It is checked whether the important features are suitable as features of a machine learning sample.
Wherein in step (E2), whether the important feature is suitable as a feature of a machine learning sample is checked using an effect change of a machine learning model based on a unit feature individually represented by at least one attribute information itself among the plurality of attribute information after the important feature is introduced.
Wherein, in the case that the test result is that the important feature is not suitable as the feature of the machine learning sample, another part of candidate features are screened from the at least one candidate feature according to the pre-sorting result to form a new candidate feature pool, and the step (D2) and the step (E2) are re-executed.
In an exemplary embodiment of the present invention, generating at least one candidate feature based on the plurality of attribute information in step (B2) is specifically: for at least a portion of the attribute information of the history data record, a corresponding continuous feature may be generated, where a continuous feature is a feature opposite a discrete feature (e.g., a category feature) and may be valued for a value having a certain continuity, such as distance, age, amount, etc. In contrast, as an example, the values of the discrete features do not have continuity, and may be, for example, features classified unordered such as "from beijing", "from Shanghai", or "from Tianjin", "sex male", "sex female". Some discrete value attribute information in the history data record may be directly used as the corresponding discrete feature, or some attribute information (e.g., continuous value attribute and/or discrete value attribute information) in the history data record may be processed to obtain the corresponding discrete feature.
For example, certain continuous value attribute information in the history data record may be directly used as the corresponding continuous feature, for example, attribute information such as distance, age, amount, etc. may be directly used as the corresponding continuous feature. Alternatively, certain attribute information (e.g., continuous value attribute and/or discrete value attribute information) in the history data record may be processed to obtain corresponding continuous features, for example, a ratio of height to weight may be used as the corresponding continuous features. The continuous value attribute information can be discretized and/or the discrete value attribute information can be continuous according to the requirement, and further operation or combination can be performed on the original or processed different attribute value information. Even further, any combination or operation between features may be performed, for example, a Cartesian combination of discrete features may be performed.
According to one embodiment of the invention, the aforementioned automatic feature combination operator performs feature combination by at least one of the following means:
the first way of performing automatic feature combining: performing at least one binning operation for each successive feature in the sample to obtain a binning group feature comprised of at least one binning feature, wherein each binning operation corresponds to one binning feature; and generating combined features of the machine-learned sample by feature combining between the binned features and/or other discrete features in the sample;
Specifically, the method comprises the following steps: (A3) Acquiring a data record, wherein the data record comprises a plurality of attribute information; (B3) Performing at least one binning operation for each successive feature generated based on the plurality of attribute information to obtain a binning group feature comprised of at least one binning feature, wherein each binning operation corresponds to one binning feature; and (C3) generating combined features of the machine learning sample by feature combining between the binned features and/or other discrete features generated based on the plurality of attribute information.
Here, at least one binning operation may be performed such that a plurality of discrete features characterizing certain properties of the original data record from different angles, scales/levels can be obtained simultaneously. Here, the binning operation refers to a specific way of discretizing the continuous feature, i.e., dividing the value range of the continuous feature into a plurality of sections (i.e., a plurality of bins), and determining the corresponding binning feature value based on the divided bins. Binning operations can be broadly divided into supervised binning and unsupervised binning, each of which includes some specific manner of binning, e.g., supervised binning including minimum entropy binning, minimum description length binning, etc., and unsupervised binning including equally wide binning, equally deep binning, k-means clustering based binning, etc. In each binning mode, corresponding binning parameters, such as width, depth, etc., may be set. It should be noted that, according to the exemplary embodiment of the present invention, the binning operation performed by the binning feature generating apparatus 200 is not limited in the kind of binning mode nor in the parameters of the binning operation, and the specific representation of the correspondingly generated binning features is not limited. Taking the unsupervised equal width bin as an example, assuming that the value interval of the consecutive features is [0,100], and the corresponding bin parameter (i.e., width) is 50, 2 bins may be separated, in which case the consecutive features having a value of 61.5 correspond to the 2 nd bin, and if the two bins are numbered 0 and 1, the bin corresponding to the consecutive features is numbered 1. Alternatively, assuming a bin width of 10, 10 bins may be separated, in which case a continuous feature of 61.5 corresponds to the 7 th bin, and if the ten bins are numbered 0 through 9, the bin corresponding to the continuous feature is numbered 6. Alternatively, assuming a bin width of 2, 50 bins may be separated, in which case a continuous feature of 61.5 corresponds to 31 th bin, and if the fifty bins are numbered 0 through 49, the continuous feature corresponds to bin number 30.
Before step (B3), further comprising: (D3) The at least one binning operation is selected from a predetermined number of binning operations such that the importance of the binning feature corresponding to the selected binning operation is not lower than the importance of the binning feature corresponding to the non-selected binning operation.
In step (D3), a single feature machine learning model is constructed for each of the binning features corresponding to the predetermined number of binning operations, the importance of each of the binning features is determined based on the effect of each of the single feature machine learning models, and the at least one binning operation is selected based on the importance of each of the binning features, wherein a single feature machine learning model corresponds to the each of the binning features.
Alternatively, in step (D3), a composite machine learning model is constructed for each of the binning features corresponding to the predetermined number of binning operations, the importance of each binning feature is determined based on the effect of each composite machine learning model, and the at least one binning operation is selected based on the importance of each binning feature, wherein the composite machine learning model includes a basic sub-model and an additional sub-model based on the lifting frame, wherein the basic sub-model corresponds to the basic feature subset and the additional sub-model corresponds to each of the binning features. Wherein the combined features of the machine learning sample are generated in an iterative manner according to a search strategy with respect to the combined features. Wherein step (D3) is performed for each round of iterations to update the at least one binning operation, and the combined features generated in each round of iterations are added as new discrete features to the basic feature subset. Wherein each composite machine learning model is constructed by separately training additional sub-models with the base sub-model fixed.
In step (C3) feature combinations are performed between the sub-bin set features and/or the other discrete features according to a cartesian product.
Wherein the at least one binning operation corresponds to an equal width binning operation of different widths or an equal depth binning operation of different depths, respectively. The different widths or different depths numerically form an equal ratio array or an equal difference array.
The binning feature indicates which bin the sequential feature is binned into according to the corresponding binning operation.
Each of the continuous features is formed by continuous value attribute information itself among the plurality of attribute information, or by continuously transforming discrete value attribute information among the plurality of attribute information. The continuous transformation indicates that the discrete value attribute information is counted.
Wherein each composite machine learning model is constructed by separately training additional sub-models with the base sub-model fixed.
In embodiments of the present invention, a single discrete feature may be considered a first order feature, and according to exemplary embodiments of the present invention, higher order feature combinations of two orders, three orders, etc., may be performed until a predetermined cutoff condition is met. As an example, the combined features of the machine learning sample may be generated in an iterative manner according to a search strategy with respect to the combined features.
Fig. 3 illustrates an example of a search tree for generating combined features according to an exemplary embodiment of the present invention. According to an exemplary embodiment of the present invention, the search tree may be based on a heuristic search strategy, such as a bundle search, for example, wherein one layer of the search tree may correspond to a feature combination of a particular order.
Referring to fig. 3, it is assumed that the discrete features that can be combined include a feature a, a feature B, a feature C, a feature D, and a feature E, which may be discrete features formed by the discrete value attribute information of the data record itself, and a feature D and a feature E may be binned features converted from continuous features, as examples.
According to the search strategy, in the first iteration, two nodes, namely a feature B and a feature E, are selected as first-order features, where each node may be ordered using, for example, feature importance as an index, and a portion of the nodes may be further selected to continue expanding at the next layer.
In the next iteration, feature BA, feature BC, feature BD, feature BE, feature EA, feature EB, feature EC, feature ED, which are second order combined features, are generated based on feature B and feature E, and feature BC and feature EA therein are selected continuously based on the ranking index. As an example, feature BE and feature EB can BE considered as the same combined feature.
The iteration is continued in the manner described above until certain cut-off conditions, e.g. order limits, etc., are met. Here, the nodes selected in each layer (shown in solid lines) may be used as combined features for subsequent processing, e.g., as final employed features or for further importance assessment, while the remaining features (shown in dashed lines) are pruned.
A second way of performing automatic feature combining: feature combining is performed stage by stage between at least one feature of the sample in accordance with a heuristic search strategy to generate candidate combined features, wherein for each stage, a target combined feature is selected from the set of candidate combined features as a combined feature of the machine-learned sample.
Specifically, a method for generating combined features of a machine learning sample in the invention comprises the following steps: (A4) Acquiring a historical data record, wherein the historical data record comprises a plurality of attribute information; and (B4) performing feature combination step by step between at least one feature generated based on the plurality of attribute information in accordance with a heuristic search strategy to generate candidate combined features; wherein for each stage, a target combined feature is selected from the set of candidate combined features as a combined feature of the machine learning sample. The heuristic search strategy here refers to fig. 3 and will not be repeated here.
Wherein the at least one feature is at least one discrete feature, wherein the discrete feature is generated by processing at least one continuous value attribute information and/or discrete value attribute information among the plurality of attribute information; alternatively, the at least one feature is at least one continuous feature, wherein the continuous feature is generated by processing at least one continuous value attribute information and/or discrete value attribute information among the plurality of attribute information.
Wherein candidate combined features of the next stage are generated by combining the target combined feature selected in the current stage with the at least one feature under the heuristic search strategy.
Under the heuristic search strategy, candidate combination features of the next stage are generated by combining target combination features selected in the current stage and the previous stage in pairs.
Wherein the set of candidate combined features comprises candidate combined features generated in the current stage.
Wherein the set of candidate combined features comprises the candidate combined features generated in the current stage and all candidate combined features generated in the previous stage that were not selected as target combined features.
Wherein the set of candidate combined features comprises the candidate combined features generated in the current stage and a portion of the candidate combined features generated in the previous stage that were not selected as target combined features. The part of the candidate combination features are candidate combination features of higher importance among the candidate combination features generated in the previous stage that are not selected as the target combination feature.
The target combination feature is a candidate combination feature with higher importance in the candidate combination feature set.
A third way of performing automatic feature combining: obtaining unit features capable of being combined in a sample; providing a graphical interface for setting feature combination configuration items to a user, wherein the feature combination configuration items are used for limiting how feature combinations are performed among unit features; receiving input operation which is executed on a graphical interface by a user for setting feature combination configuration items, and acquiring the feature combination configuration items set by the user according to the input operation; and combining the features to be combined among the unit features based on the acquired feature combination configuration items to generate combined features of the machine learning sample.
Specifically, a method for generating combined features of a machine learning sample in the invention comprises the following steps: (A5) acquiring unit features capable of being combined; (B5) Providing a graphical interface for setting feature combination configuration items to a user, wherein the feature combination configuration items are used for limiting how feature combinations are performed among unit features; (C5) Receiving input operation which is executed on a graphical interface by a user for setting feature combination configuration items, and acquiring the feature combination configuration items set by the user according to the input operation; and (D5) combining the features to be combined among the unit features based on the acquired feature combination configuration items to generate combined features of the machine learning sample. Here, the unit feature is the minimum unit in which feature combination is possible.
Wherein the feature combination configuration item includes at least one of: a feature configuration item for specifying a feature to be combined among the unit features such that the specified feature to be combined is combined in step (D5); an evaluation index configuration item for designating an evaluation index of the combination feature such that the effect of the machine learning model corresponding to the various combination features is measured according to the designated evaluation index in step (D5) to determine the combination manner of the features to be combined; and (3) training parameter configuration items, which are used for designating training parameters of the machine learning model, so that the combination mode of the characteristics to be combined is determined in the step (D5) by measuring the effect of the machine learning model corresponding to various combination characteristics obtained under the designated training parameters. The feature combination configuration item further includes: and (D5) a sub-bucket operation configuration item, configured to designate one or more sub-bucket operations to be respectively performed on at least one continuous feature among the features to be combined, so that the designated one or more sub-bucket operations are respectively performed on the at least one continuous feature in step (D5) to obtain one or more corresponding sub-bucket features, and the obtained sub-bucket features are combined with other features to be combined as a whole. The sub-bucket operation configuration item is used for respectively designating one or more sub-bucket operations for each continuous feature; alternatively, the sub-bucket operation configuration item is used to specify one or more sub-bucket operations for all continuous features.
As an example, a machine learning model corresponding to a particular combined feature may indicate that a sample of the machine learning model includes the particular combined feature. According to an exemplary embodiment of the present invention, when a combination of unit features is performed, whether to employ the combination feature may be determined by measuring the effect of a machine learning model corresponding to the combination feature. Here, the set evaluation index may be used to measure the effect of the machine learning model corresponding to various combination features, and if the evaluation index of a certain machine learning model is higher, the combination feature corresponding to the machine learning model is easier to be determined as the combination feature of the machine learning sample. As an example, the training parameter configuration items may include configuration items of one or more different training parameters. For example, the training parameter coordination items may include a learning rate configuration item and/or a tuning number configuration item, etc. For each successive feature, each sub-bucket operation performed on it may produce a sub-bucket feature, and accordingly, a feature consisting of all sub-bucket features may participate in automatic combination between the features to be combined instead of the original successive feature. As an example, the sub-bucket operation configuration items may further include a sub-bucket manner configuration item and/or a sub-bucket parameter configuration item. The sub-bucket mode configuration item is used for specifying a sub-bucket mode used for sub-bucket operation. The barrel dividing parameter configuration item is used for designating barrel dividing parameters of a barrel dividing mode. For example, the equal-width barrel dividing mode or the equal-depth barrel dividing mode may be designated by the barrel dividing mode configuration item, and the number of barrels, the barrel width, the barrel depth, or the like may be designated by the barrel dividing parameter configuration item. Here, the user may manually input or select the numerical value of the sub-bucket parameter configuration item, and in particular, may be prompted to set the respective widths/depths of the equal-width/equal-depth sub-buckets in an equal ratio or equal difference relationship.
After step (D5), the method further comprises: (E5) displaying the generated combined feature to a user. In step (E5), an evaluation value of each of the combined features with respect to the evaluation index is also displayed to the user.
After step (D5), the method further comprises: (F5) The generated combined features are directly applied to a subsequent machine learning step.
After step (E5), the method further comprises: (G5) The combined feature selected by the user from the displayed combined features is applied to a subsequent machine learning step.
After step (D5), the method further comprises: (H5) And (3) saving the combination mode of the combination features generated in the step (D5) in the form of a configuration file.
After step (G5), the method further comprises: (I5) And (5) saving the combination mode of the combination features selected by the user in the step (G5) in the form of a configuration file.
In step (A5), the unit feature is obtained by performing feature processing on the attribute information of the data record.
According to an exemplary embodiment of the present invention, a machine learning process may be performed in the form of a directed acyclic graph (DAG graph), which may encompass all or part of the steps for performing machine learning model training, testing, or prediction. For example, a DAG graph including a history data import step, a data splitting step, a feature extraction step, an automatic feature combination step may be established for feature automatic combination. That is, the various steps described above may be performed as nodes in a DAG graph.
Fourth way of automatic feature combination: iteratively performing feature combinations between at least one discrete feature of the sample in accordance with a search strategy to generate candidate combined features, and selecting a target combined feature from the generated candidate combined features as a combined feature; and for each round of iteration, pre-ordering the importance of each candidate combination feature in the candidate combination feature set, screening a part of candidate combination features from the candidate combination feature set according to a pre-ordering result to form a candidate combination feature pool, re-ordering the importance of each candidate combination feature in the candidate combination feature pool, and selecting at least one candidate combination feature with higher importance from the candidate combination feature pool according to the re-ordering result as a target combination feature.
Specifically, a method of generating combined features of a machine learning sample includes: (A6) Acquiring a historical data record, wherein the historical data record comprises a plurality of attribute information; and (B6) iteratively performing feature combinations between at least one discrete feature generated based on the plurality of attribute information in accordance with a search policy to generate candidate combination features, and selecting a target combination feature from the generated candidate combination features as a combination feature of the machine learning sample. The method comprises the steps of carrying out importance pre-sequencing on each candidate combination feature in a candidate combination feature set for each round of iteration; screening a part of candidate combined features from the candidate combined feature set according to the pre-sequencing result to form a candidate combined feature pool; re-ordering the importance of each candidate combined feature in the candidate combined feature pool; and selecting at least one candidate combination feature with higher importance from the candidate combination feature pool as a target combination feature according to the re-ordering result. The search strategy here refers to fig. 3 and will not be repeated here.
Wherein the pre-ordering is based on a first number of history data records, the re-ordering is based on a second number of history data records, and the second number is not less than the first number.
And screening out the candidate combination features with higher importance from the candidate combination feature set according to the pre-sequencing result to form a candidate combination feature pool.
Wherein the candidate combined feature set comprises candidate combined features generated in a current round of iterations; alternatively, the set of candidate combined features includes candidate combined features generated in a current round of iterations and candidate combined features generated in a previous round of iterations that were not selected as target combined features.
Wherein candidate combined features for a next round of iterations are generated by combining the target combined feature selected in the current round of iterations with the at least one discrete feature; alternatively, candidate combined features for the next round of iterations are generated by pairwise combining between target combined features selected in the current round of iterations and the previous round of iterations.
Wherein the at least one discrete feature comprises a discrete feature converted from a continuous feature generated based on the plurality of attribute information by: for each continuous feature, performing at least one binning operation to generate a discrete feature comprised of at least one binning feature, wherein each binning operation corresponds to one binning feature. The at least one binning operation is selected from a predetermined number of binning operations for each iteration or for all iterations, wherein the importance of the binning feature corresponding to a selected binning operation is not lower than the importance of the binning feature corresponding to an unselected binning operation.
Specifically, the at least one binning operation is selected by: and obtaining a single-feature-in-case machine learning model for each of the multiple-feature-in-case corresponding to the predetermined number of single-feature-in-case operations, determining the importance of each single-feature-in-case machine learning model based on the effect of each single-feature-in-case machine learning model, and selecting the at least one single-feature-in-case operation based on the importance of each single-feature-in-case machine learning model for each single-feature-in-case. Alternatively, the at least one binning operation is selected by: for each of the binning features corresponding to the predetermined number of binning operations, obtaining a binning overall machine learning model, determining an importance of each of the binning features based on an effect of each of the binning overall machine learning models, and selecting the at least one binning operation based on the importance of each of the binning features, wherein the binning overall machine learning model corresponds to the subset of binning basic features and the each of the binning features. Still alternatively, the at least one binning operation is selected by: for each of the binning features corresponding to the predetermined number of binning operations, obtaining a binning composite machine learning model, determining an importance of each of the binning features based on an effect of each of the binning composite machine learning models, and selecting the at least one binning operation based on the importance of each of the binning features, wherein the binning composite machine learning model includes a binning base sub-model and a binning additional sub-model based on the lifting frame, wherein the binning base sub-model corresponds to a subset of the binning base features and the binning additional sub-model corresponds to each of the binning features. Wherein the binned basic feature subset comprises target combined features selected prior to the current round of iteration.
Wherein the pre-ordering is performed by: and aiming at each candidate combined feature in the candidate combined feature set, obtaining a pre-ordered single feature machine learning model, and determining the importance of each candidate combined feature based on the effect of each pre-ordered single feature machine learning model, wherein the pre-ordered single feature machine learning model corresponds to each candidate combined feature.
Wherein the pre-ordering is performed by: and aiming at each candidate combined feature in the candidate combined feature set, obtaining a pre-ordered integral machine learning model, and determining the importance of each candidate combined feature based on the effect of each pre-ordered integral machine learning model, wherein the pre-ordered integral machine learning model corresponds to the pre-ordered basic feature subset and each candidate combined feature.
Wherein the pre-ordering is performed by: for each candidate combined feature in the candidate combined feature set, a pre-ordered composite machine learning model is obtained, the importance of each candidate combined feature is determined based on the effect of each pre-ordered composite machine learning model, wherein the pre-ordered composite machine learning model comprises a pre-ordered basic sub-model based on the lifting frame and a pre-ordered additional sub-model, wherein the pre-ordered basic sub-model corresponds to a subset of the pre-ordered basic features, and the pre-ordered additional sub-model corresponds to each candidate combined feature.
Wherein the pre-ordered basic feature subset comprises target combined features selected prior to the current round of iteration.
Wherein the reordering is performed by: for each candidate combined feature in the candidate combined feature pool, a re-ordered single feature machine learning model is obtained, and the importance of each candidate combined feature is determined based on the effect of each re-ordered single feature machine learning model, wherein the re-ordered single feature machine learning model corresponds to each candidate combined feature.
Wherein the reordering is performed by: for each candidate combined feature in the candidate combined feature pool, a re-ordering overall machine learning model is obtained, the importance of each candidate combined feature is determined based on the effect of each re-ordering overall machine learning model, wherein the re-ordering composite machine learning model corresponds to the re-ordering basic feature subset and each candidate combined feature.
Wherein the reordering is performed by: for each candidate combined feature in the candidate combined feature pool, a re-ordering composite machine learning model is obtained, and the importance of each candidate combined feature is determined based on the effect of each re-ordering composite machine learning model, wherein the re-ordering composite machine learning model comprises a re-ordering basic sub-model and a re-ordering additional sub-model based on the lifting frame, wherein the re-ordering basic sub-model corresponds to the re-ordering basic feature subset, and the re-ordering additional sub-model corresponds to the each candidate combined feature.
Wherein the re-ordered basic feature subset comprises target combined features selected prior to the current round of iteration.
Wherein step (B6) further comprises: for each round of iteration, it is checked whether the selected target combined feature is suitable as a combined feature of the machine learning sample. In step (B6), it is checked whether the selected target combination feature is suitable as a combination feature of a machine learning sample using an effect change after the selected target combination feature is introduced based on the machine learning model of the target combination feature that has passed the check. Taking the selected target combined feature as the combined feature of the machine learning sample and performing the next iteration if the test result is that the selected target combined feature is suitable as the combined feature of the machine learning sample; and screening another part of candidate combination features from the candidate combination feature set according to the pre-sequencing result to form a new candidate combination feature pool in the condition that the test result is that the selected target combination feature is not suitable for being the combination feature of the machine learning sample.
Here, the importance of the binning feature may be automatically determined in any suitable manner.
For example, a single-feature-per-bin machine learning model may be derived for each of the bin features corresponding to the predetermined number of bin operations, the importance of each of the bin features determined based on the effect of each single-feature-per-bin machine learning model, and the at least one bin operation may be selected based on the importance of each of the bin features.
As an example, assume that for a continuous feature F there is a predetermined number M (M is an integer greater than 1) of binning operations corresponding to M binning features F m Wherein m is [1, M ]]. Accordingly, at least a portion of the historical data records may be utilized to construct M binned single-feature machine learning models (wherein each binned single-feature machine learning model is based on a respective single binned feature f m To predict for machine learning problems), then measure the effect of the M binned single-feature machine learning models on the same test dataset (e.g., area under AUC (ROC (subject work feature, receiver Operating Characteristic) curve, area Under ROC Curve), MAE (mean absolute error ), etc.), and determine at least one binning operation to be ultimately performed based on the ordering of the effects.
For another example, a binning global machine learning model may be derived for each of the binning features corresponding to the predetermined number of binning operations, wherein the binning global machine learning model corresponds to the subset of binning basic features and the each of the binning features, determining an importance of each of the binning features based on an effect of each of the binning global machine learning models, and selecting the at least one binning operation based on the importance of each of the binning features. As an example, the binned whole machine learning model herein may be a logarithmic probability regression (LR) model; accordingly, a sample of the binned integral machine learning model is comprised of a subset of binned basic features and each of the binned features.
As an example, assume that for a continuous feature F there are a predetermined number M of binning operations corresponding to M binning features F m Accordingly, at least a portion of the historical data records may be utilized to construct M binned integral machine learning models (wherein the sample features of each binned integral machine learning model include a fixed binned basic feature subset and a corresponding binned feature f m ) The effects (e.g., AUC, MAE, etc.) of the M binned whole machine learning models on the same test data set are then measured and the final performed at least one binning operation is determined based on the ordering of the effects.
For another example, a binning composite machine learning model may be derived for each of the binning features corresponding to the predetermined number of binning operations, the importance of each of the binning features determined based on the effect of each of the binning composite machine learning models, and the at least one binning operation selected based on the importance of each of the binning features, wherein the binning composite machine learning model includes a binning base sub-model and a binning additional sub-model based on a lifting frame (e.g., a gradient lifting frame), wherein the binning base sub-model corresponds to a subset of the binning base features and the binning additional sub-model corresponds to each of the binning features.
As an example, assume that for a continuous feature F there are a predetermined number M of binning operations corresponding to M binning features F m Accordingly, at least a portion of the historical data records may be utilized to construct M binned composite machine learning models (wherein each binned composite machine learning model is based on a fixed subset of binned basic features and a corresponding binned feature f m For lifting framesMachine learning problem prediction), then measure the effectiveness (e.g., AUC, MAE, etc.) of the M binned composite machine learning models on the same test data set, and determine at least one binning operation to ultimately perform based on the ordering of the effectiveness. Preferably, in order to further improve the operation efficiency and reduce the resource consumption, the method can be implemented by respectively aiming at each sub-bin characteristic f under the condition of fixing the sub-bin basic sub-model m The sub-model is trained to build each sub-model composite machine learning model.
According to an exemplary embodiment of the present invention, the sub-set of sub-tank basic features may be fixedly applied to the sub-tank basic sub-model in all relevant sub-tank whole machine learning models or sub-tank compound machine learning models, where for the first round of iteration the sub-tank basic feature subset may be empty; alternatively, any feature generated based on the attribute information of the history data record may be taken as the binning basic feature, for example, a part of the attribute information or all of the attribute information of the history data record may be taken directly as the binning basic feature. Further, as an example, actual machine learning problems may be considered, and relatively important or essential features may be determined as binning essential features based on estimates or according to business personnel specifications.
Here, the importance of each candidate combined feature in the candidate combined feature pool may be measured by any means of determining the importance of the feature.
For example, a re-ordered single feature machine learning model may be derived for each candidate combined feature in the pool of candidate combined features, the importance of each candidate combined feature being determined based on the effect of each re-ordered single feature machine learning model corresponding to the each candidate combined feature.
As an example, assume that the candidate combined feature pool includes 10 candidate combined features. Accordingly, at least a portion of the historical data records may be utilized to construct 10 re-ordered single feature machine learning models (wherein each re-ordered single feature machine learning model predicts machine learning problems based on a respective single candidate combined feature), then measure the effectiveness (e.g., AUC, MAE, etc.) of the 10 re-ordered single feature machine learning models on the same test data set, and determine an order of importance of individual candidate combined features in the pool of candidate combined features based on the ordering of the effectiveness.
For another example, a re-ordered overall machine learning model may be derived for each candidate combined feature in the pool of candidate combined features, the importance of each candidate combined feature being determined based on the effect of each re-ordered overall machine learning model, wherein the re-ordered composite machine learning model corresponds to the subset of re-ordered base features and the each candidate combined feature. As an example, the reordered overall machine learning model herein may be an LR model; accordingly, a sample of the reordered overall machine learning model is composed of the reordered base feature subset and the each candidate combined feature.
As an example, assuming that the candidate combined feature pool includes 10 candidate combined features, accordingly, 10 reordered overall machine learning models (where the sample features of each reordered overall machine learning model include a fixed subset of the reordered base features and the corresponding candidate combined features) may be constructed using at least a portion of the historical data records, then the effectiveness of the 10 reordered overall machine learning models on the same test data set (e.g., AUC, MAE, etc.) is measured, and the order of importance of the individual candidate combined features in the candidate combined feature pool is determined based on the ordering of the effectiveness.
For another example, a re-ordered composite machine learning model may be derived for each candidate combined feature in the pool of candidate combined features, the importance of each candidate combined feature determined based on the effect of each re-ordered composite machine learning model, wherein the re-ordered composite machine learning model includes a re-ordered base sub-model based on a lifting frame (e.g., a gradient lifting frame) and a re-ordered additional sub-model, wherein the re-ordered base sub-model corresponds to a subset of the re-ordered base features and the re-ordered additional sub-model corresponds to the each candidate combined feature.
As an example, assuming that the candidate combined feature pool includes 10 candidate combined features, accordingly, 10 re-ordered composite machine learning models (where each re-ordered composite machine learning model predicts machine learning problems in terms of a lifting framework based on a fixed subset of re-ordered base features and corresponding candidate combined features) may be constructed using at least a portion of the historical data records, then the effectiveness (e.g., AUC, MAE, etc.) of the 10 re-ordered composite machine learning models on the same test data set is measured, and the order of importance of the individual candidate combined features in the candidate combined feature pool is determined based on the ranking of the effectiveness. Preferably, in order to further improve the operation efficiency and reduce the resource consumption, each re-ordered composite machine learning model is constructed by training a re-ordered additional sub-model for each candidate combined feature, respectively, in the case of a fixed re-ordered basic sub-model.
Fifth way of performing automatic feature combination: screening a plurality of key unit features from the features of the sample; and obtaining at least one combined feature from the plurality of key unit features by using an automatic feature combination algorithm, wherein each combined feature is formed by combining corresponding part of key unit features in the plurality of key unit features; and taking the obtained at least one combined characteristic as an automatically generated combined characteristic.
Specifically, a method for automatically generating combined features includes: a feature extraction step is configured, wherein the feature extraction step is used for carrying out feature extraction processing according to a plurality of unit features for attribute fields of each data record in an input data set; an automatic feature combination step is configured, wherein the automatic feature combination step is used for obtaining at least one combined feature by utilizing an automatic feature combination algorithm based on a feature extraction processing result; and a feature extraction step and an automatic feature combination step of the operation configuration, wherein the obtained at least one combined feature is used as an automatically generated combined feature.
In the embodiment of the present invention, the attribute field may be directly used as the unit feature, in addition to the above-described manner of obtaining the unit feature based on the feature processing of the attribute field of the data record. In the method for automatically generating the combined features, the combined features are obtained by running the pre-configured feature extraction step and the automatic feature combination step, so that the features can be automatically combined even if technicians do not deeply understand business scenes or the technicians do not have abundant industry practice experience in the process of obtaining the combined features, the use threshold of feature engineering is reduced, and the usability of the feature engineering is improved.
Wherein the automatic feature combination step is configured to include: screening a plurality of key unit features from the feature extraction processing result; and obtaining at least one combined feature from the plurality of key unit features by using an automatic feature combination algorithm, wherein each combined feature is formed by combining corresponding part of key unit features in the plurality of key unit features. And screening out a plurality of key unit features from the feature extraction processing result according to the feature importance, the feature relevance and/or the feature filling rate.
As an example, feature importance may be determined based on the effects of the machine learning model. For example, a machine learning model (for example, a sample of the machine learning model includes a fixed feature portion and an additional feature portion, wherein the additional feature portion is each unit feature) corresponding to each of a plurality of unit features obtained after the feature extraction process may be established, a plurality of unit features having a higher feature importance may be determined based on an effect of the machine learning model (for example, all the unit features are sorted in descending order of feature importance, the unit features being before a predetermined number), and the plurality of unit features having the higher feature importance may be regarded as a plurality of key unit features. In addition, feature relevance and feature fill rate may also be determined based on various statistical methods or data characteristics of the features themselves.
The automatic feature combination algorithm is used for generating various candidate combination features in a traversing mode, the importance of each candidate combination feature is measured based on the effect of the machine learning model, and at least one candidate combination feature with high importance is determined to be the combination feature. For example, each candidate combined feature may be input as a distinction of the machine learning model, the importance of each candidate combined feature may be determined based on the effect of the machine learning model, each candidate combined feature may be sorted in descending order of importance, and the candidate combined features that precede a predetermined number may be determined as combined features.
Wherein the automatic feature combination step is configured to include: and executing a plurality of processing flows corresponding to the automatic feature combination algorithm in parallel to obtain the at least one combined feature based on the feature extraction processing result.
Wherein the automatic feature combination step is configured to include: based on the feature extraction processing results corresponding to each subset of the data sets, a plurality of processing flows corresponding to the automatic feature combination algorithm are executed in parallel to obtain a combined feature corresponding to each subset. When a plurality of processing flows corresponding to the automatic feature combination algorithm are executed in parallel, there may be a repetition of the combined features obtained thereof, and the automatic feature combination step is configured to further include: and carrying out de-duplication treatment on the combined features corresponding to all the subsets, and taking the combined features obtained after the de-duplication treatment as the at least one combined feature. The feature extraction step corresponds to feature extraction nodes in a directed acyclic graph representing a machine learning process, and the automatic feature combination step corresponds to automatic feature combination nodes in the directed acyclic graph. The automatic feature combining step is configured with a configuration item of the automatic feature combining node.
Wherein the configuration item of the automatic feature combination node includes an option switch regarding whether to turn on a key feature screening function, wherein in case the option switch is turned on by a user, the automatic feature combination step is configured to include: screening a plurality of key unit features from the feature extraction processing result; and obtaining at least one combined feature from the plurality of key unit features by using an automatic feature combination algorithm, wherein each combined feature is formed by combining corresponding part of key unit features in the plurality of key unit features.
Wherein the configuration items of the automatic feature combination node include parallel operation configuration items related to executing a plurality of processing flows corresponding to the automatic feature combination algorithm in parallel, wherein the parallel operation configuration items relate to at least one of: the number of processing flows executed in parallel, and the super parameters of the automatic feature combination algorithm corresponding to each processing flow when training the machine learning model. The parallel operation configuration item also relates to at least one of the following: the number of subsets of the data set, and the data record extraction rule corresponding to each subset. The parallel operation configuration item has a default configuration value and/or a manual configuration value. The default configuration values of the parallel operation configuration items related to the super parameters enable the machine learning models trained in the automatic feature combination algorithms corresponding to different processing flows to have substantial differences. The default configuration values of the parallel operation configuration items related to the super parameters enable the super parameters of the training machine learning model in the automatic feature combination algorithm corresponding to different processing flows to have differences. The super-parameters comprise learning rate, and the default configuration values of the parallel operation configuration items related to the learning rate enable the super-parameters of the training machine learning model in the automatic feature combination algorithm corresponding to different processing flows to show a stepwise increasing trend.
Wherein the configuration item of the automatic feature combination node includes an option switch regarding whether to turn on a deduplication function, wherein, in a case where the option switch is turned on by a user, the automatic feature combination step is configured to further include: and carrying out de-duplication treatment on the combined features corresponding to all the subsets, and taking the combined features obtained after the de-duplication treatment as the at least one combined feature.
According to one embodiment of the invention, the aforementioned auto-tuning operator performs auto-tuning by any one of the following means:
the first automatic parameter adjusting mode is as follows: the following steps are performed in each iteration: determining currently available resources; scoring a plurality of super-parameter tuning strategies respectively, and distributing currently available resources for the plurality of super-parameter tuning strategies according to scoring results, wherein each super-parameter tuning strategy is used for selecting a super-parameter combination for the machine learning model based on a super-parameter selection strategy corresponding to the super-parameter tuning strategy; each super-parameter tuning policy allocated to the resource is obtained and one or more super-parameter combinations generated based on the allocated resource are generated.
Specifically, the method merges a plurality of super-parameter tuning strategies, each super-parameter tuning strategy is used for selecting a super-parameter combination for the machine learning model based on a super-parameter selection strategy corresponding to the super-parameter tuning strategy, and each round of iterative process of the method comprises the following steps: determining currently available resources; distributing currently available resources for the plurality of super-parameter tuning strategies; each super-parameter tuning policy allocated to the resource is obtained and one or more super-parameter combinations generated based on the allocated resource are generated.
The currently available resources, i.e. the resources to be allocated for the current round. The resources mentioned in the invention can be various types of resources, such as the calculation power resources of CPU number, CPU core number and the like, time resources representing the working time length and task resources representing the task number. In the invention, the number of tasks, namely the number of super-parameter combinations to be generated, and each super-parameter combination to be generated can be regarded as one task. As one example, the currently available resources may include computing resources that are available for the current round. The computing resources may include, among other things, computational resources and time resources. For example, "10 CPU cores" may be used as the currently available resources, or "10 CPU cores may be operated for 2 hours" may be used as the currently available resources. As another example, the currently available resources may include the number of hyper-parameter combinations that need to be generated for the current round. The number of the super-parameter combinations to be generated currently can be determined according to the currently available computing resources, for example, when the currently available computing resources are more, a larger number of super-parameter combinations to be generated can be determined, and when the currently available computing resources are less, a smaller number of super-parameter combinations to be generated can be determined. Therefore, the allocation of the number of hyper-parameter combinations that need to be generated for the current round is also essentially an allocation of computing resources.
Each super-parameter tuning strategy is used for selecting super-parameter combinations for the machine learning model based on the super-parameter selection strategy corresponding to the super-parameter tuning strategy. The super-parameter tuning strategy can be an existing super-parameter tuning scheme, such as a plurality of super-parameter tuning strategies including random search, grid search, evolutionary algorithm, bayesian optimization and the like. Wherein the plurality of super-parametric tuning policies may include one or more non-model-oriented search policies and/or one or more model-oriented policies. The non-model-oriented search strategy is used for selecting a hyper-parameter combination from a hyper-parameter search space based on a preset search mode (such as a random search mode, a grid search mode, an evolutionary algorithm mode and the like), wherein the hyper-parameter search space refers to a possible value space of all hyper-parameters. The model-directed strategy is used to select a hyper-parameter combination based on a predictive model, wherein the predictive model may be trained based on at least a portion of the hyper-parameter combination generated during an iterative process. Alternatively, the model-directed strategy may be a super-parametric optimization algorithm such as bayesian optimization, tree-structured Parzen estimator (Tree-structured Parzen Estimator, TPE), etc.
The plurality of super-parameter tuning strategies comprise: one or more non-model-directed search strategies for selecting hyper-parametric combinations from within a hyper-parametric search space based on a predetermined search pattern; and/or one or more model-directed strategies for selecting a superparameter combination based on a predictive model, wherein the predictive model is trained based on at least a portion of the superparameter combination generated in an iterative process.
Wherein the allocating currently available resources for the plurality of super parameter tuning policies includes: the currently available resources are evenly distributed to the plurality of super-parameter tuning strategies; or distributing the currently available resources to the plurality of super-parameter tuning strategies according to a preset proportion.
Or, the allocating the currently available resources for the plurality of super parameter tuning policies includes: scoring the plurality of super-parameter tuning strategies respectively; and distributing the currently available resources for the plurality of super-parameter tuning strategies according to the scoring result. Wherein, when the plurality of super-parameter tuning strategies includes one or more model-oriented strategies, during each round of iteration, the method further comprises: and acquiring evaluation indexes respectively corresponding to one or more super-parameter combinations generated in the round of iterative process, and adding the one or more super-parameter combinations and the evaluation indexes thereof into a super-parameter combination sample set of the machine learning model. The method further comprises the steps of: and model training is carried out on the model guiding strategy distributed to the resources in the round by taking at least partial hyper-parameter combinations in the current hyper-parameter combination sample set of the machine learning model as training samples so as to obtain the prediction model.
Wherein scoring the plurality of super-parameter tuning strategies respectively comprises at least one of: scoring the super-parameter tuning strategies according to the availability of the super-parameter tuning strategies; scoring the super-parameter tuning strategies according to the confidence degrees of the super-parameter tuning strategies; and respectively scoring the super-parameter tuning strategies according to evaluation indexes of super-parameter combinations generated by each super-parameter tuning strategy in the previous iteration or iterations.
Wherein scoring the plurality of super parameter tuning strategies according to the availability of each super parameter tuning strategy comprises: the availability of the non-model-oriented search strategy is a fixed constant, when the number of the super-parameter combinations generated in the iterative process is smaller than or equal to a preset threshold, the availability of the model-oriented strategy is zero, and when the number of the super-parameter combinations generated in the iterative process is larger than the preset threshold, the availability of the model-oriented strategy is in direct proportion to the number of the super-parameter combinations generated in the iterative process.
Wherein scoring the plurality of super-parameter tuning strategies according to the confidence degrees of the super-parameter tuning strategies comprises: the confidence of the non-model-oriented search strategy is a fixed constant; dividing the hyper-parameter combination generated in the iterative process into at least one pair of training set and test set, calculating the score of each model-oriented strategy under each pair of training set and test set, and normalizing after averaging the scores to obtain the confidence coefficient of each model-oriented strategy.
Wherein, the scoring the super parameter tuning strategies according to the evaluation indexes of the super parameter combinations generated by the super parameter tuning strategies in one or more previous rounds comprises the following steps: and respectively scoring the super-parameter tuning strategies according to the average ranking of the evaluation indexes of the super-parameter combinations generated by each super-parameter tuning strategy in the previous iteration or iterations in all the generated super-parameter combinations, wherein the scoring result is in direct proportion to the average ranking.
The step of distributing the currently available resources to the super-parameter tuning strategies according to the scoring result comprises the following steps: determining a probability value of each super-parameter tuning strategy according to a scoring result, wherein the probability value is in direct proportion to the scoring result; dividing the currently available resources into a plurality of parts; and sampling the super-parameter tuning strategies for a plurality of times based on the probability values to determine the super-parameter tuning strategy to which each resource belongs.
Wherein the currently available resources include: the number of super-parameter combinations required to be generated in the current round; or, the computing resources available for the current round.
The method further comprises the following steps: and when the iteration termination condition is met, selecting a super-parameter combination with the optimal evaluation index from at least partial super-parameter combinations generated in the iteration process as a final super-parameter combination of the machine learning model. For example, in the case where the degree of elevation of the evaluation index of the hyper-parameter combinations generated under the consecutive predetermined number of rounds is smaller than the predetermined threshold value, terminating the iterative process; or, terminating the iterative process if the evaluation index of the generated hyper-parameter combination reaches a predetermined target; alternatively, the iterative process is terminated in case the consumed resources exceed a predetermined resource threshold.
Specifically, in this embodiment, when the multiple super parameter tuning strategies are respectively scored, the scoring is mainly performed according to the state and the historical quality condition of each super parameter tuning strategy in the current round. As an example, in scoring a super-parametric tuning policy, one may refer to any one or more of the following three dimensions.
Availability of dimension 1 and super parameter tuning strategy
The availability of the super-parameter tuning strategy is used for representing the availability of the super-parameter tuning strategy capable of selecting super-parameter combinations for the machine learning model. Taking super-parameter tuning strategies as examples, the super-parameter tuning strategies are divided into non-model-oriented searching strategies and model-oriented strategies:
the non-model-oriented search strategy is always available in the process of selecting the hyper-parameter combinations for the machine learning model without depending on the hyper-parameter combinations generated in the iterative process. Thus, the availability of the non-model-directed search strategy may be a fixed constant, such as may be 1;
model-directed strategies select hyper-parameter combinations for machine learning models based on predictive models, and generation of the predictive models depends on the hyper-parameter combinations generated in an iterative process. The number of hyper-parameter combinations generated in the iterative process is small at the beginning, for example, when the number is smaller than the minimum number of the hyper-parameter combinations required for training the predictive model, the predictive model cannot be trained, and at the moment, the model guiding strategy is not available. When the number of the super-parameter combinations generated in the iterative process is larger than the minimum value of the super-parameter combinations required for training the prediction model, a model guiding strategy is available, and the more the number of the super-parameter combinations is, the better the effect of the trained prediction model is, and the stronger the availability of the model guiding strategy is.
Thus, the availability of model-directed strategies is related to the number of hyper-parameter combinations generated during the iteration. Specifically, when the number of the super-parameter combinations generated in the iterative process is smaller than or equal to a preset threshold, the availability of the model-oriented strategy is 0. When the number of the super-parameter combinations generated in the iterative process is larger than a preset threshold, the availability of the model-oriented strategy is larger than zero, and the availability of the model-oriented strategy is proportional to the number of the super-parameter combinations generated in the iterative process. The preset threshold may be a minimum value of a hyper-parameter combination required for training a predictive model of a model-oriented policy, for example, a model-oriented policy TPE (Tree-structured Parzen Estimator) requires at least 10 sets of evaluated hyper-parameter combinations to start constructing the model, so the corresponding preset threshold of TPE may be set to 10.
When the super-parameter tuning strategies are respectively scored based on the availability of the super-parameter tuning strategies, the higher the availability of the super-parameter tuning strategies is, the higher the score is. For example, the availability of a super-parameter tuning strategy may be taken as a score in this dimension.
As an example, when the super parameter tuning policy i is scored based on availability:
If the super-parametric tuning strategy i is a non-model-oriented search strategy, the score may be noted as a fixed constant 1, such as:
if the super-parametric tuning strategy i is a model-directed strategy, the score can be noted as:
wherein,,representing the score of the super-parameter tuning strategy i under the dimension 1, D is a super-parameter sample set, D is the number of super-parameter combinations in the super-parameter sample set, and the function ∈ ->Is a monotonically increasing function with respect to |d|. The meaning of this expression is that if the model-directed strategy requires at least M i The model can be constructed only by the group super-parameters, and when the D is less than M i When the score is minus infinity, the probability value equivalent to the final super-parameter tuning strategy is 0, and when the |D| is more than or equal to M i When the model guiding strategy is used, the availability of the model guiding strategy is improved along with the increase of the hyper-parameter sample set, and the improvement degree is formed by a monotonically increasing function f i 1 (|D|) decides. Monotonically increasing function f i 1 The specific form of (|D|) can be set according to the actual situation, e.g., f i 1 (|D|)=|D| d D is a number greater than 0, which may be 0.5, for example.
Confidence of dimension 2, super parameter tuning strategy
The confidence level of the super-parameter tuning strategy is used for representing the credibility degree of the super-parameter tuning strategy for selecting the super-parameter combination for the machine learning model, namely the effect of the super-parameter tuning strategy. Taking super-parameter tuning strategies as examples, the super-parameter tuning strategies are divided into non-model-oriented searching strategies and model-oriented strategies: confidence of the non-model-directed search strategy may be considered a fixed constant, such as may be 1; model-directed strategies are based on selecting a hyper-parametric combination for a machine learning model based on a predictive model, the confidence of which depends on the model effect of the predictive model. Thus, the confidence of the model-directed strategy may be determined by evaluating the model effect of the predictive model.
As an example, the hyper-parameter combinations generated in the iterative process may be divided into at least one pair of training set and test set, e.g., the hyper-parameter combinations generated in the iterative process may be divided based on a cross-validation approach to obtain multiple pairs of training set and test set. Here, for convenience of explanation, the 10 th group of super parameter combinations generated in the iterative process is taken as an example, the 1 st to 9 th group of super parameter combinations may be taken as a training set, the 10 th group of super parameter combinations may be taken as a test set, the 1 st to 8 th, 10 th group of super parameter combinations may be taken as a training set, the 9 th group of super parameter combinations may be taken as a test set, the 1 st to 7 th, 9 th to 10 th group of super parameter combinations may be taken as a training set, the 8 th group of super parameter combinations may be taken as a test set, and so on, 10 pairs of training sets and test sets may be obtained. The scores for each model-oriented strategy (i.e., the predictive model of the model-oriented strategy) under each pair of training and test sets may then be calculated. The predictive model may be trained based on a training set and then validated using a test set to obtain a score for the model-oriented strategy under the pair of training and test sets. Finally, the scores can be normalized after being averaged, and the numerical values are normalized to the interval of 0,1, so that the confidence of each model-oriented strategy is obtained.
When the multiple super-parameter tuning strategies are respectively scored based on the confidence degrees of the super-parameter tuning strategies, the higher the confidence degrees of the super-parameter tuning strategies are, the higher the scores are. For example, the confidence of the super-parameter tuning strategy can be taken as a score in this dimension.
As an example, when the super-parameter tuning strategy i is scored based on confidence, if the super-parameter tuning strategy i is a non-model-directed search strategy, the score may be scored as a fixed constant of 1, e.g., may be scored asIf the super-parameter tuning strategy i is a model-oriented strategy, the confidence calculated in the above manner can be used as a score thereof to obtain Representing the score of the hyperparametric tuning strategy i in dimension 2.
Dimension 3, evaluation index of super parameter combination generated by each super parameter tuning strategy in previous iteration or iterations
When facing different machine learning models, the effect of each super-parameter tuning strategy has certain difference, so as to improve the accuracy and the robustness of a scoring result obtained aiming at the super-parameter tuning strategy. The invention provides that the evaluation index of the super-parameter combination generated by each super-parameter tuning strategy in one or more previous iterations can be monitored in real time, and the super-parameter tuning strategy is scored according to the evaluation index of the super-parameter combination generated in one or more previous iterations.
As an example, the plurality of hyperparametric tuning strategies may be scored according to an average ranking of the evaluation index of the hyperparametric combination generated in the previous iteration or iterations of each hyperparametric tuning strategy in all generated hyperparametric combinations, respectively, wherein the scoring result is proportional to the average ranking.
For example, the rank of the super parameter combination generated by the super parameter tuning strategy i in all the generated super parameter combinations can be calculated according to the evaluation index of the super parameter combination, then the quantile (quaternion) is calculated according to the rank, the higher the rank is, the higher the quantile is, and the average of the quantiles obtained by calculation is calculatedThe value is used as the score of the super-parameter tuning strategy i. For another example, an average ranking of the super parameter combinations generated by the super parameter tuning policy i in all generated super parameter combinations may be calculated according to an evaluation index of the super parameter combinations, then a quantile (quaternion) may be calculated according to the average ranking, the higher the upper the average ranking is, and the higher the quantile obtained by calculation is used as the score of the super parameter tuning policy i. Wherein, the size of the quantile is proportional to the ranking, and the higher the ranking is, the larger the quantile is. The score based on dimension 3 can be noted as
To sum up, for the super-parameter tuning strategy i, the score of the super-parameter tuning strategy i in one or more dimensions can be calculated, and then the final score of the super-parameter tuning strategy i is obtained according to the score of the super-parameter tuning strategy i in one or more dimensions. The final score may be calculated in a variety of ways, such as summation, multiplication, weighted summation, and the like. As an example, in calculating the score of the hyperparametric tuning policy i based on the three dimensions described above, the final score of the hyperparametric tuning policy i may be noted as
The scoring result can represent the quality of each super-parameter tuning strategy in the current turn, so that the currently available resources can be distributed for a plurality of super-parameter tuning strategies according to the scoring result. As an example, the scoring result may be used to characterize the probability of allocating resources for a super-parametric tuning policy, with a higher score for a super-parametric tuning policy indicating a greater probability of allocating resources for the super-parametric tuning policy. For example, the probability value of each super-parameter tuning strategy can be determined according to the scoring result, the currently available resources are divided into a plurality of copies, and the plurality of super-parameter tuning strategies are sampled for a plurality of times based on the probability value, so as to determine the super-parameter tuning strategy to which each copy of resources belongs. Taking four super-parameter tuning strategies as an example, wherein the probability value corresponding to the super-parameter tuning strategy 1 is 0.2, the probability value corresponding to the super-parameter tuning strategy 2 is 0.8, the probability value corresponding to the super-parameter tuning strategy 3 is 0.6 and the probability value corresponding to the super-parameter tuning strategy 4 is 0.5, sampling one super-parameter tuning strategy from the four super-parameter tuning strategies according to the probability value corresponding to each super-parameter tuning strategy for each resource, and distributing the current resource to the sampled super-parameter tuning strategy.
The probability value of the super-parameter tuning strategy is proportional to the scoring result. The final score of the super-parameter tuning strategy i is recorded asFor example, the corresponding probability value may be expressed as: />Wherein q i (D) The probability value of the super parameter tuning strategy i is represented, and N represents the number of the super parameter tuning strategies.
When dividing the currently available resources into multiple parts, the division may be performed according to various division standards, and the specific division manner of the resources may be set according to actual situations, which is only exemplified herein.
For example, the resources may be divided into multiple parts according to the working time length, for example, when the currently available resources are "10 CPU cores are working for 1 day", the resources may be divided into 24 parts, and each part of resources is "10 CPU cores are working for 1 hour". For another example, the resources may be divided according to the number of CPU cores, and if the currently available resources are "10 CPU cores operate for 1 day", the resources may be divided into 10 resources, and each resource is "1 CPU core operates for 1 day".
Also for example, when the currently available resources are "3 hyper-parameter combinations need to be generated", 3 parts of resources may be divided, and each part of resources is "1 hyper-parameter combination needs to be generated". Where each resource can be considered a task. The number of hyper-parameter combinations that currently need to be generated may be determined based on the currently available computing resources. For example, a greater number of hyper-parameter combinations may be determined to be generated when there are more currently available computing resources, and a lesser number of hyper-parameter combinations may be determined to be generated when there are fewer currently available computing resources. Therefore, the allocation of the number of hyper-parameter combinations that need to be generated for the current round is also essentially an allocation of computing resources.
After the currently available resources are allocated to the plurality of super-parameter tuning strategies according to the scoring result, each super-parameter tuning strategy allocated to the resources can generate one or more super-parameter combinations based on the allocated resources respectively. The super-parameter tuning strategy allocated to the resource can select a super-parameter combination for the machine learning model based on the super-parameter selection strategy corresponding to the super-parameter tuning strategy to generate one or more super-parameter combinations. The generation process of the hyper-parameter combinations is not described in detail.
It should be noted that the machine learning model hyper-parameter optimization process supports parallel computation. For example, during an optimization process, the super-parametric tuning policies assigned to resources may run in parallel to provide multiple sets of super-parametric combinations simultaneously. Thereby, the optimization rate can be greatly improved.
Considering that when a single strategy is used for machine learning model super-parameter optimization, there is inevitably some risk that the scene effect is poor or the scene effect converges to local optimum, the embodiment proposes that in machine learning model super-parameter optimization, multiple super-parameter optimization strategies can be used simultaneously. In addition, considering the limited resources which can be used in the super-parameter optimization process, the embodiment further provides a resource scheduling scheme in the super-parameter optimization process of the machine learning model, and in each iteration process, currently available resources are allocated to a plurality of super-parameter tuning strategies according to the states of the super-parameter tuning strategies in the current round and the historical quality conditions. The resource scheduling scheme for simulating the winner and the winner can ensure that the super-parameter tuning strategy combination with the optimal effect is used in the whole super-parameter optimization process of the machine learning model, so that under the condition of limited resources, the convergence efficiency of the parameter tuning can be effectively improved, and the super-parameter optimization effect is improved.
The second automatic parameter adjusting mode is as follows: in the competition stage, training corresponding competition models according to a machine learning algorithm under a plurality of competition super-parameter combinations to obtain a competition model with the best effect, and taking the obtained competition model and the competition super-parameter combination corresponding to the obtained competition model as a winning model and a winning super-parameter combination to enter a growth stage; in the growing stage, under the combination of the winning super parameters obtained in the competition stage of the round, continuing to train the winning model obtained in the competition stage of the round, obtaining the effect of the winning model, if the effect of the winning model indicates that the model effect stops growing, restarting the competition stage to train the updated competition model under the combination of the updated plurality of the winning super parameters according to the machine learning algorithm, otherwise, continuing to train the winning model, and repeating the iteration until the preset termination condition is met; the updated competition super parameter combinations are obtained based on the winning super parameter combinations of the previous growth stage, and the updated competition model is the winning model obtained in the previous growth stage.
In the competition stage, training the corresponding competition model according to a machine learning algorithm under a plurality of competition super-parameter combinations to obtain the competition model with the best effect, wherein the method comprises the following steps: when each competition training step among the competition training steps in the competition stage is finished, respectively acquiring the effect of a competition model trained under each competition super-parameter combination; based on the obtained effect of each competition model, adjusting the competition super-parameter combination entering the next competition training step and the corresponding competition model, and obtaining the competition model with the best effect when the last competition training step is finished; wherein at each of the competitive training steps, at least one gradient update of the competition model is performed based on a first predetermined number of training samples.
The step of adjusting the competition super parameter combination entering the next competition training step and the corresponding competition model based on the obtained effect of each competition model, and obtaining the competition model with the best effect at the end of the last competition training step comprises the following steps: when the obtained effect of each competition model indicates that the competition model is not in a stop-and-go state, if the current competition training step is not the last competition training step, the competition super-parameter combination entering the next competition training step and the corresponding competition model are obtained, and if the current competition training step is the last competition training step, the competition model with the best effect is obtained.
The step of obtaining the competition super-parameter combination and the corresponding competition model entering the next competition training step comprises the following steps: removing the second preset number of competition models with the worst effect, thereby obtaining competition super-parameter combinations entering the next competition training step and corresponding competition models; or alternatively
Replacing the second preset number of competition models with the worst effect by the third preset number of competition models with the best effect, and fine-tuning the competition super-parameter combination of each competition model with the third preset number of competition models with the best effect to obtain the competition super-parameter combination entering the next competition training step and the corresponding competition model; or alternatively
Randomly removing the second preset number of competition models, so as to obtain a competition super-parameter combination entering the next competition training step and a corresponding competition model thereof; or alternatively
Replacing the randomly selected second preset number of competition models with the third preset number of competition models with the best effect, and fine-tuning the super-parameter combination of each of the third preset number of competition models with the best effect so as to obtain competition entering the next competition training step; or alternatively
Removing the second preset number of competition models with the longest existence time, so as to obtain a competition super-parameter combination entering the next competition training step and a corresponding competition model; or alternatively
Replacing the second preset number of competition models with the third preset number of competition models with the best effect for the longest time, and fine-tuning the super-parameter combination of each of the third preset number of competition models with the best effect so as to obtain competition entering the next competition training step; or alternatively
And directly taking the competition super-parameter combination and the corresponding competition model of the current competition training step as the competition super-parameter combination and the corresponding competition model of the competition super-parameter combination entering the next competition training step.
Wherein the second predetermined number is greater than or equal to the third predetermined number; and/or the second predetermined number is set to a fixed value or a regularly varying value for each competition training step.
Wherein the method further comprises obtaining the number of competitive training steps, comprising: and acquiring the number of the competition training steps according to the number of the competition models, the total number of the training samples, the maximum iteration number of the training samples and the number of the training samples which can be trained in each competition training step.
The step of continuously training the winning model obtained in the round of competition under the winning super parameter combination obtained in the round of competition in the growing stage and obtaining the effect of the winning model comprises the following steps: in the growing stage, under the combination of the winning super parameters obtained in the current round of competition stage, continuing training the winning model obtained in the current round of competition stage according to the growing training step, and obtaining the effect of the winning model obtained in each growing training step; wherein at each of the growing training steps, at least one gradient update of the winning model is performed based on a fourth predetermined number of training samples.
The method further comprises the steps of: determining whether an effect of the winning model indicates that model effect stop growth occurs based on effects of the winning model obtained in a fifth predetermined number of consecutive growth training steps; determining whether the effect of the winning model indicates that model effect stopping growth occurs based on the effect of the winning model obtained in a fifth predetermined number of continuous growth training steps, comprising: determining that the effect of the winning model indicates that the model effect stops growing when the winning model has a tendency to slip down at the effect obtained at the continuous fifth predetermined number of growing training steps, wherein whether a tendency to slip down occurs is determined based on the degree of decrease and/or the degree of shake of the effect obtained at the continuous fifth predetermined number of growing training steps; or determining that the effect of the winning model indicates that the model effect stops growing when the effect obtained by the winning model in the continuous fifth predetermined number of growing training steps meets a preset attenuation condition, wherein the attenuation condition meets that the sixth predetermined number of continuous effects are lower than the average value of the seventh predetermined number of effects before the sixth predetermined number of continuous effects.
The method further comprises the steps of: obtaining, for each growth training step, the effect of the winning model obtained in a fifth predetermined number of consecutive growth training steps, and the continuing training of the winning model comprises: if the effect of the winning model indicates that the model effect does not appear and stops growing, continuing the next growing training step; alternatively, the method further comprises: obtaining, for each fifth predetermined number of growth training steps, an effect that the winning model obtained in a fifth predetermined number of growth training steps in succession, and the continuing training of the winning model comprising: if the effect of the winning model indicates that no model effect has occurred to stop growing, then the next fifth predetermined number of growing training steps is continued.
Wherein the preset termination condition indicates any one or more of: the effect of each competition model obtained at the end of each competition training step indicates that the competition model is in a stop state; the training time reaches the time limit; and the effect of the winning model reaches the expected value.
The method further comprises the steps of: determining whether the training time reaches a time limit when the effect of the winning model indicates that the model effect stops growing; and/or determining whether the effect of the winning model has reached an expected value when the effect of the winning model indicates that model effect growth is stopped.
Wherein the method further comprises: the trained machine learning model is derived based on at least one best performing model obtained when the preset termination condition is satisfied.
Wherein the competition super-parameter combination comprises at least one model super-parameter and at least one training super-parameter; alternatively, the competitive super parameter combination includes at least one training super parameter.
Wherein for each competition super-parameter combination of the first round of competition phases, at least one training super-parameter is obtained by: dividing the linear super-parameter space of each training super-parameter into a plurality of parts, and taking a plurality of points in the middle as a value candidate set of each training super-parameter; and for each competition model, selecting a numerical value from each value candidate set of the training super-parameters to form a set of configured training super-parameter combinations.
Wherein for each combination of updated contention super parameters of the restarted contention phase, at least one training super parameter is obtained by: obtaining the numerical value of each training super parameter from the winning super parameter combination of the previous growth stage; randomly setting the numerical value of each training super-parameter as the upper boundary or the lower boundary of the linear super-parameter space of each training super-parameter to obtain a new linear super-parameter space; dividing a new linear super-parameter space into a plurality of parts, and taking a plurality of points in the middle as a value candidate set of each training super-parameter; and for each updated competition model, selecting a numerical value from the value candidate set of each training hyper-parameter to form a set of configured training hyper-parameter combinations.
Wherein the training hyper-parameters include a learning rate, wherein the learning rate is obtained for each combination of the updated competition hyper-parameters of the restarted competition stage by: obtaining the numerical value of the learning rate from the winning super-parameter combination in the previous growth stage; setting the numerical value of the learning rate as the upper boundary of the learning rate super-parameter space to obtain an updated learning rate super-parameter space; acquiring an updated intermediate value of the learning rate super-parameter space as an alternative value, and storing the alternative value into an alternative value set; and when the number of the candidate values included in the candidate value set is greater than or equal to the number of the updated competition models, randomly distributing corresponding candidate values for each updated competition model as corresponding learning rates.
Wherein the method further comprises: when the number of the candidate values included in the candidate value set is smaller than the number of the updated competition models, taking the currently obtained candidate value as the lower boundary of the learning rate super-parameter space to obtain an updated learning rate super-parameter space; and re-executing the step of acquiring the updated intermediate value of the learning rate super-parameter space as an alternative value and storing the alternative value into an alternative value set.
Wherein the preset termination condition indicates: the effect of each competition model obtained at the end of each competition training step indicates that the competition model is in a stop state; and/or the training time reaches a time limit, and the method further comprises: determining whether a machine learning model with a desired effect can be obtained based on at least one model with the best effect obtained when a preset termination condition is satisfied; outputting a machine learning model of a desired effect in a case where the machine learning model of the desired effect can be obtained; in the case of a machine learning model that fails to achieve the desired effect, the model hyper-parameters are reset and the competition phase is entered again.
The training sample is composed of data, and the machine learning model is used for processing the data; wherein the data comprises at least image data, text data or voice data. The training sample is composed of image data, the machine learning model is a neural network model, the machine learning algorithm is a neural network algorithm, and the neural network model is used for processing images.
In this embodiment, a one-time training process of the model is divided into a plurality of competition phases and a plurality of growth phases, in the competition phases, a plurality of competition super-parameter combinations are adopted to train corresponding competition models simultaneously, the competition super-parameter combinations are continuously selected, eliminated and evolved, and the competition model with the best effect and the corresponding competition super-parameter combination thereof are selected to enter the growth phases, i.e. in the growth phases, only the winning competition super-parameter combination is utilized to train the corresponding competition model continuously.
In one example, the competition superparameter combination may include at least one model superparameter and at least one training superparameter.
The above model hyper-parameters are hyper-parameters used to define the model, such as, but not limited to, activation functions (e.g., identity functions, sigmoid functions, truncated ramp functions, etc.), hidden layer node numbers, convolutional layer channel numbers, and full-link layer node numbers, etc.
The above training hyper-parameters are hyper-parameters used to define the model training process, such as, but not limited to, learning rate, batch size, and number of iterations, etc.
In another example, at least one training hyper-parameter may be included in the competitive hyper-parameter combination.
In this embodiment, the model training starts to enter the competition phase, that is, the first round of competition phase, where, for each competition super parameter combination of the first round of competition phase, at least one training super parameter may be obtained through the following steps S1010 to S1020:
in step S1010, the linear superparameter space of each training superparameter is divided into a plurality of parts, and a plurality of points in the middle are taken as a candidate set for the value of each training superparameter.
In step S1010, if the initial entry into the contention phase (first round contention phase) is currently performed, for example, the linear super-parameter space of each training super-parameter may be divided into n+2 or n+4 parts, and the middle N values are taken as the candidate set of values corresponding to the training super-parameters. The size of N may be adaptively adjusted, for example, the size of N may be set according to hardware resources (such as the number of CPU cores, the number of GPUs, and the number of clusters), or the size of N may be set according to the current remaining training time, or the size of N may be calculated according to the following formula (1):
N=2*n+1 (1)
The n may be the total number of the training superparameters and the model superparameters in the competitive superparameter combination, or may be only the number of the training superparameters in the competitive superparameter combination, which is not limited herein.
Step S1020, for each competition model, selecting a numerical value from the candidate set of values of each training hyper-parameter to form a set of configured training hyper-parameter combinations.
In step S1020, for each competition model, for example, a value may be randomly selected from the candidate set of values of each training hyper-parameter as the hyper-parameter value, or a value may be selected from the candidate set of values of each training hyper-parameter as the hyper-parameter value according to a predetermined probability distribution (such as, but not limited to, gaussian distribution or poisson distribution), and the like, and the values of different training hyper-parameters are respectively selected once, so as to form a set of configured training hyper-parameter combinations.
Here, it is sufficient to ensure that the combination of the training hyper-parameters configured for each competition model is entirely different, for example, but not limited to, the hyper-parameter values of the same training hyper-parameters configured for each competition model are different from each other.
In one embodiment of the present invention, in the competition stage in step S2100, training the corresponding competition model according to the machine learning algorithm under a plurality of competition super parameter combinations to obtain the competition model with the best effect may further include the following steps S2110 to S2120:
Step S2110, when each of the plurality of competition training steps in the competition stage is finished, obtaining the effect of the competition model trained under each competition super-parameter combination.
In step S2110, for example, K competition training steps may be set in the competition stage, and for each competition model entering the current competition training step, the corresponding competition model is trained in the current training step through a respective set of competition super parameters, so as to obtain the effect of the trained competition model. The size of K can be adaptively adjusted, for example, the size of K can be manually specified, or the size of K can be calculated according to the number of competition models, the total number of training samples, the maximum iteration number of the training samples and the number of training samples that can be trained in each competition training step:
wherein N is the number of competition models, R is the maximum iteration number of the training samples, I is the total number of the training samples, and I is the number of the training samples which can be trained in each competition training step.
Here, any index may be used to measure the effect of the competition model, for example, any one or more of the accuracy, the loss rate, the derivative of the accuracy, or the derivative of the loss rate of the trained competition model on the current verification data set may be used as a criterion, and after the competition model trained by each competition super-parameter combination is obtained at the end of the current competition training step, the effect of the trained competition model may be ranked.
In step S2120, the competition super parameter combination and the corresponding competition model entering the next competition training step are adjusted based on the obtained effect of each competition model, and the competition model with the best effect at the end of the last competition training step is obtained.
It can be seen that according to the exemplary embodiments of the present invention, in the competition phase, it uses the model parameter inheritance technique, after obtaining the effect of the competition model trained by each competition super-parameter combination at the end of the current competition training step, the competition super-parameter combination entering the next competition training step and its corresponding competition model can be obtained based on the effect, that is, the substitution model is not used during the model training process of the whole competition phase, so that, at the end of the competition phase training, in addition to the optimal competition super-parameter combination, the competition model of the optimal effect can be generated, which corresponds to the optimal super-parameter combination.
In one embodiment of the present invention, the step S2120 of adjusting the competition super-parameter combination and the corresponding competition model entering the next competition training step based on the obtained effect of each competition model, and obtaining the competition model having the best effect at the end of the last competition training step may further include the following step S2121:
In step S2121, when the obtained effect of each competition model indicates that the competition model is not in the stop state, if the current competition training step is not the last competition training step, the competition super-parameter combination entering the next competition training step and the corresponding competition model thereof are obtained, if the current competition training step is the last competition training step, the competition model with the best effect is obtained.
The fact that the competition model is not in the stop-and-go state means that the optimal effect among the effects obtained at the end of each competition training step is better than the optimal effect obtained at the end of the last competition training step.
In step S2121, at least one gradient update of the competition model is performed based on the first predetermined number of training samples in each competition training step.
The first predetermined number may be set according to a specific application scenario or a simulation experiment, and may be that a plurality of competition models use the same number of training samples to perform training of one competition training step at the same time, but specifically, training sample data of each competition model may be the same or different. For example, the plurality of competition models perform training in one competition training step by using 1000 training samples, respectively, but the 1000 training sample data used in each step may be the same or different.
In one embodiment of the present invention, the step S2121 of obtaining the competition super parameter combination and the corresponding competition model for entering the next competition training step may further include any one or more of the following steps S2121-1 to S2121 to 7:
step S2121-1, removing the second predetermined number of competition models with the worst effect, thereby obtaining the competition super-parameter combination and the corresponding competition model for entering the next competition training step.
The above second predetermined number may be set according to a specific application scenario or simulation experiment. The second predetermined number is set to a fixed value or a regular variation value for each competition training step
It will be appreciated that after the second predetermined number of competition models with the worst effect are removed, the number of competition models entering the current competition training step is smaller than the number of competition models entering the next competition training step.
It will be appreciated that the removal of the worst effect model may be such that a second predetermined number of models are removed after the last competitive training step, leaving only one best effect model.
Step S2121-2, replacing the second preset number of competition models with the worst effect by the third preset number of competition models with the best effect, and fine-tuning the competition super-parameter combination of each competition super-parameter combination of the third preset number of competition models with the best effect to obtain the competition super-parameter combination entering the next competition training step and the corresponding competition model.
The above third predetermined number may be set according to a specific application scenario or simulation experiment.
The above second predetermined number is greater than or equal to the third predetermined number.
In one example, in the case where the second predetermined number is equal to the third predetermined number, the number of competition models entering the current competition training step is the same as the number of competition models entering the next competition training step.
In another example, in case that the second predetermined number is greater than the third predetermined number, the number of competition models entering the current competition training step is smaller than the number of competition models entering the next competition training step.
In step S2121-2, the fine tuning method is not unique, for example, but not limited to, randomly increasing or decreasing the current value by z% for the competitive super parameter combination, and a new set of competitive super parameter combinations is obtained. Wherein z can be set according to specific application scenes and simulation experiments.
Step S2121-3, randomly removing the second predetermined number of competition models, thereby obtaining the competition super-parameter combination and the corresponding competition model for entering the next competition training step.
It will be appreciated that after randomly removing the second predetermined number of competition models, the number of competition models entering the current competition training step is less than the number of competition models entering the next competition training step.
Step S2121-4, replacing the randomly selected second preset number of competition models with the third preset number of competition models with the best effect, and fine-tuning the super-parameter combinations of the third preset number of competition models with the best effect, thereby obtaining the competition entering the next competition training step.
Step S2121-5, removing the second predetermined number of competition models with the longest existence time, thereby obtaining the competition super-parameter combination and the corresponding competition model for entering the next competition training step.
It will be appreciated that after the second predetermined number of competition models having the longest lifetime are removed, the number of competition models entering the current competition training step is smaller than the number of competition models entering the next competition training step.
Step S2121-6, replacing the second preset number of competition models with the third preset number of competition models with the best effect for the longest time, and fine-tuning the super-parameter combination of each of the third preset number of competition models with the best effect, so as to obtain the competition entering the next competition training step.
Step S2121-7, directly taking the competition super-parameter combination and the corresponding competition model of the current competition training step as the competition super-parameter combination and the corresponding competition model of the next competition training step.
In this embodiment, after the competition model with the best effect is obtained according to the above step S2100, the obtained competition model and the competition super parameter combination corresponding to the competition model can be used as the winning model and the winning super parameter combination to enter the growth stage, so as to continuously use the winning super parameter combination to fully train the winning model in the growth stage.
Step S2200, in the growing stage, under the winning super parameter combination obtained in the competition stage of the round, continuing to train the winning model obtained in the competition stage of the round, and obtaining the effect of the winning model, if the effect of the winning model indicates that the model effect stops growing, restarting the competition stage to train the updated competition model according to the machine learning algorithm under the updated plurality of competition super parameter combinations, otherwise, continuing to train the winning model, and repeating the process until the preset termination condition is met.
The main effect of the growing stage is to fully train the winning model obtained in the current round of competition stage by using the winning super-parameter combination obtained in the current round of competition stage, and the number of training samples used in the competition stage is smaller than that used in the growing stage, so that extremely small calculation cost is used for super-parameter combination optimization in the competition stage, and the time-consuming gravity center of the model is located in the growing stage.
It can be seen that according to an exemplary embodiment of the present invention, the contention phase and the growth phase are iterated repeatedly, wherein, when the contention phase is returned from the growth phase to the contention phase, a plurality of updated contention super-parameter combinations are obtained based on the win super-parameter combinations of the previous growth phase, and the updated contention model is the win model obtained in the previous growth phase.
In one example, upon returning from the growing phase to the competing phase, since the updated plurality of competing super-parameter combinations are obtained based on the winning super-parameter combination of the previous round of growing phase, here, for each of the updated competing super-parameter combinations of the restarted competing phase, at least one of the training super-parameters can be obtained by the following steps S2011 to S2014:
in step S2011, the value of each training super parameter is obtained from the winning super parameter combination in the previous growing stage.
In step S2011, if the competition phase is currently entered from the growth phase, for example, the value of each training super-parameter in the winning super-parameter combination may be obtained from the winning super-parameter combination of the previous round of growth phase, and the linear super-parameter space of each training super-parameter may be updated according to the value of each training super-parameter.
In step S2012, the value of each training hyper-parameter is randomly set as the upper boundary or the lower boundary of the linear hyper-parameter space of each training hyper-parameter to obtain a new linear hyper-parameter space.
The linear superparameter space for each training superparameter may be scaled down by this step S2012.
Step S2013, dividing the new linear super-parameter space into a plurality of parts, and taking a plurality of points in the middle as a value candidate set of each training super-parameter.
In step S2013, the new linear superparameter space of each training superparameter may be divided into n+2 or n+4 parts, and the middle N values are taken as the candidate set of values corresponding to the training superparameter. The calculation of N is already given in detail in step S1010 above, and will not be described here.
Step S2014, for each updated competition model, selecting a numerical value from the candidate set of values of each training hyper-parameter to form a set of configured training hyper-parameter combinations.
In step S2014, for each updated competition model, for example, a value may be randomly selected from the candidate set of values of each training hyper-parameter as a hyper-parameter value, or a value may be selected from the candidate set of values of each training hyper-parameter as a hyper-parameter value according to a predetermined probability distribution (for example, but not limited to, gaussian distribution or poisson distribution), and the values of different training hyper-parameters are respectively selected once, so as to form a set of configured training hyper-parameter combinations.
Here, it is sufficient to ensure that the combination of the training hyper-parameters configured for each updated competition model is entirely different, for example, but not limited to, the hyper-parameter values of the same training hyper-parameters configured for each updated competition model are mutually different.
In another example, the training super-parameters include a learning rate, and the learning rate may be obtained by the following steps S2021 to S2024 for each competition super-parameter combination updated during the restarted competition phase according to the learning rate decreasing policy:
in step S2021, the learning rate value is obtained from the winning super-parameter combination in the previous growth stage.
In step S2021, if the competition phase is currently entered from the growth phase, the learning rate may be obtained from the winning super-parameter combination of the previous growth phase.
In step S2022, the value of the learning rate is set as the upper boundary of the learning rate super-parameter space, so as to obtain the updated learning rate super-parameter space.
By this step S2022, the super parameter space of the learning rate can be reduced.
Step S2023, acquiring the intermediate value of the updated learning rate super parameter space as an alternative value, and storing the alternative value into an alternative value set.
Step S2024, when the number of candidate values included in the candidate value set is greater than or equal to the number of updated competition models, randomly assigning a corresponding candidate value to each updated competition model as a corresponding learning rate.
Step S2024, when the number of candidate values included in the candidate value set is smaller than the number of updated competition models, of taking the currently obtained candidate value as the lower boundary of the learning rate super-parameter space to obtain an updated learning rate super-parameter space, and re-executing step S2023 to obtain an intermediate value of the updated learning rate super-parameter space as a candidate value, and storing the candidate value in the candidate value set.
In one example, the preset termination condition for the termination of the iterative iteration indicates any one or more of the following:
the effect of each competition model obtained at the end of each competition training step indicates that the competition model is in a stop state;
the training time reaches the time limit; and
the effect of the winning model reaches the expected value.
Wherein, the contention model being in the stop-and-go state means that the optimal effect among the effects obtained at the end of each contention training step is not better than the optimal effect obtained at the end of the previous contention training step.
In this example, for example, when the effect of the winning model indicates that the model effect stops growing, it may be determined whether the training time reaches the time limit; for example, when the effect of the winning model indicates that the model effect stops growing, it may be determined whether or not the effect of the winning model reaches an expected value.
In this example, the trained machine learning model may be obtained based on at least one best-performing model obtained when the preset termination condition is satisfied, and the machine learning model may be output.
In one example, the preset termination condition for the repeated iteration termination may further indicate: the effect of each competition model obtained at the end of each competition training step indicates that the competition model is in a stop state; and/or the training time reaches a time limit.
In this example, it may be a machine learning model that determines whether or not a desired effect can be obtained based on at least one best-effect model obtained when a preset termination condition is satisfied. And outputting the machine learning model of the desired effect in the case of the machine learning model of the desired effect being obtainable; and resetting the model super parameters and entering the competition stage again in the case of a machine learning model which cannot obtain the expected effect. That is, since the machine learning model having the desired effect cannot be obtained only by the repeated iteration of the competition phase and the growth phase on the basis of the value of the at least one model hyper-parameter initially set, it is necessary to reset the value of the at least one model hyper-parameter and enter the competition phase again, and further obtain the machine learning model having the desired effect by the repeated iteration of the competition phase and the growth phase.
In one embodiment of the present invention, in the growing stage in step S2200, under the combination of the winning super parameters obtained in the present round of competition stage, continuing to train the winning model obtained in the present round of competition stage, and obtaining the effect of the winning model may further include:
and in the growing stage, continuously training the winning model obtained in the growing training step under the winning super-parameter combination obtained in the competition stage of the round, and obtaining the effect of the winning model obtained in each growing training step.
In this embodiment, at least one gradient update of the winning model is performed based on a fourth predetermined number of training samples at each growth training step.
The fourth predetermined number may be set according to a specific application scenario or simulation experiment, and each growth training step may include the same number of training samples. For example, each growth training step includes 10000 training samples; for example, 10000 samples may be trained in the first training step, 8000 samples may be trained in the second training step, and so on.
In one embodiment of the present invention, it may be determined whether the effect of the winning model indicates that the model effect is stopped growing, based on the effect of the winning model obtained in the continuous fifth predetermined number of growing training steps, and if the effect of the winning model indicates that the model effect is stopped growing, the competition phase is restarted to continuously train the updated competition model according to the machine learning algorithm under the updated plurality of competition super-parameter combinations, respectively, otherwise, the winning model is continuously trained, and the above-mentioned process is iterated until a preset termination condition is satisfied.
In one example, determining whether the effect of the winning model indicates the occurrence of model effect to stop growing based on the effect of the winning model obtained in the fifth predetermined number of growing training steps in the step may be determined by any one or more of:
in mode 1, when the winning model exhibits a tendency to slip down in the effect obtained in the fifth predetermined number of growth training steps in succession, it is determined that the effect of the winning model indicates that the model effect stops growing.
The fifth predetermined number may be set according to a specific application scenario or simulation experiment.
In this mode 1, whether or not the slip tendency occurs may be determined based on the degree of degradation and/or the degree of shake of the effect obtained in the continuous fifth predetermined number of growth training steps.
In the mode 2, when the effect obtained by the winning model in the fifth predetermined number of continuous growth training steps meets the preset attenuation condition, it is determined that the effect of the winning model indicates that the model effect stops growing.
In this mode 2, the attenuation condition satisfies that there are a sixth predetermined number of consecutive effects each lower than the average value of the seventh predetermined number of effects before it.
The sixth predetermined number and the seventh predetermined number may be set according to a specific application scenario or simulation experiment.
Illustratively, if the attenuation condition satisfies that there are any two consecutive effects that are lower than the average of their previous three effects, then the attenuation condition may be:
and->
Wherein v is i 、v j Representing the effect of any two consecutive events in a fifth predetermined number of growth training steps, v i 、v i-1 、v i-3 Representing effect v i The previous three consecutive effects, v j 、v j-1 、v j-3 Representing effect v j The previous three consecutive effects may be that when the decay condition of equation (3) is satisfied, it is determined that the model effect stop growing is indicated to occur.
In one embodiment of the invention, the effect of the winning model on the fifth predetermined number of consecutive growing training steps is obtained for each growing training step, and training the winning model is continued, including:
if the effect of the winning model indicates that the model effect does not appear to stop growing, the next growing training step is continued.
In one embodiment of the invention, the effect of the winning model on the successive fifth predetermined number of growing training steps is obtained for every fifth predetermined number of growing training steps, and continuing to train the winning model comprises:
if the effect of the winning model indicates that no model effect has occurred to stop growing, then the next fifth predetermined number of growing training steps is continued.
According to the method provided by the embodiment of the invention, a one-time training process of the model is divided into a plurality of competition phases and a growth phase based on a model parameter inheritance technology, a plurality of competition super parameter combinations are adopted in the competition phases to train corresponding competition models simultaneously, the competition super parameter combinations are continuously selected, eliminated and evolved, a group of the competition super parameters with the best performance and the corresponding competition models are selected to enter the growth phase when the competition phases are ended, super parameter optimizing is carried out in the competition phases, the competition super parameter combinations which are won in the competition phases are utilized in the growth phase to train the corresponding competition models continuously, and the competition phases are reentered to carry out super parameter optimizing when the model stops growing, the competition phases and the growth phases are alternately carried out until preset termination conditions are met, training is stopped, labor is not needed, the calculated amount is less, and the end-to-end automatic parameter adjusting is realized.
On the one hand, the competition stage and the growth stage are both in one training process, so that the optimal super-parameter combination and the optimal machine learning model can be obtained through one training period.
On the other hand, the model training process is discretized based on the model parameter inheritance technology, so that the calculation amount of the super parameter optimization from the initialization is reduced, and the optimization selection of a plurality of groups of super parameters can be completed in one training of the model.
In the third aspect, when a plurality of machine learning models are simultaneously trained by utilizing a plurality of super-parameter combinations in the prior art, P iterative selections are needed, M models are trained for each iterative selection, and M times of training reach a threshold value set in advance to obtain the super-parameter combination meeting the requirements, so that the calculated amount is huge, and the final model is not obtained; in the present application, the training samples used in the competition stage can be significantly less than those used in the growth stage, so that the time-consuming gravity center of model training is located in the growth stage, and the calculated amount is only that of the existing schemeThe calculation is greatly reduced, namely, the final model is directly generated at the end of training, so that the manual participation is avoided.
On the basis of the above embodiment, before performing step S2100, the method may further include steps S1100 to S1200 as follows:
step S1100, a setting entry for setting an application scenario of the machine learning model is provided.
The user can determine the specific application scene of the required machine learning model according to the self requirement, and input the application scene through the setting entry.
Step S1200, an application scenario input through the setting portal is acquired.
Then, step S1000 may further be: and acquiring a corresponding training sample set according to the input application scene.
Specifically, a training sample set corresponding to a plurality of application scenarios may be prestored in an electronic device executing an embodiment of the present invention, where the training sample is composed of data, for example, image data, text data, or voice data, and according to an application scenario input through a provided setting entry of the application scenario, a training sample set matched with the application scenario is obtained for machine learning training, so that the obtained final machine learning model may be suitable for the input application scenario, and the data may be processed accordingly.
After the final machine learning model is obtained through the above embodiment, the method may further include steps S1300 to S1500 as follows:
step S1300, determining an application scenario to which the final machine learning model is applicable.
Step S1400, find the application item matching the application scenario.
Step S1500, inputting the final machine learning model into the application item.
In this embodiment, the final machine learning model is input to the application item matched with the application scenario to which the final machine learning model is applied, so that sample information in the application item is processed in the corresponding application item by using the final machine learning model.
The third automatic parameter adjusting mode is as follows: respectively carrying out one round of super-parameter exploration training on a plurality of machine learning algorithms based on the same target data set, wherein in the one round of exploration, each machine learning algorithm explores at least M groups of super-parameters, and M is a positive integer greater than 1; calculating the current round of performance score of each machine learning algorithm based on model evaluation indexes respectively corresponding to a plurality of groups of super parameters respectively explored by the machine learning algorithms in the round, and calculating the future potential score of each machine learning algorithm; combining the current round of performance scores and future potential scores of each machine learning algorithm, and determining a resource allocation scheme for allocating available resources to each machine learning algorithm; and carrying out corresponding resource scheduling in the next round of super-parameter exploration training according to the resource allocation scheme.
Wherein said calculating the current round of performance score for each machine learning model comprises: determining first K optimal model evaluation indexes from model evaluation indexes respectively corresponding to a plurality of groups of super parameters respectively explored by the machine learning models, wherein K is a positive integer; for each machine learning model, the proportion value of the machine learning model to the top K best model evaluation indexes is taken as the current round of performance score of the machine learning model.
Wherein said calculating future potential scores for each machine learning model comprises: storing model evaluation indexes corresponding to a plurality of groups of super parameters explored by each machine learning model in the round in an array according to a sequence to obtain a plurality of arrays corresponding to the plurality of machine learning models; for each machine learning model, a monotonically enhanced array is extracted from an array corresponding to the machine learning model, and a ratio of a length of the monotonically enhanced array to a length of the array corresponding to the machine learning model is used as a future potential score of the machine learning model.
Wherein the plurality of machine learning models includes at least two of a logistic regression machine learning model with a super-parametric selection mechanism, a naive bayes machine learning model with a super-parametric selection mechanism, an ensemble learning model with a super-parametric selection mechanism, and a regression-related machine learning model with a super-parametric selection mechanism.
Wherein the resources include at least one of a central processor, a memory space, and threads.
The step of performing a round of super parameter exploration training on the plurality of machine learning models based on the same target data set respectively further comprises the following steps: determining whether at least one machine learning model of the plurality of machine learning models satisfies an early-stop condition, wherein training of the at least one machine learning model is stopped when the at least one machine learning model is determined to satisfy the early-stop condition, and the step of calculating the current round of performance score and the future potential score is not performed on the at least one machine learning model.
The early stop condition includes: when the model evaluation index corresponding to the current round of exploration super parameter of one machine learning model is continuously and I times of non-innovative advantages, the machine learning model meets the early stop condition; and/or when the model evaluation index corresponding to the J super parameters explored by one machine learning model in the round is higher than the optimal evaluation index of the other machine learning model in the round, the other machine learning model meets the early stop condition.
Wherein the array corresponding to the machine learning model sequentially comprises a first model evaluation index to an X model evaluation index, wherein X is an integer greater than or equal to M; the step of extracting a monotonically enhancing array from the array corresponding to the machine learning model comprises: extracting a first model evaluation index as a first value in a monotonically enhanced array; and extracting any model evaluation index from the second model evaluation index to the X model evaluation index as a new value in the monotonically enhanced array if the any model evaluation index is better than the maximum value in the current monotonically enhanced array.
Wherein the step of determining the resource allocation scheme comprises: calculating a composite score for each machine learning model based on the current round of performance scores and the future potential scores for each machine learning model; calculating the ratio of the comprehensive score of each machine learning model to the sum of all comprehensive scores as a resource allocation coefficient of each machine learning model; the resource allocation scheme is determined as the following resource allocation scheme: a resource corresponding to a product of the resource allocation coefficient of each machine learning model and the total resource to be allocated is determined as a resource to be allocated to each machine learning model.
The step of determining a resource corresponding to a product of the resource allocation coefficient of each machine learning model and the total resource to be allocated as a resource to be allocated to each machine learning model includes: from among all machine learning models except the machine learning model with the highest resource allocation coefficient, starting from the machine learning model with the lowest resource allocation coefficient, rounding down the product of the resource allocation coefficient of the machine learning model and the total resources to be allocated and determining the value after rounding down as the number of the resources to be allocated to the machine learning model; and determining the resources which are not allocated to the machine learning model in the total resources to be allocated as the resources which are allocated to the machine learning model with the highest resource allocation coefficient.
The step of determining a resource corresponding to a product of the resource allocation coefficient of each machine learning model and the total resource to be allocated as a resource to be allocated to each machine learning model further includes: when there are a value of zero and a value greater than one in the number of resources allocated to each machine learning model, sorting the number of resources of the machine learning model in which the number of resources allocated to the machine learning model is greater than one in ascending order; among the resources of the machine learning models ordered in the ascending order, starting from the machine learning model having the smallest resources, the resources of the machine learning model are reduced by one unit, and the reduced resources are allocated to one of the machine learning models having zero number of resources, and the step of ordering in the ascending order is returned until the resources of all the machine learning models are not zero.
The resource scheduling method further comprises the following steps: and stopping allocating resources to the machine learning model in response to a stop request of the user, the total training time reaching a predetermined total training time, or the total training round number reaching a predetermined total training round number.
The step of respectively carrying out one round of super-parameter exploration training on a plurality of machine learning models based on the same target data set comprises the following steps of: and respectively allocating the same number of resources to the plurality of machine learning models, and respectively performing one round of super-parameter exploration training on the plurality of machine learning models based on the same target data set by using the same number of resources.
In an exemplary embodiment, the model evaluation indexes respectively corresponding to the multiple sets of super parameters respectively explored in this round of the logistic regression machine learning model lr, the gradient lifting decision tree machine learning model gbdt and the deep sparse network machine learning model dsn may be expressed as follows:
lr:[0.2,0.4,0.5,0.3,0.6,0.1,0.7,0.3]
gbdt:[0.5,0.2,0.1,0.4,0.2,0.6]
dsn:[0.61,0.67,0.63,0.72,0.8]
wherein a single value in a single array may indicate a training effect of a machine learning model having a set of super parameters. For example, a single value (e.g., 0.2) in the array herein may indicate verification set accuracy, as an example. Furthermore, in this example, the logistic regression machine learning model lr is trained with eight sets of hyper-parameters, the gradient boost decision tree machine learning model gbdt is trained with six sets of hyper-parameters, and the deep sparse network machine learning model dsn is trained with five sets of hyper-parameters. In the present exemplary example, the optimal model evaluation index and 5 th (in this example, J is 5, however, the present invention is not limited thereto) optimal model evaluation index of the logistic regression machine learning model lr are 0.7 and 0.3, respectively, the optimal model evaluation index and 5 th optimal model evaluation index of the gradient-lifting decision tree machine learning model gbdt are 0.6 and 0.2, respectively, and the optimal model evaluation index and 5 th optimal model evaluation index of the deep sparse network machine learning model dsn are 0.8 and 0.61, respectively. Since the 5 th best model evaluation index 0.61 of the deep sparse network machine learning model dsn is greater than the best model evaluation index 0.6 of the gradient-lifting decision tree machine learning model gbdt, the gradient-lifting decision tree machine learning model gbdt is determined to satisfy the early-stop condition. Thus, the gradient-lifting decision tree machine learning model gbdt is no longer involved in model exploration. Therefore, by judging whether the machine learning model satisfies the early-stop condition and stopping the search for the machine learning model satisfying the early-stop condition, the waste of resources can be reduced and the search efficiency can be improved.
The calculation of the present round performance score for each machine learning model in the present exemplary embodiment is specifically: since the gradient boost decision tree machine learning model gbdt may satisfy the early stop condition as described above, the gradient boost decision tree machine learning model gbdt does not subsequently participate in training exploration. In this case, the top 5 (here, K is 5, however, the present invention is not limited thereto) of all the model evaluation indexes of the logistic regression machine learning model lr and the deep sparse network machine learning model dsn are respectively: 0.7, 0.67, 0.63, 0.72 and 0.8. Here, "0.7" is a model evaluation index of the logistic regression machine learning model lr, and therefore, the ratio of the model evaluation index of the logistic regression machine learning model lr to the top 5 best model evaluation indexes "0.7, 0.67, 0.63, 0.72, and 0.8" is 1/5. In contrast, in the present exemplary example, the "0.67", "0.63", "0.72", and "0.8" of the 5 best model evaluation indexes "0.7, 0.67, 0.63, 0.72, and 0.8" are model evaluation indexes of the deep sparse network machine learning model dsn, and thus, the model evaluation index of the deep sparse network machine learning model dsn occupies the ratio value of the first 5 best model evaluation indexes "0.7, 0.67, 0.63, 0.72, and 0.8" to be 4/5. Thus, in the present exemplary embodiment, the round-robin performance score of the logistic regression machine learning model lr may correspond to 1/5 and the round-robin performance score of the deep sparse network machine learning model dsn may correspond to 4/5.
The future potential score for each machine learning model is calculated in the present exemplary embodiment as: as described above, the array corresponding to the model evaluation index of the logistic regression machine learning model lr is [0.2,0.4,0.5,0.3,0.6,0.1,0.7,0.3], and the array corresponding to the model evaluation index of the deep sparse network machine learning model dsn is [0.61,0.67,0.63,0.72,0.8]. Here, the monotonically increasing array is not necessarily expressed as a monotonically increasing array. In one example, when the training effect indicates a validation set accuracy, the monotonically increasing array may indicate a monotonically increasing array. In another example, the monotonically increasing array may indicate a monotonically decreasing array when the training effect may indicate a mean square error. In other words, an enhancement of a value in a monotonically enhanced array may indicate an enhancement or optimization of the training effect. For convenience of description, it is assumed below that an array corresponding to the machine learning model sequentially includes a first model evaluation index to an xth model evaluation index, where X is an integer greater than or equal to M. The step of extracting a monotonically enhancing array from the array corresponding to the machine learning model may comprise: the first model evaluation index is extracted as a first value in a monotonically increasing array. For example, in the first illustrative example, the array corresponding to the model evaluation index of the logistic regression machine learning model lr is [0.2,0.4,0.5,0.3,0.6,0.1,0.7,0.3], and thus 0.2 is extracted as the first value in the monotonically increasing array. Furthermore, the step of extracting the monotonically increasing array from the array corresponding to the machine learning model may further comprise: and extracting any model evaluation index from the second model evaluation index to the X model evaluation index as a new value in the monotonically enhanced array if the any model evaluation index is better than the maximum value in the current monotonically enhanced array. For example, in the illustrative first example, for the second model evaluation index 0.4 of the logistic regression machine learning model lr, since the second model evaluation index 0.4 is greater than the maximum value (i.e., 0.2) in the current monotonically enhanced array (which, at this time, corresponds to the monotonically enhanced array including only the first value), 0.4 is extracted as the new value (i.e., the second value) in the monotonically enhanced array. At this time, the monotonically increasing array becomes [0.2,0.4]. Next, for the third model evaluation index 0.5 of the logistic regression machine learning model lr, since the third model evaluation index 0.5 is greater than the maximum value (i.e., 0.4) in the current monotonically enhanced array (at this time, corresponding to the monotonically enhanced array including the first value and the second value), 0.4 is extracted as the new value (i.e., the third value) in the monotonically enhanced array. At this time, the monotonically increasing array becomes [0.2,0.4,0.5]. Next, for the fourth model evaluation index 0.3 of the logistic regression machine learning model lr, since the fourth model evaluation index 0.3 is smaller than the maximum value (i.e., 0.5) in the current monotonically enhanced array (at this time, corresponding to the monotonically enhanced array including the first value, the second value, and the third value), 0.3 is not extracted as a new value in the monotonically enhanced array. At this point, the monotonically increasing array is still [0.2,0.4,0.5]. Subsequently, the fifth model evaluation index 0.6 to the eighth model evaluation index 0.3 were subjected to similar processing to the second model evaluation index 0.4 to the fourth model evaluation index 0.3, and the resulting monotonically enhanced array was [0.2,0.4,0.5,0.6,0.7]. In the present invention, the length of an array may indicate the number of values included in the array. In the first example, the length of the resulting monotonically increasing array of the logistic regression machine learning model lr is 5 and the length of the array [0.2,0.4,0.5,0.3,0.6,0.1,0.7,0.3] corresponding to the logistic regression machine learning model lr is 8, thus the future potential score of the logistic regression machine learning model lr is 5/8. In an example, based on a method similar to the calculation method of the future potential score of the reference logistic regression machine learning model lr, the length of the resulting monotonically enhanced array [0.61,0.67,0.72,0.8] of the deep sparse network machine learning model dsn is 4 and the length of the array [0.61,0.67,0.63,0.72,0.8] corresponding to the deep sparse network machine learning model dsn is 5, so the future potential score of the logistic regression machine learning model lr is 4/5.
Thus, the composite score of the logistic regression machine learning model lr may be: 1/5+5/8=33/40; the comprehensive score of the deep sparse network machine learning model dsn may be: 4/5+4/5=8/5. The resource allocation coefficients of the logistic regression machine learning model lr may be calculated as: (33/40)/(33/40+8/5) =33/97, the resource allocation coefficient of the deep sparse network machine learning model dsn can be calculated as: (8/5)/(33/40+8/5) =64/97. Determining a resource corresponding to the product of the resource allocation coefficient 33/97 of the logistic regression machine learning model lr and the total resource to be allocated as a resource to be allocated to the logistic regression machine learning model lr; the resource corresponding to the product of the resource allocation coefficient 64/97 of the deep sparse network machine learning model dsn and the total resource to be allocated is determined as the resource to be allocated to the deep sparse network machine learning model dsn.
An example of the allocation compensation mechanism is as follows: for ease of explanation, the following description is given by taking an example in which six machine learning models a, b, c, d, e, f are assigned the number of tasks [1,0,0,0,2,7], however, the present invention is not limited thereto, and the number of machine learning models and the number of resources (for example, the number of tasks) specifically assigned may be any other number. Since at least one machine learning model (i.e., machine learning model b through machine learning model d) is assigned a task number of 0, the assignment compensation mechanism is triggered. Here, the number of the resources of the machine learning model, of which the number of the resources allocated to the machine learning model is greater than one, is ordered in ascending order. That is, in the present example, the number of resources of the machine learning models (i.e., the machine learning model e and the machine learning model f) whose number of resources is greater than one among the task numbers to which the machine learning models a to f are assigned are ordered in ascending order [2,7]. In the present invention, the number 1 of resources does not necessarily represent a single resource, and it may represent one unit of resources, where one unit of resources corresponds to a predetermined number of resources. The number of resources 2 of the machine learning model e is subtracted by 1, and the reduced number of resources 1 is allocated to one machine learning model (for example, the machine learning model b) of which the number of resources is 0. Since the number of the resources of the machine learning model e after subtracting 1 becomes 1, the number of the resources of the machine learning model e is subsequently kept at 1, that is, the resources are no longer allocated from the resources of the machine learning model e to the other machine learning models. At this time, since there are also two machine learning models (i.e., the machine learning models c and d) whose number of resources is 0, it is considered to continue to allocate resources from the other machine learning models to the machine learning model whose number of resources is 0. Since the number of resources of the machine learning model e has become 1, at most the number of resources of the starting next machine learning model (i.e., the machine learning model f) is reduced to 1. The number of resources of the machine learning model f may be reduced from 7 to 5, and the reduced resources may be allocated to the machine learning model c and the machine learning model d, respectively, such that the number of resources of the machine learning model c and the machine learning model d are both 1. By assigning the compensation mechanism, the number of resources of the machine learning model a to the machine learning model f eventually becomes [1,1,1,1,1,5]. Therefore, after the allocation compensation mechanism is adopted, the machine learning model a to the machine learning model f are allocated with resources, so that the strong person is kept constant, the weak person still has an opportunity, and the machine learning model which only has poor performance at one time is prevented from being stopped to be explored, so that the exploration accuracy is further improved.
The foregoing has been a significant discussion of several innovative operators. The process of encapsulating the content generated during processing to obtain new data, samples, models and operators and adding them to the node show list is described below.
Specifically, data generated in the process of executing a data processing flow corresponding to the directed acyclic graph are packaged into nodes with the type of the data and stored; encapsulating samples generated in the process of executing the data processing flow corresponding to the directed acyclic graph into nodes with the types of the samples, and storing the nodes; packaging a model generated in the process of executing a data processing flow corresponding to the directed acyclic graph into a node with the model type and storing the node; and encapsulating the data processing flow corresponding to the execution directed acyclic graph into the nodes with the types of operators and storing the nodes. And adding the packaged nodes into a node display list for subsequent editing or creating a directed acyclic graph.
The data and the samples are static data, so that the data and the samples are stored according to the format of the data/sample nodes, and node symbols corresponding to the binding are generated and added into the node display list. These data or samples may then be dragged from the node list as nodes of the DAG.
In the embodiment of the present invention, the model generated in the process of executing the data processing flow corresponding to the directed acyclic graph is packaged into the node with the model type, which includes the following three cases:
1) Writing the description information of the model into a file, and packaging the file into nodes with the model file type. Here the description information of the model itself includes: the name of the model and the parameters of the model itself (i.e., model parameters determined by model training).
When the description information of the model itself is written into a file, the file is packaged into a node of the model file type, the code of the processing logic of the model itself is also packaged into a node, and the node has two inputs, one input is the model file, and the other input is a sample, namely, the processing logic of the model algorithm packaged in the node, and the model file is required to input model parameters to execute the prediction processing logic of a trained model on the sample. And in this case, it is generally required that the acquisition process of the inputted sample is consistent with the acquisition process of the sample when training the model.
2) Writing description information of the model, description information of a data source input into the model and description information of preprocessing and feature extraction processing of the data source into a file, and packaging the file into nodes with the model file type.
Also here, the description information of the model itself includes: the name of the model and the parameters of the model itself (i.e., model parameters determined by model training). The data source description information is field information (shcema) of one or more data tables as input data, for example, a data table for online shopping behavior of a user, the fields of which are purchase time, purchase commodity name, season, discount information of purchased commodity, and the like. The preprocessing of the data source comprises: normalization, null filling, etc. The characteristic processing comprises the following steps: from which fields features are extracted, how features are extracted, automatic feature combinations, etc.
Writing description information of the model, description information of the data source and description information of preprocessing and feature extraction processing on the data source into a file, acquiring description information of the data source to be analyzed and description information of preprocessing and feature extraction processing on the data source under the condition that the file is packaged into a node of the model file, generating an analysis code corresponding to a data processing code, and packaging the analysis code and the code of processing logic of the model into the node, wherein the node has two inputs, one input is the model file and the other input is the data source. The node performs preprocessing and feature extraction processing on an input data source to obtain a sample, combines the parameters of the model file and processing logic of the model to obtain a trained model, and then predicts and outputs a result on the sample according to the trained model.
3) And packaging the description information of the model, the description information of the data source, the codes for preprocessing the data source and extracting the characteristics and the codes for executing the processing logic of the model into the nodes with the model type.
Also here, the description information of the model itself includes: the name of the model and the parameters of the model itself (i.e., model parameters determined by model training). The data source description information is field information (shcema) of one or more data tables as input data, for example, a data table for online shopping behavior of a user, the fields of which are purchase time, purchase commodity name, season, discount information of purchased commodity, and the like. The preprocessing of the data source comprises: normalization, null filling, etc. The characteristic processing comprises the following steps: from which fields features are extracted, how features are extracted, automatic feature combinations, etc.
And (3) packaging the description information of the model, the description information of the data source, the codes for preprocessing the data source and extracting the characteristics and the codes for executing the processing logic of the model into a node with the model type, wherein the node has one input, and the input is the data source.
In one embodiment of the disclosure, each node in the directed acyclic graph respectively corresponds to an execution body, and the description information of the data source and the description information or code of the data processing logic of each execution body are transmitted through according to the execution sequence, so that each execution body can output the description information of the data source and the description information or code of the data processing logic of the execution body and all the upper execution bodies thereof.
Or in one embodiment of the disclosure, each execution body corresponding to each node in the directed acyclic graph has a corresponding information record file; and if the execution main body at the previous stage exists, reading the content in the information record file of the previous execution main body, and storing the description information or the code of the data processing logic of the self into the information record file of the self together with the read content.
In one embodiment of the present invention, in response to the operation of running the directed acyclic graph, the executing the data processing flow corresponding to the directed acyclic graph in step S120 includes: acquiring configuration information of a directed acyclic graph; according to the configuration information of the directed acyclic graph, determining a previous-stage node and a next-stage node of each node in the directed acyclic graph, and further determining a connection relationship between the nodes; and executing the data processing flow corresponding to each node according to the connection relation among the nodes.
The configuration information of the directed acyclic graph comprises configuration information of each node, and the configuration information of each node comprises input slot information (input slot) and output slot information (output slot); the input slot information is used for describing information of a previous level node, identification of the previous level node, data information output by the previous level node and the like, and correspondingly, the output slot information is used for describing information of a next level node, identification of the next level node, data information output to the next level node and the like.
The determining the previous level node and the next level node of each node in the directed acyclic graph according to the configuration information of the directed acyclic graph comprises: and determining a previous-stage node and a next-stage node of each node in the directed acyclic graph according to the input slot information and the output slot information in the configuration information of each node in the directed acyclic graph.
The configuration information of each node also comprises information for determining the operation mode of the node, in particular an identifier for indicating stand-alone operation or an identifier for indicating distributed operation. The method shown in fig. 1 further comprises: according to the configuration information of each node, the node is determined to operate in a stand-alone mode or in a distributed mode. Correspondingly, when the data processing flow corresponding to each node is executed, the data processing flow is executed in a single machine mode or a distributed mode.
Fig. 4 shows a schematic diagram of a system implementing data processing according to an embodiment of the invention. As shown in fig. 4, the system 400 for implementing data processing includes:
an operation unit 401 adapted to generate a corresponding directed acyclic graph in response to a user's operation of generating the directed acyclic graph;
the execution unit 402 is adapted to execute a data processing flow corresponding to the directed acyclic graph in response to an operation of executing the directed acyclic graph.
Wherein the operation unit 401 is adapted to display a first graphical user interface comprising a node presentation area and a canvas area, wherein the node types in the node presentation area comprise data, samples, models and operators; responsive to the operation of exposing the region selection node at the node, the respective node is displayed at the canvas region, and responsive to the operation of connecting the nodes, a connection is generated between the respective nodes in the canvas region to generate the directed acyclic graph.
The node display area comprises an element list and an operator list, wherein the element list comprises data, samples and models, and the operator list comprises various data processing operators related to machine learning; the node presentation area also includes a file list including a directed acyclic graph.
The nodes of the node display area further comprise directed acyclic graphs; the operating unit 401 is further adapted to perform at least one of the following: responsive to an operation of selecting a directed acyclic graph at a node presentation area, displaying the selected directed acyclic graph at a canvas area for direct execution or modification editing; responsive to an operation to save the directed acyclic graph in the canvas area, saving the directed acyclic graph and adding the saved directed acyclic graph to the node-presentation area; in response to the operation of exporting the directed acyclic graph, the corresponding directed acyclic graph is output to a specified export location.
The operating unit 401 is further adapted to perform at least one of the following: storing corresponding elements in response to an operation of importing the elements from outside, and adding the elements into the node display area; saving elements generated in the process of executing a data processing flow corresponding to the directed acyclic graph, and adding the saved elements into the node display area; providing a management page for managing elements generated in the process of executing a data processing flow corresponding to the directed acyclic graph, so that a user can check and delete the intermediate elements through the management page; outputting the corresponding element to a specified export location in response to an operation to export the element; wherein the element is data, a sample, or a model.
The operating unit 401 is further adapted to perform at least one of the following: responding to the operation of importing operators from the outside, storing codes corresponding to the corresponding operators, and adding the corresponding operators into a node display area; providing an operator code editing interface, acquiring and storing an input code from the interface, and adding a corresponding operator into a node display area.
The operation unit 401 is further adapted to perform the following operations: responding to the operation of selecting one node in the canvas area, displaying a configuration interface of the node, and completing the related configuration of the corresponding node according to the configuration operation on the configuration interface; when the nodes are not necessarily configured, or the configured parameters do not meet the preset requirements, the prompt identification is displayed at the corresponding nodes in the canvas area.
Wherein the operation unit 401 is further adapted to perform at least one of the following: displaying a graphic control running the directed acyclic graph in the first graphic user interface, responding to the operation of triggering the graphic control, and executing a data processing flow corresponding to the directed acyclic graph according to each node in the directed acyclic graph and the connection relation among the nodes; and displaying a timer on the first graphical user interface, wherein the timer is used for timing the time for executing the data processing flow corresponding to the directed acyclic graph in real time.
Wherein the operation unit 401 is further adapted to perform at least one of the following: in the process of executing the data processing flow corresponding to the directed acyclic graph, displaying information for representing the progress of executing the corresponding node on each node of the directed acyclic graph on the first graphical user interface; in the process of executing the data processing flow corresponding to the directed acyclic graph, displaying an in-operation identifier on each node of the directed acyclic graph on the first graphical user interface, and displaying the executed identifier on one node when the data processing flow corresponding to the node is executed; and responding to the operation of checking the operation result of the node in the directed acyclic graph, and acquiring and displaying operation result data corresponding to the node.
Wherein the operation unit 401 is further adapted to perform one or more of the following: the data, samples, and models in the canvas area all support one or more of the following operations: copying, deleting and previewing; operators in the canvas area support one or more of the following operations: copying, deleting, previewing, running the current task, starting running from the current task, running to the current task, checking logs, and checking task details.
For the directed acyclic graph with the completed operation in the canvas area, in response to clicking one of the operators, displaying product type marks respectively corresponding to the types of products output by the operator, in response to clicking the product type marks, displaying a product related information interface, wherein the product information related interface comprises: a control for previewing the product, a control for exporting the product, a control for importing the product into an element list, basic information of the product and path information for storing the product; wherein the product types of the operator outputs include: data, samples, models, and reports.
Wherein, the node display area comprises one or more of the following operators:
data splitting operator: the data splitting method provided in the configuration interface of the data splitting operator comprises one or more of proportional splitting, regular splitting and sorting-then-splitting; when the proportional splitting is selected, the proportional sequential splitting, the proportional random splitting and the proportional layering splitting can be further selected, an input area for setting random seed parameters is further provided on the configuration interface when the random splitting is selected, and an input area for setting fields for layering basis is further provided on the configuration interface when the layering splitting is selected; providing an input area for inputting splitting rules when selecting splitting by rule; providing a split proportion selection item, an input area for setting a sorting field and a sorting direction selection item on a configuration interface when sorting is performed before splitting is selected;
Feature extraction operator: providing an interface for adding an input source and a script editing inlet in a configuration interface of a feature extraction operator, and further providing at least one of a sample random ordering option, a feature accurate statistics option, an option for outputting whether a sample is compressed, an output plaintext option, a tag type option and an output result storage type option;
feature importance analysis operator: the input of the feature importance analysis operator is a sample table with a target value, the output is a feature importance evaluation report, the report comprises importance coefficients of all the features, and the report also comprises one or more of feature numbers, sample numbers and basic statistical information of all the features;
automatic feature combination operator: at least one of a feature selection item, a grading index selection item, a learning rate setting item and a termination condition selection item is provided in a configuration interface of the automatic feature combination operator, wherein the feature selection item is used for determining each feature for feature combination, and the termination condition selection item comprises the maximum number of running feature pools and the maximum number of output features.
Automatic adjustment parameter operator: the automatic parameter adjustment operator is used for searching out proper parameters from a given parameter range according to a parameter adjustment algorithm, training a model by using the searched parameters and carrying out model evaluation; providing at least one of a feature selection setting item, a parameter adjustment method option and a parameter adjustment times setting item in a configuration interface of the automatic parameter adjustment operator, wherein all features or user-defined feature selection can be selected in the feature selection setting item, and random search or grid search can be selected in the parameter adjustment method option;
TensorFlow operator: the TensorFlow operator is used for running TensorFlow codes written by a user, and an input source setting item and a TensorFlow code file path setting item are provided in a configuration interface of the TensorFlow operator;
custom script operators: an interface for providing a user with a custom scripting operator using a specific scripting language, the custom scripting operator providing input source settings and script editing entries in its configuration.
Wherein the feature importance analysis operator determines the importance of a feature by at least one of:
training at least one feature pool model based on a sample set, wherein the feature pool model refers to a machine learning model for providing a prediction result about a machine learning problem based on at least a part of features contained in a sample, obtaining an effect of the at least one feature pool model, and determining importance of the features according to the obtained effect of the at least one feature pool model; wherein the feature pool model is trained by performing a discretization operation on at least one continuous feature among the at least a portion of the features;
determining a basic feature subset of the sample, and determining a plurality of target feature subsets of which the importance is to be determined; for each target feature subset of the plurality of target feature subsets, acquiring a corresponding composite machine learning model, wherein the composite machine learning model comprises a basic sub-model trained according to a lifting framework and an additional sub-model, wherein the basic sub-model is trained based on the basic feature subset, and the additional sub-model is trained based on each target feature subset; and determining importance of the plurality of target feature subsets according to the effect of the composite machine learning model;
Pre-ordering the importance of at least one candidate feature in the sample, and screening a part of candidate features from the at least one candidate feature according to a pre-ordering result to form a candidate feature pool; and re-ordering the importance of each candidate feature in the candidate feature pool, and selecting at least one candidate feature with higher importance from the candidate feature pool as an important feature according to the re-ordering result.
Wherein the automatic feature combination operator performs feature combination by at least one of:
performing at least one binning operation for each successive feature in the sample to obtain a binning group feature comprised of at least one binning feature, wherein each binning operation corresponds to one binning feature; and generating combined features of the machine-learned sample by feature combining between the binned features and/or other discrete features in the sample;
performing feature combinations between at least one feature of the sample stage by stage in accordance with a heuristic search strategy to generate candidate combined features, wherein for each stage, a target combined feature is selected from a set of candidate combined features as a combined feature of the machine-learned sample;
Obtaining unit features capable of being combined in a sample; providing a graphical interface for setting feature combination configuration items to a user, wherein the feature combination configuration items are used for limiting how feature combinations are performed among unit features; receiving input operation which is executed on a graphical interface by a user for setting feature combination configuration items, and acquiring the feature combination configuration items set by the user according to the input operation; combining the features to be combined among the unit features based on the acquired feature combination configuration items to generate combined features of the machine learning sample;
iteratively performing feature combinations between at least one discrete feature of the sample in accordance with a search strategy to generate candidate combined features, and selecting a target combined feature from the generated candidate combined features as a combined feature; for each round of iteration, pre-ordering importance of each candidate combination feature in the candidate combination feature set, screening a part of candidate combination features from the candidate combination feature set according to a pre-ordering result to form a candidate combination feature pool, re-ordering importance of each candidate combination feature in the candidate combination feature pool, and selecting at least one candidate combination feature with higher importance from the candidate combination feature pool according to the re-ordering result as a target combination feature;
Screening a plurality of key unit features from the features of the sample; and obtaining at least one combined feature from the plurality of key unit features by using an automatic feature combination algorithm, wherein each combined feature is formed by combining corresponding part of key unit features in the plurality of key unit features; and taking the obtained at least one combined characteristic as an automatically generated combined characteristic.
Wherein the auto-tuning parameter operator performs auto-tuning by any one of the following means:
the following steps are performed in each iteration: determining currently available resources; scoring a plurality of super-parameter tuning strategies respectively, and distributing currently available resources for the plurality of super-parameter tuning strategies according to scoring results, wherein each super-parameter tuning strategy is used for selecting a super-parameter combination for the machine learning model based on a super-parameter selection strategy corresponding to the super-parameter tuning strategy; acquiring one or more super-parameter combinations generated by each super-parameter tuning strategy allocated to the resource based on the allocated resource respectively;
in the competition stage, training corresponding competition models according to a machine learning algorithm under a plurality of competition super-parameter combinations to obtain a competition model with the best effect, and taking the obtained competition model and the competition super-parameter combination corresponding to the obtained competition model as a winning model and a winning super-parameter combination to enter a growth stage; in the growing stage, under the combination of the winning super parameters obtained in the competition stage of the round, continuing to train the winning model obtained in the competition stage of the round, obtaining the effect of the winning model, if the effect of the winning model indicates that the model effect stops growing, restarting the competition stage to train the updated competition model under the combination of the updated plurality of the winning super parameters according to the machine learning algorithm, otherwise, continuing to train the winning model, and repeating the iteration until the preset termination condition is met; the updated competition super parameter combinations are obtained based on the winning super parameter combinations of the previous growth stage, and the updated competition model is the winning model obtained in the previous growth stage.
Respectively carrying out one round of super-parameter exploration training on a plurality of machine learning algorithms, wherein in the round of exploration, each machine learning algorithm explores at least M groups of super-parameters, and M is a positive integer greater than 1; calculating the current round of performance score of each machine learning algorithm based on model evaluation indexes respectively corresponding to a plurality of groups of super parameters respectively explored by the machine learning algorithms in the round, and calculating the future potential score of each machine learning algorithm; combining the current round of performance scores and future potential scores of each machine learning algorithm, and determining a resource allocation scheme for allocating available resources to each machine learning algorithm; and carrying out corresponding resource scheduling in the next round of super-parameter exploration training according to the resource allocation scheme.
Wherein the operation unit 401 is further adapted to perform one or more of the following:
packaging data generated in the process of executing a data processing flow corresponding to the directed acyclic graph into nodes with the type of data, and storing the nodes;
encapsulating samples generated in the process of executing the data processing flow corresponding to the directed acyclic graph into nodes with the types of the samples, and storing the nodes;
packaging a model generated in the process of executing a data processing flow corresponding to the directed acyclic graph into a node with the model type and storing the node;
And packaging the data processing flow corresponding to the execution directed acyclic graph into the nodes with the types of operators and storing the nodes.
The operation unit 401 is further adapted to add the encapsulated nodes to a node presentation list for subsequent editing or creation of a directed acyclic graph.
The operation unit 401 is adapted to write the description information of the model itself into a file, and encapsulate the file into a node with a type of model file; or writing description information of the model, description information of the data source and description information of preprocessing and feature extraction processing of the data source into a file, and packaging the file into nodes with the model file type; or, the description information of the model itself, the description information of the data source, the codes for preprocessing the data source and extracting the characteristics and the codes for executing the processing logic of the model itself are packaged into the nodes with the model type.
The method comprises the steps that each node in a directed acyclic graph respectively corresponds to an execution main body, and data source description information and description information or codes of data processing logic of each execution main body are transmitted through according to an execution sequence, so that each execution main body can output the data source description information and the description information or codes of the data processing logic of the execution main body and all upper-level execution main bodies of the execution main body;
Or each execution main body corresponding to each node in the directed acyclic graph is provided with a corresponding information record file; and if the execution main body at the previous stage exists, reading the content in the information record file of the previous execution main body, and storing the description information or the code of the data processing logic of the self into the information record file of the self together with the read content.
Wherein the data source description information is field information of one or more data tables as input data.
The operation unit 401 is adapted to, when writing the description information of the model itself into a file, encapsulate the file into a node of the type model file, encapsulate the code of the processing logic of the model itself into a node, and the input of the node is the model file; writing description information of the model, data source description information of the input model and description information of preprocessing and feature extraction processing of the data source into a file, acquiring description information of the analysis data source and description information of preprocessing and feature extraction processing of the data source under the condition that the file is packaged into a node of the model file, generating analysis codes corresponding to data processing codes, packaging the analysis codes and codes of processing logic of the model into the node, wherein the node input is the model file.
Wherein the running unit 402 is adapted to obtain configuration information of the directed acyclic graph; according to the configuration information of the directed acyclic graph, determining a previous-stage node and a next-stage node of each node in the directed acyclic graph, and further determining a connection relationship between the nodes; and executing the data processing flow corresponding to each node according to the connection relation among the nodes.
The configuration information of the directed acyclic graph comprises configuration information of each node, and the configuration information of each node comprises input slot information and output slot information; the operation unit 402 is adapted to determine a previous level node and a next level node of each node in the directed acyclic graph according to the input slot information and the output slot information in the configuration information of each node in the directed acyclic graph.
The configuration information of the directed acyclic graph comprises configuration information of each node; the operation unit 402 is further adapted to determine, according to the configuration information of each node, whether the node is operated in a stand-alone manner or in a distributed manner.
The information used for determining the operation mode of each node in the configuration information of each node comprises the following steps: an identification indicating stand-alone operation or an identification indicating distributed operation.
A method and system for implementing data processing according to an exemplary embodiment of the present disclosure have been described above with reference to fig. 1 to 4. However, it should be understood that: the apparatus and systems shown in the figures may each be configured to include software, hardware, firmware, or any combination thereof to perform a particular function. For example, these systems, devices may correspond to application specific integrated circuits, and may also correspond to modules of software combined with hardware. Furthermore, one or more functions implemented by these systems or apparatuses may also be performed uniformly by components in a physical entity device (e.g., a processor, a client, a server, or the like).
Furthermore, the above-described method of implementing data processing may be implemented by instructions recorded on a computer-readable storage medium, for example, according to an exemplary embodiment of the present disclosure, a computer-readable storage medium storing instructions may be provided, wherein the instructions, when executed by at least one computing device, cause the at least one computing device to perform the disclosed process of implementing data processing in the present document. For example, the following steps are performed: responding to the operation of generating the directed acyclic graph of the user, and generating a corresponding directed acyclic graph; and responding to the operation of running the directed acyclic graph, and executing the data processing flow corresponding to the directed acyclic graph.
The instructions stored in the computer-readable storage medium described above may be run in an environment deployed in a computer device, such as a client, host, proxy device, server, etc., it being noted that the instructions may also be used to perform additional steps than or more specific processes when the steps described above are performed.
It should be noted that a system implementing data processing according to an exemplary embodiment of the present disclosure may completely rely on the execution of a computer program or instructions to implement the respective functions, i.e., the respective devices correspond to the respective steps in the functional architecture of the computer program, so that the entire system is called through a dedicated software package (e.g., lib library) to implement the respective functions.
On the other hand, when the system of the present invention and its functions are implemented by software, firmware, middleware or microcode, the program code or code segments to perform the corresponding operations may be stored in a computer-readable medium, such as a storage medium, so that at least one processor or at least one computing device can perform the corresponding operations by reading and executing the corresponding program code or code segments.
For example, in accordance with an exemplary embodiment of the present disclosure, a system may be provided that includes at least one computing device and at least one storage device storing instructions that, when executed by the at least one computing device, cause the at least one computing device to perform the disclosed processes of implementing data processing. For example, the following steps are performed: responding to the operation of generating the directed acyclic graph of the user, and generating a corresponding directed acyclic graph; and responding to the operation of running the directed acyclic graph, and executing the data processing flow corresponding to the directed acyclic graph.
The at least one computing device may include a Central Processing Unit (CPU), a Graphics Processor (GPU), a programmable logic device, a special purpose processor system, a microcontroller or microprocessor, a display device, or the like. By way of example and not limitation, the at least one computing device may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, and the like. The computing device may execute instructions or code stored in one of the storage devices, wherein the storage devices may also store data. Instructions and data may also be transmitted and received over a network via a network interface device, which may employ any known transmission protocol.
The storage device may be integrated with the computing device, for example, with RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, the storage devices may include stand-alone devices, such as external disk drives, storage arrays, or other storage devices usable by any database system. The storage device and the computing device may be operatively coupled or may communicate with each other, such as through an I/O port, network connection, or the like, such that the computing device is capable of reading instructions stored in the storage device.
The foregoing description of exemplary embodiments of the present disclosure has been presented only to be understood as illustrative and not exhaustive, and the present disclosure is not limited to the exemplary embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the disclosure. Accordingly, the scope of the present disclosure should be determined by the scope of the claims.
Claims (50)
1. A method of implementing data processing, wherein the method comprises:
responding to the operation of generating the directed acyclic graph of the user, and generating a corresponding directed acyclic graph;
in response to an operation of running the directed acyclic graph, executing a data processing flow corresponding to the directed acyclic graph, wherein the data processing flow corresponding to the directed acyclic graph is a data processing flow related to machine learning,
wherein the generating the corresponding directed acyclic graph in response to the user's operation of generating the directed acyclic graph comprises:
displaying a first graphical user interface comprising a node presentation area and a canvas area, wherein node types in the node presentation area comprise data, samples, models and operators;
responsive to the operation of exposing the region selection node at the node, displaying the respective node in the canvas region, and responsive to the operation of connecting the nodes, generating a connection between the respective nodes in the canvas region to generate a directed acyclic graph,
Wherein the node presentation area comprises an element list and an operator list, the element list comprises data, samples and models, the samples are samples related to machine learning, and the operator list comprises various data processing operators related to machine learning;
the node display area comprises one or more of the following operators: a data splitting operator; a feature extraction operator; a feature importance analysis operator; an automatic feature combination operator; automatically adjusting a parameter operator; a TensorFlow operator; and customizing the script operator.
2. The method of claim 1, wherein,
the node presentation area also includes a file list including a directed acyclic graph.
3. The method of claim 1, wherein the nodes of the node presentation area further comprise directed acyclic graphs;
the method further comprises at least one of the following:
responsive to an operation of selecting a directed acyclic graph at a node presentation area, displaying the selected directed acyclic graph at a canvas area for direct execution or modification editing;
responsive to an operation to save the directed acyclic graph in the canvas area, saving the directed acyclic graph and adding the saved directed acyclic graph to the node-presentation area;
In response to the operation of exporting the directed acyclic graph, the corresponding directed acyclic graph is output to a specified export location.
4. The method of claim 1, wherein the method further comprises at least one of:
storing corresponding elements in response to an operation of importing the elements from outside, and adding the elements into the node display area;
saving elements generated in the process of executing a data processing flow corresponding to the directed acyclic graph, and adding the saved elements into the node display area;
providing a management page for managing elements generated in the process of executing a data processing flow corresponding to the directed acyclic graph, so that a user can check and delete the intermediate elements through the management page;
outputting the corresponding element to a specified export location in response to an operation to export the element;
wherein the element is data, a sample, or a model.
5. The method of claim 1, wherein the method further comprises at least one of:
responding to the operation of importing operators from the outside, storing codes corresponding to the corresponding operators, and adding the corresponding operators into a node display area;
providing an operator code editing interface, acquiring and storing an input code from the interface, and adding a corresponding operator into a node display area.
6. The method of claim 1, wherein the method further comprises:
responding to the operation of selecting one node in the canvas area, displaying a configuration interface of the node, and completing the related configuration of the corresponding node according to the configuration operation on the configuration interface;
when the nodes are not necessarily configured, or the configured parameters do not meet the preset requirements, the prompt identification is displayed at the corresponding nodes in the canvas area.
7. The method of claim 1, wherein the method further comprises at least one of:
displaying a graphic control running the directed acyclic graph in the first graphic user interface, responding to the operation of triggering the graphic control, and executing a data processing flow corresponding to the directed acyclic graph according to each node in the directed acyclic graph and the connection relation among the nodes;
and displaying a timer on the first graphical user interface, wherein the timer is used for timing the time for executing the data processing flow corresponding to the directed acyclic graph in real time.
8. The method of claim 1, wherein the method further comprises at least one of:
in the process of executing the data processing flow corresponding to the directed acyclic graph, displaying information for representing the progress of executing the corresponding node on each node of the directed acyclic graph on the first graphical user interface;
In the process of executing the data processing flow corresponding to the directed acyclic graph, displaying an in-operation identifier on each node of the directed acyclic graph on the first graphical user interface, and displaying the executed identifier on one node when the data processing flow corresponding to the node is executed;
and responding to the operation of checking the operation result of the node in the directed acyclic graph, and acquiring and displaying operation result data corresponding to the node.
9. The method of claim 1, wherein the method comprises one or more of:
the data, samples, and models in the canvas area all support one or more of the following operations: copying, deleting and previewing;
operators in the canvas area support one or more of the following operations: copying, deleting, previewing, running the current task, starting running from the current task, running to the current task, checking logs and checking task details;
for the directed acyclic graph with the completed operation in the canvas area, in response to clicking one of the operators, displaying product type marks respectively corresponding to types of products output by the operator, in response to clicking the product type marks, displaying a product related information interface, wherein the product related information interface comprises: a control for previewing the product, a control for exporting the product, a control for importing the product into an element list, basic information of the product and path information for storing the product; wherein the product types of the operator outputs include: data, samples, models, and reports.
10. The method of claim 1, wherein,
the data splitting method provided in the configuration interface of the data splitting operator comprises one or more of proportional splitting, regular splitting and sorting-then-splitting; when the proportional splitting is selected, the proportional sequential splitting, the proportional random splitting and the proportional layering splitting can be further selected, an input area for setting random seed parameters is further provided on the configuration interface when the random splitting is selected, and an input area for setting fields for layering basis is further provided on the configuration interface when the layering splitting is selected; providing an input area for inputting splitting rules when selecting splitting by rule; providing a split proportion selection item, an input area for setting a sorting field and a sorting direction selection item on a configuration interface when sorting is performed before splitting is selected;
providing an interface for adding an input source and a script editing inlet in a configuration interface of the feature extraction operator, and providing at least one of a sample random ordering option, a feature accurate statistics option, an option for outputting whether a sample is compressed, an output plaintext option, a tag type option and an output result storage type option;
The input of the feature importance analysis operator is a sample table with a target value, the output is a feature importance evaluation report, the report comprises importance coefficients of all the features, and the report also comprises one or more of feature numbers, sample numbers and basic statistical information of all the features;
providing at least one of a feature selection item, a grading index selection item, a learning rate setting item and a termination condition selection item in a configuration interface of the automatic feature combination operator, wherein the feature selection item is used for determining each feature for feature combination, and the termination condition selection item comprises the maximum number of running feature pools and the maximum number of output features;
the automatic parameter adjustment operator is used for searching proper parameters from a given parameter range according to a parameter adjustment algorithm, training a model by using the searched parameters and carrying out model evaluation; providing at least one of a feature selection setting item, a parameter adjustment method option and a parameter adjustment times setting item in a configuration interface of the automatic parameter adjustment operator, wherein all features or user-defined feature selection can be selected in the feature selection setting item, and random search or grid search can be selected in the parameter adjustment method option;
the TensorFlow operator is used for running TensorFlow codes written by a user, and an input source setting item and a TensorFlow code file path setting item are provided in a configuration interface of the TensorFlow operator;
The custom script operator is used for providing an interface for a user to custom write the operator by using a specific script language, and an input source setting item and a script editing entry are provided in the configuration of the custom script operator.
11. The method of claim 10, wherein the feature importance analysis operator determines the importance of a feature by at least one of:
training at least one feature pool model based on a sample set, wherein the feature pool model refers to a machine learning model for providing a prediction result about a machine learning problem based on at least a part of features contained in a sample, obtaining an effect of the at least one feature pool model, and determining importance of the features according to the obtained effect of the at least one feature pool model; wherein the feature pool model is trained by performing a discretization operation on at least one continuous feature among the at least a portion of the features;
determining a basic feature subset of the sample, and determining a plurality of target feature subsets of which the importance is to be determined; for each target feature subset of the plurality of target feature subsets, acquiring a corresponding composite machine learning model, wherein the composite machine learning model comprises a basic sub-model trained according to a lifting framework and an additional sub-model, wherein the basic sub-model is trained based on the basic feature subset, and the additional sub-model is trained based on each target feature subset; and determining importance of the plurality of target feature subsets according to the effect of the composite machine learning model;
Pre-ordering the importance of at least one candidate feature in the sample, and screening a part of candidate features from the at least one candidate feature according to a pre-ordering result to form a candidate feature pool; and re-ordering the importance of each candidate feature in the candidate feature pool, and selecting at least one candidate feature with higher importance from the candidate feature pool as an important feature according to the re-ordering result.
12. The method of claim 10, wherein the automatic feature combination operator performs feature combination by at least one of:
performing at least one binning operation for each successive feature in the sample to obtain a binning group feature comprised of at least one binning feature, wherein each binning operation corresponds to one binning feature; and generating combined features of the machine-learned sample by feature combining between the binned features and/or other discrete features in the sample;
performing feature combinations between at least one feature of the sample stage by stage in accordance with a heuristic search strategy to generate candidate combined features, wherein for each stage, a target combined feature is selected from a set of candidate combined features as a combined feature of the machine-learned sample;
Obtaining unit features capable of being combined in a sample; providing a graphical interface for setting feature combination configuration items to a user, wherein the feature combination configuration items are used for limiting how feature combinations are performed among unit features; receiving input operation which is executed on a graphical interface by a user for setting feature combination configuration items, and acquiring the feature combination configuration items set by the user according to the input operation; combining the features to be combined among the unit features based on the acquired feature combination configuration items to generate combined features of the machine learning sample;
iteratively performing feature combinations between at least one discrete feature of the sample in accordance with a search strategy to generate candidate combined features, and selecting a target combined feature from the generated candidate combined features as a combined feature; for each round of iteration, pre-ordering importance of each candidate combination feature in the candidate combination feature set, screening a part of candidate combination features from the candidate combination feature set according to a pre-ordering result to form a candidate combination feature pool, re-ordering importance of each candidate combination feature in the candidate combination feature pool, and selecting at least one candidate combination feature with higher importance from the candidate combination feature pool according to the re-ordering result as a target combination feature;
Screening a plurality of key unit features from the features of the sample; and obtaining at least one combined feature from the plurality of key unit features by using an automatic feature combination algorithm, wherein each combined feature is formed by combining corresponding part of key unit features in the plurality of key unit features; and taking the obtained at least one combined characteristic as an automatically generated combined characteristic.
13. The method of claim 10, wherein the auto-tune parameter operator auto-tunes by any one of:
the following steps are performed in each iteration: determining currently available resources; scoring a plurality of super-parameter tuning strategies respectively, and distributing currently available resources for the plurality of super-parameter tuning strategies according to scoring results, wherein each super-parameter tuning strategy is used for selecting a super-parameter combination for a machine learning model based on a super-parameter selection strategy corresponding to the super-parameter tuning strategy; acquiring one or more super-parameter combinations generated by each super-parameter tuning strategy allocated to the resource based on the allocated resource respectively;
in the competition stage, training corresponding competition models according to a machine learning algorithm under a plurality of competition super-parameter combinations to obtain a competition model with the best effect, and taking the obtained competition model and the competition super-parameter combination corresponding to the obtained competition model as a winning model and a winning super-parameter combination to enter a growth stage; in the growing stage, under the combination of the winning super parameters obtained in the competition stage of the round, continuing to train the winning model obtained in the competition stage of the round, obtaining the effect of the winning model, if the effect of the winning model indicates that the model effect stops growing, restarting the competition stage to train the updated competition model under the combination of the updated plurality of the winning super parameters according to the machine learning algorithm, otherwise, continuing to train the winning model, and repeating the iteration until the preset termination condition is met; the updated competition super parameter combinations are obtained based on the winning super parameter combinations of the previous growth stage, and the updated competition models are all winning models obtained in the previous growth stage;
Respectively carrying out one round of super-parameter exploration training on a plurality of machine learning algorithms, wherein in the round of exploration, each machine learning algorithm explores at least M groups of super-parameters, and M is a positive integer greater than 1; calculating the current round of performance score of each machine learning algorithm based on model evaluation indexes respectively corresponding to a plurality of groups of super parameters respectively explored by the machine learning algorithms in the round, and calculating the future potential score of each machine learning algorithm; combining the current round of performance scores and future potential scores of each machine learning algorithm, and determining a resource allocation scheme for allocating available resources to each machine learning algorithm; and carrying out corresponding resource scheduling in the next round of super-parameter exploration training according to the resource allocation scheme, wherein the calculating the current round of performance score of each machine learning algorithm comprises the following steps: determining the first K optimal model evaluation indexes from the model evaluation indexes respectively corresponding to a plurality of groups of super parameters which are explored in the round by the plurality of machine learning algorithms, wherein K is a positive integer; for each machine learning algorithm, taking the proportional value of the machine learning algorithm accounting for the top K best model evaluation indexes as the current round of performance score of the machine learning algorithm, wherein the calculating the future potential score of each machine learning algorithm comprises: storing model evaluation indexes respectively corresponding to a plurality of groups of super parameters explored by each machine learning algorithm in the round in an array according to a sequence to obtain a plurality of arrays respectively corresponding to the plurality of machine learning algorithms; for each machine learning algorithm, a monotonically enhanced array is extracted from an array corresponding to the machine learning algorithm, and a ratio of a length of the monotonically enhanced array to a length of the array corresponding to the machine learning algorithm is used as a future potential score of the machine learning algorithm.
14. The method of claim 1, wherein the method further comprises one or more of:
packaging data generated in the process of executing a data processing flow corresponding to the directed acyclic graph into nodes with the type of data, and storing the nodes;
encapsulating samples generated in the process of executing the data processing flow corresponding to the directed acyclic graph into nodes with the types of the samples, and storing the nodes;
packaging a model generated in the process of executing a data processing flow corresponding to the directed acyclic graph into a node with the model type and storing the node;
and packaging the data processing flow corresponding to the execution directed acyclic graph into the nodes with the types of operators and storing the nodes.
15. The method of claim 14, wherein the method further comprises:
and adding the packaged nodes into a node display list for subsequent editing or creating a directed acyclic graph.
16. The method of claim 14, wherein the encapsulating the model generated during the execution of the data processing flow corresponding to the directed acyclic graph into the model-like nodes comprises:
writing description information of the model into a file, and packaging the file into nodes with the model file type;
Or writing description information of the model, description information of the data source and description information of preprocessing and feature extraction processing of the data source into a file, and packaging the file into nodes with the model file type;
or, the description information of the model itself, the description information of the data source, the codes for preprocessing the data source and extracting the characteristics and the codes for executing the processing logic of the model itself are packaged into the nodes with the model type.
17. The method of claim 14, wherein,
and each execution body in the directed acyclic graph respectively corresponds to the execution bodies, and the description information of the data source and the description information or codes of the data processing logic of each execution body are transmitted through according to the execution sequence, so that each execution body can output the description information of the data source and the description information or codes of the data processing logic of the execution body and all the upper execution bodies.
18. The method of claim 14, wherein,
each execution main body corresponding to each node in the directed acyclic graph is provided with a corresponding information record file;
and if the execution main body at the previous stage exists, reading the content in the information record file of the previous execution main body, and storing the description information or the code of the data processing logic of the self into the information record file of the self together with the read content.
19. The method of any of claims 16-18, wherein the data source description information is field information of one or more data tables as input data.
20. The method of claim 16, wherein the method further comprises:
when the description information of the model is written into a file, and the file is packaged into a node with the type of the model file, the code of the processing logic of the model is also packaged into the node, and the input of the node is the model file;
writing description information of the model, data source description information of the input model and description information of preprocessing and feature extraction processing of the data source into a file, acquiring description information of the analysis data source and description information of preprocessing and feature extraction processing of the data source under the condition that the file is packaged into a node of the model file, generating analysis codes corresponding to data processing codes, packaging the analysis codes and codes of processing logic of the model into the node, wherein the node input is the model file.
21. The method of claim 1, wherein the executing the data processing flow corresponding to the directed acyclic graph in response to the operation of running the directed acyclic graph comprises:
Acquiring configuration information of a directed acyclic graph;
according to the configuration information of the directed acyclic graph, determining a previous-stage node and a next-stage node of each node in the directed acyclic graph, and further determining a connection relationship between the nodes;
and executing the data processing flow corresponding to each node according to the connection relation among the nodes.
22. The method of claim 21, wherein configuration information of the directed acyclic graph includes configuration information of each node, and configuration information of each node includes input slot information and output slot information;
the determining the previous level node and the next level node of each node in the directed acyclic graph according to the configuration information of the directed acyclic graph comprises: and determining a previous-stage node and a next-stage node of each node in the directed acyclic graph according to the input slot information and the output slot information in the configuration information of each node in the directed acyclic graph.
23. The method of claim 21, wherein the configuration information of the directed acyclic graph includes configuration information of each node;
the method further comprises the steps of: according to the configuration information of each node, the node is determined to operate in a stand-alone mode or in a distributed mode.
24. The method of claim 23, wherein the information in the configuration information of each node for determining the operation mode of the node comprises: an identification indicating stand-alone operation or an identification indicating distributed operation.
25. A computer-readable storage medium storing instructions that, when executed by at least one computing device, cause the at least one computing device to perform the method of any of claims 1-24.
26. A system comprising at least one computing device and at least one storage device storing instructions that, when executed by the at least one computing device, cause the at least one computing device to perform the method of any of claims 1-24.
27. A data processing system, wherein the system comprises:
an operation unit adapted to generate a corresponding directed acyclic graph in response to an operation of generating the directed acyclic graph by a user;
an operation unit adapted to execute a data processing flow corresponding to the directed acyclic graph in response to an operation of operating the directed acyclic graph, wherein the data processing flow corresponding to the directed acyclic graph is a data processing flow related to machine learning,
The operation unit is suitable for displaying a first graphical user interface comprising a node display area and a canvas area, wherein the node type in the node display area comprises data, samples, models and operators; responsive to the operation of exposing the region selection node at the node, displaying the respective node in the canvas region, and responsive to the operation of connecting the nodes, generating a connection between the respective nodes in the canvas region to generate a directed acyclic graph,
wherein the node presentation area comprises an element list and an operator list, the element list comprises data, samples and models, the samples are samples related to machine learning, and the operator list comprises various data processing operators related to machine learning;
the node display area comprises one or more of the following operators: a data splitting operator; a feature extraction operator; a feature importance analysis operator; an automatic feature combination operator; automatically adjusting a parameter operator; a TensorFlow operator; and customizing the script operator.
28. The system of claim 27, wherein,
the node display area comprises an element list and an operator list, wherein the element list comprises data, samples and models, and the operator list comprises various data processing operators related to machine learning;
The node presentation area also includes a file list including a directed acyclic graph.
29. The system of claim 27, wherein the nodes of the node presentation area further comprise directed acyclic graphs;
the operating unit is further adapted to perform at least one of the following:
responsive to an operation of selecting a directed acyclic graph at a node presentation area, displaying the selected directed acyclic graph at a canvas area for direct execution or modification editing;
responsive to an operation to save the directed acyclic graph in the canvas area, saving the directed acyclic graph and adding the saved directed acyclic graph to the node-presentation area;
in response to the operation of exporting the directed acyclic graph, the corresponding directed acyclic graph is output to a specified export location.
30. The system of claim 27, wherein the operating unit is further adapted to perform at least one of:
storing corresponding elements in response to an operation of importing the elements from outside, and adding the elements into the node display area;
saving elements generated in the process of executing a data processing flow corresponding to the directed acyclic graph, and adding the saved elements into the node display area;
Providing a management page for managing elements generated in the process of executing a data processing flow corresponding to the directed acyclic graph, so that a user can check and delete the intermediate elements through the management page;
outputting the corresponding element to a specified export location in response to an operation to export the element;
wherein the element is data, a sample, or a model.
31. The system of claim 27, wherein the operating unit is further adapted to perform at least one of:
responding to the operation of importing operators from the outside, storing codes corresponding to the corresponding operators, and adding the corresponding operators into a node display area;
providing an operator code editing interface, acquiring and storing an input code from the interface, and adding a corresponding operator into a node display area.
32. The system of claim 27, wherein the operating unit is further adapted to perform the following operations:
responding to the operation of selecting one node in the canvas area, displaying a configuration interface of the node, and completing the related configuration of the corresponding node according to the configuration operation on the configuration interface;
when the nodes are not necessarily configured, or the configured parameters do not meet the preset requirements, the prompt identification is displayed at the corresponding nodes in the canvas area.
33. The system of claim 27, wherein the operating unit is further adapted to perform at least one of:
displaying a graphic control running the directed acyclic graph in the first graphic user interface, responding to the operation of triggering the graphic control, and executing a data processing flow corresponding to the directed acyclic graph according to each node in the directed acyclic graph and the connection relation among the nodes;
and displaying a timer on the first graphical user interface, wherein the timer is used for timing the time for executing the data processing flow corresponding to the directed acyclic graph in real time.
34. The system of claim 27, wherein the operating unit is further adapted to perform at least one of:
in the process of executing the data processing flow corresponding to the directed acyclic graph, displaying information for representing the progress of executing the corresponding node on each node of the directed acyclic graph on the first graphical user interface;
in the process of executing the data processing flow corresponding to the directed acyclic graph, displaying an in-operation identifier on each node of the directed acyclic graph on the first graphical user interface, and displaying the executed identifier on one node when the data processing flow corresponding to the node is executed;
And responding to the operation of checking the operation result of the node in the directed acyclic graph, and acquiring and displaying operation result data corresponding to the node.
35. The system of claim 27, wherein the operating unit is further adapted to perform one or more of:
the data, samples, and models in the canvas area all support one or more of the following operations: copying, deleting and previewing;
operators in the canvas area support one or more of the following operations: copying, deleting, previewing, running the current task, starting running from the current task, running to the current task, checking logs and checking task details;
for the directed acyclic graph with the completed operation in the canvas area, in response to clicking one of the operators, displaying product type marks respectively corresponding to the types of products output by the operator, in response to clicking the product type marks, displaying a product related information interface, wherein the product information related interface comprises: a control for previewing the product, a control for exporting the product, a control for importing the product into an element list, basic information of the product and path information for storing the product; wherein the product types of the operator outputs include: data, samples, models, and reports.
36. The system of claim 27, wherein,
the data splitting method provided in the configuration interface of the data splitting operator comprises one or more of proportional splitting, regular splitting and sorting-then-splitting; when the proportional splitting is selected, the proportional sequential splitting, the proportional random splitting and the proportional layering splitting can be further selected, an input area for setting random seed parameters is further provided on the configuration interface when the random splitting is selected, and an input area for setting fields for layering basis is further provided on the configuration interface when the layering splitting is selected; providing an input area for inputting splitting rules when selecting splitting by rule; providing a split proportion selection item, an input area for setting a sorting field and a sorting direction selection item on a configuration interface when sorting is performed before splitting is selected;
providing an interface for adding an input source and a script editing inlet in a configuration interface of the feature extraction operator, and providing at least one of a sample random ordering option, a feature accurate statistics option, an option for outputting whether a sample is compressed, an output plaintext option, a tag type option and an output result storage type option;
The input of the feature importance analysis operator is a sample table with a target value, the output is a feature importance evaluation report, the report comprises importance coefficients of all the features, and the report also comprises one or more of feature numbers, sample numbers and basic statistical information of all the features;
providing at least one of a feature selection item, a grading index selection item, a learning rate setting item and a termination condition selection item in a configuration interface of the automatic feature combination operator, wherein the feature selection item is used for determining each feature for feature combination, and the termination condition selection item comprises the maximum number of running feature pools and the maximum number of output features;
the automatic parameter adjustment operator is used for searching proper parameters from a given parameter range according to a parameter adjustment algorithm, training a model by using the searched parameters and carrying out model evaluation; providing at least one of a feature selection setting item, a parameter adjustment method option and a parameter adjustment times setting item in a configuration interface of the automatic parameter adjustment operator, wherein all features or user-defined feature selection can be selected in the feature selection setting item, and random search or grid search can be selected in the parameter adjustment method option;
the TensorFlow operator is used for running TensorFlow codes written by a user, and an input source setting item and a TensorFlow code file path setting item are provided in a configuration interface of the TensorFlow operator;
The custom script operator is used for providing an interface for a user to custom write the operator by using a specific script language, and an input source setting item and a script editing entry are provided in the configuration of the custom script operator.
37. The system of claim 36, wherein the feature importance analysis operator determines the importance of a feature by at least one of:
training at least one feature pool model based on a sample set, wherein the feature pool model refers to a machine learning model for providing a prediction result about a machine learning problem based on at least a part of features contained in a sample, obtaining an effect of the at least one feature pool model, and determining importance of the features according to the obtained effect of the at least one feature pool model; wherein the feature pool model is trained by performing a discretization operation on at least one continuous feature among the at least a portion of the features;
determining a basic feature subset of the sample, and determining a plurality of target feature subsets of which the importance is to be determined; for each target feature subset of the plurality of target feature subsets, acquiring a corresponding composite machine learning model, wherein the composite machine learning model comprises a basic sub-model trained according to a lifting framework and an additional sub-model, wherein the basic sub-model is trained based on the basic feature subset, and the additional sub-model is trained based on each target feature subset; and determining importance of the plurality of target feature subsets according to the effect of the composite machine learning model;
Pre-ordering the importance of at least one candidate feature in the sample, and screening a part of candidate features from the at least one candidate feature according to a pre-ordering result to form a candidate feature pool; and re-ordering the importance of each candidate feature in the candidate feature pool, and selecting at least one candidate feature with higher importance from the candidate feature pool as an important feature according to the re-ordering result.
38. The system of claim 36, wherein the automatic feature combination operator performs feature combination by at least one of:
performing at least one binning operation for each successive feature in the sample to obtain a binning group feature comprised of at least one binning feature, wherein each binning operation corresponds to one binning feature; and generating combined features of the machine-learned sample by feature combining between the binned features and/or other discrete features in the sample;
performing feature combinations between at least one feature of the sample stage by stage in accordance with a heuristic search strategy to generate candidate combined features, wherein for each stage, a target combined feature is selected from a set of candidate combined features as a combined feature of the machine-learned sample;
Obtaining unit features capable of being combined in a sample; providing a graphical interface for setting feature combination configuration items to a user, wherein the feature combination configuration items are used for limiting how feature combinations are performed among unit features; receiving input operation which is executed on a graphical interface by a user for setting feature combination configuration items, and acquiring the feature combination configuration items set by the user according to the input operation; combining the features to be combined among the unit features based on the acquired feature combination configuration items to generate combined features of the machine learning sample;
iteratively performing feature combinations between at least one discrete feature of the sample in accordance with a search strategy to generate candidate combined features, and selecting a target combined feature from the generated candidate combined features as a combined feature; for each round of iteration, pre-ordering importance of each candidate combination feature in the candidate combination feature set, screening a part of candidate combination features from the candidate combination feature set according to a pre-ordering result to form a candidate combination feature pool, re-ordering importance of each candidate combination feature in the candidate combination feature pool, and selecting at least one candidate combination feature with higher importance from the candidate combination feature pool according to the re-ordering result as a target combination feature;
Screening a plurality of key unit features from the features of the sample; and obtaining at least one combined feature from the plurality of key unit features by using an automatic feature combination algorithm, wherein each combined feature is formed by combining corresponding part of key unit features in the plurality of key unit features; and taking the obtained at least one combined characteristic as an automatically generated combined characteristic.
39. The system of claim 36, wherein the auto-tune parameter operator auto-tunes by any one of:
the following steps are performed in each iteration: determining currently available resources; scoring a plurality of super-parameter tuning strategies respectively, and distributing currently available resources for the plurality of super-parameter tuning strategies according to scoring results, wherein each super-parameter tuning strategy is used for selecting a super-parameter combination for a machine learning model based on a super-parameter selection strategy corresponding to the super-parameter tuning strategy; acquiring one or more super-parameter combinations generated by each super-parameter tuning strategy allocated to the resource based on the allocated resource respectively;
in the competition stage, training corresponding competition models according to a machine learning algorithm under a plurality of competition super-parameter combinations to obtain a competition model with the best effect, and taking the obtained competition model and the competition super-parameter combination corresponding to the obtained competition model as a winning model and a winning super-parameter combination to enter a growth stage; in the growing stage, under the combination of the winning super parameters obtained in the competition stage of the round, continuing to train the winning model obtained in the competition stage of the round, obtaining the effect of the winning model, if the effect of the winning model indicates that the model effect stops growing, restarting the competition stage to train the updated competition model under the combination of the updated plurality of the winning super parameters according to the machine learning algorithm, otherwise, continuing to train the winning model, and repeating the iteration until the preset termination condition is met; the updated competition super parameter combinations are obtained based on the winning super parameter combinations of the previous growth stage, and the updated competition models are all winning models obtained in the previous growth stage;
Respectively carrying out one round of super-parameter exploration training on a plurality of machine learning algorithms, wherein in the round of exploration, each machine learning algorithm explores at least M groups of super-parameters, and M is a positive integer greater than 1; calculating the current round of performance score of each machine learning algorithm based on model evaluation indexes respectively corresponding to a plurality of groups of super parameters respectively explored by the machine learning algorithms in the round, and calculating the future potential score of each machine learning algorithm; combining the current round of performance scores and future potential scores of each machine learning algorithm, and determining a resource allocation scheme for allocating available resources to each machine learning algorithm; and carrying out corresponding resource scheduling in the next round of super-parameter exploration training according to the resource allocation scheme, wherein the calculating the current round of performance score of each machine learning algorithm comprises the following steps: determining the first K optimal model evaluation indexes from the model evaluation indexes respectively corresponding to a plurality of groups of super parameters which are explored in the round by the plurality of machine learning algorithms, wherein K is a positive integer; for each machine learning algorithm, taking the proportional value of the machine learning algorithm accounting for the top K best model evaluation indexes as the current round of performance score of the machine learning algorithm, wherein the calculating the future potential score of each machine learning algorithm comprises: storing model evaluation indexes respectively corresponding to a plurality of groups of super parameters explored by each machine learning algorithm in the round in an array according to a sequence to obtain a plurality of arrays respectively corresponding to the plurality of machine learning algorithms; for each machine learning algorithm, a monotonically enhanced array is extracted from an array corresponding to the machine learning algorithm, and a ratio of a length of the monotonically enhanced array to a length of the array corresponding to the machine learning algorithm is used as a future potential score of the machine learning algorithm.
40. The system of claim 27, wherein the operating unit is further adapted to perform one or more of:
packaging data generated in the process of executing a data processing flow corresponding to the directed acyclic graph into nodes with the type of data, and storing the nodes;
encapsulating samples generated in the process of executing the data processing flow corresponding to the directed acyclic graph into nodes with the types of the samples, and storing the nodes;
packaging a model generated in the process of executing a data processing flow corresponding to the directed acyclic graph into a node with the model type and storing the node;
and packaging the data processing flow corresponding to the execution directed acyclic graph into the nodes with the types of operators and storing the nodes.
41. The system of claim 40, wherein the operating unit is further adapted to add the encapsulated nodes to a node-exposed list for use in subsequent editing or creation of a directed acyclic graph.
42. The system of claim 40, wherein the operating unit is adapted to
Writing description information of the model into a file, and packaging the file into nodes with the model file type;
or writing description information of the model, description information of the data source and description information of preprocessing and feature extraction processing of the data source into a file, and packaging the file into nodes with the model file type;
Or, the description information of the model itself, the description information of the data source, the codes for preprocessing the data source and extracting the characteristics and the codes for executing the processing logic of the model itself are packaged into the nodes with the model type.
43. The system of claim 40, wherein,
and each execution body in the directed acyclic graph respectively corresponds to the execution bodies, and the description information of the data source and the description information or codes of the data processing logic of each execution body are transmitted through according to the execution sequence, so that each execution body can output the description information of the data source and the description information or codes of the data processing logic of the execution body and all the upper execution bodies.
44. The system of claim 40, wherein,
each execution main body corresponding to each node in the directed acyclic graph is provided with a corresponding information record file;
and if the execution main body at the previous stage exists, reading the content in the information record file of the previous execution main body, and storing the description information or the code of the data processing logic of the self into the information record file of the self together with the read content.
45. The system of any one of claims 42-44, wherein the data source description information is field information of one or more data tables as input data.
46. The system of claim 42, wherein the operating unit is adapted to
When the description information of the model is written into a file, and the file is packaged into a node with the type of the model file, the code of the processing logic of the model is also packaged into the node, and the input of the node is the model file;
writing description information of the model, data source description information of the input model and description information of preprocessing and feature extraction processing of the data source into a file, acquiring description information of the analysis data source and description information of preprocessing and feature extraction processing of the data source under the condition that the file is packaged into a node of the model file, generating analysis codes corresponding to data processing codes, packaging the analysis codes and codes of processing logic of the model into the node, wherein the node input is the model file.
47. The system of claim 27, wherein,
the operation unit is suitable for acquiring configuration information of the directed acyclic graph; according to the configuration information of the directed acyclic graph, determining a previous-stage node and a next-stage node of each node in the directed acyclic graph, and further determining a connection relationship between the nodes;
And executing the data processing flow corresponding to each node according to the connection relation among the nodes.
48. The system of claim 47, wherein the configuration information of the directed acyclic graph includes configuration information of each node, the configuration information of each node including input slot information and output slot information;
the operation unit is suitable for determining the previous-level node and the next-level node of each node in the directed acyclic graph according to the input slot information and the output slot information in the configuration information of each node in the directed acyclic graph.
49. The system of claim 47, wherein the configuration information of the directed acyclic graph includes configuration information of each node;
the operation unit is further adapted to determine, according to the configuration information of each node, that the node is operated in a stand-alone mode or in a distributed mode.
50. The system of claim 49, wherein the information in the configuration information of each node for determining the operation mode of the node comprises: an identification indicating stand-alone operation or an identification indicating distributed operation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911061020.9A CN110956272B (en) | 2019-11-01 | 2019-11-01 | Method and system for realizing data processing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911061020.9A CN110956272B (en) | 2019-11-01 | 2019-11-01 | Method and system for realizing data processing |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110956272A CN110956272A (en) | 2020-04-03 |
CN110956272B true CN110956272B (en) | 2023-08-08 |
Family
ID=69976491
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911061020.9A Active CN110956272B (en) | 2019-11-01 | 2019-11-01 | Method and system for realizing data processing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110956272B (en) |
Families Citing this family (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111523676B (en) * | 2020-04-17 | 2024-04-12 | 第四范式(北京)技术有限公司 | Method and device for assisting machine learning model to be online |
CN111506779B (en) * | 2020-04-20 | 2021-03-16 | 东云睿连(武汉)计算技术有限公司 | Object version and associated information management method and system facing data processing |
CN113673707A (en) * | 2020-05-15 | 2021-11-19 | 第四范式(北京)技术有限公司 | Method and device for learning by applying machine, electronic equipment and storage medium |
CN111752555B (en) * | 2020-05-18 | 2021-08-20 | 南京认知物联网研究院有限公司 | Business scene driven visual insight support system, client and method |
CN111625692B (en) * | 2020-05-27 | 2023-08-22 | 抖音视界有限公司 | Feature extraction method, device, electronic equipment and computer readable medium |
CN112115129B (en) * | 2020-09-16 | 2024-05-10 | 浪潮软件股份有限公司 | Retail terminal sample sampling method based on machine learning |
CN112508346B (en) * | 2020-11-17 | 2022-06-24 | 四川新网银行股份有限公司 | Method for realizing indexed business data auditing |
CN112380216B (en) * | 2020-11-17 | 2023-07-28 | 北京融七牛信息技术有限公司 | Automatic feature generation method based on intersection |
CN112860655B (en) * | 2020-12-10 | 2024-01-30 | 南京三眼精灵信息技术有限公司 | Visual knowledge model construction method and device |
CN112558938B (en) * | 2020-12-16 | 2021-11-09 | 中国科学院空天信息创新研究院 | Machine learning workflow scheduling method and system based on directed acyclic graph |
GB2602475B (en) * | 2020-12-31 | 2023-09-13 | Seechange Tech Limited | Method and system for processing image data |
CN113392422B (en) * | 2021-08-16 | 2021-10-29 | 华控清交信息科技(北京)有限公司 | Data processing method and device and data processing device |
WO2023115570A1 (en) * | 2021-12-24 | 2023-06-29 | 深圳晶泰科技有限公司 | Management method and apparatus for machine learning model, computer device and storage medium |
CN114492737B (en) | 2021-12-31 | 2022-12-09 | 北京百度网讯科技有限公司 | Data processing method, data processing device, electronic equipment, storage medium and program product |
Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107316082A (en) * | 2017-06-15 | 2017-11-03 | 第四范式(北京)技术有限公司 | For the method and system for the feature importance for determining machine learning sample |
CN107392319A (en) * | 2017-07-20 | 2017-11-24 | 第四范式(北京)技术有限公司 | Generate the method and system of the assemblage characteristic of machine learning sample |
CN107704871A (en) * | 2017-09-08 | 2018-02-16 | 第四范式(北京)技术有限公司 | Generate the method and system of the assemblage characteristic of machine learning sample |
CN107729915A (en) * | 2017-09-08 | 2018-02-23 | 第四范式(北京)技术有限公司 | For the method and system for the key character for determining machine learning sample |
CN107766946A (en) * | 2017-09-28 | 2018-03-06 | 第四范式(北京)技术有限公司 | Generate the method and system of the assemblage characteristic of machine learning sample |
CN107909087A (en) * | 2017-09-08 | 2018-04-13 | 第四范式(北京)技术有限公司 | Generate the method and system of the assemblage characteristic of machine learning sample |
CN108021984A (en) * | 2016-11-01 | 2018-05-11 | 第四范式(北京)技术有限公司 | Determine the method and system of the feature importance of machine learning sample |
WO2018134248A1 (en) * | 2017-01-17 | 2018-07-26 | Catchoom Technologies, S.L. | Classifying data |
CN109242040A (en) * | 2018-09-28 | 2019-01-18 | 第四范式(北京)技术有限公司 | Automatically generate the method and system of assemblage characteristic |
CN109284828A (en) * | 2018-09-06 | 2019-01-29 | 沈文策 | A kind of hyper parameter tuning method, device and equipment |
CN109325808A (en) * | 2018-09-27 | 2019-02-12 | 重庆智万家科技有限公司 | Demand for commodity prediction based on Spark big data platform divides storehouse planing method with logistics |
CN109375912A (en) * | 2018-10-18 | 2019-02-22 | 腾讯科技(北京)有限公司 | Model sequence method, apparatus and storage medium |
CN109389143A (en) * | 2018-06-19 | 2019-02-26 | 北京九章云极科技有限公司 | A kind of Data Analysis Services system and method for automatic modeling |
CN109726216A (en) * | 2018-12-29 | 2019-05-07 | 北京九章云极科技有限公司 | A kind of data processing method and processing system based on directed acyclic graph |
CN109933834A (en) * | 2018-12-26 | 2019-06-25 | 阿里巴巴集团控股有限公司 | A kind of model creation method and device of time series data prediction |
CN110321112A (en) * | 2019-07-02 | 2019-10-11 | 北京百度网讯科技有限公司 | AI ability research/development platform and data processing method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2014210050A1 (en) * | 2013-06-24 | 2014-12-31 | Cylance Inc. | Automated system for generative multimodel multiclass classification and similarity analysis using machine learning |
-
2019
- 2019-11-01 CN CN201911061020.9A patent/CN110956272B/en active Active
Patent Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108021984A (en) * | 2016-11-01 | 2018-05-11 | 第四范式(北京)技术有限公司 | Determine the method and system of the feature importance of machine learning sample |
WO2018134248A1 (en) * | 2017-01-17 | 2018-07-26 | Catchoom Technologies, S.L. | Classifying data |
CN107316082A (en) * | 2017-06-15 | 2017-11-03 | 第四范式(北京)技术有限公司 | For the method and system for the feature importance for determining machine learning sample |
CN107392319A (en) * | 2017-07-20 | 2017-11-24 | 第四范式(北京)技术有限公司 | Generate the method and system of the assemblage characteristic of machine learning sample |
CN107704871A (en) * | 2017-09-08 | 2018-02-16 | 第四范式(北京)技术有限公司 | Generate the method and system of the assemblage characteristic of machine learning sample |
CN107729915A (en) * | 2017-09-08 | 2018-02-23 | 第四范式(北京)技术有限公司 | For the method and system for the key character for determining machine learning sample |
CN107909087A (en) * | 2017-09-08 | 2018-04-13 | 第四范式(北京)技术有限公司 | Generate the method and system of the assemblage characteristic of machine learning sample |
CN107766946A (en) * | 2017-09-28 | 2018-03-06 | 第四范式(北京)技术有限公司 | Generate the method and system of the assemblage characteristic of machine learning sample |
CN109389143A (en) * | 2018-06-19 | 2019-02-26 | 北京九章云极科技有限公司 | A kind of Data Analysis Services system and method for automatic modeling |
CN109284828A (en) * | 2018-09-06 | 2019-01-29 | 沈文策 | A kind of hyper parameter tuning method, device and equipment |
CN109325808A (en) * | 2018-09-27 | 2019-02-12 | 重庆智万家科技有限公司 | Demand for commodity prediction based on Spark big data platform divides storehouse planing method with logistics |
CN109242040A (en) * | 2018-09-28 | 2019-01-18 | 第四范式(北京)技术有限公司 | Automatically generate the method and system of assemblage characteristic |
CN109375912A (en) * | 2018-10-18 | 2019-02-22 | 腾讯科技(北京)有限公司 | Model sequence method, apparatus and storage medium |
CN109933834A (en) * | 2018-12-26 | 2019-06-25 | 阿里巴巴集团控股有限公司 | A kind of model creation method and device of time series data prediction |
CN109726216A (en) * | 2018-12-29 | 2019-05-07 | 北京九章云极科技有限公司 | A kind of data processing method and processing system based on directed acyclic graph |
CN110321112A (en) * | 2019-07-02 | 2019-10-11 | 北京百度网讯科技有限公司 | AI ability research/development platform and data processing method |
Non-Patent Citations (1)
Title |
---|
面向遥感大数据应用的云计算任务调度研究;殷宪亮;《中国优秀硕士学位论文全文数据库:信息科技辑》(第1期);I140-821 * |
Also Published As
Publication number | Publication date |
---|---|
CN110956272A (en) | 2020-04-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110956272B (en) | Method and system for realizing data processing | |
Shang et al. | Democratizing data science through interactive curation of ml pipelines | |
US11468366B2 (en) | Parallel development and deployment for machine learning models | |
US11861462B2 (en) | Preparing structured data sets for machine learning | |
US20150302433A1 (en) | Automatic Generation of Custom Intervals | |
WO2019144066A1 (en) | Systems and methods for preparing data for use by machine learning algorithms | |
CN110309427A (en) | A kind of object recommendation method, apparatus and storage medium | |
US20150058266A1 (en) | Predictive analytics factory | |
CN107704871A (en) | Generate the method and system of the assemblage characteristic of machine learning sample | |
CN105593818A (en) | Apparatus and method for scheduling distributed workflow tasks | |
US10956535B2 (en) | Operating a neural network defined by user code | |
US20210103858A1 (en) | Method and system for model auto-selection using an ensemble of machine learning models | |
US20170169353A1 (en) | Systems and Methods for Multi-Objective Evolutionary Algorithms with Soft Constraints | |
CN108846695A (en) | The prediction technique and device of terminal replacement cycle | |
CN109766470A (en) | Image search method, device and processing equipment | |
JP7245961B2 (en) | interactive machine learning | |
US11521077B1 (en) | Automatic recommendation of predictor variable values for improving predictive outcomes | |
CA3154784A1 (en) | Interactive machine learning | |
Tousi et al. | Comparative analysis of machine learning models for performance prediction of the spec benchmarks | |
CN115147092A (en) | Resource approval method and training method and device of random forest model | |
CN115705501A (en) | Hyper-parametric spatial optimization of machine learning data processing pipeline | |
CN108446378B (en) | Method, system and computer storage medium based on user search | |
Tesser et al. | Selecting efficient VM types to train deep learning models on Amazon SageMaker | |
Shin et al. | Hippo: Taming hyper-parameter optimization of deep learning with stage trees | |
CN118428701B (en) | Method and device for generating order information, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |