Background
The big data processing technology develops gradually, a data model applied to business can be built by using big data, and the data model is applied to prediction of business results. When the data size is very small, the computing power of a single computer is enough; however, when the data size is huge, a distributed computing platform is required to perform the whole modeling process. In the related art, when a distributed computing platform performs modeling, a plurality of functional modules included in a modeling process may be respectively deployed on different devices for computing, but when the functional modules are connected in series, the modules are not smoothly connected due to a complex dependency relationship between the modules, for example, the modules are manually analyzed and connected in series, so that the efficiency of data processing is low.
Disclosure of Invention
In view of this, the present application provides a distributed machine learning method and platform to improve the efficiency of data processing.
Specifically, the method is realized through the following technical scheme:
in a first aspect, a distributed machine learning platform is provided, the platform comprising:
a logic framework module for constructing execution logic of a data processing task, wherein the data processing task comprises a plurality of algorithm modules, and each algorithm module comprises: the input part and the output part have the same interface format so as to enable the plurality of algorithm modules to be connected in series according to the interface format; the input part comprises dependency information between the algorithm module and other algorithm modules;
and the algorithm execution module is used for respectively executing each algorithm module according to the execution logic constructed by the logic architecture module, and calling an algorithm library in a resource layer to carry out operation according to the algorithm part in the algorithm module.
In a second aspect, a distributed machine learning method is provided, including:
respectively executing a plurality of algorithm modules included in the data processing task according to the constructed execution logic of the data processing task; each of the algorithm modules includes: the system comprises an input part, an algorithm part and an output part, wherein the input part and the output part have the same interface format; calling an algorithm library in a resource layer to perform operation according to an algorithm part in the algorithm module;
and according to the dependency information between the algorithm module and other algorithm modules included in the input part of the algorithm module and the interface format, the algorithm modules are connected in series.
According to the distributed machine learning method and the distributed machine learning platform, the algorithm modules included in the modeling process are respectively deployed on different devices for calculation processing, and the algorithm modules can be smoothly connected in series through the same interface format, so that the data processing efficiency is improved, and in the example of applying the distributed machine learning platform to model, the modeling efficiency is improved.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.
The embodiment of the application provides a distributed machine learning platform, and a data miner can use the platform to perform a data processing task, for example, the data processing task can be to establish a prediction model according to acquired data and evaluate the accuracy of the prediction model.
Fig. 1 illustrates the framework of the distributed machine learning platform, which, as shown in fig. 1, includes a logical framework module 11, an algorithm execution module 12, and a resource layer 13. When a data processing task is executed, for example, in a process of building a model, various algorithms are used, the resource layer 13 is used as a bottom support, and a plurality of algorithm libraries may be integrated, for example, a single-version algorithm library such as R, Python illustrated in fig. 1, or a distributed algorithm library such as Hadoop, ODPS, Spark, or the like, and in addition, other algorithm libraries such as MLlib, Mahout, Xlib may also be included, which are not listed in fig. 1.
The resource layer 13 is equivalent to the underlying support for executing data processing tasks, for example, data processing, feature selection, model training, etc. in the modeling process all use various algorithms, and call the algorithm library in the resource layer 13 to execute specific processing. The logic framework module 11 is used to construct execution logic of a data processing task, for example, the data processing task may include a plurality of algorithm modules, and referring to the example of fig. 1, the distributed machine learning platform may construct DAG (Directed Acyclic Graph) execution logic at the logic framework module 11, and the DAG execution logic may represent call relations between the algorithm modules of the data processing task.
Fig. 1 illustrates DAG logic between algorithm modules, for example, algorithm module 1 may be a module that performs data processing on raw acquired data, algorithm module 2 may be a module that performs feature analysis after processing raw data, and may perform feature selection or feature dimension reduction, etc., algorithm module 3 may be a module that performs model training according to features obtained by algorithm module 2 to obtain a model, and algorithm module n may be a module that performs effect prediction on the trained model. The above example is only illustrative, and in practical application, a plurality of algorithm modules may be divided according to the characteristics of a data processing task, and the execution process of the task is represented by building a DAG graph.
The algorithm executing module 12 may respectively execute each algorithm module according to the execution logic constructed by the logic framework module 11, and may perform an operation by calling an algorithm library in the resource layer 13 when executing the algorithm module. In this example, the resource layer 13 includes a single-edition algorithm library and a distributed algorithm library, so that the resource layer 13 includes multiple types of algorithm libraries as comprehensively as possible, and when executing a certain algorithm module, the algorithm execution module 12 may select and invoke a suitable algorithm library in the resource layer 13 to execute according to the data size of the current processing, the accuracy requirement on the algorithm, and other factors. For example, fig. 1 illustrates that for one of the algorithm modules constructed in the logic framework module 11, the algorithm execution module 12 may select one algorithm library from Hadoop, ODPS, and Spark in the resource layer 13 to execute the operation.
Through the above description, it can be understood that the general architecture of the distributed machine learning platform of this example, that is, the logic framework module 11 may build DAG execution logic of the data processing task, and show each algorithm module included in the data processing task and the interrelation thereof, and the algorithm execution module 12 may call the algorithm library in the resource layer 13 to execute each algorithm module according to the execution logic built by the logic framework module 11. In the embodiment, each algorithm module is designed into a uniform structural format, so that the modules can be conveniently arranged in a serial connection and a distributed manner.
Fig. 2 illustrates a structural design of an algorithm module, and as shown in fig. 2, each algorithm module may include: an input section 21, an algorithm section 22 and an output section 23. The input part 21(input) serves as an input of the algorithm part (algorithm), the output part 23(output) serves as an output of the algorithm part, the input and the output have the same interface format, and the information type can be at least one of the following three types: data (data), model (model), or result (evaluation). For example, the data may be sampled data, split data, etc., the model may be a model trained from the data, and the result may be a result predicted from the model.
The input part 21 may also include dependency information between the present algorithm module and other algorithm modules, for example, a module identification of an algorithm module may be used to indicate which algorithm module is depended on, e.g., the present module depends on the data, model or result of the previous module. The number of other algorithm modules on which the input section 21 depends may be at least one. The algorithm section 22 is used to indicate which algorithm is used to process the information input from the input data 21. And the output section 23 may be used to state whether there is a yield of data, model or result for the algorithm module.
The following structural design of the algorithm module is illustrated by an example:
in the above example, the input part inputs, algorithmm and output part outputs of the algorithm modules are respectively defined by standards, and each algorithm module is designed according to the structure. For example, referring to the above example, the taskId of the dependent algorithm module is '10002' in the input portion, and the data, model and result of the algorithm module '10002' are all used as the input of the algorithm module. Referring again to the output outputs section of the above example, the output of the algorithm block, including the data and results (true), has no output model (false). In the algorithm part algorithm of the algorithm module, the used algorithm is called logistic regression algorithm logisticRegression.
Furthermore, for clarity of DAG logic, it may be specified that each algorithm module can only produce a single data, model, or result, but that multiple data, models, or results may be introduced. For example, in the above example, the algorithm module on which the input part inputs depends only has the algorithm module with taskId of '10002', and the data, model and result of this module are used as the input of this algorithm module. In other application scenarios, there may be more algorithm modules on which the input portions inputs depend.
Illustratively, an example of a plurality of inputs is shown as follows, referring to an input part input of the example, the input of the algorithm module depends on three, including three algorithm modules with taskId of '10002', '10003' and '10004', and data output by '10002', model models output by '10003' and result evaluations output by '10004' are used as the inputs of the algorithm module. Of course, other scenarios are also possible in practical application, for example, the input part may only have data and models, but no results evaluations, which will not be described in detail.
In this embodiment, the information types output by the algorithm modules are also defined uniformly, for example, for data, data of the intermediate result may be temporarily stored in a local or distributed system, and data may be transferred between different algorithm modules by using a Schema file. For Model models, Model parameters can be expressed in PMML (Predictive Model Markup Language), which is a de facto standard Language used to render data mining models, and which can be used to share Predictive analytics models between different algorithm modules. For the result evaluations, the result data of the model evaluation can be stored in the form of JSON, and the result data can be displayed visually.
It can be seen that the present application divides the data processing task into a plurality of independent algorithm modules, and the algorithm modules have a uniform interface format, such that the algorithm modules can be deployed in a distributed manner, and the uniform interface format will be used to facilitate smooth concatenation between the algorithm modules. For example, referring to the example of FIG. 3, this FIG. 3 illustrates three algorithm modules G1, G2, and G3, where the data and model output by G1, and the data and results output by G2, may all be inputs to G3, while the model output by G3 may be inputs to other modules. In the process, the concatenation of G1, G2 and G3 is data, model or result because the output of G1 (or G2) and the input of G3 have the same format definition, and the concatenation between modules is easy to realize without conflict in interface standard. Therefore, through the same interface format, the modules can be assembled into a complete DAG logic for execution.
When the distributed machine learning platform of the example is used for modeling, a plurality of algorithm modules included in the modeling process can be respectively deployed on different devices for calculation processing, and the algorithm modules can be smoothly connected in series through the same interface format, so that the data processing efficiency is improved.
Fig. 4 illustrates a distributed machine learning method performed using the distributed machine learning platform of the present application, which, as shown in fig. 4, may include:
in step 401, according to the constructed execution logic of the data processing task, a plurality of algorithm modules included in the data processing task are respectively executed; each of the algorithm modules includes: the system comprises an input part, an algorithm part and an output part, wherein the input part and the output part have the same interface format; and calling an algorithm library in the resource layer to perform operation according to the algorithm part in the algorithm module.
In step 402, the plurality of algorithm modules are connected in series according to the dependency information between the algorithm module and other algorithm modules included in the input part of the algorithm modules and the interface format.
The execution order of 401 and 402 is not limited, and for example, the distributed machine learning platform may execute the algorithm modules in the DAG logic and concatenate the algorithm modules at the same time. In addition, when the algorithm library is called according to the algorithm module, the algorithm library in the resource layer may be called to perform operation according to the algorithm part algorithm in the algorithm module. In addition, the machine learning platform of this embodiment may encapsulate the same algorithm distributed in different locations, and select a suitable algorithm library to execute algorithmic operations according to factors such as data volume and algorithm operation requirements.
For example, the training modules of the logistic regression algorithm logicregression all provide a single-machine algorithm library in R, Python, and meanwhile, a distributed algorithm library also exists on Mahout and MLlib, but the parameters of the algorithm are not very different no matter the single-machine algorithm library or the distributed algorithm library, and the machine learning platform of the example can uniformly package the algorithm libraries. And the algorithm execution module of the platform can evaluate and select a proper algorithm library to operate according to the data size, the stability of the algorithm, the accuracy requirement and other factors. For example, if the amount of data is small, a single version of the algorithm library may be selected, and if the amount of data is large, a distributed algorithm library may be selected to increase the processing speed.
In addition, the type of algorithm used in the data processing task may be various, for example, an algorithm in data processing, an algorithm in feature engineering, and an algorithm in model training evaluation. In the aspect of data processing, data sampling processing, data splitting processing, missing value processing and the like can be performed, in the aspect of feature engineering, feature importance calculation, feature cross calculation, feature discretization processing, feature selection and the like can be performed, and in the aspect of model training evaluation, model training, PMML intelligent assembly of model parameter expression, model prediction and evaluation, intelligent optimization searching of model parameters and the like can be performed.
The distributed machine learning platform of the embodiment of the application can realize the sharing of various algorithm libraries, and the algorithm libraries can be more comprehensive as far as possible; the modeling process of the expression model with clear DAG logic and the association of each algorithm module can be constructed; in addition, by designing a uniform algorithm module interface format, each algorithm module can be deployed in a relatively independent and distributed manner, and smooth serial connection among the modules can be ensured, so that the data processing efficiency is improved.
The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the scope of protection of the present application.