The content of the invention
In view of this, the application provides a kind of distributed machines learning method and platform, to improve at data
The efficiency of reason.
Specifically, the application is achieved by the following technical solution:
First aspect includes there is provided a kind of distributed machines learning platform, the platform:
Logic Architecture module, the execution logic for building data processing task, the data processing task
Including polyalgorithm module, each algoritic module includes:Importation, algorithm part, output section
Point, and the importation and output par, c have identical interface format, for the multiple algorithm mould
Concatenated between block according to the interface format;The importation includes this algoritic module and calculated with other
Dependency Specification between method module;
Algorithm performs module, for the execution logic according to the logical architecture module construction, is performed respectively
Each described algoritic module, and the algorithm part in the algoritic module, call the calculation in resource layer
Faku County carries out computing.
Second aspect there is provided a kind of distributed machines learning method, including:
According to the execution logic of the data processing task of structure, perform in the data processing task and wrap respectively
The polyalgorithm module included;Each algoritic module includes:Importation, algorithm part, output section
Point, and the importation and output par, c have identical interface format;According in the algoritic module
Algorithm part, call algorithms library in resource layer to carry out computing;
Between this algoritic module and other algoritic modules that importation in the algoritic module includes
Dependency Specification, and the interface format carries out the concatenation between the multiple algoritic module.
The application provide distributed machines learning method and platform, by by modeling process include it is multiple
Algoritic module, is deployed in progress calculating processing in different equipment respectively, also, each algoritic module can
By identical interface format, to realize smooth concatenation, so that the efficiency of data processing is improved,
In the example of application distribution formula machine learning platform modeling, modeling efficiency is improved.
Embodiment
Here exemplary embodiment will be illustrated in detail, its example is illustrated in the accompanying drawings.Following
When description is related to accompanying drawing, unless otherwise indicated, the same numbers in different accompanying drawings represent same or analogous
Key element.Embodiment described in following exemplary embodiment does not represent the institute consistent with the application
There is embodiment.On the contrary, they are only one be described in detail in such as appended claims, the application
The example of the consistent apparatus and method of a little aspects.
The embodiment of the present application provides a kind of distributed machines learning platform, and data mining teacher can use this
Platform performs data processing task, for example, the data processing task can be built according to the data got
Vertical forecast model, and assess the accuracy rate of the forecast model.
Fig. 1 illustrates the framework of the distributed machines learning platform, as shown in figure 1, the platform includes patrolling
Collect frame modules 11, algorithm performs module 12 and resource layer 13.Wherein, data processing task is being performed
When, such as build model during, various algorithms will be used, resource layer 13 as base layer support,
Can be with integrated many algorithms storehouse, such as example in Fig. 1, the standalone version algorithms library such as R, Python, also
Have Hadoop, ODPS, Spark distributed algorithms library, in addition, it can include MLlib, Mahout,
Other algorithms libraries such as Xlib, will not enumerate display in Fig. 1.
Above-mentioned resource layer 13 is equivalent to the base layer support for performing data processing task, for example, modeling process
In data processing, feature selecting, model training etc. will all use various algorithms, will call
Algorithms library in resource layer 13 performs specific processing.Logic Architecture module 11 is used to build data processing
The execution logic of task, for example, the data processing task can include polyalgorithm module, referring to Fig. 1
Example, the distributed machines learning platform can build DAG (Directed in Logic Architecture module 11
Acyclic Graph, directed acyclic graph) execution logic, the DAG execution logics can be represented at data
Call relation between each algoritic module of reason task.
Fig. 1 illustrates the DAG logics between algoritic module, for example, algoritic module 1 can be to original
Beginning gathered data carries out the module of data processing, and algoritic module 2 can be laggard to original data processing
The module of row signature analysis, can carry out feature selecting or Feature Dimension Reduction etc., and algoritic module 3 can be root
The feature obtained according to algoritic module 2 carries out the module that model training obtains model, and algoritic module n can be with
It is the module that effect prediction is carried out to the model that training is obtained.Above-mentioned example is only to illustrate, in practical application
It can be represented according to division polyalgorithm module the characteristics of data processing task, and by building DAG figures
The implementation procedure of the task.
The execution logic that algorithm performs module 12 can be built according to Logic Architecture module 11, is performed respectively
Each algoritic module, can be by calling the algorithms library in resource layer 13 to carry out when performing algoritic module
Computing.In this example, resource layer 13 includes standalone version algorithms library and distributed algorithm storehouse, as far as possible
So that resource layer 13 includes more comprehensive polytype algorithms library, algorithm performs module 12 is being held
, can the data volume size according to this processing, the accuracy requirement to algorithm during some algoritic module of row
Etc. factor, the suitable algorithms library of Selection and call is performed in resource layer 13.Such as, illustrated in Fig. 1 pair
In the algoritic module that one of them builds in Logic Architecture module 11, algorithm performs module 12 can be from money
An algorithms library is selected to perform computing in Hadoop, ODPS and Spark in active layer 13.
Pass through the description above, it will be appreciated that the general architecture of the distributed machines learning platform of this example,
That is, Logic Architecture module 11 can build the DAG execution logics of data processing task, there is shown the number
Each algoritic module and its correlation included according to processing task, and algorithm performs module 12 can basis
The execution logic that Logic Architecture module 11 is built, calls the algorithms library in resource layer 13 to perform each algorithm
Module.Wherein, in the present embodiment, each algoritic module is designed as to unified architecture, with convenient
Concatenation and distributed deployment between module.
Fig. 2 illustrates the structure design of an algoritic module, as shown in Fig. 2 each algoritic module can be with
Including:Importation 21, algorithm part 22 and output par, c 23.Importation 21 (input) conduct
The input of algorithm part (algorithm), output par, c 23 (output) as algorithm part output,
The input and output have identical interface format, and its information type can be at least one of following three kinds:
Data (data), model (model) or result (evaluation).For example, data can be sampling
Data, split after data etc., model can be that some obtained model is trained according to data, as a result
It can be the result obtained according to model prediction.
Importation 21 can also include the Dependency Specification between this algoritic module and other algoritic modules, example
Such as, which algoritic module can use that the module id of algoritic module represents to be relied on is, such as, this
Module depends on data, model or the result of a upper module.Other algorithms that importation 21 is relied on
The quantity of module can be at least one.Algorithm part 22 is used to represent using which kind of algorithm to input data
The information of 21 inputs is handled.And output par, c 23 can be used for illustrating that the algoritic module whether there is
The output of data, model or result.
As follows by an example, the structure design of the lower algoritic module of signal:
Above-mentioned example, to the importation inputs of algoritic module, algorithm part algorithm and defeated
Go out part outputs and all each carried out standard definition, each algoritic module is carried out according to this configuration
Design.For example, with reference in above example, inputs parts, the taskId of the algoritic module depended on are inputted
It is ' 10002 ', also, data, model and the result of the algoritic module ' 10002 ', all it is used as this
The input of algoritic module.Referring back to the output outputs parts of above-mentioned example, the output of this algoritic module,
Including data and result (true), without output model (false).The algorithm part of this algoritic module
In algorithm, the algorithm used claims to be logistic regression algorithm logisticRegression.
In addition, for the clarity of DAG logics, can specify that each algoritic module can only output it is unique
One data, model or result, but multiple data, model or result can be introduced.For example, above
Example in, the algoritic module that importation inputs is relied on only have taskId be ' 10002 ' algorithm
Module, data, model and the result of this module as this algoritic module input.In others application
In scene, the algoritic module that importation inputs is relied on can have more.
Exemplary, a kind of example of multiple inputs is illustrated as follows, referring to the importation inputs of the example,
The input dependence of this algoritic module three, including taskId be respectively ' 10002 ', ' 10003 ' and
' 10004 ' three algoritic modules, the data data that ' 10002 ' are exported, ' 10003 ' outputs
Model models and ' 10004 ' output result evaluations, as this algoritic module
Input.Certainly, other scenes are can also be in practical application, for example, importation can only have data
Data and model models, without result evaluations, is no longer lifted in detail.
In the present embodiment, unified definition is still carried out to the information type that each algoritic module is exported, for example,
For data data, the data of intermediate result can be temporarily stored in local or distributed system, use one
Individual Schema files can transmit data between different algoritic modules., can for model models
To be expressed with PMML (Predictive Model Markup Language, Predictive Model Markup Language)
Model parameter, PMML is a kind of de facto standard language, for data mining model, and PMML to be presented
It can be used between different algoritic modules sharing forecast analysis model., can for result evaluations
The result data of model evaluation is deposited in the form of using JSON, also, the result data can also enter
Row visual presentation.
It can be seen that, data processing task divide into multiple independent algoritic modules, and this by the application
A little algoritic modules have unified interface format, and this feature allows to be divided these algoritic modules
Cloth is disposed, and the unified interface form is by for facilitating the smooth concatenation between each algoritic module.
For example, with reference to Fig. 3 example, the Fig. 3 illustrates three algoritic modules G1, G2 and G3, wherein,
The data and model of G1 outputs and the data and result of G2 outputs, can serve as G3 input,
And G3 output model can as other modules input.In this process, G1, G2 and G3
Concatenation, because G1 (or G2) output and G3 input have the definition of identical form, be all
Data, model or result, it is easy to realize the concatenation between module, in terms of interface standard will not being produced
Conflict.Therefore, by identical interface format, each module splicing can be assembled into a complete DAG
Logic is for execution.
When being modeled using the distributed machines learning platform of this example, modeling process can be included
Polyalgorithm module, progress calculating processing, also, each algorithm are deployed in different equipment respectively
Module can realize smooth concatenation, so as to improve the effect of data processing by identical interface format
Rate, in the example of application distribution formula machine learning platform modeling, that is, improves modeling efficiency.
Fig. 4 illustrates the distributed machines study side performed using the distributed machines learning platform of the application
Method, as shown in figure 4, this method can include:
In step 401, according to the execution logic of the data processing task of structure, the number is performed respectively
The polyalgorithm module included according to processing task;Each algoritic module includes:Importation, calculation
Method part, output par, c, and the importation and output par, c have identical interface format;According to
Algorithm part in the algoritic module, calls the algorithms library in resource layer to carry out computing.
In step 402, the importation in the algoritic module includes this algoritic module and its
Dependency Specification between his algoritic module, and the interface format, carry out the multiple algoritic module it
Between concatenation.
401 above-mentioned and 402 execution sequence is not intended to limit, for example, distributed machines learning platform can
Algoritic module in DAG logics is performed with one side, while each algoritic module is concatenated.In addition,
When calling of algorithms library is being carried out according to algoritic module, can be the algorithm part in algoritic module
Algorithm, calls the algorithms library in resource layer to carry out computing.Also, the machine learning of the present embodiment is put down
The same algorithm that platform can will be distributed over diverse location is packaged, and is wanted according to data volume, algorithm computing
The factor such as seek, select suitable algorithms library to perform algorithm computing.
For example, logistic regression algorithm logisticRegression training module has in R, Python
Standalone version algorithms library is provided, while there is also distributed algorithm storehouse on Mahout, MLlib, but nothing
By being standalone version or distributed algorithm storehouse, the parameter of algorithm in itself has no too big difference, the machine of this example
Device learning platform can carry out these algorithms libraries unified encapsulation.Also, the algorithm performs module of platform can
According to factors such as data volume size, the stability of algorithm, accuracy requirements, to assess, selection is suitable to calculate
Run Faku County.If for example, data volume is smaller, standalone version algorithms library can be selected, when data volume is larger,
Distributed algorithm storehouse can be selected to improve processing speed.
In addition, in data processing task using to algorithm types can have a variety of, for example, data processing
Algorithm in terms of the algorithm of aspect, Feature Engineering, the algorithm in terms of model training assessment.In data processing
Aspect, can carry out data sampling processing, data deconsolidation process, missing values processing etc., in Feature Engineering
Aspect, can carry out the calculating of feature importance, characteristic crossover calculating, the choosing of feature sliding-model control, feature
Select, in terms of model training assessment, model training, the PMML intelligence of model parameter expression can be carried out
Energy assembling, the prediction of model and assessment, intelligent optimizing of model parameter etc..
The distributed machines learning platform of the embodiment of the present application, it is possible to achieve sharing for many algorithms storehouse, it is most
Potentially include more comprehensive many algorithms storehouse;The expression model of DAG clear logics can also be built
The association of modeling process and each algoritic module;Also, the algoritic module interface lattice unified by designing
Formula so that each algoritic module be able to can both be ensured between module again with relatively independent distributed deployment
Smooth concatenation, so as to improve data-handling efficiency.
The preferred embodiment of the application is the foregoing is only, it is all at this not to limit the application
Within the spirit and principle of application, any modification, equivalent substitution and improvements done etc. should be included in
Within the scope of the application protection.