CN107092962B - Distributed machine learning method and platform - Google Patents

Distributed machine learning method and platform Download PDF

Info

Publication number
CN107092962B
CN107092962B CN201610090044.7A CN201610090044A CN107092962B CN 107092962 B CN107092962 B CN 107092962B CN 201610090044 A CN201610090044 A CN 201610090044A CN 107092962 B CN107092962 B CN 107092962B
Authority
CN
China
Prior art keywords
algorithm
module
modules
data
platform
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610090044.7A
Other languages
Chinese (zh)
Other versions
CN107092962A (en
Inventor
毛仁歆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Advanced New Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Advanced New Technologies Co Ltd filed Critical Advanced New Technologies Co Ltd
Priority to CN201610090044.7A priority Critical patent/CN107092962B/en
Publication of CN107092962A publication Critical patent/CN107092962A/en
Application granted granted Critical
Publication of CN107092962B publication Critical patent/CN107092962B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • G06N5/025Extracting rules from data

Abstract

The application provides a distributed machine learning method and a platform, wherein the platform comprises: a logic framework module for constructing execution logic of a data processing task, wherein the data processing task comprises a plurality of algorithm modules, and each algorithm module comprises: the input part and the output part have the same interface format so as to enable the plurality of algorithm modules to be connected in series according to the interface format; the input part comprises dependency information between the algorithm module and other algorithm modules; and the algorithm execution module is used for respectively executing each algorithm module according to the execution logic constructed by the logic architecture module, and calling an algorithm library in a resource layer to carry out operation according to the algorithm part in the algorithm module. The data processing efficiency is improved.

Description

Distributed machine learning method and platform
Technical Field
The present application relates to computer technologies, and in particular, to a distributed machine learning method and platform.
Background
The big data processing technology develops gradually, a data model applied to business can be built by using big data, and the data model is applied to prediction of business results. When the data size is very small, the computing power of a single computer is enough; however, when the data size is huge, a distributed computing platform is required to perform the whole modeling process. In the related art, when a distributed computing platform performs modeling, a plurality of functional modules included in a modeling process may be respectively deployed on different devices for computing, but when the functional modules are connected in series, the modules are not smoothly connected due to a complex dependency relationship between the modules, for example, the modules are manually analyzed and connected in series, so that the efficiency of data processing is low.
Disclosure of Invention
In view of this, the present application provides a distributed machine learning method and platform to improve the efficiency of data processing.
Specifically, the method is realized through the following technical scheme:
in a first aspect, a distributed machine learning platform is provided, the platform comprising:
a logic framework module for constructing execution logic of a data processing task, wherein the data processing task comprises a plurality of algorithm modules, and each algorithm module comprises: the input part and the output part have the same interface format so as to enable the plurality of algorithm modules to be connected in series according to the interface format; the input part comprises dependency information between the algorithm module and other algorithm modules;
and the algorithm execution module is used for respectively executing each algorithm module according to the execution logic constructed by the logic architecture module, and calling an algorithm library in a resource layer to carry out operation according to the algorithm part in the algorithm module.
In a second aspect, a distributed machine learning method is provided, including:
respectively executing a plurality of algorithm modules included in the data processing task according to the constructed execution logic of the data processing task; each of the algorithm modules includes: the system comprises an input part, an algorithm part and an output part, wherein the input part and the output part have the same interface format; calling an algorithm library in a resource layer to perform operation according to an algorithm part in the algorithm module;
and according to the dependency information between the algorithm module and other algorithm modules included in the input part of the algorithm module and the interface format, the algorithm modules are connected in series.
According to the distributed machine learning method and the distributed machine learning platform, the algorithm modules included in the modeling process are respectively deployed on different devices for calculation processing, and the algorithm modules can be smoothly connected in series through the same interface format, so that the data processing efficiency is improved, and in the example of applying the distributed machine learning platform to model, the modeling efficiency is improved.
Drawings
FIG. 1 is a framework of a distributed machine learning platform shown in an exemplary embodiment of the present application;
FIG. 2 is a block diagram illustrating the structural design of an algorithm module according to an exemplary embodiment of the present application;
FIG. 3 is a schematic diagram of a concatenation of algorithm modules according to an exemplary embodiment of the present application;
fig. 4 is a flowchart illustrating a distributed machine learning method according to an exemplary embodiment of the present application.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.
The embodiment of the application provides a distributed machine learning platform, and a data miner can use the platform to perform a data processing task, for example, the data processing task can be to establish a prediction model according to acquired data and evaluate the accuracy of the prediction model.
Fig. 1 illustrates the framework of the distributed machine learning platform, which, as shown in fig. 1, includes a logical framework module 11, an algorithm execution module 12, and a resource layer 13. When a data processing task is executed, for example, in a process of building a model, various algorithms are used, the resource layer 13 is used as a bottom support, and a plurality of algorithm libraries may be integrated, for example, a single-version algorithm library such as R, Python illustrated in fig. 1, or a distributed algorithm library such as Hadoop, ODPS, Spark, or the like, and in addition, other algorithm libraries such as MLlib, Mahout, Xlib may also be included, which are not listed in fig. 1.
The resource layer 13 is equivalent to the underlying support for executing data processing tasks, for example, data processing, feature selection, model training, etc. in the modeling process all use various algorithms, and call the algorithm library in the resource layer 13 to execute specific processing. The logic framework module 11 is used to construct execution logic of a data processing task, for example, the data processing task may include a plurality of algorithm modules, and referring to the example of fig. 1, the distributed machine learning platform may construct DAG (Directed Acyclic Graph) execution logic at the logic framework module 11, and the DAG execution logic may represent call relations between the algorithm modules of the data processing task.
Fig. 1 illustrates DAG logic between algorithm modules, for example, algorithm module 1 may be a module that performs data processing on raw acquired data, algorithm module 2 may be a module that performs feature analysis after processing raw data, and may perform feature selection or feature dimension reduction, etc., algorithm module 3 may be a module that performs model training according to features obtained by algorithm module 2 to obtain a model, and algorithm module n may be a module that performs effect prediction on the trained model. The above example is only illustrative, and in practical application, a plurality of algorithm modules may be divided according to the characteristics of a data processing task, and the execution process of the task is represented by building a DAG graph.
The algorithm executing module 12 may respectively execute each algorithm module according to the execution logic constructed by the logic framework module 11, and may perform an operation by calling an algorithm library in the resource layer 13 when executing the algorithm module. In this example, the resource layer 13 includes a single-edition algorithm library and a distributed algorithm library, so that the resource layer 13 includes multiple types of algorithm libraries as comprehensively as possible, and when executing a certain algorithm module, the algorithm execution module 12 may select and invoke a suitable algorithm library in the resource layer 13 to execute according to the data size of the current processing, the accuracy requirement on the algorithm, and other factors. For example, fig. 1 illustrates that for one of the algorithm modules constructed in the logic framework module 11, the algorithm execution module 12 may select one algorithm library from Hadoop, ODPS, and Spark in the resource layer 13 to execute the operation.
Through the above description, it can be understood that the general architecture of the distributed machine learning platform of this example, that is, the logic framework module 11 may build DAG execution logic of the data processing task, and show each algorithm module included in the data processing task and the interrelation thereof, and the algorithm execution module 12 may call the algorithm library in the resource layer 13 to execute each algorithm module according to the execution logic built by the logic framework module 11. In the embodiment, each algorithm module is designed into a uniform structural format, so that the modules can be conveniently arranged in a serial connection and a distributed manner.
Fig. 2 illustrates a structural design of an algorithm module, and as shown in fig. 2, each algorithm module may include: an input section 21, an algorithm section 22 and an output section 23. The input part 21(input) serves as an input of the algorithm part (algorithm), the output part 23(output) serves as an output of the algorithm part, the input and the output have the same interface format, and the information type can be at least one of the following three types: data (data), model (model), or result (evaluation). For example, the data may be sampled data, split data, etc., the model may be a model trained from the data, and the result may be a result predicted from the model.
The input part 21 may also include dependency information between the present algorithm module and other algorithm modules, for example, a module identification of an algorithm module may be used to indicate which algorithm module is depended on, e.g., the present module depends on the data, model or result of the previous module. The number of other algorithm modules on which the input section 21 depends may be at least one. The algorithm section 22 is used to indicate which algorithm is used to process the information input from the input data 21. And the output section 23 may be used to state whether there is a yield of data, model or result for the algorithm module.
The following structural design of the algorithm module is illustrated by an example:
Figure BDA0000925327990000041
Figure BDA0000925327990000051
Figure BDA0000925327990000061
in the above example, the input part inputs, algorithmm and output part outputs of the algorithm modules are respectively defined by standards, and each algorithm module is designed according to the structure. For example, referring to the above example, the taskId of the dependent algorithm module is '10002' in the input portion, and the data, model and result of the algorithm module '10002' are all used as the input of the algorithm module. Referring again to the output outputs section of the above example, the output of the algorithm block, including the data and results (true), has no output model (false). In the algorithm part algorithm of the algorithm module, the used algorithm is called logistic regression algorithm logisticRegression.
Furthermore, for clarity of DAG logic, it may be specified that each algorithm module can only produce a single data, model, or result, but that multiple data, models, or results may be introduced. For example, in the above example, the algorithm module on which the input part inputs depends only has the algorithm module with taskId of '10002', and the data, model and result of this module are used as the input of this algorithm module. In other application scenarios, there may be more algorithm modules on which the input portions inputs depend.
Illustratively, an example of a plurality of inputs is shown as follows, referring to an input part input of the example, the input of the algorithm module depends on three, including three algorithm modules with taskId of '10002', '10003' and '10004', and data output by '10002', model models output by '10003' and result evaluations output by '10004' are used as the inputs of the algorithm module. Of course, other scenarios are also possible in practical application, for example, the input part may only have data and models, but no results evaluations, which will not be described in detail.
Figure BDA0000925327990000062
Figure BDA0000925327990000071
In this embodiment, the information types output by the algorithm modules are also defined uniformly, for example, for data, data of the intermediate result may be temporarily stored in a local or distributed system, and data may be transferred between different algorithm modules by using a Schema file. For Model models, Model parameters can be expressed in PMML (Predictive Model Markup Language), which is a de facto standard Language used to render data mining models, and which can be used to share Predictive analytics models between different algorithm modules. For the result evaluations, the result data of the model evaluation can be stored in the form of JSON, and the result data can be displayed visually.
It can be seen that the present application divides the data processing task into a plurality of independent algorithm modules, and the algorithm modules have a uniform interface format, such that the algorithm modules can be deployed in a distributed manner, and the uniform interface format will be used to facilitate smooth concatenation between the algorithm modules. For example, referring to the example of FIG. 3, this FIG. 3 illustrates three algorithm modules G1, G2, and G3, where the data and model output by G1, and the data and results output by G2, may all be inputs to G3, while the model output by G3 may be inputs to other modules. In the process, the concatenation of G1, G2 and G3 is data, model or result because the output of G1 (or G2) and the input of G3 have the same format definition, and the concatenation between modules is easy to realize without conflict in interface standard. Therefore, through the same interface format, the modules can be assembled into a complete DAG logic for execution.
When the distributed machine learning platform of the example is used for modeling, a plurality of algorithm modules included in the modeling process can be respectively deployed on different devices for calculation processing, and the algorithm modules can be smoothly connected in series through the same interface format, so that the data processing efficiency is improved.
Fig. 4 illustrates a distributed machine learning method performed using the distributed machine learning platform of the present application, which, as shown in fig. 4, may include:
in step 401, according to the constructed execution logic of the data processing task, a plurality of algorithm modules included in the data processing task are respectively executed; each of the algorithm modules includes: the system comprises an input part, an algorithm part and an output part, wherein the input part and the output part have the same interface format; and calling an algorithm library in the resource layer to perform operation according to the algorithm part in the algorithm module.
In step 402, the plurality of algorithm modules are connected in series according to the dependency information between the algorithm module and other algorithm modules included in the input part of the algorithm modules and the interface format.
The execution order of 401 and 402 is not limited, and for example, the distributed machine learning platform may execute the algorithm modules in the DAG logic and concatenate the algorithm modules at the same time. In addition, when the algorithm library is called according to the algorithm module, the algorithm library in the resource layer may be called to perform operation according to the algorithm part algorithm in the algorithm module. In addition, the machine learning platform of this embodiment may encapsulate the same algorithm distributed in different locations, and select a suitable algorithm library to execute algorithmic operations according to factors such as data volume and algorithm operation requirements.
For example, the training modules of the logistic regression algorithm logicregression all provide a single-machine algorithm library in R, Python, and meanwhile, a distributed algorithm library also exists on Mahout and MLlib, but the parameters of the algorithm are not very different no matter the single-machine algorithm library or the distributed algorithm library, and the machine learning platform of the example can uniformly package the algorithm libraries. And the algorithm execution module of the platform can evaluate and select a proper algorithm library to operate according to the data size, the stability of the algorithm, the accuracy requirement and other factors. For example, if the amount of data is small, a single version of the algorithm library may be selected, and if the amount of data is large, a distributed algorithm library may be selected to increase the processing speed.
In addition, the type of algorithm used in the data processing task may be various, for example, an algorithm in data processing, an algorithm in feature engineering, and an algorithm in model training evaluation. In the aspect of data processing, data sampling processing, data splitting processing, missing value processing and the like can be performed, in the aspect of feature engineering, feature importance calculation, feature cross calculation, feature discretization processing, feature selection and the like can be performed, and in the aspect of model training evaluation, model training, PMML intelligent assembly of model parameter expression, model prediction and evaluation, intelligent optimization searching of model parameters and the like can be performed.
The distributed machine learning platform of the embodiment of the application can realize the sharing of various algorithm libraries, and the algorithm libraries can be more comprehensive as far as possible; the modeling process of the expression model with clear DAG logic and the association of each algorithm module can be constructed; in addition, by designing a uniform algorithm module interface format, each algorithm module can be deployed in a relatively independent and distributed manner, and smooth serial connection among the modules can be ensured, so that the data processing efficiency is improved.
The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the scope of protection of the present application.

Claims (10)

1. A distributed machine learning platform, the platform comprising:
a logic framework module for constructing execution logic of a data processing task, wherein the data processing task comprises a plurality of algorithm modules, and each algorithm module comprises: the input part and the output part have the same interface format so as to enable the plurality of algorithm modules to be connected in series according to the interface format; the input part comprises dependency information between the algorithm module and other algorithm modules; each algorithm module generates a data, model or result; each algorithm module is used for introducing at least one data, model or result;
the algorithm execution module is used for respectively executing each algorithm module according to the execution logic constructed by the logic architecture module, and calling an algorithm library in a resource layer to carry out operation according to the algorithm part in the algorithm module; the algorithm library is packaged uniformly.
2. The platform of claim 1, wherein the interface format of the input portion and the output portion comprises:
the input part as input of the algorithm part and the output part as output of the algorithm part comprise at least one of the following information types: data, model, or result.
3. The platform of claim 1, wherein the number of other algorithm modules on which the input part depends is at least one.
4. The platform of claim 1, wherein the resource layer comprises: a single-machine version algorithm library and a distributed algorithm library.
5. The platform of claim 1, wherein the dependency information comprises: module identification of the dependent algorithm module.
6. A distributed machine learning method, comprising:
respectively executing a plurality of algorithm modules included in the data processing task according to the constructed execution logic of the data processing task; each of the algorithm modules includes: the system comprises an input part, an algorithm part and an output part, wherein the input part and the output part have the same interface format; calling an algorithm library in a resource layer to perform operation according to an algorithm part in the algorithm module; the algorithm library is packaged uniformly; each algorithm module generates a data, model or result; each algorithm module is used for introducing at least one data, model or result;
and according to the dependency information between the algorithm module and other algorithm modules included in the input part of the algorithm module and the interface format, the algorithm modules are connected in series.
7. The method according to claim 6, characterized in that at least one of the following types of information is used as input or output for the algorithm module: data, model, or result.
8. The method of claim 6, wherein the number of other algorithm modules on which the input part depends is at least one.
9. The method of claim 6, wherein the resource layer comprises: a single-machine version algorithm library and a distributed algorithm library.
10. The method of claim 6, wherein the dependency information comprises: module identification of the dependent algorithm module.
CN201610090044.7A 2016-02-17 2016-02-17 Distributed machine learning method and platform Active CN107092962B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610090044.7A CN107092962B (en) 2016-02-17 2016-02-17 Distributed machine learning method and platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610090044.7A CN107092962B (en) 2016-02-17 2016-02-17 Distributed machine learning method and platform

Publications (2)

Publication Number Publication Date
CN107092962A CN107092962A (en) 2017-08-25
CN107092962B true CN107092962B (en) 2021-01-26

Family

ID=59649265

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610090044.7A Active CN107092962B (en) 2016-02-17 2016-02-17 Distributed machine learning method and platform

Country Status (1)

Country Link
CN (1) CN107092962B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108153815A (en) * 2017-11-29 2018-06-12 北京京航计算通讯研究所 Towards the index classification method of big data
CN110120251A (en) * 2018-02-07 2019-08-13 北京第一视角科技有限公司 The statistical analysis technique and system of multidimensional health data based on Spark
CN110598868B (en) * 2018-05-25 2023-04-18 腾讯科技(深圳)有限公司 Machine learning model building method and device and related equipment
CN108897587B (en) * 2018-06-22 2021-11-12 北京优特捷信息技术有限公司 Pluggable machine learning algorithm operation method and device and readable storage medium
CN108960433B (en) * 2018-06-26 2022-04-05 第四范式(北京)技术有限公司 Method and system for running machine learning modeling process
CN109325756A (en) * 2018-08-03 2019-02-12 上海小渔数据科技有限公司 Data processing method and device, server for data algorithm transaction
CN109343833B (en) * 2018-09-20 2022-12-16 鼎富智能科技有限公司 Data processing platform and data processing method
TWI706378B (en) * 2018-12-29 2020-10-01 鴻海精密工業股份有限公司 Cloud device, terminal device, and image classification method
CN110909761A (en) * 2019-10-12 2020-03-24 平安科技(深圳)有限公司 Image recognition method and device, computer equipment and storage medium
CN110825511A (en) * 2019-11-07 2020-02-21 北京集奥聚合科技有限公司 Operation flow scheduling method based on modeling platform model
CN110880036B (en) * 2019-11-20 2023-10-13 腾讯科技(深圳)有限公司 Neural network compression method, device, computer equipment and storage medium
CN112488365A (en) * 2020-11-17 2021-03-12 深圳供电局有限公司 Load prediction system and method based on load prediction pipeline framework language
CN114489867B (en) * 2022-04-19 2022-09-06 浙江大华技术股份有限公司 Algorithm module scheduling method, algorithm module scheduling device and readable storage medium
CN114997414B (en) * 2022-05-25 2024-03-08 北京百度网讯科技有限公司 Data processing method, device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030033263A1 (en) * 2001-07-31 2003-02-13 Reel Two Limited Automated learning system
CN101782976A (en) * 2010-01-15 2010-07-21 南京邮电大学 Automatic selection method for machine learning in cloud computing environment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030033263A1 (en) * 2001-07-31 2003-02-13 Reel Two Limited Automated learning system
CN101782976A (en) * 2010-01-15 2010-07-21 南京邮电大学 Automatic selection method for machine learning in cloud computing environment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
分布式单类支持向量机聚类算法研究;谢通;《万方学位论文集》;20111228;正文第36页及43页 *

Also Published As

Publication number Publication date
CN107092962A (en) 2017-08-25

Similar Documents

Publication Publication Date Title
CN107092962B (en) Distributed machine learning method and platform
US11080435B2 (en) System architecture with visual modeling tool for designing and deploying complex models to distributed computing clusters
US11074107B1 (en) Data processing system and method for managing AI solutions development lifecycle
KR101657495B1 (en) Image recognition method using deep learning analysis modular systems
US20210004642A1 (en) Ai capability research and development platform and data processing method
CN111488211A (en) Task processing method, device, equipment and medium based on deep learning framework
US11256484B2 (en) Utilizing natural language understanding and machine learning to generate an application
US8924923B2 (en) Apparatus and method of generating multi-level test case from unified modeling language sequence diagram based on multiple condition control flow graph
CN109816114A (en) A kind of generation method of machine learning model, device
CN109656872A (en) Dynamic partially reconfigurable on-chip system software and hardware partitioning method
Choi et al. Tellurium: A python based modeling and reproducibility platform for systems biology
GB2582782A (en) Graph conversion method
Kang et al. The extended activity cycle diagram and its generality
Shcherbakov et al. Lean data science research life cycle: A concept for data analysis software development
US20240086165A1 (en) Systems and methods for building and deploying machine learning applications
CN117201308A (en) Network resource allocation method, system, storage medium and electronic equipment
Bala et al. Extracting-transforming-loading modeling approach for big data analytics
KR20200103133A (en) Method and apparatus for performing extract-transfrom-load procedures in a hadoop-based big data processing system
Aleeva et al. Software System for Maximal Parallelization of Algorithms on the Base of the Conception of Q-determinant
US20220214864A1 (en) Efficient deployment of machine learning and deep learning model's pipeline for serving service level agreement
Foit Petri nets in modelling and simulation of the hierarchical structure of manufacturing systems
KR101995108B1 (en) Method and system for modeling compositional feature model providing interest in view in software product line
Rousseau et al. AMBER: A New Architecture for Flexible MBSE Workflows
Zander et al. Technical engine for democratization of modeling, simulations, and predictions
Fallah et al. A parallel hybrid genetic algorithm for solving the maximum clique problem

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20200923

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20200923

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

GR01 Patent grant
GR01 Patent grant