CN107679625B - The distributed system and its method of machine learning are executed for data record - Google Patents

The distributed system and its method of machine learning are executed for data record Download PDF

Info

Publication number
CN107679625B
CN107679625B CN201710764131.0A CN201710764131A CN107679625B CN 107679625 B CN107679625 B CN 107679625B CN 201710764131 A CN201710764131 A CN 201710764131A CN 107679625 B CN107679625 B CN 107679625B
Authority
CN
China
Prior art keywords
parameter
computing device
machine learning
learning model
server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710764131.0A
Other languages
Chinese (zh)
Other versions
CN107679625A (en
Inventor
戴文渊
陈雨强
杨强
焦英翔
涂威威
石光川
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
4Paradigm Beijing Technology Co Ltd
Original Assignee
4Paradigm Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 4Paradigm Beijing Technology Co Ltd filed Critical 4Paradigm Beijing Technology Co Ltd
Priority to CN201710764131.0A priority Critical patent/CN107679625B/en
Publication of CN107679625A publication Critical patent/CN107679625A/en
Application granted granted Critical
Publication of CN107679625B publication Critical patent/CN107679625B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

A kind of distributed system and its method that machine learning is executed for data record is provided, the system comprises: multiple computing devices, wherein each computing device is executed for respective data record similarly to be calculated about the data stream type of machine learning;Parameter server, parameter for machine maintenance learning model, wherein, when the data stream type for executing training machine learning model calculates, computing device is directed to respective data record using the parameter obtained from parameter server and executes the operation similarly trained about machine learning model, and parameter server updates the parameter according to the operation result of computing device;And/or when executing the data stream type calculating estimated using machine learning model, computing device is directed to respective data record using the parameter obtained from parameter server and executes the operation similarly estimated about machine learning model.Thus, it can be achieved that homogeneity between computing device, reduces network transmission expense.

Description

The distributed system and its method of machine learning are executed for data record
Technical field
Exemplary embodiment of the present invention all things considered is related to artificial intelligence field, is directed to more specifically to one kind Data record, which is executed the distributed system of machine learning and is directed to data record using the distributed system, executes machine The method of study.
Background technique
With increasing rapidly for data scale, machine learning is widely used in various fields with the value of mining data. However, the memory of general physical machine is much insufficient in order to execute machine learning, divide for this purpose, generally requiring to utilize in practice Cloth machine learning platform is completed the training of machine learning model or is accordingly estimated.
In existing distributed machines learning system (for example, in the deep learning frame TensorFlow of Google), lead to One or more control nodes are commonly present, these control nodes are responsible for the task and calculating of other calculate nodes in scheduling system Resource, wherein each calculate node can execute a part of calculating task, that is to say, that the meter executed between each calculate node Calculation task is not identical.Correspondingly, for entire calculating task, adjacent step may be may require that on different physical machines It executes, this needs network transmission just flow data between each calculate node and/or control node, and such mode will be led Huge read-write expense is caused, and High_speed NIC (for example, ten thousand Broadcoms) is with high costs, can not be widely used in distribution In machine learning system.
Further, in above-mentioned distributed machines learning system, if it is desired to realize and be based on some machine learning algorithm Multi-configuration operation or be run multiple times, or, if it is desired to while running multiple machine learning algorithms, then need in algorithm Portion modifies, or realizes the multiple calling of external logic, both modes will all expend biggish practical calculation amount.
Summary of the invention
Exemplary embodiment of the present invention is intended to overcome existing distributed machines learning system when executing machine learning Network overhead and the biggish defect of operand.
An exemplary embodiment of the present invention provides a kind of for executing the distribution of machine learning for data record System, comprising: multiple computing devices, wherein each computing device is configured as executing for respective data record same About machine learning data stream type calculate;Parameter server, the parameter for machine maintenance learning model, wherein holding When the data stream type of row training machine learning model calculates, computing device is directed to respectively using the parameter obtained from parameter server From data record execute the similarly operation about machine learning model training, also, parameter server is according to computing device Operation result update the parameter;And/or in the data stream type meter that execution is estimated using machine learning model When calculation, computing device is directed to respective data record using the parameter obtained from parameter server and executed similarly about machine The operation that learning model is estimated.
Optionally, in the distributed system, the parameter server has distributed frame, wherein at described point Under cloth structure, each section parameter server becomes one with corresponding computing device.
Optionally, in the distributed system, when the data stream type for executing training machine learning model calculates, for The each round iteration of data record is standby to carry out calamity.
Optionally, in the distributed system, the data stream type calculating is expressed as at least one by processing step group At directed acyclic graph.
Optionally, in the distributed system, computing device is by merging identical processing in different directed acyclic graphs Step calculates to execute data stream type.
Optionally, in the distributed system, parameter server saves the ginseng of machine learning model according to key-value pair Number, also, the key-value pair with same keys is saved as the form that single key corresponds to multiple values.
In accordance with an alternative illustrative embodiment of the present invention, provide it is a kind of be directed to using distributed system data record execution The method of machine learning, wherein each computing device among multiple computing devices in the distributed system is configured as It executes for respective data record and is similarly calculated about the data stream type of machine learning, which comprises by multiple Each computing device among computing device obtains respective data record;It is taken by computing device from the parameter in distributed system The parameter for device acquisition machine learning model of being engaged in;Wherein, when the data stream type for executing training machine learning model calculates, by calculating Device is directed to respective data record using the parameter of acquisition and executes the similarly operation about machine learning model training, and And the parameter is updated according to the operation result of computing device by parameter server;And/or machine is utilized executing When the data stream type that learning model is estimated calculates, respective data record is directed to using the parameter of acquisition by computing device Execute the operation similarly estimated about machine learning model.
Optionally, in the method, the parameter server has distributed frame, wherein in the distributed knot Under structure, each section parameter server becomes one with corresponding computing device.
Optionally, in the method, when the data stream type for executing training machine learning model calculates, remember for data The each round iteration of record is standby to carry out calamity.
Optionally, in the method, the data stream type calculating is expressed as at least one is had by what processing step formed To acyclic figure.
Optionally, in the method, computing device by merge in different directed acyclic graphs identical processing step come Data stream type is executed to calculate.
Optionally, in the method, parameter server saves the parameter of machine learning model according to key-value pair, and And the key-value pair with same keys is saved as the form that single key corresponds to multiple values.
In distributed machines learning system according to an exemplary embodiment of the present invention and its method, each computing device quilt It is configured to execute same data stream type calculating, to realize the homogeneity between computing device, which reduce network transmissions Expense can be preferably applied for large-scale machine learning model.
Detailed description of the invention
From the detailed description with reference to the accompanying drawing to the embodiment of the present invention, these and or other aspects of the invention and Advantage will become clearer and be easier to understand, in which:
Fig. 1 shows according to an exemplary embodiment of the present invention for executing the distributed system of machine learning for data record The block diagram of system;
Fig. 2 shows another exemplary embodiments according to the present invention for executing the distribution of machine learning for data record The block diagram of formula system;
Fig. 3 shows the block diagram of parameter server according to an exemplary embodiment of the present invention;
Fig. 4 shows the block diagram of computing device according to an exemplary embodiment of the present invention;
Fig. 5 shows distributed machines learning system according to an exemplary embodiment of the present invention and executes machine learning model training Method flow chart;
The distributed machines learning system that Fig. 6 shows another exemplary embodiment according to the present invention executes machine learning model The flow chart of trained method;
Fig. 7 shows distributed machines learning system execution machine learning model according to an exemplary embodiment of the present invention and estimates Method flow chart;
Fig. 8 show it is according to an exemplary embodiment of the present invention by merge directed acyclic graph come execute data stream type calculating Example;And
Fig. 9 shows showing for the parameter according to an exemplary embodiment of the present invention that machine learning model is saved according to key-value pair Example.
Specific embodiment
In order to make those skilled in the art more fully understand the present invention, with reference to the accompanying drawings and detailed description to this hair Bright exemplary embodiment is described in further detail.
Machine learning is the inevitable outcome that artificial intelligence study develops to certain phase, is dedicated to the hand by calculating Section, improves the performance of system itself using experience.In computer systems, " experience " exists usually in the form of " data ", leads to Machine learning algorithm is crossed, " model " can be generated from data, which is represented by certain algorithmic function under special parameter, That is empirical data is supplied to machine learning algorithm, model can be generated based on these empirical datas (namely based on data And learn to arrive the parameter of function), when facing news, model can provide corresponding judgement, that is, estimation results.Engineering Practise the form that can be implemented as " supervised learning ", " unsupervised learning " or " semi-supervised learning ", it should be noted that example of the invention Property embodiment is to specific machine learning algorithm and without specific limitation.Further, it should also be noted that in training machine learning model In the process, also using statistic algorithm, business rule and/or expertise etc., to further increase the effect of machine learning.
Particularly, an exemplary embodiment of the present invention relates to distributed machines learning system, distributed machines study System can be made of parameter server and computing device, wherein be held for respective data record on multiple computing device distributions ground The identical machine learning task of row, correspondingly, parameter server is by interacting to machine maintenance with each computing device Practise the parameter of model.It should be noted that computing device and/or parameter server mentioned here by performed by it processing or institute it is real Existing function limits, and can both indicate physical entity, can also indicate that pseudo-entity, for example, computing device can indicate actual meter Machine is calculated, can also indicate that the logic entity being deployed on the computing machine, equally, parameter server can both indicate actual calculating Machine also can be used as one or more logic entities and be deployed on same or different computing machine from computing device, even It can directly be served as by certain computing devices.
Fig. 1 shows according to an exemplary embodiment of the present invention for executing the distributed system of machine learning for data record The block diagram of system.Particularly, distributed machines learning system shown in FIG. 1 may include parameter server 2000 and multiple calculating dress Set 1000 (for example, 1000-1,1000-2 ..., 1000-n (wherein, n is the integer greater than 1)).The distributed machines study System can be used for training machine learning model and/or be estimated using trained machine learning model.
Particularly, each computing device 1000 is configured as executing for respective data record similarly about machine The data stream type of device study calculates.As an example, each computing device 1000 can be respectively from data source (for example, the cloud on network The addressable position of all computing devices such as disk) a part of data record to be treated is obtained, and for this portion obtained Divided data record calculates to execute data stream type;Alternatively, in the case where data volume is less, each computing device 1000 can also one Secondary property obtains whole data records from data source, and executes data stream type meter for a part of data record therein respectively It calculates.An exemplary embodiment of the present invention, calculating task performed by each computing device 1000 are identical (that is, about machine The data stream type of study calculates), only targeted data record is different.
Here, data stream type, which calculates, refers to that each computing device 1000 is both needed to the streaming computing task executed, can be The set of the certain processing executed required for estimating is executed for training machine learning model and/or using machine learning model. An exemplary embodiment of the present invention, data stream type calculating are represented by least one directed acyclic being made of processing step Figure.That is, data stream type calculating can indicate calculation process represented by single directed acyclic graph;In addition, data stream type meter Calculation can also indicate that calculation process represented by multiple directed acyclic graphs, correspondingly, distribution according to an exemplary embodiment of the present invention Formula machine learning system can be performed simultaneously the multi-configuration operation of certain machine learning algorithm process or be performed simultaneously a variety of engineerings Practise algorithm flow.The processing step of composition data streaming computing not only includes calculation step, further includes other various processing steps (for example, obtaining data, output operation result etc.).
Parameter server 2000 is used for the parameter of machine maintenance learning model.As described above, machine learning model can be regarded as It, particularly, can be gradually by recording constantly repetitive exercise for total data about the function of machine learning sample characteristics Converge to the parametric optimal solution of the function.An exemplary embodiment of the present invention, parameter server 2000 are used for machine maintenance The parameter of learning model, so that computing device 1000 can be by carrying out when executing data stream type and calculating with parameter server 2000 Interaction is to obtain corresponding parameter, and on the other hand, in the training stage of machine learning model, parameter server 2000 be may be based on The operation result of each computing device 1000 updates the parameter.That is, in the number for executing training machine learning model When according to streaming computing, computing device 1000 is held using the parameter obtained from parameter server 2000 to be directed to respective data record The row similarly operation about machine learning model training, also, parameter server 2000 is according to the operation of computing device 1000 As a result the parameter is updated;And/or when executing the data stream type calculating estimated using machine learning model, Computing device 1000 be directed to using the parameter obtained from parameter server 2000 respective data record execute similarly about The operation that machine learning model is estimated.
As an example, parameter server 2000 can be deployed on single computing machine;Alternatively, parameter server 2000 can It is deployed on multiple relevant computing machines simultaneously.For example, the effect of parameter server 2000 can be equivalent to a global Kazakhstan Table is wished, key-value pair (key-value) of the preservation a part about parameter in each partial parameters server, and various pieces parameter Key-value pair on server is without intersection.As an example, parameter server 2000 can save machine learning model according to key-value pair Parameter, also, with same keys (key) key-value pair be saved as single key correspond to multiple values (value) form.
An exemplary embodiment of the present invention can first cut the data record from data source (for example, far-end network) It is divided into different set, to be respectively stored in the local hard drive or memory of computing device 1000, correspondingly, can be filled by each calculating 1000 are set to calculate by executing data stream type to local data come the corresponding machine learning calculating task of complete independently, thus substantially Reduce the expense of reading and writing data.
As an example, each computing device 1000 is local can individually to store portion completely when machine learning model is smaller Parameter, and when machine learning model is larger, parameter is stored in dispersiblely on multiple portions parameter server.According to this hair Bright exemplary implementation, the parameter server 2000 can have distributed frame, wherein under the distributed frame, often A part of parameter server becomes one with corresponding computing device 1000.
Fig. 2 shows another exemplary embodiments according to the present invention for executing the distribution of machine learning for data record The block diagram of formula system.In distributed machines learning system shown in Fig. 2, correspond to each computing device 1000, exists corresponding Partial parameters server 2001.Particularly, computing device 1000-1 can be integrated in phase with partial parameters server 2001-1 On same virtual machine or physical machine, computing device 1000-2 can be integrated in identical virtual machine with partial parameters server 2001-2 Or in physical machine, and so on, computing device 1000-n can be integrated in identical virtual machine with partial parameters server 2001-n Or in physical machine.In the distributed machines learning system, each computing device 1000 calculates to execute data stream type, can be The local parameter safeguarded using the corresponding part parameter server 2001 integrated with it, in addition, computing device 1000 Can also require the parameter safeguarded using other parts parameter server 2001, for this purpose, computing device 1000 need with it is described its His partial parameters server 2001 is interacted to obtain executing whole parameters needed for data stream type calculates.
As can be seen that in distributed machines learning system according to an exemplary embodiment of the present invention, in addition to multiple Except computing device, parameter server can also have distributed structure, that is, there are multiple portions parameter servers.In this feelings Under condition, calculated by being performed by computing device data stream type, it is ensured that the realization of large-scale machines study, so as to pass through spy Better machine learning effect is realized in the increase of the promotion and data volume of levying dimension.
Fig. 3 shows the block diagram of parameter server according to an exemplary embodiment of the present invention.Here, parameter clothes shown in Fig. 3 Device be engaged in either univers parameter server, is also possible to partial parameters server.
Referring to Fig. 3, parameter server may include interface arrangement 2100, processing unit 2200 and parameter memory 2300.
Particularly, interface arrangement 2100 can be used for interacting with computing device 1000, thus transmission and machine learning Relevant instruction and/or data, wherein the data can be the parameter of machine learning model, the operation for undated parameter The various related datas such as a result.Here, interface arrangement 2100 can receive the instruction of request parameter from computing device 1000, Instruction related with undated parameter or operation result can be received from computing device 1000;In addition, interface arrangement 2100 can also be to meter It calculates device 1000 and sends relevant parameter.
Processing unit 2200 can execute processing according to by the received instruction of interface arrangement 2100 and/or data to update And/or provide parameter, wherein the parameter is saved by parameter memory 2300.As an example, processing unit 2200 can divide Corresponding parameter, is then supplied to from parameter memory 2300 and connects by the instruction for analysing the required parameter from computing device 1000 Mouth device 2100, and then the parameter is supplied to computing device 1000 by interface arrangement 2100.Alternatively, processing unit 2200 can The instruction and corresponding operation result data for analyzing the undated parameter from computing device 1000, according to corresponding parameter update side Formula executes the update of parameter, and updated parameter is stored in parameter memory 2300.
Parameter memory 2300 is used to save the parameter of machine learning model.As an example, parameter memory 2300 The parameter can be saved according to the form of key-value pair (key-value).Exemplary implementation according to the present invention, for model More set configurations, can be used a key and correspond to the form of multiple value to save corresponding parameter.
Fig. 4 shows the block diagram of computing device according to an exemplary embodiment of the present invention.Referring to Fig. 4, computing device 1000 can Including interface unit 1100 and arithmetic element 1200.
Particularly, interface unit 1100 can be used for parameter server (for example, parameter server 2000 or partial parameters Server 2001) it interacts, thus transmission instruction relevant to machine learning and/or data, wherein the data can be The parameter of machine learning model, for the various related datas such as the operation result of undated parameter.Here, interface unit 1100 can be to Parameter server sends the instruction of request parameter, and receives requested parameter from parameter server;In addition, interface unit 1100 can also provide operation result and the relevant instruction for undated parameter to parameter server.As an example, interface list Member 1100 can also be used in the data record that expectation processing is obtained from data source, or for backuping to separately interim operation result In outer device.
Arithmetic unit 1200 is used for the parameter using machine learning model, and data stream type meter is executed for data record It calculates.Here, arithmetic unit 1200 can be performed data stream type calculate involved in about machine learning model training and/or estimate Various concrete operations.As described above, an exemplary embodiment of the present invention, data stream type calculating are represented by one or more A directed acyclic graph being made of processing step, for example, generally requiring the more sets of training in the training stage of machine learning model and matching Model under setting is to carry out model tuning, in this case, if it is desired to while carry out the training of more set of model, then data stream type Calculating can be made of the different directed acyclic graph of multiple configurations.Correspondingly, the execution of the arithmetic unit 1200 resulting result of operation can Parameter server or other devices are passed to via interface unit 1100.It should be noted that the composition that data stream type calculates is not limited It in above-mentioned example, but may include the combination of any single directed acyclic graph or different directed acyclic graphs, for example, data stream type meter Calculate the single process that can indicate to be estimated using machine learning model.The stage is estimated in machine learning model, is filled by operation It sets 1200 and can be used as discreet value for respective data record by executing the resulting result of operation.
Fig. 5 shows distributed machines learning system according to an exemplary embodiment of the present invention and executes machine learning model training Method flow chart.The step of the method is related to can be by the computing device in the distributed machines learning system that describes before And/or parameter server (for example, parameter server 2000 or partial parameters server 2001) Lai Zhihang, for example, can be by calculating Device and/or parameter server are executed according to preset configuration, wherein multiple calculating dress in the distributed system Each computing device among setting is configured as executing the similarly data about machine learning for respective data record Streaming computing.
Referring to Fig. 5, in the step s 100, respective data is obtained by each computing device among multiple computing devices and are remembered Record.Here data record instruction is used for the historgraphic data recording of model training, has in the case where supervised learning corresponding Label (label).For example, each computing device can be read respectively from data source by the data record of processing, meter respectively first Calculating the data record read between device does not have intersection, that is to say, that each computing device can assign to the one of conceptual data record Then identical training mission is done in part together.Data record is read to after being locally stored in computing device, it is subsequent When needing to be implemented related computing then corresponding data record directly can be obtained from local.
Next, in step s 200, obtaining machine learning from the parameter server in distributed system by computing device The parameter of model.Here, each computing device can obtain all required parameter from single parameter server;Alternatively, joining In the case that number server has distributed frame, computing device from integrated partial parameters server in addition to getting parms Except, it can also require and obtain other parameter from other parts parameter server.
In step S300, it is directed to respective data record using the parameter of acquisition by computing device and executes same close In the operation of machine learning model training.Here, computing device can complete number based on the data record and parameter that had previously obtained According to calculation step involved in streaming computing.
In step S400, the parameter is updated according to the operation result of computing device by parameter server.Here, root The factors such as the design according to machine learning algorithm and Distributed Architecture, parameter server can carry out undated parameter according to certain frequency, For example, operation result can be aggregated into parameter server by each computing device after the operation for completing data record, by Parameter server executes the update of parameter according to scheduled update mode.In addition, the frequency that parameter updates is not limited to one Data, for example, the operation result of iteration can be taken turns come undated parameter based on batch of data or one.
It should be noted that each step in Fig. 5 be not limited to it is specific execute sequence, for example, those skilled in the art should manage Solution generally requires repeatedly to obtain data note from outside or locally during being iterated operation for mass data record Record and/or parameter.
When the machine learning model that training obtains is smaller, can individually be stored on each computational entity a complete Whole model parameter, however, then needing for model parameter piecemeal to be stored in multiple portions ginseng when machine learning model is larger On number server.Due to needing repeatedly access data when computing device executes processor active task, it is therefore necessary to calamity appropriate is arranged Standby measure.Different from frequently executing the standby processing mode of calamity in the prior art, exemplary implementation according to the present invention is executing instruction When practicing the data stream type calculating of machine learning model, it is standby that calamity is carried out for each round iteration of data record.Pass through this spy For mode operational efficiency can be significantly increased while realizing calamity for target in fixed calamity.
The distributed machines learning system that Fig. 6 shows another exemplary embodiment according to the present invention executes machine learning model The flow chart of trained method.In method shown in Fig. 6, execute that calamity is standby according to every wheel iteration, wherein step S100, S200, S300 and S400 are similar with corresponding steps shown in fig. 5, will not be described in great detail details here.
Referring to Fig. 6, in step S500, it is determined whether perform a wheel repetitive exercise for total data record.If A wheel iteration is not yet completed, then the method proceeds to step S700.If it is determined that completing a wheel iteration, then in step S600 In, currently available model parameter is backed up, for example, can between multiple portions parameter server extraly interleaved Currently available model parameter, that is, each partial parameters server is also additional other than keeping that a part of parameter of oneself The parameter that storage other parameters server is safeguarded;Alternatively, institute can be backed up on other devices other than parameter server State parameter.Here, the currently available model parameter of one or more parts can be backed up, to ensure the standby realization of calamity.
In step S700, it is determined whether complete the training of machine learning model, if completing training, obtain The machine learning model being made of parameter.Otherwise, the method return step S100 to be to continue to obtain new data record, this In, according to process flow before, the new data record is either continue acquisition when epicycle iteration is not yet completed Data record is also possible to the data record reacquired when epicycle iteration is just completed.These data records both may be from outer Portion's data source also may be from computing device local.
Fig. 7 shows distributed machines learning system execution machine learning model according to an exemplary embodiment of the present invention and estimates Method flow chart.The step of the method is related to can be by the computing device in the distributed machines learning system that describes before And/or parameter server (for example, parameter server 2000 or partial parameters server 2001) Lai Zhihang, for example, can be by calculating Device and/or parameter server are executed according to preset configuration, wherein multiple calculating dress in the distributed system Each computing device among setting is configured as executing the similarly data about machine learning for respective data record Streaming computing.
Referring to Fig. 7, in step s 110, respective data is obtained by each computing device among multiple computing devices and are remembered Record.Here data record of the data record instruction for model pre-estimating (or test).Each computing device can be respectively from data Source is read the data record of processing respectively, and the data record read between computing device does not have intersection, that is to say, that Mei Geji A part of conceptual data record can be assigned to by calculating device, then done together and identical estimated task.
Next, obtaining machine learning from the parameter server in distributed system by computing device in step S210 The parameter of model.Here, each computing device can obtain all required parameter from single parameter server;Alternatively, joining In the case that number server has distributed frame, computing device from integrated partial parameters server in addition to getting parms Except, it can also require and obtain other parameter from other parts parameter server.
In step s310, it is directed to respective data record using the parameter of acquisition by computing device and executes same close In the operation that machine learning model is estimated.Here, computing device can complete number based on the data record and parameter that had previously obtained According to calculation step involved in streaming computing.
Distributed machines learning system according to an exemplary embodiment of the present invention is described above by reference to Fig. 5 to Fig. 7 to execute Some concrete operations or other processing can be encapsulated as the function that can be called, for example, can be by number here by the method for machine learning Merge according to the synchronous waiting in streaming computing, data and the processing such as broadcast interaction are encapsulated as the function that can be called.Aforesaid way makes It obtains programming personnel to call directly when needed, so that programming personnel be helped to concentrate on distributed implementation logic, effectively control Algorithm processed, without realizing complicated Lower level logical.
Although moreover, it is noted that being sequentially displayed in the method flow diagram of Fig. 5 to Fig. 7 each in process flow Step can also carry out or it is noted that the execution sequence of each step is not limited to time sequencing according to different suitable simultaneously Sequence executes.For example, being calculated in the case where its corresponding partial parameters server of computing device is integrated in single computing machine Device can complete corresponding operation first with local parameter, then again by the communication function of system from other computers Partial parameters server on device obtains other parameters, and then remaining operation is completed based on the other parameters.
An exemplary embodiment of the present invention, when computing device executes data stream type and calculates, if the data flow Formula calculating is related to multiple directed acyclic graphs, then computing device can by merge in different directed acyclic graphs identical processing step come Data stream type is executed to calculate.For example, computing device can be by since the same treatment step upstream in merging directed acyclic graph Reduce calculation amount so that the time of multiple tasks operation be less than the time for being separately operable each task and.
Fig. 8 show it is according to an exemplary embodiment of the present invention by merge directed acyclic graph come execute data stream type calculating Example.(a) in Fig. 8 illustrates that the directed acyclic graph that the data stream type that each computing device needs to be implemented calculates, that is, It says, each computing device is required to execute the calculating task as shown in (a) in Fig. 8, is homogeneity between them.Particularly, It includes two independent directed acyclic graphs that data stream type shown by (a) in Fig. 8, which calculates, corresponding to first directed acyclic graph First task by processing 1, processing 2, processing 3, processing 4 this four processing steps form, and with second directed acyclic graph phase The second task answered is made of processing 1, processing 2, processing 5 and 6 this four processing steps of processing.Here, processing step can indicate Obtain the various processing such as data record, operation, output operation result.When going to particular step, each computing device Synchronizing function can be realized between each other by packaged function.
An exemplary embodiment of the present invention, computing device can be searched for by analyzing directed acyclic graph since upstream And merge identical processing step, for example, it is assumed that two directed acyclic graphs are required to obtain data record from identical data source, Also, two initial steps are identical (being processing 1 and processing 2), then computing device can merge identical processing first Step, to obtain the directed acyclic graph as shown in (b) in Fig. 8.In this way, the directed acyclic after merging can only be executed Figure, reduces actual calculation amount and read-write amount, brings performance boost.
Fig. 9 shows showing for the parameter according to an exemplary embodiment of the present invention that machine learning model is saved according to key-value pair Example.Numerous parameters of an exemplary embodiment of the present invention, machine learning model can be saved according to the form of key-value pair, be made Can have when there is more set key-value pair (for example, more sets of uniform machinery learning algorithm configure) as in Fig. 9 (a) for example Shown in key-value pair form, wherein every suit configuration under, each key correspond to respective value, for example, k1, k2, k3 ..., kn divide Do not correspond to v11, v12, v13 ..., v1n, alternatively, k1, k2, k3 ..., kn respectively correspond v21, v22, v23 ..., v2n, wherein n For the integer greater than 1.An exemplary embodiment of the present invention can save key-value pair by merging key, to form such as Fig. 9 In (b) shown in key-value pair form, wherein single key can correspond to multiple value, for example, k1 corresponds to both v11 and v21, To reduce the storage overhead of parameter server.On the other hand, when needing to hand over simultaneously between computing device and parameter server When the relevant parameter of mutual two kinds of configurations, the merging and/or compression of key can be carried out in transmission process, to reduce network biography Defeated expense.
It should be understood that parameter server, calculating in distributed machines learning system according to an exemplary embodiment of the present invention Device or the component parts such as their device of composition or unit can be individually configured to execute the software of specific function, hardware, consolidate Any combination of part or above-mentioned item.For example, these component parts can correspond to dedicated integrated circuit, can also correspond to pure Software code also corresponds to the module that software is combined with hardware.When they are real with software, firmware, middleware or microcode Now, the program code or code segment for executing corresponding operating can store computer-readable Jie in such as storage medium In matter, so that processor can execute corresponding operation by reading and running corresponding program code or code segment.In addition, The one or more functions that these component parts are realized can also be by the component in physical entity equipment (for example, computing machine etc.) To seek unity of action.
It should be noted that distributed machines learning system according to an exemplary embodiment of the present invention can be completely dependent on computer program Operation realize corresponding function, that is, each component part is corresponding to each step in the function structure of computer program, make It is called by special software package (for example, the library lib) to obtain whole system, to realize corresponding function.
Each exemplary embodiment of the invention is described above, it should be appreciated that foregoing description is merely exemplary, not Exhaustive, and present invention is also not necessarily limited to disclosed each exemplary embodiments.Without departing from scope and spirit of the present invention In the case where, many modifications and changes are obvious for those skilled in the art.Therefore, originally The protection scope of invention should be subject to the scope of the claims.

Claims (11)

1. a kind of for executing the distributed system of machine learning for data record, comprising:
Multiple computing devices, wherein each computing device be configured as executing for respective data record similarly about The data stream type of machine learning calculates;
Parameter server, the parameter for machine maintenance learning model;
Wherein, when the data stream type for executing training machine learning model calculates, computing device is utilized to be obtained from parameter server Parameter execute the similarly operation about machine learning model training, also, parameter service to be directed to respective data record Device updates the parameter according to the operation result of computing device;And/or it is carried out in advance executing using machine learning model When the data stream type estimated calculates, computing device is directed to respective data record using the parameter obtained from parameter server and executed The operation similarly estimated about machine learning model;
Wherein, the parameter server has distributed frame, corresponds to each computing device, there are corresponding partial parameters clothes Business device;The data stream type calculating is expressed as at least one directed acyclic graph being made of processing step, wherein different oriented nothings Ring figure corresponds to a kind of multi-configuration operation of machine learning algorithm process or a variety of machine learning algorithm processes, computing device pass through Merge identical processing step in different directed acyclic graphs to calculate to execute data stream type;When computing device and parameter server it Between need while when the relevant parameter of the two or more configurations of interaction models, the merging of same keys is carried out in transmission process.
2. distributed system as described in claim 1, wherein under the distributed frame, each section parameter server Become one with corresponding computing device.
3. distributed system as described in claim 1, wherein calculated in the data stream type for executing training machine learning model When, it is standby that calamity is carried out for each round iteration of data record.
4. distributed system as described in claim 1, wherein parameter server saves machine learning model according to key-value pair Parameter, also, with same keys key-value pair be saved as single key correspond to multiple values form.
5. distributed system as described in claim 1, wherein the parameter server include: interface arrangement, processing unit and Parameter memory;
Interface arrangement sends corresponding parameter to computing device for receiving the instruction of request parameter from computing device;Or Person sends relevant parameter to computing device, and fill from calculating for receiving the instruction of request parameter from computing device Set operation result and the relevant instruction received for undated parameter;
Processing unit stores corresponding parameter from parameter for analyzing the instruction of the request parameter from computing device Device is supplied to interface arrangement;Alternatively, the instruction for analyzing the request parameter from computing device, by corresponding parameter It is supplied to interface arrangement from parameter memory, and for analyzing the operation result for undated parameter from computing device And relevant instruction, the update of parameter is executed according to corresponding parameter update mode, and updated parameter is stored in In parameter memory;
Parameter memory, for saving the parameter of machine learning model.
6. distributed system as described in claim 1, wherein the computing device includes: interface unit and arithmetic element;
Interface unit is requested for sending the instruction of request parameter to parameter server, and from parameter server reception Corresponding parameter;Alternatively, the instruction for sending request parameter to parameter server, and institute is received from parameter server The corresponding parameter of request, and for providing the operation result and relevant finger that are used for undated parameter to parameter server It enables;
Arithmetic element executes data stream type calculating for data record for the parameter using machine learning model.
7. a kind of method for being directed to data record using distributed system and executing machine learning, wherein the distributed system In multiple computing devices among each computing device be configured as executing for respective data record similarly about The data stream type of machine learning calculates, which comprises
Respective data record is obtained by each computing device among multiple computing devices;
The parameter of machine learning model is obtained from the parameter server in distributed system by computing device;
Wherein, when the data stream type for executing training machine learning model calculates, by computing device using the parameter of acquisition come needle The similarly operation about machine learning model training is executed to respective data record, also, by parameter server according to meter The operation result of device is calculated to update the parameter;And/or in the data that execution is estimated using machine learning model When streaming computing, it is directed to respective data record using the parameter of acquisition by computing device and executes similarly about machine learning The operation of model pre-estimating;
Wherein, the parameter server has distributed frame, corresponds to each computing device, there are corresponding partial parameters clothes Business device;The data stream type calculating is expressed as at least one directed acyclic graph being made of processing step, wherein different oriented nothings Ring figure corresponds to a kind of multi-configuration operation of machine learning algorithm process or a variety of machine learning algorithm processes, computing device pass through Merge identical processing step in different directed acyclic graphs to calculate to execute data stream type;When computing device and parameter server it Between need while when the relevant parameter of the two or more configurations of interaction models, the merging of same keys is carried out in transmission process.
8. the method for claim 7, wherein the parameter server has distributed frame, wherein in the distribution Under formula structure, each section parameter server becomes one with corresponding computing device.
9. the method for claim 7, wherein when the data stream type for executing training machine learning model calculates, for The each round iteration of data record is standby to carry out calamity.
10. the method for claim 7, wherein parameter server saves the ginseng of machine learning model according to key-value pair Number, also, the key-value pair with same keys is saved as the form that single key corresponds to multiple values.
11. a kind of computer readable storage medium, computer program, the meter are stored on the computer readable storage medium Calculation machine program realizes the method as described in any one of claim 7-10 when being executed by processor.
CN201710764131.0A 2017-08-30 2017-08-30 The distributed system and its method of machine learning are executed for data record Active CN107679625B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710764131.0A CN107679625B (en) 2017-08-30 2017-08-30 The distributed system and its method of machine learning are executed for data record

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710764131.0A CN107679625B (en) 2017-08-30 2017-08-30 The distributed system and its method of machine learning are executed for data record

Publications (2)

Publication Number Publication Date
CN107679625A CN107679625A (en) 2018-02-09
CN107679625B true CN107679625B (en) 2019-09-17

Family

ID=61134942

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710764131.0A Active CN107679625B (en) 2017-08-30 2017-08-30 The distributed system and its method of machine learning are executed for data record

Country Status (1)

Country Link
CN (1) CN107679625B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107609652B (en) * 2017-08-30 2019-10-25 第四范式(北京)技术有限公司 Execute the distributed system and its method of machine learning
CN108921879A (en) * 2018-05-16 2018-11-30 中国地质大学(武汉) The motion target tracking method and system of CNN and Kalman filter based on regional choice
CN110766164A (en) * 2018-07-10 2020-02-07 第四范式(北京)技术有限公司 Method and system for performing a machine learning process
CN109445953A (en) * 2018-08-30 2019-03-08 北京大学 A kind of machine learning model training method towards large-scale machines learning system
CN110968887B (en) * 2018-09-28 2022-04-05 第四范式(北京)技术有限公司 Method and system for executing machine learning under data privacy protection
CN111507476A (en) * 2019-01-31 2020-08-07 伊姆西Ip控股有限责任公司 Method, apparatus and computer program product for deploying machine learning model
CN112148202B (en) * 2019-06-26 2023-05-26 杭州海康威视数字技术股份有限公司 Training sample reading method and device
CN111680799B (en) 2020-04-08 2024-02-20 北京字节跳动网络技术有限公司 Method and device for processing model parameters

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105760932A (en) * 2016-02-17 2016-07-13 北京物思创想科技有限公司 Data exchange method, data exchange device and calculating device
CN106022483A (en) * 2016-05-11 2016-10-12 星环信息科技(上海)有限公司 Method and equipment for conversion between machine learning models
CN106156810A (en) * 2015-04-26 2016-11-23 阿里巴巴集团控股有限公司 General-purpose machinery learning algorithm model training method, system and calculating node
CN106294762A (en) * 2016-08-11 2017-01-04 齐鲁工业大学 A kind of entity recognition method based on study

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10289962B2 (en) * 2014-06-06 2019-05-14 Google Llc Training distilled machine learning models
US10262272B2 (en) * 2014-12-07 2019-04-16 Microsoft Technology Licensing, Llc Active machine learning
CN105721211A (en) * 2016-02-24 2016-06-29 北京格灵深瞳信息技术有限公司 Data processing method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156810A (en) * 2015-04-26 2016-11-23 阿里巴巴集团控股有限公司 General-purpose machinery learning algorithm model training method, system and calculating node
CN105760932A (en) * 2016-02-17 2016-07-13 北京物思创想科技有限公司 Data exchange method, data exchange device and calculating device
CN106022483A (en) * 2016-05-11 2016-10-12 星环信息科技(上海)有限公司 Method and equipment for conversion between machine learning models
CN106294762A (en) * 2016-08-11 2017-01-04 齐鲁工业大学 A kind of entity recognition method based on study

Also Published As

Publication number Publication date
CN107679625A (en) 2018-02-09

Similar Documents

Publication Publication Date Title
CN107609652B (en) Execute the distributed system and its method of machine learning
CN107679625B (en) The distributed system and its method of machine learning are executed for data record
CN109086031B (en) Business decision method and device based on rule engine
US9672065B2 (en) Parallel simulation using multiple co-simulators
CN108351805A (en) Calculate the accelerator processing based on stream of figure
CN109359732B (en) Chip and data processing method based on chip
US11188348B2 (en) Hybrid computing device selection analysis
US10929161B2 (en) Runtime GPU/CPU selection
EP2738675B1 (en) System and method for efficient resource management of a signal flow programmed digital signal processor code
WO2024041400A1 (en) Model training task scheduling method and apparatus, and electronic device
CN110633785B (en) Method and system for calculating convolutional neural network
CN114154641A (en) AI model training method and device, computing equipment and storage medium
CN111352896B (en) Artificial intelligence accelerator, equipment, chip and data processing method
Sundas et al. An introduction of CloudSim simulation tool for modelling and scheduling
CN112783614A (en) Object processing method, device, equipment, storage medium and program product
CN108985459A (en) The method and apparatus of training pattern
KR20160031360A (en) Apparatus and method for executing an application based on an open computing language
CN115129481B (en) Computing resource allocation method and device and electronic equipment
CN114358253A (en) Time estimation method of neural network model and related product
CN110659125A (en) Analysis task execution method, device and system and electronic equipment
Dziok et al. Adaptive multi-level workflow scheduling with uncertain task estimates
CN112242959B (en) Micro-service current-limiting control method, device, equipment and computer storage medium
CN112817573B (en) Method, apparatus, computer system, and medium for building a streaming computing application
CN116032928B (en) Data collaborative computing method, device, system, electronic device and storage medium
CN111208980B (en) Data analysis processing method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant