CN107679625B - The distributed system and its method of machine learning are executed for data record - Google Patents
The distributed system and its method of machine learning are executed for data record Download PDFInfo
- Publication number
- CN107679625B CN107679625B CN201710764131.0A CN201710764131A CN107679625B CN 107679625 B CN107679625 B CN 107679625B CN 201710764131 A CN201710764131 A CN 201710764131A CN 107679625 B CN107679625 B CN 107679625B
- Authority
- CN
- China
- Prior art keywords
- parameter
- computing device
- machine learning
- learning model
- server
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Abstract
A kind of distributed system and its method that machine learning is executed for data record is provided, the system comprises: multiple computing devices, wherein each computing device is executed for respective data record similarly to be calculated about the data stream type of machine learning;Parameter server, parameter for machine maintenance learning model, wherein, when the data stream type for executing training machine learning model calculates, computing device is directed to respective data record using the parameter obtained from parameter server and executes the operation similarly trained about machine learning model, and parameter server updates the parameter according to the operation result of computing device;And/or when executing the data stream type calculating estimated using machine learning model, computing device is directed to respective data record using the parameter obtained from parameter server and executes the operation similarly estimated about machine learning model.Thus, it can be achieved that homogeneity between computing device, reduces network transmission expense.
Description
Technical field
Exemplary embodiment of the present invention all things considered is related to artificial intelligence field, is directed to more specifically to one kind
Data record, which is executed the distributed system of machine learning and is directed to data record using the distributed system, executes machine
The method of study.
Background technique
With increasing rapidly for data scale, machine learning is widely used in various fields with the value of mining data.
However, the memory of general physical machine is much insufficient in order to execute machine learning, divide for this purpose, generally requiring to utilize in practice
Cloth machine learning platform is completed the training of machine learning model or is accordingly estimated.
In existing distributed machines learning system (for example, in the deep learning frame TensorFlow of Google), lead to
One or more control nodes are commonly present, these control nodes are responsible for the task and calculating of other calculate nodes in scheduling system
Resource, wherein each calculate node can execute a part of calculating task, that is to say, that the meter executed between each calculate node
Calculation task is not identical.Correspondingly, for entire calculating task, adjacent step may be may require that on different physical machines
It executes, this needs network transmission just flow data between each calculate node and/or control node, and such mode will be led
Huge read-write expense is caused, and High_speed NIC (for example, ten thousand Broadcoms) is with high costs, can not be widely used in distribution
In machine learning system.
Further, in above-mentioned distributed machines learning system, if it is desired to realize and be based on some machine learning algorithm
Multi-configuration operation or be run multiple times, or, if it is desired to while running multiple machine learning algorithms, then need in algorithm
Portion modifies, or realizes the multiple calling of external logic, both modes will all expend biggish practical calculation amount.
Summary of the invention
Exemplary embodiment of the present invention is intended to overcome existing distributed machines learning system when executing machine learning
Network overhead and the biggish defect of operand.
An exemplary embodiment of the present invention provides a kind of for executing the distribution of machine learning for data record
System, comprising: multiple computing devices, wherein each computing device is configured as executing for respective data record same
About machine learning data stream type calculate;Parameter server, the parameter for machine maintenance learning model, wherein holding
When the data stream type of row training machine learning model calculates, computing device is directed to respectively using the parameter obtained from parameter server
From data record execute the similarly operation about machine learning model training, also, parameter server is according to computing device
Operation result update the parameter;And/or in the data stream type meter that execution is estimated using machine learning model
When calculation, computing device is directed to respective data record using the parameter obtained from parameter server and executed similarly about machine
The operation that learning model is estimated.
Optionally, in the distributed system, the parameter server has distributed frame, wherein at described point
Under cloth structure, each section parameter server becomes one with corresponding computing device.
Optionally, in the distributed system, when the data stream type for executing training machine learning model calculates, for
The each round iteration of data record is standby to carry out calamity.
Optionally, in the distributed system, the data stream type calculating is expressed as at least one by processing step group
At directed acyclic graph.
Optionally, in the distributed system, computing device is by merging identical processing in different directed acyclic graphs
Step calculates to execute data stream type.
Optionally, in the distributed system, parameter server saves the ginseng of machine learning model according to key-value pair
Number, also, the key-value pair with same keys is saved as the form that single key corresponds to multiple values.
In accordance with an alternative illustrative embodiment of the present invention, provide it is a kind of be directed to using distributed system data record execution
The method of machine learning, wherein each computing device among multiple computing devices in the distributed system is configured as
It executes for respective data record and is similarly calculated about the data stream type of machine learning, which comprises by multiple
Each computing device among computing device obtains respective data record;It is taken by computing device from the parameter in distributed system
The parameter for device acquisition machine learning model of being engaged in;Wherein, when the data stream type for executing training machine learning model calculates, by calculating
Device is directed to respective data record using the parameter of acquisition and executes the similarly operation about machine learning model training, and
And the parameter is updated according to the operation result of computing device by parameter server;And/or machine is utilized executing
When the data stream type that learning model is estimated calculates, respective data record is directed to using the parameter of acquisition by computing device
Execute the operation similarly estimated about machine learning model.
Optionally, in the method, the parameter server has distributed frame, wherein in the distributed knot
Under structure, each section parameter server becomes one with corresponding computing device.
Optionally, in the method, when the data stream type for executing training machine learning model calculates, remember for data
The each round iteration of record is standby to carry out calamity.
Optionally, in the method, the data stream type calculating is expressed as at least one is had by what processing step formed
To acyclic figure.
Optionally, in the method, computing device by merge in different directed acyclic graphs identical processing step come
Data stream type is executed to calculate.
Optionally, in the method, parameter server saves the parameter of machine learning model according to key-value pair, and
And the key-value pair with same keys is saved as the form that single key corresponds to multiple values.
In distributed machines learning system according to an exemplary embodiment of the present invention and its method, each computing device quilt
It is configured to execute same data stream type calculating, to realize the homogeneity between computing device, which reduce network transmissions
Expense can be preferably applied for large-scale machine learning model.
Detailed description of the invention
From the detailed description with reference to the accompanying drawing to the embodiment of the present invention, these and or other aspects of the invention and
Advantage will become clearer and be easier to understand, in which:
Fig. 1 shows according to an exemplary embodiment of the present invention for executing the distributed system of machine learning for data record
The block diagram of system;
Fig. 2 shows another exemplary embodiments according to the present invention for executing the distribution of machine learning for data record
The block diagram of formula system;
Fig. 3 shows the block diagram of parameter server according to an exemplary embodiment of the present invention;
Fig. 4 shows the block diagram of computing device according to an exemplary embodiment of the present invention;
Fig. 5 shows distributed machines learning system according to an exemplary embodiment of the present invention and executes machine learning model training
Method flow chart;
The distributed machines learning system that Fig. 6 shows another exemplary embodiment according to the present invention executes machine learning model
The flow chart of trained method;
Fig. 7 shows distributed machines learning system execution machine learning model according to an exemplary embodiment of the present invention and estimates
Method flow chart;
Fig. 8 show it is according to an exemplary embodiment of the present invention by merge directed acyclic graph come execute data stream type calculating
Example;And
Fig. 9 shows showing for the parameter according to an exemplary embodiment of the present invention that machine learning model is saved according to key-value pair
Example.
Specific embodiment
In order to make those skilled in the art more fully understand the present invention, with reference to the accompanying drawings and detailed description to this hair
Bright exemplary embodiment is described in further detail.
Machine learning is the inevitable outcome that artificial intelligence study develops to certain phase, is dedicated to the hand by calculating
Section, improves the performance of system itself using experience.In computer systems, " experience " exists usually in the form of " data ", leads to
Machine learning algorithm is crossed, " model " can be generated from data, which is represented by certain algorithmic function under special parameter,
That is empirical data is supplied to machine learning algorithm, model can be generated based on these empirical datas (namely based on data
And learn to arrive the parameter of function), when facing news, model can provide corresponding judgement, that is, estimation results.Engineering
Practise the form that can be implemented as " supervised learning ", " unsupervised learning " or " semi-supervised learning ", it should be noted that example of the invention
Property embodiment is to specific machine learning algorithm and without specific limitation.Further, it should also be noted that in training machine learning model
In the process, also using statistic algorithm, business rule and/or expertise etc., to further increase the effect of machine learning.
Particularly, an exemplary embodiment of the present invention relates to distributed machines learning system, distributed machines study
System can be made of parameter server and computing device, wherein be held for respective data record on multiple computing device distributions ground
The identical machine learning task of row, correspondingly, parameter server is by interacting to machine maintenance with each computing device
Practise the parameter of model.It should be noted that computing device and/or parameter server mentioned here by performed by it processing or institute it is real
Existing function limits, and can both indicate physical entity, can also indicate that pseudo-entity, for example, computing device can indicate actual meter
Machine is calculated, can also indicate that the logic entity being deployed on the computing machine, equally, parameter server can both indicate actual calculating
Machine also can be used as one or more logic entities and be deployed on same or different computing machine from computing device, even
It can directly be served as by certain computing devices.
Fig. 1 shows according to an exemplary embodiment of the present invention for executing the distributed system of machine learning for data record
The block diagram of system.Particularly, distributed machines learning system shown in FIG. 1 may include parameter server 2000 and multiple calculating dress
Set 1000 (for example, 1000-1,1000-2 ..., 1000-n (wherein, n is the integer greater than 1)).The distributed machines study
System can be used for training machine learning model and/or be estimated using trained machine learning model.
Particularly, each computing device 1000 is configured as executing for respective data record similarly about machine
The data stream type of device study calculates.As an example, each computing device 1000 can be respectively from data source (for example, the cloud on network
The addressable position of all computing devices such as disk) a part of data record to be treated is obtained, and for this portion obtained
Divided data record calculates to execute data stream type;Alternatively, in the case where data volume is less, each computing device 1000 can also one
Secondary property obtains whole data records from data source, and executes data stream type meter for a part of data record therein respectively
It calculates.An exemplary embodiment of the present invention, calculating task performed by each computing device 1000 are identical (that is, about machine
The data stream type of study calculates), only targeted data record is different.
Here, data stream type, which calculates, refers to that each computing device 1000 is both needed to the streaming computing task executed, can be
The set of the certain processing executed required for estimating is executed for training machine learning model and/or using machine learning model.
An exemplary embodiment of the present invention, data stream type calculating are represented by least one directed acyclic being made of processing step
Figure.That is, data stream type calculating can indicate calculation process represented by single directed acyclic graph;In addition, data stream type meter
Calculation can also indicate that calculation process represented by multiple directed acyclic graphs, correspondingly, distribution according to an exemplary embodiment of the present invention
Formula machine learning system can be performed simultaneously the multi-configuration operation of certain machine learning algorithm process or be performed simultaneously a variety of engineerings
Practise algorithm flow.The processing step of composition data streaming computing not only includes calculation step, further includes other various processing steps
(for example, obtaining data, output operation result etc.).
Parameter server 2000 is used for the parameter of machine maintenance learning model.As described above, machine learning model can be regarded as
It, particularly, can be gradually by recording constantly repetitive exercise for total data about the function of machine learning sample characteristics
Converge to the parametric optimal solution of the function.An exemplary embodiment of the present invention, parameter server 2000 are used for machine maintenance
The parameter of learning model, so that computing device 1000 can be by carrying out when executing data stream type and calculating with parameter server 2000
Interaction is to obtain corresponding parameter, and on the other hand, in the training stage of machine learning model, parameter server 2000 be may be based on
The operation result of each computing device 1000 updates the parameter.That is, in the number for executing training machine learning model
When according to streaming computing, computing device 1000 is held using the parameter obtained from parameter server 2000 to be directed to respective data record
The row similarly operation about machine learning model training, also, parameter server 2000 is according to the operation of computing device 1000
As a result the parameter is updated;And/or when executing the data stream type calculating estimated using machine learning model,
Computing device 1000 be directed to using the parameter obtained from parameter server 2000 respective data record execute similarly about
The operation that machine learning model is estimated.
As an example, parameter server 2000 can be deployed on single computing machine;Alternatively, parameter server 2000 can
It is deployed on multiple relevant computing machines simultaneously.For example, the effect of parameter server 2000 can be equivalent to a global Kazakhstan
Table is wished, key-value pair (key-value) of the preservation a part about parameter in each partial parameters server, and various pieces parameter
Key-value pair on server is without intersection.As an example, parameter server 2000 can save machine learning model according to key-value pair
Parameter, also, with same keys (key) key-value pair be saved as single key correspond to multiple values (value) form.
An exemplary embodiment of the present invention can first cut the data record from data source (for example, far-end network)
It is divided into different set, to be respectively stored in the local hard drive or memory of computing device 1000, correspondingly, can be filled by each calculating
1000 are set to calculate by executing data stream type to local data come the corresponding machine learning calculating task of complete independently, thus substantially
Reduce the expense of reading and writing data.
As an example, each computing device 1000 is local can individually to store portion completely when machine learning model is smaller
Parameter, and when machine learning model is larger, parameter is stored in dispersiblely on multiple portions parameter server.According to this hair
Bright exemplary implementation, the parameter server 2000 can have distributed frame, wherein under the distributed frame, often
A part of parameter server becomes one with corresponding computing device 1000.
Fig. 2 shows another exemplary embodiments according to the present invention for executing the distribution of machine learning for data record
The block diagram of formula system.In distributed machines learning system shown in Fig. 2, correspond to each computing device 1000, exists corresponding
Partial parameters server 2001.Particularly, computing device 1000-1 can be integrated in phase with partial parameters server 2001-1
On same virtual machine or physical machine, computing device 1000-2 can be integrated in identical virtual machine with partial parameters server 2001-2
Or in physical machine, and so on, computing device 1000-n can be integrated in identical virtual machine with partial parameters server 2001-n
Or in physical machine.In the distributed machines learning system, each computing device 1000 calculates to execute data stream type, can be
The local parameter safeguarded using the corresponding part parameter server 2001 integrated with it, in addition, computing device 1000
Can also require the parameter safeguarded using other parts parameter server 2001, for this purpose, computing device 1000 need with it is described its
His partial parameters server 2001 is interacted to obtain executing whole parameters needed for data stream type calculates.
As can be seen that in distributed machines learning system according to an exemplary embodiment of the present invention, in addition to multiple
Except computing device, parameter server can also have distributed structure, that is, there are multiple portions parameter servers.In this feelings
Under condition, calculated by being performed by computing device data stream type, it is ensured that the realization of large-scale machines study, so as to pass through spy
Better machine learning effect is realized in the increase of the promotion and data volume of levying dimension.
Fig. 3 shows the block diagram of parameter server according to an exemplary embodiment of the present invention.Here, parameter clothes shown in Fig. 3
Device be engaged in either univers parameter server, is also possible to partial parameters server.
Referring to Fig. 3, parameter server may include interface arrangement 2100, processing unit 2200 and parameter memory 2300.
Particularly, interface arrangement 2100 can be used for interacting with computing device 1000, thus transmission and machine learning
Relevant instruction and/or data, wherein the data can be the parameter of machine learning model, the operation for undated parameter
The various related datas such as a result.Here, interface arrangement 2100 can receive the instruction of request parameter from computing device 1000,
Instruction related with undated parameter or operation result can be received from computing device 1000;In addition, interface arrangement 2100 can also be to meter
It calculates device 1000 and sends relevant parameter.
Processing unit 2200 can execute processing according to by the received instruction of interface arrangement 2100 and/or data to update
And/or provide parameter, wherein the parameter is saved by parameter memory 2300.As an example, processing unit 2200 can divide
Corresponding parameter, is then supplied to from parameter memory 2300 and connects by the instruction for analysing the required parameter from computing device 1000
Mouth device 2100, and then the parameter is supplied to computing device 1000 by interface arrangement 2100.Alternatively, processing unit 2200 can
The instruction and corresponding operation result data for analyzing the undated parameter from computing device 1000, according to corresponding parameter update side
Formula executes the update of parameter, and updated parameter is stored in parameter memory 2300.
Parameter memory 2300 is used to save the parameter of machine learning model.As an example, parameter memory 2300
The parameter can be saved according to the form of key-value pair (key-value).Exemplary implementation according to the present invention, for model
More set configurations, can be used a key and correspond to the form of multiple value to save corresponding parameter.
Fig. 4 shows the block diagram of computing device according to an exemplary embodiment of the present invention.Referring to Fig. 4, computing device 1000 can
Including interface unit 1100 and arithmetic element 1200.
Particularly, interface unit 1100 can be used for parameter server (for example, parameter server 2000 or partial parameters
Server 2001) it interacts, thus transmission instruction relevant to machine learning and/or data, wherein the data can be
The parameter of machine learning model, for the various related datas such as the operation result of undated parameter.Here, interface unit 1100 can be to
Parameter server sends the instruction of request parameter, and receives requested parameter from parameter server;In addition, interface unit
1100 can also provide operation result and the relevant instruction for undated parameter to parameter server.As an example, interface list
Member 1100 can also be used in the data record that expectation processing is obtained from data source, or for backuping to separately interim operation result
In outer device.
Arithmetic unit 1200 is used for the parameter using machine learning model, and data stream type meter is executed for data record
It calculates.Here, arithmetic unit 1200 can be performed data stream type calculate involved in about machine learning model training and/or estimate
Various concrete operations.As described above, an exemplary embodiment of the present invention, data stream type calculating are represented by one or more
A directed acyclic graph being made of processing step, for example, generally requiring the more sets of training in the training stage of machine learning model and matching
Model under setting is to carry out model tuning, in this case, if it is desired to while carry out the training of more set of model, then data stream type
Calculating can be made of the different directed acyclic graph of multiple configurations.Correspondingly, the execution of the arithmetic unit 1200 resulting result of operation can
Parameter server or other devices are passed to via interface unit 1100.It should be noted that the composition that data stream type calculates is not limited
It in above-mentioned example, but may include the combination of any single directed acyclic graph or different directed acyclic graphs, for example, data stream type meter
Calculate the single process that can indicate to be estimated using machine learning model.The stage is estimated in machine learning model, is filled by operation
It sets 1200 and can be used as discreet value for respective data record by executing the resulting result of operation.
Fig. 5 shows distributed machines learning system according to an exemplary embodiment of the present invention and executes machine learning model training
Method flow chart.The step of the method is related to can be by the computing device in the distributed machines learning system that describes before
And/or parameter server (for example, parameter server 2000 or partial parameters server 2001) Lai Zhihang, for example, can be by calculating
Device and/or parameter server are executed according to preset configuration, wherein multiple calculating dress in the distributed system
Each computing device among setting is configured as executing the similarly data about machine learning for respective data record
Streaming computing.
Referring to Fig. 5, in the step s 100, respective data is obtained by each computing device among multiple computing devices and are remembered
Record.Here data record instruction is used for the historgraphic data recording of model training, has in the case where supervised learning corresponding
Label (label).For example, each computing device can be read respectively from data source by the data record of processing, meter respectively first
Calculating the data record read between device does not have intersection, that is to say, that each computing device can assign to the one of conceptual data record
Then identical training mission is done in part together.Data record is read to after being locally stored in computing device, it is subsequent
When needing to be implemented related computing then corresponding data record directly can be obtained from local.
Next, in step s 200, obtaining machine learning from the parameter server in distributed system by computing device
The parameter of model.Here, each computing device can obtain all required parameter from single parameter server;Alternatively, joining
In the case that number server has distributed frame, computing device from integrated partial parameters server in addition to getting parms
Except, it can also require and obtain other parameter from other parts parameter server.
In step S300, it is directed to respective data record using the parameter of acquisition by computing device and executes same close
In the operation of machine learning model training.Here, computing device can complete number based on the data record and parameter that had previously obtained
According to calculation step involved in streaming computing.
In step S400, the parameter is updated according to the operation result of computing device by parameter server.Here, root
The factors such as the design according to machine learning algorithm and Distributed Architecture, parameter server can carry out undated parameter according to certain frequency,
For example, operation result can be aggregated into parameter server by each computing device after the operation for completing data record, by
Parameter server executes the update of parameter according to scheduled update mode.In addition, the frequency that parameter updates is not limited to one
Data, for example, the operation result of iteration can be taken turns come undated parameter based on batch of data or one.
It should be noted that each step in Fig. 5 be not limited to it is specific execute sequence, for example, those skilled in the art should manage
Solution generally requires repeatedly to obtain data note from outside or locally during being iterated operation for mass data record
Record and/or parameter.
When the machine learning model that training obtains is smaller, can individually be stored on each computational entity a complete
Whole model parameter, however, then needing for model parameter piecemeal to be stored in multiple portions ginseng when machine learning model is larger
On number server.Due to needing repeatedly access data when computing device executes processor active task, it is therefore necessary to calamity appropriate is arranged
Standby measure.Different from frequently executing the standby processing mode of calamity in the prior art, exemplary implementation according to the present invention is executing instruction
When practicing the data stream type calculating of machine learning model, it is standby that calamity is carried out for each round iteration of data record.Pass through this spy
For mode operational efficiency can be significantly increased while realizing calamity for target in fixed calamity.
The distributed machines learning system that Fig. 6 shows another exemplary embodiment according to the present invention executes machine learning model
The flow chart of trained method.In method shown in Fig. 6, execute that calamity is standby according to every wheel iteration, wherein step S100,
S200, S300 and S400 are similar with corresponding steps shown in fig. 5, will not be described in great detail details here.
Referring to Fig. 6, in step S500, it is determined whether perform a wheel repetitive exercise for total data record.If
A wheel iteration is not yet completed, then the method proceeds to step S700.If it is determined that completing a wheel iteration, then in step S600
In, currently available model parameter is backed up, for example, can between multiple portions parameter server extraly interleaved
Currently available model parameter, that is, each partial parameters server is also additional other than keeping that a part of parameter of oneself
The parameter that storage other parameters server is safeguarded;Alternatively, institute can be backed up on other devices other than parameter server
State parameter.Here, the currently available model parameter of one or more parts can be backed up, to ensure the standby realization of calamity.
In step S700, it is determined whether complete the training of machine learning model, if completing training, obtain
The machine learning model being made of parameter.Otherwise, the method return step S100 to be to continue to obtain new data record, this
In, according to process flow before, the new data record is either continue acquisition when epicycle iteration is not yet completed
Data record is also possible to the data record reacquired when epicycle iteration is just completed.These data records both may be from outer
Portion's data source also may be from computing device local.
Fig. 7 shows distributed machines learning system execution machine learning model according to an exemplary embodiment of the present invention and estimates
Method flow chart.The step of the method is related to can be by the computing device in the distributed machines learning system that describes before
And/or parameter server (for example, parameter server 2000 or partial parameters server 2001) Lai Zhihang, for example, can be by calculating
Device and/or parameter server are executed according to preset configuration, wherein multiple calculating dress in the distributed system
Each computing device among setting is configured as executing the similarly data about machine learning for respective data record
Streaming computing.
Referring to Fig. 7, in step s 110, respective data is obtained by each computing device among multiple computing devices and are remembered
Record.Here data record of the data record instruction for model pre-estimating (or test).Each computing device can be respectively from data
Source is read the data record of processing respectively, and the data record read between computing device does not have intersection, that is to say, that Mei Geji
A part of conceptual data record can be assigned to by calculating device, then done together and identical estimated task.
Next, obtaining machine learning from the parameter server in distributed system by computing device in step S210
The parameter of model.Here, each computing device can obtain all required parameter from single parameter server;Alternatively, joining
In the case that number server has distributed frame, computing device from integrated partial parameters server in addition to getting parms
Except, it can also require and obtain other parameter from other parts parameter server.
In step s310, it is directed to respective data record using the parameter of acquisition by computing device and executes same close
In the operation that machine learning model is estimated.Here, computing device can complete number based on the data record and parameter that had previously obtained
According to calculation step involved in streaming computing.
Distributed machines learning system according to an exemplary embodiment of the present invention is described above by reference to Fig. 5 to Fig. 7 to execute
Some concrete operations or other processing can be encapsulated as the function that can be called, for example, can be by number here by the method for machine learning
Merge according to the synchronous waiting in streaming computing, data and the processing such as broadcast interaction are encapsulated as the function that can be called.Aforesaid way makes
It obtains programming personnel to call directly when needed, so that programming personnel be helped to concentrate on distributed implementation logic, effectively control
Algorithm processed, without realizing complicated Lower level logical.
Although moreover, it is noted that being sequentially displayed in the method flow diagram of Fig. 5 to Fig. 7 each in process flow
Step can also carry out or it is noted that the execution sequence of each step is not limited to time sequencing according to different suitable simultaneously
Sequence executes.For example, being calculated in the case where its corresponding partial parameters server of computing device is integrated in single computing machine
Device can complete corresponding operation first with local parameter, then again by the communication function of system from other computers
Partial parameters server on device obtains other parameters, and then remaining operation is completed based on the other parameters.
An exemplary embodiment of the present invention, when computing device executes data stream type and calculates, if the data flow
Formula calculating is related to multiple directed acyclic graphs, then computing device can by merge in different directed acyclic graphs identical processing step come
Data stream type is executed to calculate.For example, computing device can be by since the same treatment step upstream in merging directed acyclic graph
Reduce calculation amount so that the time of multiple tasks operation be less than the time for being separately operable each task and.
Fig. 8 show it is according to an exemplary embodiment of the present invention by merge directed acyclic graph come execute data stream type calculating
Example.(a) in Fig. 8 illustrates that the directed acyclic graph that the data stream type that each computing device needs to be implemented calculates, that is,
It says, each computing device is required to execute the calculating task as shown in (a) in Fig. 8, is homogeneity between them.Particularly,
It includes two independent directed acyclic graphs that data stream type shown by (a) in Fig. 8, which calculates, corresponding to first directed acyclic graph
First task by processing 1, processing 2, processing 3, processing 4 this four processing steps form, and with second directed acyclic graph phase
The second task answered is made of processing 1, processing 2, processing 5 and 6 this four processing steps of processing.Here, processing step can indicate
Obtain the various processing such as data record, operation, output operation result.When going to particular step, each computing device
Synchronizing function can be realized between each other by packaged function.
An exemplary embodiment of the present invention, computing device can be searched for by analyzing directed acyclic graph since upstream
And merge identical processing step, for example, it is assumed that two directed acyclic graphs are required to obtain data record from identical data source,
Also, two initial steps are identical (being processing 1 and processing 2), then computing device can merge identical processing first
Step, to obtain the directed acyclic graph as shown in (b) in Fig. 8.In this way, the directed acyclic after merging can only be executed
Figure, reduces actual calculation amount and read-write amount, brings performance boost.
Fig. 9 shows showing for the parameter according to an exemplary embodiment of the present invention that machine learning model is saved according to key-value pair
Example.Numerous parameters of an exemplary embodiment of the present invention, machine learning model can be saved according to the form of key-value pair, be made
Can have when there is more set key-value pair (for example, more sets of uniform machinery learning algorithm configure) as in Fig. 9 (a) for example
Shown in key-value pair form, wherein every suit configuration under, each key correspond to respective value, for example, k1, k2, k3 ..., kn divide
Do not correspond to v11, v12, v13 ..., v1n, alternatively, k1, k2, k3 ..., kn respectively correspond v21, v22, v23 ..., v2n, wherein n
For the integer greater than 1.An exemplary embodiment of the present invention can save key-value pair by merging key, to form such as Fig. 9
In (b) shown in key-value pair form, wherein single key can correspond to multiple value, for example, k1 corresponds to both v11 and v21,
To reduce the storage overhead of parameter server.On the other hand, when needing to hand over simultaneously between computing device and parameter server
When the relevant parameter of mutual two kinds of configurations, the merging and/or compression of key can be carried out in transmission process, to reduce network biography
Defeated expense.
It should be understood that parameter server, calculating in distributed machines learning system according to an exemplary embodiment of the present invention
Device or the component parts such as their device of composition or unit can be individually configured to execute the software of specific function, hardware, consolidate
Any combination of part or above-mentioned item.For example, these component parts can correspond to dedicated integrated circuit, can also correspond to pure
Software code also corresponds to the module that software is combined with hardware.When they are real with software, firmware, middleware or microcode
Now, the program code or code segment for executing corresponding operating can store computer-readable Jie in such as storage medium
In matter, so that processor can execute corresponding operation by reading and running corresponding program code or code segment.In addition,
The one or more functions that these component parts are realized can also be by the component in physical entity equipment (for example, computing machine etc.)
To seek unity of action.
It should be noted that distributed machines learning system according to an exemplary embodiment of the present invention can be completely dependent on computer program
Operation realize corresponding function, that is, each component part is corresponding to each step in the function structure of computer program, make
It is called by special software package (for example, the library lib) to obtain whole system, to realize corresponding function.
Each exemplary embodiment of the invention is described above, it should be appreciated that foregoing description is merely exemplary, not
Exhaustive, and present invention is also not necessarily limited to disclosed each exemplary embodiments.Without departing from scope and spirit of the present invention
In the case where, many modifications and changes are obvious for those skilled in the art.Therefore, originally
The protection scope of invention should be subject to the scope of the claims.
Claims (11)
1. a kind of for executing the distributed system of machine learning for data record, comprising:
Multiple computing devices, wherein each computing device be configured as executing for respective data record similarly about
The data stream type of machine learning calculates;
Parameter server, the parameter for machine maintenance learning model;
Wherein, when the data stream type for executing training machine learning model calculates, computing device is utilized to be obtained from parameter server
Parameter execute the similarly operation about machine learning model training, also, parameter service to be directed to respective data record
Device updates the parameter according to the operation result of computing device;And/or it is carried out in advance executing using machine learning model
When the data stream type estimated calculates, computing device is directed to respective data record using the parameter obtained from parameter server and executed
The operation similarly estimated about machine learning model;
Wherein, the parameter server has distributed frame, corresponds to each computing device, there are corresponding partial parameters clothes
Business device;The data stream type calculating is expressed as at least one directed acyclic graph being made of processing step, wherein different oriented nothings
Ring figure corresponds to a kind of multi-configuration operation of machine learning algorithm process or a variety of machine learning algorithm processes, computing device pass through
Merge identical processing step in different directed acyclic graphs to calculate to execute data stream type;When computing device and parameter server it
Between need while when the relevant parameter of the two or more configurations of interaction models, the merging of same keys is carried out in transmission process.
2. distributed system as described in claim 1, wherein under the distributed frame, each section parameter server
Become one with corresponding computing device.
3. distributed system as described in claim 1, wherein calculated in the data stream type for executing training machine learning model
When, it is standby that calamity is carried out for each round iteration of data record.
4. distributed system as described in claim 1, wherein parameter server saves machine learning model according to key-value pair
Parameter, also, with same keys key-value pair be saved as single key correspond to multiple values form.
5. distributed system as described in claim 1, wherein the parameter server include: interface arrangement, processing unit and
Parameter memory;
Interface arrangement sends corresponding parameter to computing device for receiving the instruction of request parameter from computing device;Or
Person sends relevant parameter to computing device, and fill from calculating for receiving the instruction of request parameter from computing device
Set operation result and the relevant instruction received for undated parameter;
Processing unit stores corresponding parameter from parameter for analyzing the instruction of the request parameter from computing device
Device is supplied to interface arrangement;Alternatively, the instruction for analyzing the request parameter from computing device, by corresponding parameter
It is supplied to interface arrangement from parameter memory, and for analyzing the operation result for undated parameter from computing device
And relevant instruction, the update of parameter is executed according to corresponding parameter update mode, and updated parameter is stored in
In parameter memory;
Parameter memory, for saving the parameter of machine learning model.
6. distributed system as described in claim 1, wherein the computing device includes: interface unit and arithmetic element;
Interface unit is requested for sending the instruction of request parameter to parameter server, and from parameter server reception
Corresponding parameter;Alternatively, the instruction for sending request parameter to parameter server, and institute is received from parameter server
The corresponding parameter of request, and for providing the operation result and relevant finger that are used for undated parameter to parameter server
It enables;
Arithmetic element executes data stream type calculating for data record for the parameter using machine learning model.
7. a kind of method for being directed to data record using distributed system and executing machine learning, wherein the distributed system
In multiple computing devices among each computing device be configured as executing for respective data record similarly about
The data stream type of machine learning calculates, which comprises
Respective data record is obtained by each computing device among multiple computing devices;
The parameter of machine learning model is obtained from the parameter server in distributed system by computing device;
Wherein, when the data stream type for executing training machine learning model calculates, by computing device using the parameter of acquisition come needle
The similarly operation about machine learning model training is executed to respective data record, also, by parameter server according to meter
The operation result of device is calculated to update the parameter;And/or in the data that execution is estimated using machine learning model
When streaming computing, it is directed to respective data record using the parameter of acquisition by computing device and executes similarly about machine learning
The operation of model pre-estimating;
Wherein, the parameter server has distributed frame, corresponds to each computing device, there are corresponding partial parameters clothes
Business device;The data stream type calculating is expressed as at least one directed acyclic graph being made of processing step, wherein different oriented nothings
Ring figure corresponds to a kind of multi-configuration operation of machine learning algorithm process or a variety of machine learning algorithm processes, computing device pass through
Merge identical processing step in different directed acyclic graphs to calculate to execute data stream type;When computing device and parameter server it
Between need while when the relevant parameter of the two or more configurations of interaction models, the merging of same keys is carried out in transmission process.
8. the method for claim 7, wherein the parameter server has distributed frame, wherein in the distribution
Under formula structure, each section parameter server becomes one with corresponding computing device.
9. the method for claim 7, wherein when the data stream type for executing training machine learning model calculates, for
The each round iteration of data record is standby to carry out calamity.
10. the method for claim 7, wherein parameter server saves the ginseng of machine learning model according to key-value pair
Number, also, the key-value pair with same keys is saved as the form that single key corresponds to multiple values.
11. a kind of computer readable storage medium, computer program, the meter are stored on the computer readable storage medium
Calculation machine program realizes the method as described in any one of claim 7-10 when being executed by processor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710764131.0A CN107679625B (en) | 2017-08-30 | 2017-08-30 | The distributed system and its method of machine learning are executed for data record |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710764131.0A CN107679625B (en) | 2017-08-30 | 2017-08-30 | The distributed system and its method of machine learning are executed for data record |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107679625A CN107679625A (en) | 2018-02-09 |
CN107679625B true CN107679625B (en) | 2019-09-17 |
Family
ID=61134942
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710764131.0A Active CN107679625B (en) | 2017-08-30 | 2017-08-30 | The distributed system and its method of machine learning are executed for data record |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107679625B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107609652B (en) * | 2017-08-30 | 2019-10-25 | 第四范式(北京)技术有限公司 | Execute the distributed system and its method of machine learning |
CN108921879A (en) * | 2018-05-16 | 2018-11-30 | 中国地质大学(武汉) | The motion target tracking method and system of CNN and Kalman filter based on regional choice |
CN110766164A (en) * | 2018-07-10 | 2020-02-07 | 第四范式(北京)技术有限公司 | Method and system for performing a machine learning process |
CN109445953A (en) * | 2018-08-30 | 2019-03-08 | 北京大学 | A kind of machine learning model training method towards large-scale machines learning system |
CN110968887B (en) * | 2018-09-28 | 2022-04-05 | 第四范式(北京)技术有限公司 | Method and system for executing machine learning under data privacy protection |
CN111507476A (en) * | 2019-01-31 | 2020-08-07 | 伊姆西Ip控股有限责任公司 | Method, apparatus and computer program product for deploying machine learning model |
CN112148202B (en) * | 2019-06-26 | 2023-05-26 | 杭州海康威视数字技术股份有限公司 | Training sample reading method and device |
CN111680799B (en) | 2020-04-08 | 2024-02-20 | 北京字节跳动网络技术有限公司 | Method and device for processing model parameters |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105760932A (en) * | 2016-02-17 | 2016-07-13 | 北京物思创想科技有限公司 | Data exchange method, data exchange device and calculating device |
CN106022483A (en) * | 2016-05-11 | 2016-10-12 | 星环信息科技(上海)有限公司 | Method and equipment for conversion between machine learning models |
CN106156810A (en) * | 2015-04-26 | 2016-11-23 | 阿里巴巴集团控股有限公司 | General-purpose machinery learning algorithm model training method, system and calculating node |
CN106294762A (en) * | 2016-08-11 | 2017-01-04 | 齐鲁工业大学 | A kind of entity recognition method based on study |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10289962B2 (en) * | 2014-06-06 | 2019-05-14 | Google Llc | Training distilled machine learning models |
US10262272B2 (en) * | 2014-12-07 | 2019-04-16 | Microsoft Technology Licensing, Llc | Active machine learning |
CN105721211A (en) * | 2016-02-24 | 2016-06-29 | 北京格灵深瞳信息技术有限公司 | Data processing method and device |
-
2017
- 2017-08-30 CN CN201710764131.0A patent/CN107679625B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106156810A (en) * | 2015-04-26 | 2016-11-23 | 阿里巴巴集团控股有限公司 | General-purpose machinery learning algorithm model training method, system and calculating node |
CN105760932A (en) * | 2016-02-17 | 2016-07-13 | 北京物思创想科技有限公司 | Data exchange method, data exchange device and calculating device |
CN106022483A (en) * | 2016-05-11 | 2016-10-12 | 星环信息科技(上海)有限公司 | Method and equipment for conversion between machine learning models |
CN106294762A (en) * | 2016-08-11 | 2017-01-04 | 齐鲁工业大学 | A kind of entity recognition method based on study |
Also Published As
Publication number | Publication date |
---|---|
CN107679625A (en) | 2018-02-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107609652B (en) | Execute the distributed system and its method of machine learning | |
CN107679625B (en) | The distributed system and its method of machine learning are executed for data record | |
CN109086031B (en) | Business decision method and device based on rule engine | |
US9672065B2 (en) | Parallel simulation using multiple co-simulators | |
CN108351805A (en) | Calculate the accelerator processing based on stream of figure | |
CN109359732B (en) | Chip and data processing method based on chip | |
US11188348B2 (en) | Hybrid computing device selection analysis | |
US10929161B2 (en) | Runtime GPU/CPU selection | |
EP2738675B1 (en) | System and method for efficient resource management of a signal flow programmed digital signal processor code | |
WO2024041400A1 (en) | Model training task scheduling method and apparatus, and electronic device | |
CN110633785B (en) | Method and system for calculating convolutional neural network | |
CN114154641A (en) | AI model training method and device, computing equipment and storage medium | |
CN111352896B (en) | Artificial intelligence accelerator, equipment, chip and data processing method | |
Sundas et al. | An introduction of CloudSim simulation tool for modelling and scheduling | |
CN112783614A (en) | Object processing method, device, equipment, storage medium and program product | |
CN108985459A (en) | The method and apparatus of training pattern | |
KR20160031360A (en) | Apparatus and method for executing an application based on an open computing language | |
CN115129481B (en) | Computing resource allocation method and device and electronic equipment | |
CN114358253A (en) | Time estimation method of neural network model and related product | |
CN110659125A (en) | Analysis task execution method, device and system and electronic equipment | |
Dziok et al. | Adaptive multi-level workflow scheduling with uncertain task estimates | |
CN112242959B (en) | Micro-service current-limiting control method, device, equipment and computer storage medium | |
CN112817573B (en) | Method, apparatus, computer system, and medium for building a streaming computing application | |
CN116032928B (en) | Data collaborative computing method, device, system, electronic device and storage medium | |
CN111208980B (en) | Data analysis processing method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |