CN109358944A - Deep learning distributed arithmetic method, apparatus, computer equipment and storage medium - Google Patents
Deep learning distributed arithmetic method, apparatus, computer equipment and storage medium Download PDFInfo
- Publication number
- CN109358944A CN109358944A CN201811080562.6A CN201811080562A CN109358944A CN 109358944 A CN109358944 A CN 109358944A CN 201811080562 A CN201811080562 A CN 201811080562A CN 109358944 A CN109358944 A CN 109358944A
- Authority
- CN
- China
- Prior art keywords
- task
- code
- deep learning
- docker container
- cryptographic hash
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45504—Abstract machines for programme code execution, e.g. Java virtual machine [JVM], interpreters, emulators
- G06F9/45516—Runtime code conversion or optimisation
- G06F9/45525—Optimisation or modification within the same instruction set architecture, e.g. HP Dynamo
Abstract
The present invention relates to deep learning distributed arithmetic method, apparatus, computer equipment and storage medium, this method includes packaging model, forms model object;Determine operation number;Acquisition request;According to request cryptographic Hash;Task distribution is carried out according to cryptographic Hash and model object;According to Distribution Results code corresponding to Docker container execution task;Judge whether code corresponding to task is finished;If so, discharging code corresponding to the task in Docker container.The present invention is by carrying out further function package on the basis of the distributed arithmetic frame of tensorflow, call docker container, and the code corresponding to operation task in docker container, running environment unification is realized to be isolated with image processor, it calls docker container that can carry out the dynamic expansion of operation cluster, makes the training speed for improving deep learning model.
Description
Technical field
The present invention relates to distributed arithmetic methods, more specifically refer to deep learning distributed arithmetic method, apparatus, meter
Calculate machine equipment and storage medium.
Background technique
In recent years, deep learning and distributed computing are the research contents being concerned in machine learning field, at present
It is widely used in the research and development of artificial intelligence related application.
Distributed computing is a kind of calculation method, and it is opposite that centralization, which calculates,.With the development of computing technique, some
Using needing very huge computing capability that could complete, if calculated using centralization, needs to expend and come for quite a long time
It completes.The application decomposition at many small parts, is distributed to multiple stage computers and is handled by distributed computing.It can save in this way
About the overall calculation time, greatly improve computational efficiency.
The concept of deep learning is derived from the research of artificial neural network.Multilayer perceptron containing more hidden layers is exactly a kind of depth
Learning structure.Deep learning, which forms more abstract high level by combination low-level feature, indicates attribute classification or feature, with discovery
The distributed nature of data indicates.
Current distributed arithmetic can not only can be carried out for operation clusters such as fixed operation clusters, such as fund
Dynamic expansion causes the applicability of distributed arithmetic limited, and also the distributed fortune in deep learning may be implemented in none of these methods
The dynamic expansion to realize operation cluster is isolated with GPU hardware for the running environment of calculation, leads to the training speed of deep learning model
Lowly.
Therefore, it is necessary to design a kind of new method, realizes that running environment is isolated with image processor, carry out operation cluster
Dynamic expansion, and improve deep learning model training speed.
Summary of the invention
It is an object of the invention to overcome the deficiencies of existing technologies, deep learning distributed arithmetic method, apparatus, meter are provided
Calculate machine equipment and storage medium.
To achieve the above object, the invention adopts the following technical scheme: deep learning distributed arithmetic method, comprising:
Packaging model forms model object;
Determine operation number;
Acquisition request;
According to request cryptographic Hash;
Task distribution is carried out according to cryptographic Hash and model object;
According to Distribution Results code corresponding to Docker container execution task;
Judge whether code corresponding to task is finished;
If so, discharging code corresponding to the task in Docker container.
Its further technical solution are as follows: the determining operation number, comprising:
Obtain the number for participating in the graphics processor of operation;
Code training burden is obtained according to the number of the parameter amount of model object, train epochs and graphics processor;
Operation number is determined according to code training burden.
Its further technical solution are as follows: before the cryptographic Hash according to request, further includes:
Upload the data set of success response request.
Its further technical solution are as follows: it is described according to Distribution Results code corresponding to Docker container execution task,
Further include:
Obtain idle operation cluster;
It sends the operation cluster and corresponds to the number of task correlation cryptographic Hash, model object and graphics processor to described
Operation cluster;
The data set requested according to code corresponding to cryptographic Hash downloading task and success response;
Task code and data set are distributed to the child node in the operation cluster;
Docker container needed for starting the child node operation in the operation cluster;
The code corresponding to operation task in Docker container.
Its further technical solution are as follows: before the code in the release application container, further includes:
Obtain implementing result;
Execution result back.
The present invention also provides deep learning distributed arithmetic devices, comprising:
Encapsulation unit is used for packaging model, forms model object;
Number decision unit, for determining operation number;
Request unit is used for acquisition request;
Cryptographic Hash acquiring unit, for according to request cryptographic Hash;
Dispatching Unit, for carrying out task distribution according to cryptographic Hash and model object;
Code execution unit, for according to Distribution Results code corresponding to Docker container execution task;
Judging unit is executed, for judging whether code corresponding to task is finished;
Releasing unit, for if so, discharging code corresponding to the task in application container.
Its further technical solution are as follows: the number decision unit includes:
Number obtains subelement, for obtaining the number for participating in the graphics processor of operation;
Training burden obtains subelement, for according to the parameter amount of model object, train epochs and graphics processor
Number obtains code training burden;
Subelement is determined, for determining fund according to code training burden.
Its further technical solution are as follows: described device further include:
Data set acquiring unit, for uploading the data set of success response request.
The present invention also provides a kind of computer equipment, the computer equipment includes memory and processor, described to deposit
Computer program is stored on reservoir, the processor realizes above-mentioned method when executing the computer program.
The present invention also provides a kind of storage medium, the storage medium is stored with computer program, the computer journey
Sequence can realize above-mentioned method when being executed by processor.
Compared with the prior art, the invention has the advantages that: the present invention passes through the distributed arithmetic frame in tensorflow
Further function package is carried out on the basis of frame, calls docker container, and in docker container corresponding to operation task
Code, realize that running environment is unified to be isolated with image processor, calling docker container can carry out the dynamic of operation cluster
Extension makes the training speed for improving deep learning model.
The invention will be further described in the following with reference to the drawings and specific embodiments.
Detailed description of the invention
Technical solution in order to illustrate the embodiments of the present invention more clearly, below will be to needed in embodiment description
Attached drawing is briefly described, it should be apparent that, drawings in the following description are some embodiments of the invention, general for this field
For logical technical staff, without creative efforts, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is the application scenarios schematic diagram of deep learning distributed arithmetic method provided in an embodiment of the present invention;
Fig. 2 is the flow diagram of deep learning distributed arithmetic method provided in an embodiment of the present invention;
Fig. 3 is the sub-process schematic diagram of deep learning distributed arithmetic method provided in an embodiment of the present invention;
Fig. 4 is the sub-process schematic diagram of deep learning distributed arithmetic method provided in an embodiment of the present invention;
Fig. 5 is the schematic block diagram of deep learning distributed arithmetic device provided in an embodiment of the present invention;
Fig. 6 is the schematic frame of the number decision unit of deep learning distributed arithmetic device provided in an embodiment of the present invention
Figure;
Fig. 7 is the schematic frame of the code execution unit of deep learning distributed arithmetic device provided in an embodiment of the present invention
Figure;
Fig. 8 is the schematic block diagram of computer equipment provided in an embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair
Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts
Example, shall fall within the protection scope of the present invention.
It should be appreciated that ought use in this specification and in the appended claims, term " includes " and "comprising" instruction
Described feature, entirety, step, operation, the presence of element and/or component, but one or more of the other feature, whole is not precluded
Body, step, operation, the presence or addition of element, component and/or its set.
It is also understood that mesh of the term used in this description of the invention merely for the sake of description specific embodiment
And be not intended to limit the present invention.As description of the invention and it is used in the attached claims, unless on
Other situations are hereafter clearly indicated, otherwise " one " of singular, "one" and "the" are intended to include plural form.
It will be further appreciated that the term "and/or" used in description of the invention and the appended claims is
Refer to any combination and all possible combinations of one or more of associated item listed, and including these combinations.
Fig. 1 and Fig. 2 are please referred to, Fig. 1 is the applied field of deep learning distributed arithmetic method provided in an embodiment of the present invention
Scape schematic diagram.Fig. 2 is the schematic flow chart of deep learning distributed arithmetic method provided in an embodiment of the present invention.Deep learning
Distributed arithmetic method is applied in management server, proxy server and processing server, which can be point
A server in cloth service platform, proxy server may be a server in Distributed Services platform, place
Reason server may be a server in Distributed Services platform, and the management server and user terminal carry out data friendship
Mutually, operation number is inputted by the operation APP of user terminal by user, so that management server transfers proxy server, generation
Server is managed according to the instruction of management server, processing server is transferred and carries out distributed arithmetic, processing server is by operation knot
Fruit feeds back to proxy server, feeds back to user terminal using management server.
It should be noted that a management server is only illustrated in Fig. 2, in the actual operation process, using more
Management server carries out operation simultaneously.
Fig. 2 is the flow diagram of deep learning distributed arithmetic method provided in an embodiment of the present invention.As shown in Fig. 2,
This approach includes the following steps S110 to S210.
S110, packaging model form model object.
In the present embodiment, model object refers to meeting interface requirements using tensorflow as kernel and carrying out depth
The model of study, above-mentioned model refer to tensorflow model
The simple examples code that is there is provided according to management server of user first, using the interface of management server by model generation
Code is encapsulated as a model object, wherein tensorflow (second generation artificial intelligence learning system) model is by complicated number
The model for carrying out analysis and treatment process into artificial intelligence nerve net according to structural transmission, for encapsulation process, generally with model
For kernel, its corresponding code is packaged in the form of interface, forms a model object that can be called, avoid and respectively open
It starts and needs to redefine the invalid repetitive operation of trained code in each deep learning exploitation, improve commercialization effect
Rate.
Tensorflow is a deep learning Computational frame being most widely used at present, can be used for various perception and language
Say understanding task machine learning, pass through the frame of tensorflow, it may be convenient to study, build, train with deployment it is various
The relevant model of deep learning, while being integrated with gRPC far call function in tensorflow bottom can easily herein
On the basis of implementation model distribution training, based on tensorflow realize distributed arithmetic system, it will have operation stability
Good, audient is wide, is widely used, and develops the advantages that convenient.
S120, operation number is determined.
In the present embodiment, operation number refer to needed for carry out distributed arithmetic number, such as payment or
Complete the numerical value such as task amount.
In one embodiment, as shown in figure 3, above-mentioned steps S120 may include step S121~S123.
S121, the number for participating in the graphics processor of operation is obtained.
User participates in the number of the image processor of operation by the operation APP selection of user terminal, if current figure
The number of processor is less than the image processor number of user's selection, then management server is in and continues waiting for state, Waiting Graph
It is finished as processor works, gathers together enough the number that selection participates in the image processor of operation, can just be determined operation number.
S122, code training burden is obtained according to the number of the parameter amount of model object, train epochs and graphics processor.
In the present embodiment, parameter amount refers to the parameter after neural network parameter, that is, deep learning;Train epochs
Refer to the step number of deep learning;Code training burden refers to needing to carry out the amount of the distributed arithmetic of deep learning.
The number of the parameter amount of model object, train epochs and graphics processor has different weights, therefore, it is necessary to
A weighted value first is equipped with to the number of the parameter amount of model object, train epochs and graphics processor, obtains code training
When amount, summation can be obtained according to the mode of weighted sum, using the summation as code training burden.
S123, operation number is determined according to code training burden.
In the present embodiment, there is certain transformational relation, such as linear relationship between code training burden and operation number
Or one-to-one mapping relations, code training burden can be converted to by operation number according to transformational relation.
S130, acquisition request.
After determining operation number, user carries out the clearing of operation number by the operation APP of user terminal or gets,
In this, as request.
S140, the data set for uploading success response request;
S150, according to request cryptographic Hash.
It after requesting success response, for example pays successfully or after task gets successfully, the EOS in management server is (embedding
Enter formula operating system, Embedded Operation System) the response operation that makes requests of node, such as transfer of payment
Operation starts after confirmation node returns to payment completion information to IPFS (interspace file system, InterPlanetary File
System the data set of success response request) is uploaded, the data set of success response request returns to storage cryptographic Hash after uploading.
S160, task distribution is carried out according to cryptographic Hash and model object.
In the present embodiment, management server feeds back cryptographic Hash and model object, proxy server root to proxy server
When carrying out task distribution, specifically different cryptographic Hash and different model object collocation according to cryptographic Hash and model object, it can carry out
Different processor active tasks, therefore, it is necessary to cryptographic Hash and the one-to-one processor active task of model object are issued to specified processing
Server;In addition, management server also sends the order of starting processor active task to proxy server.
S170, according to Distribution Results code corresponding to Docker container execution task.
In the present embodiment, Distribution Results refer to which task is assigned to the operation cluster of which processing server and does and locate
Reason.
In one embodiment, as shown in figure 4, above-mentioned steps S170 may include step S171~S176.
S171, idle operation cluster is obtained.
Operation cluster is integrated in processing server, and specifically, proxy server is receiving management server transmission
Task-scheduling operation is carried out after starting the order of processor active task, idle processing server will do it queuing, and formation is lined up, according to
Waiting list chooses idle processing server.
S172, the number that the operation cluster corresponds to task correlation cryptographic Hash, model object and graphics processor is sent
To the operation cluster.
Management server sends what task correlation cryptographic Hash, the code of model object and user chose to processing server
Graphics processor node number notifies the processor to start processor active task.
The data set of S173, the code according to corresponding to cryptographic Hash downloading task and success response request.
After processing server receives the notice of processor active task, utilize cryptographic Hash from IPFS (star by operation cluster host node
Border file system, InterPlanetary File System) server downloads inter-related task code and success response is requested
Data set.
S174, code corresponding to task and data set are distributed to the child node in the operation cluster.
In the present embodiment, operation cluster host node is distributed to each operation cluster child node.
Docker container needed for child node operation in S175, the starting operation cluster.
S176, the code corresponding to operation task in Docker container.
Docker is the application container engine of an open source, and developer can be packaged their application and rely on packet and arrives
In one transplantable container, then it is published on the Linux machine of any prevalence, also may be implemented to virtualize, container has been
Sandbox mechanism is entirely used, does not have any interface between each other, docker container is used to run as Essential Environment
Tensorflow distributed code calls Docker container to execute code, it is ensured that each using the virtualization technology of docker
The environment of a distributed node code operation is consistent, while can realize the hardware isolated in same node machine between each GPU,
It prevents GPU operation from interfering with each other, the dynamic expansion of system operations cluster can be more easily carried out using docker technology.
S180, judge whether code corresponding to task is finished;
S190, if so, obtain implementing result;
S200, execution result back;
Code corresponding to task in S210, release Docker container;
If it is not, then returning described according to Distribution Results code corresponding to Docker container execution task.
Processing server has executed task code, i.e., after the completion of operation, terminates the operation of each docker container, and discharge
Calculation resources in each docker container, that is, code corresponding to task, the implementing result after the completion of operation is passed through
Proxy server, IPFS and management server are sent to user.
It gives one example, during payment, user uses interface according to the simple examples code of offer first
Model code is encapsulated as a model object, user selects to participate in the number of operation GPU, according to the big of model object parameter amount
Small, the number of train epochs and the number of operation GPU carry out code training burden to estimate determining expense, and user is notified to do most
After confirm, if confirmation time-out, send prompting to user terminal;System lifts payment to user and asks after user's confirmation task definition
It asks, if payment time-out, also sends and remind to user terminal, after user confirms payment, EOS node carries out transfer of payment operation, is
After system confirmation node returns to payment completion information, start to upload data set to IPFS.Data set returns to storage after uploading
HASH value, management server send data set, HASH value and model object code and carry out task distribution.Proxy server is connecing
Task-scheduling operation is carried out after receiving the order of task start, idle operation cluster is chosen according to waiting list, is sent to it
GPU (graphics processor, the Graphics Processing that the relevant HASH value of task, model object code and user choose
Unit) node number notifies it to start processor active task.Operation cluster host node after receiving assignment instructions using HASH value from
Ipfs server downloads inter-related task code and data set, after being distributed to each child node, docker needed for starting operation
Container is wherein running the task code of acquisition, and starts the distributed arithmetic training of model object, after the completion of operation training eventually
The only operation of each docker container, and the calculation resources of each docker container are discharged, then by the operation after the completion of training
As a result user is sent to by IPFS and proxy server, management server.
Above-mentioned example combination EOS block chain technology carries out the Zhi Fuyu clearing of expense and realizes the excitation plan of user
Slightly, the storage and forwarding of file are realized using IPFS distributed memory system.
Above-mentioned deep learning distributed arithmetic method, by the basis of the distributed arithmetic frame of tensorflow
Further function package is carried out, docker container, and the code corresponding to operation task in docker container is called, realizes
Running environment is unified to be isolated with image processor, and calling docker container can carry out the dynamic expansion of operation cluster, makes to improve
The training speed of deep learning model.
Fig. 5 is the schematic block diagram of deep learning distributed arithmetic device 300 provided in an embodiment of the present invention.Such as Fig. 5 institute
Show, corresponds to the above deep learning distributed arithmetic method, the present invention also provides deep learning distributed arithmetic devices 300.It should
Deep learning distributed arithmetic device 300 includes the unit for executing above-mentioned deep learning distributed arithmetic method, the device
It can be configured in server.
Specifically, referring to Fig. 5, the deep learning distributed arithmetic device 300 includes
Encapsulation unit 301 is used for packaging model, forms model object;
Number decision unit 302, for determining operation number;
Request unit 303 is used for acquisition request;
Cryptographic Hash acquiring unit 305, for according to request cryptographic Hash;
Dispatching Unit 306, for carrying out task distribution according to cryptographic Hash and model object;
Code execution unit 307, for according to Distribution Results code corresponding to Docker container execution task;
Judging unit 308 is executed, for judging whether code corresponding to task is finished;
Releasing unit 311, for if so, discharging code corresponding to the task in application container.
In one embodiment, as shown in fig. 6, the number decision unit 302 includes:
Number obtains subelement 3021, for obtaining the number for participating in the graphics processor of operation;
Training burden obtains subelement 3022, for parameter amount, train epochs and the graphics processor according to model object
Number obtain code training burden;
Subelement 3023 is determined, for determining fund according to code training burden.
In one embodiment, above-mentioned device further include:
Data set acquiring unit 304, for uploading the data set of success response request.
In one embodiment, as shown in fig. 7, above-mentioned code execution unit 307 includes:
Cluster obtains subelement 3071, for obtaining idle operation cluster;
Transmission sub-unit 3072 corresponds to task correlation cryptographic Hash, model object and figure for sending the operation cluster
The number of shape processor is to the operation cluster;
Lower subelements 3073, the number requested for the code according to corresponding to cryptographic Hash downloading task and success response
According to collection;
Distribute subelement 3074, for task code and data set to be distributed to the child node in the operation cluster;
Promoter unit 3075, for Docker container needed for starting the child node operation in the operation cluster;
Subelement 3076 is executed, for code corresponding to the operation task in Docker container.
In one embodiment, above-mentioned device further include:
As a result acquiring unit 309, for obtaining implementing result;
Feedback unit 310 is used for execution result back.
It should be noted that it is apparent to those skilled in the art that, above-mentioned deep learning distribution fortune
The specific implementation process of device 300 and each unit is calculated, it can be with reference to the corresponding description in preceding method embodiment, for description
Convenienct and succinct, details are not described herein.
Above-mentioned deep learning distributed arithmetic device 300 can be implemented as a kind of form of computer program, the computer
Program can be run in computer equipment as shown in Figure 8.
Referring to Fig. 8, Fig. 8 is a kind of schematic block diagram of computer equipment provided by the embodiments of the present application.The computer
Equipment 500 is server, and server can be independent server, is also possible to the server cluster of multiple server compositions.
Refering to Fig. 8, which includes processor 502, memory and the net connected by system bus 501
Network interface 505, wherein memory may include non-volatile memory medium 503 and built-in storage 504.
The non-volatile memory medium 503 can storage program area 5031 and computer program 5032.The computer program
5032 include program instruction, which is performed, and processor 502 may make to execute a kind of deep learning distributed arithmetic
Method.
The processor 502 is for providing calculating and control ability, to support the operation of entire computer equipment 500.
The built-in storage 504 provides environment for the operation of the computer program 5032 in non-volatile memory medium 503, should
When computer program 5032 is executed by processor 502, processor 502 may make to execute a kind of deep learning distributed arithmetic side
Method.
The network interface 505 is used to carry out network communication with other equipment.It will be understood by those skilled in the art that in Fig. 8
The structure shown, only the block diagram of part-structure relevant to application scheme, does not constitute and is applied to application scheme
The restriction of computer equipment 500 thereon, specific computer equipment 500 may include more more or fewer than as shown in the figure
Component perhaps combines certain components or with different component layouts.
Wherein, the processor 502 is for running computer program 5032 stored in memory, to realize following step
It is rapid:
Packaging model forms model object;
Determine operation number;
Acquisition request;
According to request cryptographic Hash;
Task distribution is carried out according to cryptographic Hash and model object;
According to Distribution Results code corresponding to Docker container execution task;
Judge whether code corresponding to task is finished;
If so, discharging code corresponding to the task in Docker container.
In one embodiment, processor 502 is implemented as follows step when realizing the determining operation number step:
Obtain the number for participating in the graphics processor of operation;
Code training burden is obtained according to the number of the parameter amount of model object, train epochs and graphics processor;
Operation number is determined according to code training burden.
In one embodiment, processor 502 is also realized as follows before realizing the cryptographic Hash step according to request
Step:
Upload the data set of success response request.
In one embodiment, processor 502 is described right in Docker container execution task institute according to Distribution Results in realization
When the code steps answered, it is implemented as follows step:
Obtain idle operation cluster;
It sends the operation cluster and corresponds to the number of task correlation cryptographic Hash, model object and graphics processor to described
Operation cluster;
The data set requested according to code corresponding to cryptographic Hash downloading task and success response;
Task code and data set are distributed to the child node in the operation cluster;
Docker container needed for starting the child node operation in the operation cluster;
The code corresponding to operation task in Docker container.
In one embodiment, processor 502 is also realized such as before realizing the code steps in the release application container
Lower step:
Obtain implementing result;
Execution result back.
In one embodiment, processor 502 is realizing described judge whether code corresponding to task is finished step
Later, following steps are also realized:
If it is not, then returning described according to Distribution Results code corresponding to Docker container execution task.
It should be appreciated that in the embodiment of the present application, processor 502 can be central processing unit (Central
Processing Unit, CPU), which can also be other general processors, digital signal processor (Digital
Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit,
ASIC), ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic
Device, discrete gate or transistor logic, discrete hardware components etc..Wherein, general processor can be microprocessor or
Person's processor is also possible to any conventional processor etc..
Those of ordinary skill in the art will appreciate that be realize above-described embodiment method in all or part of the process,
It is that relevant hardware can be instructed to complete by computer program.The computer program includes program instruction, computer journey
Sequence can be stored in a storage medium, which is computer readable storage medium.The program instruction is by the department of computer science
At least one processor in system executes, to realize the process step of the embodiment of the above method.
Therefore, the present invention also provides a kind of storage mediums.The storage medium can be computer readable storage medium.This is deposited
Storage media is stored with computer program, and processor is made to execute following steps when wherein the computer program is executed by processor:
Packaging model forms model object;
Determine operation number;
Acquisition request;
According to request cryptographic Hash;
Task distribution is carried out according to cryptographic Hash and model object;
According to Distribution Results code corresponding to Docker container execution task;
Judge whether code corresponding to task is finished;
If so, discharging code corresponding to the task in Docker container.
In one embodiment, the processor realizes the determining operation number step executing the computer program
When, it is implemented as follows step:
Obtain the number for participating in the graphics processor of operation;
Code training burden is obtained according to the number of the parameter amount of model object, train epochs and graphics processor;
Operation number is determined according to code training burden.
In one embodiment, the processor is realized described according to request Hash in the execution computer program
It is worth before step, also realization following steps:
Upload the data set of success response request.
In one embodiment, the processor is realized and described is existed according to Distribution Results executing the computer program
When code steps corresponding to Docker container execution task, it is implemented as follows step:
Obtain idle operation cluster;
It sends the operation cluster and corresponds to the number of task correlation cryptographic Hash, model object and graphics processor to described
Operation cluster;
The data set requested according to code corresponding to cryptographic Hash downloading task and success response;
Task code and data set are distributed to the child node in the operation cluster;
Docker container needed for starting the child node operation in the operation cluster;
The code corresponding to operation task in Docker container.
In one embodiment, the processor is realized in the release application container in the execution computer program
Before code steps, following steps are also realized:
Obtain implementing result;
Execution result back.
In one embodiment, the processor is realized corresponding to the judgement task in the execution computer program
Whether code is finished after step, also realization following steps:
If it is not, then returning described according to Distribution Results code corresponding to Docker container execution task.
The storage medium can be USB flash disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), magnetic disk
Or the various computer readable storage mediums that can store program code such as CD.
Those of ordinary skill in the art may be aware that list described in conjunction with the examples disclosed in the embodiments of the present disclosure
Member and algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware
With the interchangeability of software, each exemplary composition and step are generally described according to function in the above description.This
A little functions are implemented in hardware or software actually, the specific application and design constraint depending on technical solution.Specially
Industry technical staff can use different methods to achieve the described function each specific application, but this realization is not
It is considered as beyond the scope of this invention.
In several embodiments provided by the present invention, it should be understood that disclosed device and method can pass through it
Its mode is realized.For example, the apparatus embodiments described above are merely exemplary.For example, the division of each unit, only
Only a kind of logical function partition, there may be another division manner in actual implementation.Such as multiple units or components can be tied
Another system is closed or is desirably integrated into, or some features can be ignored or not executed.
The steps in the embodiment of the present invention can be sequentially adjusted, merged and deleted according to actual needs.This hair
Unit in bright embodiment device can be combined, divided and deleted according to actual needs.In addition, in each implementation of the present invention
Each functional unit in example can integrate in one processing unit, is also possible to each unit and physically exists alone, can also be with
It is that two or more units are integrated in one unit.
If the integrated unit is realized in the form of SFU software functional unit and when sold or used as an independent product,
It can store in one storage medium.Based on this understanding, technical solution of the present invention is substantially in other words to existing skill
The all or part of part or the technical solution that art contributes can be embodied in the form of software products, the meter
Calculation machine software product is stored in a storage medium, including some instructions are used so that a computer equipment (can be a
People's computer, terminal or network equipment etc.) it performs all or part of the steps of the method described in the various embodiments of the present invention.
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any
Those familiar with the art in the technical scope disclosed by the present invention, can readily occur in various equivalent modifications or replace
It changes, these modifications or substitutions should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with right
It is required that protection scope subject to.
Claims (10)
1. deep learning distributed arithmetic method characterized by comprising
Packaging model forms model object;
Determine operation number;
Acquisition request;
According to request cryptographic Hash;
Task distribution is carried out according to cryptographic Hash and model object;
According to Distribution Results code corresponding to Docker container execution task;
Judge whether task code is finished;
If so, the code in release Docker container.
2. deep learning distributed arithmetic method according to claim 1, which is characterized in that the determining operation number,
Include:
Obtain the number for participating in the graphics processor of operation;
Code training burden is obtained according to the number of the parameter amount of model object, train epochs and graphics processor;
Operation number is determined according to code training burden.
3. deep learning distributed arithmetic method according to claim 2, which is characterized in that described to be breathed out according to request
Before uncommon value, further includes:
Upload the data set of success response request.
4. deep learning distributed arithmetic method according to claim 3, which is characterized in that described to be existed according to Distribution Results
Code corresponding to Docker container execution task, further includes:
Obtain idle operation cluster;
It sends the operation cluster and corresponds to the number of task correlation cryptographic Hash, model object and graphics processor to the operation
Cluster;
The data set requested according to code corresponding to cryptographic Hash downloading task and success response;
Task code and data set are distributed to the child node in the operation cluster;
Docker container needed for starting the child node operation in the operation cluster;
The code corresponding to operation task in Docker container.
5. deep learning distributed arithmetic method according to any one of claims 1 to 4, which is characterized in that the release
Before code in application container, further includes:
Obtain implementing result;
Execution result back.
6. deep learning distributed arithmetic device characterized by comprising
Encapsulation unit is used for packaging model, forms model object;
Number decision unit, for determining operation number;
Request unit is used for acquisition request;
Cryptographic Hash acquiring unit, for according to request cryptographic Hash;
Dispatching Unit, for carrying out task distribution according to cryptographic Hash and model object;
Code execution unit, for according to Distribution Results code corresponding to Docker container execution task;
Judging unit is executed, for judging whether task code is finished;
Releasing unit, for if so, discharging the code in application container.
7. deep learning distributed arithmetic device according to claim 6, which is characterized in that the fund determination unit packet
It includes:
Number obtains subelement, for obtaining the number for participating in the graphics processor of operation;
Training burden obtains subelement, for being obtained according to the number of the parameter amount of model object, train epochs and graphics processor
Replace code training burden;
Subelement is determined, for determining fund according to code training burden.
8. deep learning distributed arithmetic device according to claim 6, which is characterized in that described device further include:
Data set acquiring unit, for uploading the data set of success response request.
9. a kind of computer equipment, which is characterized in that the computer equipment includes memory and processor, on the memory
It is stored with computer program, the processor is realized as described in any one of claims 1 to 5 when executing the computer program
Method.
10. a kind of storage medium, which is characterized in that the storage medium is stored with computer program, the computer program quilt
Processor can realize the method as described in any one of claims 1 to 5 when executing.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811080562.6A CN109358944A (en) | 2018-09-17 | 2018-09-17 | Deep learning distributed arithmetic method, apparatus, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811080562.6A CN109358944A (en) | 2018-09-17 | 2018-09-17 | Deep learning distributed arithmetic method, apparatus, computer equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109358944A true CN109358944A (en) | 2019-02-19 |
Family
ID=65350884
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811080562.6A Pending CN109358944A (en) | 2018-09-17 | 2018-09-17 | Deep learning distributed arithmetic method, apparatus, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109358944A (en) |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110414687A (en) * | 2019-07-12 | 2019-11-05 | 苏州浪潮智能科技有限公司 | A kind of method and apparatus for the training of deep learning frame distribution |
CN110516090A (en) * | 2019-08-09 | 2019-11-29 | 广东浪潮大数据研究有限公司 | A kind of object detecting method, device, equipment and computer readable storage medium |
CN110688230A (en) * | 2019-10-17 | 2020-01-14 | 广州文远知行科技有限公司 | Synchronous training method and device, computer equipment and storage medium |
CN110795141A (en) * | 2019-10-12 | 2020-02-14 | 广东浪潮大数据研究有限公司 | Training task submitting method, device, equipment and medium |
CN110839023A (en) * | 2019-11-05 | 2020-02-25 | 北京中电普华信息技术有限公司 | Electric power marketing multi-channel customer service system |
CN110866167A (en) * | 2019-11-14 | 2020-03-06 | 北京知道创宇信息技术股份有限公司 | Task allocation method, device, server and storage medium |
CN111200606A (en) * | 2019-12-31 | 2020-05-26 | 深圳市优必选科技股份有限公司 | Deep learning model task processing method, system, server and storage medium |
CN112035220A (en) * | 2020-09-30 | 2020-12-04 | 北京百度网讯科技有限公司 | Processing method, device and equipment for operation task of development machine and storage medium |
CN112097368A (en) * | 2020-08-21 | 2020-12-18 | 深圳市建滔科技有限公司 | Air conditioner power consumption adjusting method and device based on cloud deep learning |
CN112114931A (en) * | 2019-06-21 | 2020-12-22 | 鸿富锦精密电子(天津)有限公司 | Deep learning program configuration method and device, electronic equipment and storage medium |
CN112596863A (en) * | 2020-12-28 | 2021-04-02 | 南方电网深圳数字电网研究院有限公司 | Method, system and computer storage medium for monitoring training tasks |
CN112651411A (en) * | 2019-10-10 | 2021-04-13 | 中国人民解放军国防科技大学 | Gradient quantization method and system for distributed deep learning |
CN113645282A (en) * | 2021-07-29 | 2021-11-12 | 上海熠知电子科技有限公司 | Deep learning method based on server cluster |
CN113672215A (en) * | 2021-07-30 | 2021-11-19 | 阿里巴巴新加坡控股有限公司 | Deep learning distributed training adaptation method and device |
CN114756464A (en) * | 2022-04-18 | 2022-07-15 | 中国电信股份有限公司 | Code checking configuration method, device and storage medium |
WO2023123828A1 (en) * | 2021-12-31 | 2023-07-06 | 上海商汤智能科技有限公司 | Model processing method and apparatus, electronic device, computer storage medium, and program |
CN116483482A (en) * | 2023-05-19 | 2023-07-25 | 北京百度网讯科技有限公司 | Deep learning task processing method, system, device, equipment and medium |
CN114756464B (en) * | 2022-04-18 | 2024-04-26 | 中国电信股份有限公司 | Code checking configuration method, device and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102629219A (en) * | 2012-02-27 | 2012-08-08 | 北京大学 | Self-adaptive load balancing method for Reduce ends in parallel computing framework |
CN103473121A (en) * | 2013-08-20 | 2013-12-25 | 西安电子科技大学 | Mass image parallel processing method based on cloud computing platform |
WO2016010830A1 (en) * | 2014-07-12 | 2016-01-21 | Microsoft Technology Licensing, Llc | Composing and executing workflows made up of functional pluggable building blocks |
CN107343000A (en) * | 2017-07-04 | 2017-11-10 | 北京百度网讯科技有限公司 | Method and apparatus for handling task |
CN107450961A (en) * | 2017-09-22 | 2017-12-08 | 济南浚达信息技术有限公司 | A kind of distributed deep learning system and its building method, method of work based on Docker containers |
CN108280207A (en) * | 2018-01-30 | 2018-07-13 | 深圳市茁壮网络股份有限公司 | A method of the perfect Hash of construction |
-
2018
- 2018-09-17 CN CN201811080562.6A patent/CN109358944A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102629219A (en) * | 2012-02-27 | 2012-08-08 | 北京大学 | Self-adaptive load balancing method for Reduce ends in parallel computing framework |
CN103473121A (en) * | 2013-08-20 | 2013-12-25 | 西安电子科技大学 | Mass image parallel processing method based on cloud computing platform |
WO2016010830A1 (en) * | 2014-07-12 | 2016-01-21 | Microsoft Technology Licensing, Llc | Composing and executing workflows made up of functional pluggable building blocks |
CN107343000A (en) * | 2017-07-04 | 2017-11-10 | 北京百度网讯科技有限公司 | Method and apparatus for handling task |
CN107450961A (en) * | 2017-09-22 | 2017-12-08 | 济南浚达信息技术有限公司 | A kind of distributed deep learning system and its building method, method of work based on Docker containers |
CN108280207A (en) * | 2018-01-30 | 2018-07-13 | 深圳市茁壮网络股份有限公司 | A method of the perfect Hash of construction |
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112114931B (en) * | 2019-06-21 | 2023-12-26 | 富联精密电子(天津)有限公司 | Deep learning program configuration method and device, electronic equipment and storage medium |
CN112114931A (en) * | 2019-06-21 | 2020-12-22 | 鸿富锦精密电子(天津)有限公司 | Deep learning program configuration method and device, electronic equipment and storage medium |
CN110414687A (en) * | 2019-07-12 | 2019-11-05 | 苏州浪潮智能科技有限公司 | A kind of method and apparatus for the training of deep learning frame distribution |
CN110516090A (en) * | 2019-08-09 | 2019-11-29 | 广东浪潮大数据研究有限公司 | A kind of object detecting method, device, equipment and computer readable storage medium |
CN112651411A (en) * | 2019-10-10 | 2021-04-13 | 中国人民解放军国防科技大学 | Gradient quantization method and system for distributed deep learning |
CN112651411B (en) * | 2019-10-10 | 2022-06-07 | 中国人民解放军国防科技大学 | Gradient quantization method and system for distributed deep learning |
CN110795141A (en) * | 2019-10-12 | 2020-02-14 | 广东浪潮大数据研究有限公司 | Training task submitting method, device, equipment and medium |
CN110795141B (en) * | 2019-10-12 | 2023-10-10 | 广东浪潮大数据研究有限公司 | Training task submitting method, device, equipment and medium |
CN110688230A (en) * | 2019-10-17 | 2020-01-14 | 广州文远知行科技有限公司 | Synchronous training method and device, computer equipment and storage medium |
CN110688230B (en) * | 2019-10-17 | 2022-06-24 | 广州文远知行科技有限公司 | Synchronous training method and device, computer equipment and storage medium |
CN110839023A (en) * | 2019-11-05 | 2020-02-25 | 北京中电普华信息技术有限公司 | Electric power marketing multi-channel customer service system |
CN110839023B (en) * | 2019-11-05 | 2022-03-25 | 北京中电普华信息技术有限公司 | Electric power marketing multi-channel customer service system |
CN110866167A (en) * | 2019-11-14 | 2020-03-06 | 北京知道创宇信息技术股份有限公司 | Task allocation method, device, server and storage medium |
CN110866167B (en) * | 2019-11-14 | 2022-09-20 | 北京知道创宇信息技术股份有限公司 | Task allocation method, device, server and storage medium |
CN111200606A (en) * | 2019-12-31 | 2020-05-26 | 深圳市优必选科技股份有限公司 | Deep learning model task processing method, system, server and storage medium |
CN112097368A (en) * | 2020-08-21 | 2020-12-18 | 深圳市建滔科技有限公司 | Air conditioner power consumption adjusting method and device based on cloud deep learning |
CN112035220A (en) * | 2020-09-30 | 2020-12-04 | 北京百度网讯科技有限公司 | Processing method, device and equipment for operation task of development machine and storage medium |
CN112596863A (en) * | 2020-12-28 | 2021-04-02 | 南方电网深圳数字电网研究院有限公司 | Method, system and computer storage medium for monitoring training tasks |
CN113645282A (en) * | 2021-07-29 | 2021-11-12 | 上海熠知电子科技有限公司 | Deep learning method based on server cluster |
CN113672215A (en) * | 2021-07-30 | 2021-11-19 | 阿里巴巴新加坡控股有限公司 | Deep learning distributed training adaptation method and device |
CN113672215B (en) * | 2021-07-30 | 2023-10-24 | 阿里巴巴新加坡控股有限公司 | Deep learning distributed training adaptation method and device |
WO2023123828A1 (en) * | 2021-12-31 | 2023-07-06 | 上海商汤智能科技有限公司 | Model processing method and apparatus, electronic device, computer storage medium, and program |
CN114756464A (en) * | 2022-04-18 | 2022-07-15 | 中国电信股份有限公司 | Code checking configuration method, device and storage medium |
CN114756464B (en) * | 2022-04-18 | 2024-04-26 | 中国电信股份有限公司 | Code checking configuration method, device and storage medium |
CN116483482A (en) * | 2023-05-19 | 2023-07-25 | 北京百度网讯科技有限公司 | Deep learning task processing method, system, device, equipment and medium |
CN116483482B (en) * | 2023-05-19 | 2024-03-01 | 北京百度网讯科技有限公司 | Deep learning task processing method, system, device, equipment and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109358944A (en) | Deep learning distributed arithmetic method, apparatus, computer equipment and storage medium | |
EP3736692A1 (en) | Using computational cost and instantaneous load analysis for intelligent deployment of neural networks on multiple hardware executors | |
CN108369534A (en) | Code executes request routing | |
CN110780914B (en) | Service publishing method and device | |
CN105242956B (en) | Virtual functions service chaining deployment system and its dispositions method | |
Bibani et al. | A demo of IoT healthcare application provisioning in hybrid cloud/fog environment | |
CN109104336A (en) | Service request processing method, device, computer equipment and storage medium | |
CN109471710A (en) | Processing method, device, processor, terminal and the server of task requests | |
CN111310936A (en) | Machine learning training construction method, platform, device, equipment and storage medium | |
US11791050B2 (en) | 3D environment risks identification utilizing reinforced learning | |
CN112015536B (en) | Kubernetes cluster container group scheduling method, device and medium | |
CN110033091B (en) | Model-based prediction method and device | |
CN109067890A (en) | A kind of CDN node edge calculations system based on docker container | |
US20140380196A1 (en) | Creation and Prioritization of Multiple Virtual Universe Teleports in Response to an Event | |
CN108833161A (en) | A method of establishing the intelligent contract micro services model calculated based on mist | |
CN109074283A (en) | The M2M service layer based on pond is established by NFV | |
CN113051053A (en) | Heterogeneous resource scheduling method, device, equipment and computer readable storage medium | |
CN105144109A (en) | Distributed data center technology | |
Sharma et al. | Ant colony based optimization model for QoS-Based task scheduling in cloud computing environment | |
EP4127925A1 (en) | Orchestration of virtualization technology and application implementation | |
CN109189400A (en) | Program dissemination method and device, storage medium, processor | |
KR20200125890A (en) | Cloud-based transaction system and method capable of providing neural network training model in supervised state | |
CN109248440A (en) | A kind of method and system for realizing the real-time dynamically load configuration of game | |
Kemp | Programming frameworks for distributed smartphone computing | |
CN108985459A (en) | The method and apparatus of training pattern |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190219 |