CN109358944A - Deep learning distributed arithmetic method, apparatus, computer equipment and storage medium - Google Patents

Deep learning distributed arithmetic method, apparatus, computer equipment and storage medium Download PDF

Info

Publication number
CN109358944A
CN109358944A CN201811080562.6A CN201811080562A CN109358944A CN 109358944 A CN109358944 A CN 109358944A CN 201811080562 A CN201811080562 A CN 201811080562A CN 109358944 A CN109358944 A CN 109358944A
Authority
CN
China
Prior art keywords
task
code
deep learning
docker container
cryptographic hash
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811080562.6A
Other languages
Chinese (zh)
Inventor
蒋健
兰毅
尹恒
邱杰
张宜浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shensuan Technology Chongqing Co ltd
Original Assignee
Shensuan Technology Chongqing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shensuan Technology Chongqing Co ltd filed Critical Shensuan Technology Chongqing Co ltd
Priority to CN201811080562.6A priority Critical patent/CN109358944A/en
Publication of CN109358944A publication Critical patent/CN109358944A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45504Abstract machines for programme code execution, e.g. Java virtual machine [JVM], interpreters, emulators
    • G06F9/45516Runtime code conversion or optimisation
    • G06F9/45525Optimisation or modification within the same instruction set architecture, e.g. HP Dynamo

Abstract

The present invention relates to deep learning distributed arithmetic method, apparatus, computer equipment and storage medium, this method includes packaging model, forms model object;Determine operation number;Acquisition request;According to request cryptographic Hash;Task distribution is carried out according to cryptographic Hash and model object;According to Distribution Results code corresponding to Docker container execution task;Judge whether code corresponding to task is finished;If so, discharging code corresponding to the task in Docker container.The present invention is by carrying out further function package on the basis of the distributed arithmetic frame of tensorflow, call docker container, and the code corresponding to operation task in docker container, running environment unification is realized to be isolated with image processor, it calls docker container that can carry out the dynamic expansion of operation cluster, makes the training speed for improving deep learning model.

Description

Deep learning distributed arithmetic method, apparatus, computer equipment and storage medium
Technical field
The present invention relates to distributed arithmetic methods, more specifically refer to deep learning distributed arithmetic method, apparatus, meter Calculate machine equipment and storage medium.
Background technique
In recent years, deep learning and distributed computing are the research contents being concerned in machine learning field, at present It is widely used in the research and development of artificial intelligence related application.
Distributed computing is a kind of calculation method, and it is opposite that centralization, which calculates,.With the development of computing technique, some Using needing very huge computing capability that could complete, if calculated using centralization, needs to expend and come for quite a long time It completes.The application decomposition at many small parts, is distributed to multiple stage computers and is handled by distributed computing.It can save in this way About the overall calculation time, greatly improve computational efficiency.
The concept of deep learning is derived from the research of artificial neural network.Multilayer perceptron containing more hidden layers is exactly a kind of depth Learning structure.Deep learning, which forms more abstract high level by combination low-level feature, indicates attribute classification or feature, with discovery The distributed nature of data indicates.
Current distributed arithmetic can not only can be carried out for operation clusters such as fixed operation clusters, such as fund Dynamic expansion causes the applicability of distributed arithmetic limited, and also the distributed fortune in deep learning may be implemented in none of these methods The dynamic expansion to realize operation cluster is isolated with GPU hardware for the running environment of calculation, leads to the training speed of deep learning model Lowly.
Therefore, it is necessary to design a kind of new method, realizes that running environment is isolated with image processor, carry out operation cluster Dynamic expansion, and improve deep learning model training speed.
Summary of the invention
It is an object of the invention to overcome the deficiencies of existing technologies, deep learning distributed arithmetic method, apparatus, meter are provided Calculate machine equipment and storage medium.
To achieve the above object, the invention adopts the following technical scheme: deep learning distributed arithmetic method, comprising:
Packaging model forms model object;
Determine operation number;
Acquisition request;
According to request cryptographic Hash;
Task distribution is carried out according to cryptographic Hash and model object;
According to Distribution Results code corresponding to Docker container execution task;
Judge whether code corresponding to task is finished;
If so, discharging code corresponding to the task in Docker container.
Its further technical solution are as follows: the determining operation number, comprising:
Obtain the number for participating in the graphics processor of operation;
Code training burden is obtained according to the number of the parameter amount of model object, train epochs and graphics processor;
Operation number is determined according to code training burden.
Its further technical solution are as follows: before the cryptographic Hash according to request, further includes:
Upload the data set of success response request.
Its further technical solution are as follows: it is described according to Distribution Results code corresponding to Docker container execution task, Further include:
Obtain idle operation cluster;
It sends the operation cluster and corresponds to the number of task correlation cryptographic Hash, model object and graphics processor to described Operation cluster;
The data set requested according to code corresponding to cryptographic Hash downloading task and success response;
Task code and data set are distributed to the child node in the operation cluster;
Docker container needed for starting the child node operation in the operation cluster;
The code corresponding to operation task in Docker container.
Its further technical solution are as follows: before the code in the release application container, further includes:
Obtain implementing result;
Execution result back.
The present invention also provides deep learning distributed arithmetic devices, comprising:
Encapsulation unit is used for packaging model, forms model object;
Number decision unit, for determining operation number;
Request unit is used for acquisition request;
Cryptographic Hash acquiring unit, for according to request cryptographic Hash;
Dispatching Unit, for carrying out task distribution according to cryptographic Hash and model object;
Code execution unit, for according to Distribution Results code corresponding to Docker container execution task;
Judging unit is executed, for judging whether code corresponding to task is finished;
Releasing unit, for if so, discharging code corresponding to the task in application container.
Its further technical solution are as follows: the number decision unit includes:
Number obtains subelement, for obtaining the number for participating in the graphics processor of operation;
Training burden obtains subelement, for according to the parameter amount of model object, train epochs and graphics processor Number obtains code training burden;
Subelement is determined, for determining fund according to code training burden.
Its further technical solution are as follows: described device further include:
Data set acquiring unit, for uploading the data set of success response request.
The present invention also provides a kind of computer equipment, the computer equipment includes memory and processor, described to deposit Computer program is stored on reservoir, the processor realizes above-mentioned method when executing the computer program.
The present invention also provides a kind of storage medium, the storage medium is stored with computer program, the computer journey Sequence can realize above-mentioned method when being executed by processor.
Compared with the prior art, the invention has the advantages that: the present invention passes through the distributed arithmetic frame in tensorflow Further function package is carried out on the basis of frame, calls docker container, and in docker container corresponding to operation task Code, realize that running environment is unified to be isolated with image processor, calling docker container can carry out the dynamic of operation cluster Extension makes the training speed for improving deep learning model.
The invention will be further described in the following with reference to the drawings and specific embodiments.
Detailed description of the invention
Technical solution in order to illustrate the embodiments of the present invention more clearly, below will be to needed in embodiment description Attached drawing is briefly described, it should be apparent that, drawings in the following description are some embodiments of the invention, general for this field For logical technical staff, without creative efforts, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is the application scenarios schematic diagram of deep learning distributed arithmetic method provided in an embodiment of the present invention;
Fig. 2 is the flow diagram of deep learning distributed arithmetic method provided in an embodiment of the present invention;
Fig. 3 is the sub-process schematic diagram of deep learning distributed arithmetic method provided in an embodiment of the present invention;
Fig. 4 is the sub-process schematic diagram of deep learning distributed arithmetic method provided in an embodiment of the present invention;
Fig. 5 is the schematic block diagram of deep learning distributed arithmetic device provided in an embodiment of the present invention;
Fig. 6 is the schematic frame of the number decision unit of deep learning distributed arithmetic device provided in an embodiment of the present invention Figure;
Fig. 7 is the schematic frame of the code execution unit of deep learning distributed arithmetic device provided in an embodiment of the present invention Figure;
Fig. 8 is the schematic block diagram of computer equipment provided in an embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall within the protection scope of the present invention.
It should be appreciated that ought use in this specification and in the appended claims, term " includes " and "comprising" instruction Described feature, entirety, step, operation, the presence of element and/or component, but one or more of the other feature, whole is not precluded Body, step, operation, the presence or addition of element, component and/or its set.
It is also understood that mesh of the term used in this description of the invention merely for the sake of description specific embodiment And be not intended to limit the present invention.As description of the invention and it is used in the attached claims, unless on Other situations are hereafter clearly indicated, otherwise " one " of singular, "one" and "the" are intended to include plural form.
It will be further appreciated that the term "and/or" used in description of the invention and the appended claims is Refer to any combination and all possible combinations of one or more of associated item listed, and including these combinations.
Fig. 1 and Fig. 2 are please referred to, Fig. 1 is the applied field of deep learning distributed arithmetic method provided in an embodiment of the present invention Scape schematic diagram.Fig. 2 is the schematic flow chart of deep learning distributed arithmetic method provided in an embodiment of the present invention.Deep learning Distributed arithmetic method is applied in management server, proxy server and processing server, which can be point A server in cloth service platform, proxy server may be a server in Distributed Services platform, place Reason server may be a server in Distributed Services platform, and the management server and user terminal carry out data friendship Mutually, operation number is inputted by the operation APP of user terminal by user, so that management server transfers proxy server, generation Server is managed according to the instruction of management server, processing server is transferred and carries out distributed arithmetic, processing server is by operation knot Fruit feeds back to proxy server, feeds back to user terminal using management server.
It should be noted that a management server is only illustrated in Fig. 2, in the actual operation process, using more Management server carries out operation simultaneously.
Fig. 2 is the flow diagram of deep learning distributed arithmetic method provided in an embodiment of the present invention.As shown in Fig. 2, This approach includes the following steps S110 to S210.
S110, packaging model form model object.
In the present embodiment, model object refers to meeting interface requirements using tensorflow as kernel and carrying out depth The model of study, above-mentioned model refer to tensorflow model
The simple examples code that is there is provided according to management server of user first, using the interface of management server by model generation Code is encapsulated as a model object, wherein tensorflow (second generation artificial intelligence learning system) model is by complicated number The model for carrying out analysis and treatment process into artificial intelligence nerve net according to structural transmission, for encapsulation process, generally with model For kernel, its corresponding code is packaged in the form of interface, forms a model object that can be called, avoid and respectively open It starts and needs to redefine the invalid repetitive operation of trained code in each deep learning exploitation, improve commercialization effect Rate.
Tensorflow is a deep learning Computational frame being most widely used at present, can be used for various perception and language Say understanding task machine learning, pass through the frame of tensorflow, it may be convenient to study, build, train with deployment it is various The relevant model of deep learning, while being integrated with gRPC far call function in tensorflow bottom can easily herein On the basis of implementation model distribution training, based on tensorflow realize distributed arithmetic system, it will have operation stability Good, audient is wide, is widely used, and develops the advantages that convenient.
S120, operation number is determined.
In the present embodiment, operation number refer to needed for carry out distributed arithmetic number, such as payment or Complete the numerical value such as task amount.
In one embodiment, as shown in figure 3, above-mentioned steps S120 may include step S121~S123.
S121, the number for participating in the graphics processor of operation is obtained.
User participates in the number of the image processor of operation by the operation APP selection of user terminal, if current figure The number of processor is less than the image processor number of user's selection, then management server is in and continues waiting for state, Waiting Graph It is finished as processor works, gathers together enough the number that selection participates in the image processor of operation, can just be determined operation number.
S122, code training burden is obtained according to the number of the parameter amount of model object, train epochs and graphics processor.
In the present embodiment, parameter amount refers to the parameter after neural network parameter, that is, deep learning;Train epochs Refer to the step number of deep learning;Code training burden refers to needing to carry out the amount of the distributed arithmetic of deep learning.
The number of the parameter amount of model object, train epochs and graphics processor has different weights, therefore, it is necessary to A weighted value first is equipped with to the number of the parameter amount of model object, train epochs and graphics processor, obtains code training When amount, summation can be obtained according to the mode of weighted sum, using the summation as code training burden.
S123, operation number is determined according to code training burden.
In the present embodiment, there is certain transformational relation, such as linear relationship between code training burden and operation number Or one-to-one mapping relations, code training burden can be converted to by operation number according to transformational relation.
S130, acquisition request.
After determining operation number, user carries out the clearing of operation number by the operation APP of user terminal or gets, In this, as request.
S140, the data set for uploading success response request;
S150, according to request cryptographic Hash.
It after requesting success response, for example pays successfully or after task gets successfully, the EOS in management server is (embedding Enter formula operating system, Embedded Operation System) the response operation that makes requests of node, such as transfer of payment Operation starts after confirmation node returns to payment completion information to IPFS (interspace file system, InterPlanetary File System the data set of success response request) is uploaded, the data set of success response request returns to storage cryptographic Hash after uploading.
S160, task distribution is carried out according to cryptographic Hash and model object.
In the present embodiment, management server feeds back cryptographic Hash and model object, proxy server root to proxy server When carrying out task distribution, specifically different cryptographic Hash and different model object collocation according to cryptographic Hash and model object, it can carry out Different processor active tasks, therefore, it is necessary to cryptographic Hash and the one-to-one processor active task of model object are issued to specified processing Server;In addition, management server also sends the order of starting processor active task to proxy server.
S170, according to Distribution Results code corresponding to Docker container execution task.
In the present embodiment, Distribution Results refer to which task is assigned to the operation cluster of which processing server and does and locate Reason.
In one embodiment, as shown in figure 4, above-mentioned steps S170 may include step S171~S176.
S171, idle operation cluster is obtained.
Operation cluster is integrated in processing server, and specifically, proxy server is receiving management server transmission Task-scheduling operation is carried out after starting the order of processor active task, idle processing server will do it queuing, and formation is lined up, according to Waiting list chooses idle processing server.
S172, the number that the operation cluster corresponds to task correlation cryptographic Hash, model object and graphics processor is sent To the operation cluster.
Management server sends what task correlation cryptographic Hash, the code of model object and user chose to processing server Graphics processor node number notifies the processor to start processor active task.
The data set of S173, the code according to corresponding to cryptographic Hash downloading task and success response request.
After processing server receives the notice of processor active task, utilize cryptographic Hash from IPFS (star by operation cluster host node Border file system, InterPlanetary File System) server downloads inter-related task code and success response is requested Data set.
S174, code corresponding to task and data set are distributed to the child node in the operation cluster.
In the present embodiment, operation cluster host node is distributed to each operation cluster child node.
Docker container needed for child node operation in S175, the starting operation cluster.
S176, the code corresponding to operation task in Docker container.
Docker is the application container engine of an open source, and developer can be packaged their application and rely on packet and arrives In one transplantable container, then it is published on the Linux machine of any prevalence, also may be implemented to virtualize, container has been Sandbox mechanism is entirely used, does not have any interface between each other, docker container is used to run as Essential Environment Tensorflow distributed code calls Docker container to execute code, it is ensured that each using the virtualization technology of docker The environment of a distributed node code operation is consistent, while can realize the hardware isolated in same node machine between each GPU, It prevents GPU operation from interfering with each other, the dynamic expansion of system operations cluster can be more easily carried out using docker technology.
S180, judge whether code corresponding to task is finished;
S190, if so, obtain implementing result;
S200, execution result back;
Code corresponding to task in S210, release Docker container;
If it is not, then returning described according to Distribution Results code corresponding to Docker container execution task.
Processing server has executed task code, i.e., after the completion of operation, terminates the operation of each docker container, and discharge Calculation resources in each docker container, that is, code corresponding to task, the implementing result after the completion of operation is passed through Proxy server, IPFS and management server are sent to user.
It gives one example, during payment, user uses interface according to the simple examples code of offer first Model code is encapsulated as a model object, user selects to participate in the number of operation GPU, according to the big of model object parameter amount Small, the number of train epochs and the number of operation GPU carry out code training burden to estimate determining expense, and user is notified to do most After confirm, if confirmation time-out, send prompting to user terminal;System lifts payment to user and asks after user's confirmation task definition It asks, if payment time-out, also sends and remind to user terminal, after user confirms payment, EOS node carries out transfer of payment operation, is After system confirmation node returns to payment completion information, start to upload data set to IPFS.Data set returns to storage after uploading HASH value, management server send data set, HASH value and model object code and carry out task distribution.Proxy server is connecing Task-scheduling operation is carried out after receiving the order of task start, idle operation cluster is chosen according to waiting list, is sent to it GPU (graphics processor, the Graphics Processing that the relevant HASH value of task, model object code and user choose Unit) node number notifies it to start processor active task.Operation cluster host node after receiving assignment instructions using HASH value from Ipfs server downloads inter-related task code and data set, after being distributed to each child node, docker needed for starting operation Container is wherein running the task code of acquisition, and starts the distributed arithmetic training of model object, after the completion of operation training eventually The only operation of each docker container, and the calculation resources of each docker container are discharged, then by the operation after the completion of training As a result user is sent to by IPFS and proxy server, management server.
Above-mentioned example combination EOS block chain technology carries out the Zhi Fuyu clearing of expense and realizes the excitation plan of user Slightly, the storage and forwarding of file are realized using IPFS distributed memory system.
Above-mentioned deep learning distributed arithmetic method, by the basis of the distributed arithmetic frame of tensorflow Further function package is carried out, docker container, and the code corresponding to operation task in docker container is called, realizes Running environment is unified to be isolated with image processor, and calling docker container can carry out the dynamic expansion of operation cluster, makes to improve The training speed of deep learning model.
Fig. 5 is the schematic block diagram of deep learning distributed arithmetic device 300 provided in an embodiment of the present invention.Such as Fig. 5 institute Show, corresponds to the above deep learning distributed arithmetic method, the present invention also provides deep learning distributed arithmetic devices 300.It should Deep learning distributed arithmetic device 300 includes the unit for executing above-mentioned deep learning distributed arithmetic method, the device It can be configured in server.
Specifically, referring to Fig. 5, the deep learning distributed arithmetic device 300 includes
Encapsulation unit 301 is used for packaging model, forms model object;
Number decision unit 302, for determining operation number;
Request unit 303 is used for acquisition request;
Cryptographic Hash acquiring unit 305, for according to request cryptographic Hash;
Dispatching Unit 306, for carrying out task distribution according to cryptographic Hash and model object;
Code execution unit 307, for according to Distribution Results code corresponding to Docker container execution task;
Judging unit 308 is executed, for judging whether code corresponding to task is finished;
Releasing unit 311, for if so, discharging code corresponding to the task in application container.
In one embodiment, as shown in fig. 6, the number decision unit 302 includes:
Number obtains subelement 3021, for obtaining the number for participating in the graphics processor of operation;
Training burden obtains subelement 3022, for parameter amount, train epochs and the graphics processor according to model object Number obtain code training burden;
Subelement 3023 is determined, for determining fund according to code training burden.
In one embodiment, above-mentioned device further include:
Data set acquiring unit 304, for uploading the data set of success response request.
In one embodiment, as shown in fig. 7, above-mentioned code execution unit 307 includes:
Cluster obtains subelement 3071, for obtaining idle operation cluster;
Transmission sub-unit 3072 corresponds to task correlation cryptographic Hash, model object and figure for sending the operation cluster The number of shape processor is to the operation cluster;
Lower subelements 3073, the number requested for the code according to corresponding to cryptographic Hash downloading task and success response According to collection;
Distribute subelement 3074, for task code and data set to be distributed to the child node in the operation cluster;
Promoter unit 3075, for Docker container needed for starting the child node operation in the operation cluster;
Subelement 3076 is executed, for code corresponding to the operation task in Docker container.
In one embodiment, above-mentioned device further include:
As a result acquiring unit 309, for obtaining implementing result;
Feedback unit 310 is used for execution result back.
It should be noted that it is apparent to those skilled in the art that, above-mentioned deep learning distribution fortune The specific implementation process of device 300 and each unit is calculated, it can be with reference to the corresponding description in preceding method embodiment, for description Convenienct and succinct, details are not described herein.
Above-mentioned deep learning distributed arithmetic device 300 can be implemented as a kind of form of computer program, the computer Program can be run in computer equipment as shown in Figure 8.
Referring to Fig. 8, Fig. 8 is a kind of schematic block diagram of computer equipment provided by the embodiments of the present application.The computer Equipment 500 is server, and server can be independent server, is also possible to the server cluster of multiple server compositions.
Refering to Fig. 8, which includes processor 502, memory and the net connected by system bus 501 Network interface 505, wherein memory may include non-volatile memory medium 503 and built-in storage 504.
The non-volatile memory medium 503 can storage program area 5031 and computer program 5032.The computer program 5032 include program instruction, which is performed, and processor 502 may make to execute a kind of deep learning distributed arithmetic Method.
The processor 502 is for providing calculating and control ability, to support the operation of entire computer equipment 500.
The built-in storage 504 provides environment for the operation of the computer program 5032 in non-volatile memory medium 503, should When computer program 5032 is executed by processor 502, processor 502 may make to execute a kind of deep learning distributed arithmetic side Method.
The network interface 505 is used to carry out network communication with other equipment.It will be understood by those skilled in the art that in Fig. 8 The structure shown, only the block diagram of part-structure relevant to application scheme, does not constitute and is applied to application scheme The restriction of computer equipment 500 thereon, specific computer equipment 500 may include more more or fewer than as shown in the figure Component perhaps combines certain components or with different component layouts.
Wherein, the processor 502 is for running computer program 5032 stored in memory, to realize following step It is rapid:
Packaging model forms model object;
Determine operation number;
Acquisition request;
According to request cryptographic Hash;
Task distribution is carried out according to cryptographic Hash and model object;
According to Distribution Results code corresponding to Docker container execution task;
Judge whether code corresponding to task is finished;
If so, discharging code corresponding to the task in Docker container.
In one embodiment, processor 502 is implemented as follows step when realizing the determining operation number step:
Obtain the number for participating in the graphics processor of operation;
Code training burden is obtained according to the number of the parameter amount of model object, train epochs and graphics processor;
Operation number is determined according to code training burden.
In one embodiment, processor 502 is also realized as follows before realizing the cryptographic Hash step according to request Step:
Upload the data set of success response request.
In one embodiment, processor 502 is described right in Docker container execution task institute according to Distribution Results in realization When the code steps answered, it is implemented as follows step:
Obtain idle operation cluster;
It sends the operation cluster and corresponds to the number of task correlation cryptographic Hash, model object and graphics processor to described Operation cluster;
The data set requested according to code corresponding to cryptographic Hash downloading task and success response;
Task code and data set are distributed to the child node in the operation cluster;
Docker container needed for starting the child node operation in the operation cluster;
The code corresponding to operation task in Docker container.
In one embodiment, processor 502 is also realized such as before realizing the code steps in the release application container Lower step:
Obtain implementing result;
Execution result back.
In one embodiment, processor 502 is realizing described judge whether code corresponding to task is finished step Later, following steps are also realized:
If it is not, then returning described according to Distribution Results code corresponding to Docker container execution task.
It should be appreciated that in the embodiment of the present application, processor 502 can be central processing unit (Central Processing Unit, CPU), which can also be other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic Device, discrete gate or transistor logic, discrete hardware components etc..Wherein, general processor can be microprocessor or Person's processor is also possible to any conventional processor etc..
Those of ordinary skill in the art will appreciate that be realize above-described embodiment method in all or part of the process, It is that relevant hardware can be instructed to complete by computer program.The computer program includes program instruction, computer journey Sequence can be stored in a storage medium, which is computer readable storage medium.The program instruction is by the department of computer science At least one processor in system executes, to realize the process step of the embodiment of the above method.
Therefore, the present invention also provides a kind of storage mediums.The storage medium can be computer readable storage medium.This is deposited Storage media is stored with computer program, and processor is made to execute following steps when wherein the computer program is executed by processor:
Packaging model forms model object;
Determine operation number;
Acquisition request;
According to request cryptographic Hash;
Task distribution is carried out according to cryptographic Hash and model object;
According to Distribution Results code corresponding to Docker container execution task;
Judge whether code corresponding to task is finished;
If so, discharging code corresponding to the task in Docker container.
In one embodiment, the processor realizes the determining operation number step executing the computer program When, it is implemented as follows step:
Obtain the number for participating in the graphics processor of operation;
Code training burden is obtained according to the number of the parameter amount of model object, train epochs and graphics processor;
Operation number is determined according to code training burden.
In one embodiment, the processor is realized described according to request Hash in the execution computer program It is worth before step, also realization following steps:
Upload the data set of success response request.
In one embodiment, the processor is realized and described is existed according to Distribution Results executing the computer program When code steps corresponding to Docker container execution task, it is implemented as follows step:
Obtain idle operation cluster;
It sends the operation cluster and corresponds to the number of task correlation cryptographic Hash, model object and graphics processor to described Operation cluster;
The data set requested according to code corresponding to cryptographic Hash downloading task and success response;
Task code and data set are distributed to the child node in the operation cluster;
Docker container needed for starting the child node operation in the operation cluster;
The code corresponding to operation task in Docker container.
In one embodiment, the processor is realized in the release application container in the execution computer program Before code steps, following steps are also realized:
Obtain implementing result;
Execution result back.
In one embodiment, the processor is realized corresponding to the judgement task in the execution computer program Whether code is finished after step, also realization following steps:
If it is not, then returning described according to Distribution Results code corresponding to Docker container execution task.
The storage medium can be USB flash disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), magnetic disk Or the various computer readable storage mediums that can store program code such as CD.
Those of ordinary skill in the art may be aware that list described in conjunction with the examples disclosed in the embodiments of the present disclosure Member and algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware With the interchangeability of software, each exemplary composition and step are generally described according to function in the above description.This A little functions are implemented in hardware or software actually, the specific application and design constraint depending on technical solution.Specially Industry technical staff can use different methods to achieve the described function each specific application, but this realization is not It is considered as beyond the scope of this invention.
In several embodiments provided by the present invention, it should be understood that disclosed device and method can pass through it Its mode is realized.For example, the apparatus embodiments described above are merely exemplary.For example, the division of each unit, only Only a kind of logical function partition, there may be another division manner in actual implementation.Such as multiple units or components can be tied Another system is closed or is desirably integrated into, or some features can be ignored or not executed.
The steps in the embodiment of the present invention can be sequentially adjusted, merged and deleted according to actual needs.This hair Unit in bright embodiment device can be combined, divided and deleted according to actual needs.In addition, in each implementation of the present invention Each functional unit in example can integrate in one processing unit, is also possible to each unit and physically exists alone, can also be with It is that two or more units are integrated in one unit.
If the integrated unit is realized in the form of SFU software functional unit and when sold or used as an independent product, It can store in one storage medium.Based on this understanding, technical solution of the present invention is substantially in other words to existing skill The all or part of part or the technical solution that art contributes can be embodied in the form of software products, the meter Calculation machine software product is stored in a storage medium, including some instructions are used so that a computer equipment (can be a People's computer, terminal or network equipment etc.) it performs all or part of the steps of the method described in the various embodiments of the present invention.
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can readily occur in various equivalent modifications or replace It changes, these modifications or substitutions should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with right It is required that protection scope subject to.

Claims (10)

1. deep learning distributed arithmetic method characterized by comprising
Packaging model forms model object;
Determine operation number;
Acquisition request;
According to request cryptographic Hash;
Task distribution is carried out according to cryptographic Hash and model object;
According to Distribution Results code corresponding to Docker container execution task;
Judge whether task code is finished;
If so, the code in release Docker container.
2. deep learning distributed arithmetic method according to claim 1, which is characterized in that the determining operation number, Include:
Obtain the number for participating in the graphics processor of operation;
Code training burden is obtained according to the number of the parameter amount of model object, train epochs and graphics processor;
Operation number is determined according to code training burden.
3. deep learning distributed arithmetic method according to claim 2, which is characterized in that described to be breathed out according to request Before uncommon value, further includes:
Upload the data set of success response request.
4. deep learning distributed arithmetic method according to claim 3, which is characterized in that described to be existed according to Distribution Results Code corresponding to Docker container execution task, further includes:
Obtain idle operation cluster;
It sends the operation cluster and corresponds to the number of task correlation cryptographic Hash, model object and graphics processor to the operation Cluster;
The data set requested according to code corresponding to cryptographic Hash downloading task and success response;
Task code and data set are distributed to the child node in the operation cluster;
Docker container needed for starting the child node operation in the operation cluster;
The code corresponding to operation task in Docker container.
5. deep learning distributed arithmetic method according to any one of claims 1 to 4, which is characterized in that the release Before code in application container, further includes:
Obtain implementing result;
Execution result back.
6. deep learning distributed arithmetic device characterized by comprising
Encapsulation unit is used for packaging model, forms model object;
Number decision unit, for determining operation number;
Request unit is used for acquisition request;
Cryptographic Hash acquiring unit, for according to request cryptographic Hash;
Dispatching Unit, for carrying out task distribution according to cryptographic Hash and model object;
Code execution unit, for according to Distribution Results code corresponding to Docker container execution task;
Judging unit is executed, for judging whether task code is finished;
Releasing unit, for if so, discharging the code in application container.
7. deep learning distributed arithmetic device according to claim 6, which is characterized in that the fund determination unit packet It includes:
Number obtains subelement, for obtaining the number for participating in the graphics processor of operation;
Training burden obtains subelement, for being obtained according to the number of the parameter amount of model object, train epochs and graphics processor Replace code training burden;
Subelement is determined, for determining fund according to code training burden.
8. deep learning distributed arithmetic device according to claim 6, which is characterized in that described device further include:
Data set acquiring unit, for uploading the data set of success response request.
9. a kind of computer equipment, which is characterized in that the computer equipment includes memory and processor, on the memory It is stored with computer program, the processor is realized as described in any one of claims 1 to 5 when executing the computer program Method.
10. a kind of storage medium, which is characterized in that the storage medium is stored with computer program, the computer program quilt Processor can realize the method as described in any one of claims 1 to 5 when executing.
CN201811080562.6A 2018-09-17 2018-09-17 Deep learning distributed arithmetic method, apparatus, computer equipment and storage medium Pending CN109358944A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811080562.6A CN109358944A (en) 2018-09-17 2018-09-17 Deep learning distributed arithmetic method, apparatus, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811080562.6A CN109358944A (en) 2018-09-17 2018-09-17 Deep learning distributed arithmetic method, apparatus, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN109358944A true CN109358944A (en) 2019-02-19

Family

ID=65350884

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811080562.6A Pending CN109358944A (en) 2018-09-17 2018-09-17 Deep learning distributed arithmetic method, apparatus, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN109358944A (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110414687A (en) * 2019-07-12 2019-11-05 苏州浪潮智能科技有限公司 A kind of method and apparatus for the training of deep learning frame distribution
CN110516090A (en) * 2019-08-09 2019-11-29 广东浪潮大数据研究有限公司 A kind of object detecting method, device, equipment and computer readable storage medium
CN110688230A (en) * 2019-10-17 2020-01-14 广州文远知行科技有限公司 Synchronous training method and device, computer equipment and storage medium
CN110795141A (en) * 2019-10-12 2020-02-14 广东浪潮大数据研究有限公司 Training task submitting method, device, equipment and medium
CN110839023A (en) * 2019-11-05 2020-02-25 北京中电普华信息技术有限公司 Electric power marketing multi-channel customer service system
CN110866167A (en) * 2019-11-14 2020-03-06 北京知道创宇信息技术股份有限公司 Task allocation method, device, server and storage medium
CN111200606A (en) * 2019-12-31 2020-05-26 深圳市优必选科技股份有限公司 Deep learning model task processing method, system, server and storage medium
CN112035220A (en) * 2020-09-30 2020-12-04 北京百度网讯科技有限公司 Processing method, device and equipment for operation task of development machine and storage medium
CN112097368A (en) * 2020-08-21 2020-12-18 深圳市建滔科技有限公司 Air conditioner power consumption adjusting method and device based on cloud deep learning
CN112114931A (en) * 2019-06-21 2020-12-22 鸿富锦精密电子(天津)有限公司 Deep learning program configuration method and device, electronic equipment and storage medium
CN112596863A (en) * 2020-12-28 2021-04-02 南方电网深圳数字电网研究院有限公司 Method, system and computer storage medium for monitoring training tasks
CN112651411A (en) * 2019-10-10 2021-04-13 中国人民解放军国防科技大学 Gradient quantization method and system for distributed deep learning
CN113645282A (en) * 2021-07-29 2021-11-12 上海熠知电子科技有限公司 Deep learning method based on server cluster
CN113672215A (en) * 2021-07-30 2021-11-19 阿里巴巴新加坡控股有限公司 Deep learning distributed training adaptation method and device
CN114756464A (en) * 2022-04-18 2022-07-15 中国电信股份有限公司 Code checking configuration method, device and storage medium
WO2023123828A1 (en) * 2021-12-31 2023-07-06 上海商汤智能科技有限公司 Model processing method and apparatus, electronic device, computer storage medium, and program
CN116483482A (en) * 2023-05-19 2023-07-25 北京百度网讯科技有限公司 Deep learning task processing method, system, device, equipment and medium
CN114756464B (en) * 2022-04-18 2024-04-26 中国电信股份有限公司 Code checking configuration method, device and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102629219A (en) * 2012-02-27 2012-08-08 北京大学 Self-adaptive load balancing method for Reduce ends in parallel computing framework
CN103473121A (en) * 2013-08-20 2013-12-25 西安电子科技大学 Mass image parallel processing method based on cloud computing platform
WO2016010830A1 (en) * 2014-07-12 2016-01-21 Microsoft Technology Licensing, Llc Composing and executing workflows made up of functional pluggable building blocks
CN107343000A (en) * 2017-07-04 2017-11-10 北京百度网讯科技有限公司 Method and apparatus for handling task
CN107450961A (en) * 2017-09-22 2017-12-08 济南浚达信息技术有限公司 A kind of distributed deep learning system and its building method, method of work based on Docker containers
CN108280207A (en) * 2018-01-30 2018-07-13 深圳市茁壮网络股份有限公司 A method of the perfect Hash of construction

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102629219A (en) * 2012-02-27 2012-08-08 北京大学 Self-adaptive load balancing method for Reduce ends in parallel computing framework
CN103473121A (en) * 2013-08-20 2013-12-25 西安电子科技大学 Mass image parallel processing method based on cloud computing platform
WO2016010830A1 (en) * 2014-07-12 2016-01-21 Microsoft Technology Licensing, Llc Composing and executing workflows made up of functional pluggable building blocks
CN107343000A (en) * 2017-07-04 2017-11-10 北京百度网讯科技有限公司 Method and apparatus for handling task
CN107450961A (en) * 2017-09-22 2017-12-08 济南浚达信息技术有限公司 A kind of distributed deep learning system and its building method, method of work based on Docker containers
CN108280207A (en) * 2018-01-30 2018-07-13 深圳市茁壮网络股份有限公司 A method of the perfect Hash of construction

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112114931B (en) * 2019-06-21 2023-12-26 富联精密电子(天津)有限公司 Deep learning program configuration method and device, electronic equipment and storage medium
CN112114931A (en) * 2019-06-21 2020-12-22 鸿富锦精密电子(天津)有限公司 Deep learning program configuration method and device, electronic equipment and storage medium
CN110414687A (en) * 2019-07-12 2019-11-05 苏州浪潮智能科技有限公司 A kind of method and apparatus for the training of deep learning frame distribution
CN110516090A (en) * 2019-08-09 2019-11-29 广东浪潮大数据研究有限公司 A kind of object detecting method, device, equipment and computer readable storage medium
CN112651411A (en) * 2019-10-10 2021-04-13 中国人民解放军国防科技大学 Gradient quantization method and system for distributed deep learning
CN112651411B (en) * 2019-10-10 2022-06-07 中国人民解放军国防科技大学 Gradient quantization method and system for distributed deep learning
CN110795141A (en) * 2019-10-12 2020-02-14 广东浪潮大数据研究有限公司 Training task submitting method, device, equipment and medium
CN110795141B (en) * 2019-10-12 2023-10-10 广东浪潮大数据研究有限公司 Training task submitting method, device, equipment and medium
CN110688230A (en) * 2019-10-17 2020-01-14 广州文远知行科技有限公司 Synchronous training method and device, computer equipment and storage medium
CN110688230B (en) * 2019-10-17 2022-06-24 广州文远知行科技有限公司 Synchronous training method and device, computer equipment and storage medium
CN110839023A (en) * 2019-11-05 2020-02-25 北京中电普华信息技术有限公司 Electric power marketing multi-channel customer service system
CN110839023B (en) * 2019-11-05 2022-03-25 北京中电普华信息技术有限公司 Electric power marketing multi-channel customer service system
CN110866167A (en) * 2019-11-14 2020-03-06 北京知道创宇信息技术股份有限公司 Task allocation method, device, server and storage medium
CN110866167B (en) * 2019-11-14 2022-09-20 北京知道创宇信息技术股份有限公司 Task allocation method, device, server and storage medium
CN111200606A (en) * 2019-12-31 2020-05-26 深圳市优必选科技股份有限公司 Deep learning model task processing method, system, server and storage medium
CN112097368A (en) * 2020-08-21 2020-12-18 深圳市建滔科技有限公司 Air conditioner power consumption adjusting method and device based on cloud deep learning
CN112035220A (en) * 2020-09-30 2020-12-04 北京百度网讯科技有限公司 Processing method, device and equipment for operation task of development machine and storage medium
CN112596863A (en) * 2020-12-28 2021-04-02 南方电网深圳数字电网研究院有限公司 Method, system and computer storage medium for monitoring training tasks
CN113645282A (en) * 2021-07-29 2021-11-12 上海熠知电子科技有限公司 Deep learning method based on server cluster
CN113672215A (en) * 2021-07-30 2021-11-19 阿里巴巴新加坡控股有限公司 Deep learning distributed training adaptation method and device
CN113672215B (en) * 2021-07-30 2023-10-24 阿里巴巴新加坡控股有限公司 Deep learning distributed training adaptation method and device
WO2023123828A1 (en) * 2021-12-31 2023-07-06 上海商汤智能科技有限公司 Model processing method and apparatus, electronic device, computer storage medium, and program
CN114756464A (en) * 2022-04-18 2022-07-15 中国电信股份有限公司 Code checking configuration method, device and storage medium
CN114756464B (en) * 2022-04-18 2024-04-26 中国电信股份有限公司 Code checking configuration method, device and storage medium
CN116483482A (en) * 2023-05-19 2023-07-25 北京百度网讯科技有限公司 Deep learning task processing method, system, device, equipment and medium
CN116483482B (en) * 2023-05-19 2024-03-01 北京百度网讯科技有限公司 Deep learning task processing method, system, device, equipment and medium

Similar Documents

Publication Publication Date Title
CN109358944A (en) Deep learning distributed arithmetic method, apparatus, computer equipment and storage medium
EP3736692A1 (en) Using computational cost and instantaneous load analysis for intelligent deployment of neural networks on multiple hardware executors
CN108369534A (en) Code executes request routing
CN110780914B (en) Service publishing method and device
CN105242956B (en) Virtual functions service chaining deployment system and its dispositions method
Bibani et al. A demo of IoT healthcare application provisioning in hybrid cloud/fog environment
CN109104336A (en) Service request processing method, device, computer equipment and storage medium
CN109471710A (en) Processing method, device, processor, terminal and the server of task requests
CN111310936A (en) Machine learning training construction method, platform, device, equipment and storage medium
US11791050B2 (en) 3D environment risks identification utilizing reinforced learning
CN112015536B (en) Kubernetes cluster container group scheduling method, device and medium
CN110033091B (en) Model-based prediction method and device
CN109067890A (en) A kind of CDN node edge calculations system based on docker container
US20140380196A1 (en) Creation and Prioritization of Multiple Virtual Universe Teleports in Response to an Event
CN108833161A (en) A method of establishing the intelligent contract micro services model calculated based on mist
CN109074283A (en) The M2M service layer based on pond is established by NFV
CN113051053A (en) Heterogeneous resource scheduling method, device, equipment and computer readable storage medium
CN105144109A (en) Distributed data center technology
Sharma et al. Ant colony based optimization model for QoS-Based task scheduling in cloud computing environment
EP4127925A1 (en) Orchestration of virtualization technology and application implementation
CN109189400A (en) Program dissemination method and device, storage medium, processor
KR20200125890A (en) Cloud-based transaction system and method capable of providing neural network training model in supervised state
CN109248440A (en) A kind of method and system for realizing the real-time dynamically load configuration of game
Kemp Programming frameworks for distributed smartphone computing
CN108985459A (en) The method and apparatus of training pattern

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190219