CN114816669A - Distributed training method and data processing method of model - Google Patents

Distributed training method and data processing method of model Download PDF

Info

Publication number
CN114816669A
CN114816669A CN202210476218.9A CN202210476218A CN114816669A CN 114816669 A CN114816669 A CN 114816669A CN 202210476218 A CN202210476218 A CN 202210476218A CN 114816669 A CN114816669 A CN 114816669A
Authority
CN
China
Prior art keywords
training
data
containers
task
training data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210476218.9A
Other languages
Chinese (zh)
Inventor
王晖
刘洋
王亚男
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202210476218.9A priority Critical patent/CN114816669A/en
Publication of CN114816669A publication Critical patent/CN114816669A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/4557Distribution of virtual machine instances; Migration and load balancing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The disclosure provides a distributed training method and a data processing method for a model, and relates to the technical field of artificial intelligence, in particular to the technical fields of image processing, computer vision and the like. The implementation scheme is as follows: the method comprises the steps of obtaining a plurality of containers on a plurality of nodes, wherein each container in the plurality of containers is provided with a training module, and the training module can respond to a training request corresponding to each training task in a plurality of training tasks to execute the training task; in response to receiving a first request corresponding to a first training task, determining a plurality of training containers corresponding to the first training task from a plurality of containers; and respectively sending first training requests corresponding to the first training task to the plurality of training containers so that the training module in each of the plurality of training containers executes the first training task, thereby training the model corresponding to the first training task.

Description

Distributed training method and data processing method of model
Technical Field
The present disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of image processing, computer vision, and the like, and in particular to a distributed training method for a model, a data processing method, an apparatus, an electronic device, a computer-readable storage medium, and a computer program product.
Background
Artificial intelligence is the subject of research that makes computers simulate some human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge map technology and the like.
The artificial intelligence technology widely adopts a trained algorithm model as a means for realizing the related software technology, wherein in the training process of the algorithm model, a distributed training method is often adopted to train the algorithm model due to more training data and larger calculation scale.
The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, unless otherwise indicated, the problems mentioned in this section should not be considered as having been acknowledged in any prior art.
Disclosure of Invention
The present disclosure provides a distributed training method of a model, a data processing method, an apparatus, an electronic device, a computer-readable storage medium, and a computer program product.
According to an aspect of the present disclosure, there is provided a distributed training method of a model, including: acquiring a plurality of containers on a plurality of nodes, wherein each container in the plurality of containers is deployed with a training module capable of responding to a training request corresponding to each training task in a plurality of training tasks to execute the training task; in response to receiving a first request corresponding to a first training task, determining a plurality of training containers corresponding to the first training task from a plurality of containers; and respectively sending first training requests corresponding to the first training tasks to the plurality of training containers so as to enable the training module in each of the plurality of training containers to execute the first training tasks, thereby training the models corresponding to the first training tasks.
According to another aspect of the present disclosure, there is provided a data processing method including: acquiring data to be processed; and inputting the data to be processed into a processing model, wherein the processing model is obtained by training by adopting a distributed training method of the model according to the disclosure.
According to another aspect of the present disclosure, there is provided a distributed training apparatus of a model, including: a container acquisition unit configured to acquire a plurality of containers on a plurality of nodes, each of the plurality of containers being deployed with a training module capable of executing a plurality of training tasks in response to a training request corresponding to each of the training tasks; a determining unit configured to determine, in response to receiving a first request corresponding to a first training task, a plurality of training containers corresponding to the first training task from a plurality of containers; and a training request unit configured to send first training requests corresponding to the first training tasks to the plurality of training containers, respectively, so that the training module in each of the plurality of training containers executes the first training task, thereby training the model corresponding to the first training task.
According to another aspect of the present disclosure, there is provided a data processing apparatus including: a data acquisition unit configured to acquire data to be processed; and a data input unit configured to input the data to be processed to a processing model obtained by training using a distributed training method of a model according to the present disclosure.
According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to implement a method according to the above.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to implement the method according to the above.
According to another aspect of the present disclosure, a computer program product is provided comprising a computer program, wherein the computer program realizes the method according to the above when executed by a processor.
According to one or more embodiments of the disclosure, a training module capable of executing a plurality of training tasks in response to a training request is deployed in a container on a plurality of nodes, the training request is sent to each training container, the training module of each training container is enabled to execute the training tasks in response to the training request, and distributed training of a model is achieved; the training module is deployed in the container of the node, so that the deployment efficiency of the distributed training platform of the model is high, the deployed platform is easy to expand, and the use and maintenance cost is reduced.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the embodiments and, together with the description, serve to explain the exemplary implementations of the embodiments. The illustrated embodiments are for purposes of illustration only and do not limit the scope of the claims. Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.
FIG. 1 shows a flow diagram of a method of distributed training of a model according to an embodiment of the present disclosure;
FIG. 2 shows a flow diagram of a process of acquiring multiple containers on multiple nodes in a distributed training method of a model according to an embodiment of the present disclosure;
FIG. 3 shows a schematic diagram of a training platform in a distributed training method of a model according to an embodiment of the present disclosure;
FIG. 4 shows a flow diagram of a distributed training method of a model according to an embodiment of the present disclosure;
FIG. 5 shows a block diagram of a distributed training apparatus of a model according to an embodiment of the present disclosure;
FIG. 6 shows a block diagram of a data processing apparatus according to an embodiment of the present disclosure; and
FIG. 7 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In the present disclosure, unless otherwise specified, the use of the terms "first", "second", etc. to describe various elements is not intended to limit the positional relationship, the timing relationship, or the importance relationship of the elements, and such terms are used only to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, based on the context, they may also refer to different instances.
The terminology used in the description of the various described examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, if the number of elements is not specifically limited, the element may be one or a plurality of. Furthermore, the term "and/or" as used in this disclosure is intended to encompass any and all possible combinations of the listed items.
In the related art, a distributed model training is implemented using a cluster-based distributed architecture. For example, the distributed model training is realized by adopting a spark, hadoop and mpi distributed architecture. These distributed architectures often involve cluster-based application deployment and management, and therefore, there is a need to address the issue of cluster control, avoiding loss of efficiency. However, the deployment and management of the cluster involve complex technologies and applications (e.g., K8S), so that the use and maintenance cost of the distributed model training architecture is high, the deployment is inconvenient, and the distributed model training architecture cannot meet the use requirements of small users (e.g., the number of computing devices applied to the training platform is small).
Therefore, a distributed model training method and device are provided to solve the problems.
Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.
Referring to fig. 1, a method 200 of distributed training of a model according to some embodiments of the present disclosure includes:
step S110: acquiring a plurality of containers on a plurality of nodes, wherein each container in the plurality of containers is deployed with a training module capable of responding to a training request corresponding to each training task in a plurality of training tasks to execute the training task;
step S120: in response to receiving a first request corresponding to a first training task, determining a plurality of training containers corresponding to the first training task from a plurality of containers; and
step S130: and respectively sending first training requests corresponding to the first training tasks to the plurality of training containers so as to enable the training module in each of the plurality of training containers to execute the first training task, thereby training the model corresponding to the first training task.
The method comprises the steps that training modules capable of responding to training requests to execute a plurality of training tasks are deployed in containers on a plurality of nodes, the training requests are sent to all the training containers, the training modules of all the training containers respond to the training requests to execute the training tasks, and distributed training of a model is achieved; the training modules are deployed in the containers of the nodes, so that the deployment efficiency of the distributed training platform is high, the deployed platform is easy to expand, and the use and maintenance cost is reduced.
In the embodiment according to the present disclosure, by deploying, on a node formed by computing devices, a training module capable of executing a plurality of training tasks in response to a training request, distributed training of a model can be implemented, a cluster-based distributed architecture is not required to be used, and thus application deployment and management of a cluster are not involved, so that use and maintenance costs can be reduced, and deployment based on a container makes deployment convenient, and can be adapted to use requirements of small users (for example, the number of computing devices applied to a training platform is small).
In some embodiments, the nodes are a plurality of computing devices, e.g., computers.
In some embodiments, containers on multiple nodes may be created based on the deployment of the training modules. Wherein the container is started after creation.
In some embodiments, the training module may be a service deployed on the container, wherein the service includes an algorithm service and a response service corresponding to a plurality of training tasks, and the response service may respond to the training request and initiate the algorithm service corresponding to the training request, thereby performing the corresponding training task.
In some embodiments, the plurality of training tasks may be face detection, image classification, and the like, without limitation.
In some embodiments, the model corresponding to the training task may be a face detection model, an image classification model, or the like.
In some embodiments, the algorithm service may be a service that encapsulates an algorithm corresponding to the training task. For example, the algorithm may be a face detection algorithm, an image classification algorithm, etc., and is not limited herein.
In some embodiments, the training module may also include other services, such that the response service initiates the service to perform a corresponding task based on a response to the training request. The other service may be, for example, a data acquisition service for acquiring corresponding training data.
In some embodiments, as shown in FIG. 2, obtaining a plurality of containers on a plurality of nodes comprises:
step S210: acquiring a mirror image of the training module;
step S220: mounting the mirror onto a first node of the plurality of nodes; and
step S220: and responding to the first node to run the mirror image, and acquiring the container on the first node.
By obtaining the mirror image of the training module and mounting the mirror image, the node obtains the mirror image of the training module, and when the mirror image is operated, the container is started, so that the deployment efficiency of the training module is high, and the expansion is easy.
For example, when a user adds a node, the training module can be deployed only by mounting the mirror image of the training module on the added node and running the mirror image on the node, so as to perform distributed training of the model, so that the deployment efficiency of the training module is greatly improved, and the development is easy.
Referring to fig. 3, a schematic diagram of a distributed training platform of a model is shown, according to some embodiments of the present disclosure.
The distributed training platform 300 of the model comprises a plurality of nodes (nodes 310-3n0), each of which is deployed with a training module (training modules 311-3n1) through a container, n being a positive integer.
The process of the distributed training method of the model according to the present disclosure is implemented by the node 310. Wherein node 310 obtains a plurality of training nodes from the plurality of nodes (nodes 310-3n0) to obtain a plurality of training containers in response to obtaining a first request corresponding to a first training task. And generating respective training requests to send to the plurality of training containers based on the first request.
In some embodiments, the first request may be a request received from a user via an input-output device. The first request includes information indicating a training node corresponding to the first training task, data information, and the like, which is not limited herein.
In one example, the information of the training node may be an IP address of the training node; the data information may be label tagged information indicating a set of training data required for the first training task and each of the training data in the set of training data.
In some embodiments, the first request includes training parameters related to the training task, the method further comprising:
generating the training request, the training request indicating the training parameters.
The training parameters determined by the user are sent to the training container in a training request mode, so that the training module in the training container executes the training task based on the corresponding training request, the distributed training of the model is realized, and the training task of the distributed training of the model is easy to expand.
In some embodiments, the training parameters are hyper-parameters of the model that are not adjusted based on losses in the process. The hyper-parameter may be, for example, the regularization coefficient λ, the depth of the tree in the decision tree model. Different algorithm combinations can be set by setting the hyper-parameters, and different training tasks are realized.
In some embodiments, as shown in fig. 4, a method according to the present disclosure further comprises:
step S410: obtaining training data information corresponding to the first training task, the training data information indicating a training data set of the first training task and a label of each of the training data in the training data set; and
step S420: and respectively sending the training data information to the training containers so that the training data are respectively acquired by a plurality of training modules in the training containers.
By obtaining the training data information and sending the training data information to the training container, the training data and the training module are separately deployed, and the deployment of the distributed training platform is further simplified.
In some embodiments, the training data information is obtained based on the data information indicated in the first request.
For example, the first request includes an address of training data, and training data information is obtained based on the address of the training data.
In some embodiments, training data information is obtained by obtaining training data uploaded by a user.
In some embodiments, the training data information includes each training data in the set of training data and a label for the training data.
The training data is sent to each training container, so that the training containers perform training based on the received training data and the label of the training data, and the deployment of the distributed training platform is further simplified.
In some embodiments, the training data information includes each training data acquisition address in the training data set from which access is made to obtain the training data.
Under the condition that the data volume of the training data set is large, the training data set is mounted in the storage module, the training module in the training container is made to access based on the address by sending the address of each training data in the training data set so as to obtain the training data, the storage space for storing the training data set can be reduced, and the communication link congestion caused when the training data set is transmitted to a plurality of training modules simultaneously can be avoided.
For example, the training data set is stored on a Network Attached Storage (NAS), and the training data set is obtained by sending the acquisition address of each of the training data in the training data set on the network attached storage to the training container, and causing the training container to access the network attached storage based on the acquisition address of each of the training data.
In some embodiments, the method according to the present disclosure further comprises:
obtaining a data order corresponding to each of the plurality of training containers, the data order indicating an order in which a training module of that training container entered a plurality of training data in the training data set when performing the first training task; and wherein the sending the training data information to the plurality of training containers, respectively, comprises:
and sending the training data information to each training container in the plurality of training containers based on the corresponding data sequence of the training container.
By obtaining the corresponding data sequence of each training container and sending training data information to the training container based on the corresponding data sequence, the training module inputs a plurality of training data in the training data set based on the corresponding data sequence when executing the first training task, namely, the training data set adopted by the training module of each training container is the same when executing the first training task, and the sequence when inputting each training data in the training data set is different, so that the trained model is a training result considering the input sequence of the training data, and is more robust.
In some embodiments, the training data information includes each training data in the training data set and a label of the training data, and when the training data information is transmitted based on the corresponding data sequence of each training container, after the training data in the training data set is arranged based on the data sequence, the training data in the arranged training data set is sequentially transmitted.
In some embodiments, the training data information includes a label and an acquisition address of each training data in the training data set, and when the training data information is transmitted based on the corresponding data sequence of each training container, after the acquisition addresses of the training data in the training data set are arranged based on the data sequence, the acquisition addresses of the training data in the arranged training data set are sequentially transmitted.
In some embodiments, the training module of each training container may also input a plurality of training data in the training data set based on its respective data order when performing the first training task by deploying a service in the training module to obtain the data order.
For example, a function, e.g., a random function, used to obtain an input order of training data is packaged in the training module to order a plurality of training data in the obtained training data set to obtain their respective data orders.
In some embodiments, during the training process, the containers communicate with each other to synchronize gradient and parameter information during the training process. For example, Communication with each other is realized by NCCL (Nvidia Collective multi-GPU Communication Library).
In some embodiments, the method according to the present disclosure further comprises:
sending a get request to the plurality of training containers to get a loss gained by the training module of each of the plurality of training containers after performing the first training task.
The loss in the training process is obtained by sending an obtaining request to the training container so as to be displayed to a user, and the real-time monitoring of the distributed training process of the model is realized.
According to another aspect of the present disclosure, there is also provided a data processing method, including:
acquiring data to be processed; and
inputting the data to be processed into a processing model, wherein the processing model is obtained by training by adopting a distributed training method of the model according to the disclosure.
In some embodiments, the processing model may be, but is not limited to, a target detection model, a text recognition model, and the like.
In some embodiments, the data to be processed may be image data, audio data, and the like, without limitation.
According to another aspect of the present disclosure, there is also provided a distributed training apparatus for a model, as shown in fig. 5, the apparatus 500 includes: a container acquisition unit 510 configured to acquire a plurality of containers on a plurality of nodes, each of the plurality of containers having a training module deployed therein, the training module being capable of executing a plurality of training tasks in response to a training request corresponding to each of the training tasks; a determining unit 520 configured to determine, in response to receiving a first request corresponding to a first training task, a plurality of training containers corresponding to the first training task from a plurality of containers; and a training request unit 530 configured to send first training requests corresponding to the first training task to the plurality of training containers, respectively, so that the training module in each of the plurality of training containers executes the first training task, thereby training the model corresponding to the first training task.
In some embodiments, the container acquiring unit 510 includes: a mirror image acquisition unit configured to acquire a mirror image of the training module; a mounting unit configured to mount the mirror onto a first node of the plurality of nodes; and an obtaining subunit, configured to obtain, in response to the first node running the image, a container on the first node.
In some embodiments, the first request includes training parameters related to the training task, the apparatus further comprising: a training request generation unit configured to generate the training request, the training request indicating the training parameters.
In some embodiments, further comprising: a training data obtaining unit configured to obtain training data information corresponding to the first training task, the training data information indicating a training data set of the first training task and a label of each of the training data in the training data set; and a sending unit configured to send the training data information to the training containers, respectively, so that a plurality of training modules in the training containers acquire the training data, respectively.
In some embodiments, the training data information includes each training data in the set of training data and a label for the training data.
In some embodiments, the training data information includes a label and an acquisition address for each of the training data in the training data set, and the training data can be obtained based on the acquisition address for access.
In some embodiments, further comprising: an order acquisition unit configured to acquire a data order corresponding to each of the plurality of training containers; and the transmitting unit further includes: a sending subunit, configured to send the training data information to the training container based on the corresponding data sequence of each of the multiple training containers, so that a training module of the training container inputs the training data set according to the corresponding data sequence of the training container when executing the first training task.
In some embodiments, further comprising: a loss requesting unit configured to send an acquisition request to the plurality of training containers to acquire a loss obtained by the training module of each of the plurality of training containers after performing the first training task.
According to another aspect of the present disclosure, there is also provided a data processing apparatus, as shown in fig. 6, the apparatus 600 including: a data acquisition unit 610 configured to acquire data to be processed; and a data input unit 620 configured to input the data to be processed to a processing model obtained by training using a distributed training method of the model according to the present disclosure.
According to another aspect of the present disclosure, there is also provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method according to the present disclosure.
According to another aspect of the present disclosure, there is also provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method according to the present disclosure.
According to another aspect of the present disclosure, there is also provided a computer program product comprising a computer program, wherein the computer program realizes the method according to the present disclosure when executed by a processor.
Referring to fig. 7, a block diagram of a structure of an electronic device 700, which may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 7, the electronic device 700 includes a computing unit 701, which may perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the electronic device 700 can be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
A number of components in the electronic device 700 are connected to the I/O interface 705, including: an input unit 706, an output unit 707, a storage unit 708, and a communication unit 709. The input unit 706 may be any type of device capable of inputting information to the electronic device 700, and the input unit 706 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a track pad, a track ball, a joystick, a microphone, and/or a remote controller. Output unit 707 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, an object/audio output terminal, a vibrator, and/or a printer. Storage unit 708 may include, but is not limited to, magnetic or optical disks. The communication unit 709 allows the electronic device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, 802.11 devices, WiFi devices, WiMax devices, cellular communication devices, and/or the like.
Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 701 performs the various methods and processes described above, such as the method 200. For example, in some embodiments, the method 200 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 700 via the ROM 702 and/or the communication unit 709. When the computer program is loaded into RAM 703 and executed by the computing unit 701, one or more steps of the method 200 described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the method 200 by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be performed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.
Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the above-described methods, systems and apparatus are merely exemplary embodiments or examples and that the scope of the present invention is not limited by these embodiments or examples, but only by the claims as issued and their equivalents. Various elements in the embodiments or examples may be omitted or may be replaced with equivalents thereof. Further, the steps may be performed in an order different from that described in the present disclosure. Further, the various elements in the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced with equivalent elements that appear after the present disclosure.

Claims (21)

1. A method of distributed training of a model, comprising:
acquiring a plurality of containers on a plurality of nodes, wherein each container in the plurality of containers is deployed with a training module capable of responding to a training request corresponding to each training task in a plurality of training tasks to execute the training task;
in response to receiving a first request corresponding to a first training task, determining a plurality of training containers corresponding to the first training task from a plurality of containers; and
and respectively sending first training requests corresponding to the first training tasks to the plurality of training containers so as to enable the training module in each of the plurality of training containers to execute the first training task, thereby training the model corresponding to the first training task.
2. The method of claim 1, wherein the obtaining a plurality of containers on a plurality of nodes comprises:
acquiring a mirror image of the training module;
mounting the mirror onto a first node of the plurality of nodes; and
and responding to the first node to run the mirror image, and acquiring the container on the first node.
3. The method of claim 1 or 2, wherein the first request includes training parameters related to the first training task, the method further comprising:
generating the training request, the training request indicating the training parameters.
4. The method of claim 1 or 2, further comprising:
obtaining training data information corresponding to the first training task, the training data information indicating a training data set of the first training task and a label of each of the training data in the training data set; and
and respectively sending the training data information to the training containers so that the training data are respectively acquired by a plurality of training modules in the training containers.
5. The method of claim 4, wherein the training data information includes each of the training data in the set of training data and a label for the training data.
6. The method of claim 4, wherein the training data information includes a label tag and an acquisition address for each of the training data in the set of training data, the training data being obtainable based on accessing the acquisition address.
7. The method of claim 4, further comprising:
obtaining a data order corresponding to each of the plurality of training containers, the data order indicating an order in which a training module of that training container entered a plurality of training data in the training data set when performing the first training task; and wherein the sending the training data information to the plurality of training containers, respectively, comprises:
and sending the training data information to each training container in the plurality of training containers based on the corresponding data sequence of the training container.
8. The method of any of claims 1-7, further comprising:
sending an acquisition request to the plurality of training containers to acquire a loss obtained by the training module of each of the plurality of training containers after performing the first training task.
9. A method of data processing, comprising:
acquiring data to be processed; and
inputting the data to be processed into a processing model obtained by training using the method according to any one of claims 1-8.
10. A distributed training arrangement of models, comprising:
a container acquisition unit configured to acquire a plurality of containers on a plurality of nodes, each of the plurality of containers being deployed with a training module capable of executing a plurality of training tasks in response to a training request corresponding to each of the training tasks;
a determining unit configured to determine, in response to receiving a first request corresponding to a first training task, a plurality of training containers corresponding to the first training task from a plurality of containers; and
a training request unit configured to send first training requests corresponding to the first training tasks to the plurality of training containers, respectively, so that the training module in each of the plurality of training containers executes the first training task, thereby training the model corresponding to the first training task.
11. The apparatus of claim 10, wherein the container retrieving unit comprises:
a mirror image acquisition unit configured to acquire a mirror image of the training module;
a mounting unit configured to mount the mirror onto a first node of the plurality of nodes; and
an obtaining subunit configured to obtain a container on the first node in response to the first node running the image.
12. The apparatus of claim 10 or 11, wherein the first request includes training parameters related to the first training task, the apparatus further comprising:
a training request generation unit configured to generate the training request, the training request indicating the training parameters.
13. The apparatus of claim 10 or 11, further comprising:
a training data obtaining unit configured to obtain training data information corresponding to the first training task, the training data information indicating a training data set of the first training task and a label of each of the training data in the training data set; and
a sending unit, configured to send the training data information to the training containers, respectively, so that a plurality of training modules in the training containers acquire the training data, respectively.
14. The apparatus of claim 13, wherein the training data information includes each of the training data in the set of training data and a label tag for the training data.
15. The apparatus of claim 13, wherein the training data information includes a label tag and an acquisition address for each of the training data in the set of training data, the training data being obtainable based on accessing the acquisition address.
16. The apparatus of claim 13, further comprising:
a sequence obtaining unit configured to obtain a data sequence corresponding to each of the plurality of training containers, the data sequence indicating an order in which a training module of the training container inputs the plurality of training data in the training data set when executing the first training task; and wherein the transmitting unit further comprises:
and sending the training data information to each training container in the plurality of training containers based on the corresponding data sequence of the training container.
17. The apparatus of any of claims 10-16, further comprising:
a loss requesting unit configured to send an acquisition request to the plurality of training containers to acquire a loss obtained by the training module of each of the plurality of training containers after performing the first training task.
18. A data processing apparatus comprising:
a data acquisition unit configured to acquire data to be processed; and
a data input unit configured to input the data to be processed to a process model obtained by training using the method according to any one of claims 1 to 8.
19. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein
The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.
20. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-9.
21. A computer program product comprising a computer program, wherein the computer program realizes the method of any one of claims 1-9 when executed by a processor.
CN202210476218.9A 2022-04-29 2022-04-29 Distributed training method and data processing method of model Pending CN114816669A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210476218.9A CN114816669A (en) 2022-04-29 2022-04-29 Distributed training method and data processing method of model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210476218.9A CN114816669A (en) 2022-04-29 2022-04-29 Distributed training method and data processing method of model

Publications (1)

Publication Number Publication Date
CN114816669A true CN114816669A (en) 2022-07-29

Family

ID=82511239

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210476218.9A Pending CN114816669A (en) 2022-04-29 2022-04-29 Distributed training method and data processing method of model

Country Status (1)

Country Link
CN (1) CN114816669A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180018590A1 (en) * 2016-07-18 2018-01-18 NantOmics, Inc. Distributed Machine Learning Systems, Apparatus, and Methods
CN107733977A (en) * 2017-08-31 2018-02-23 北京百度网讯科技有限公司 A kind of cluster management method and device based on Docker
CN112000473A (en) * 2020-08-12 2020-11-27 中国银联股份有限公司 Distributed training method and device for deep learning model
CN112000450A (en) * 2020-08-18 2020-11-27 中国银联股份有限公司 Neural network architecture searching method and device
CN112364897A (en) * 2020-10-27 2021-02-12 曙光信息产业(北京)有限公司 Distributed training method and device, storage medium and electronic equipment
CN113569987A (en) * 2021-08-19 2021-10-29 北京沃东天骏信息技术有限公司 Model training method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180018590A1 (en) * 2016-07-18 2018-01-18 NantOmics, Inc. Distributed Machine Learning Systems, Apparatus, and Methods
CN107733977A (en) * 2017-08-31 2018-02-23 北京百度网讯科技有限公司 A kind of cluster management method and device based on Docker
CN112000473A (en) * 2020-08-12 2020-11-27 中国银联股份有限公司 Distributed training method and device for deep learning model
CN112000450A (en) * 2020-08-18 2020-11-27 中国银联股份有限公司 Neural network architecture searching method and device
CN112364897A (en) * 2020-10-27 2021-02-12 曙光信息产业(北京)有限公司 Distributed training method and device, storage medium and electronic equipment
CN113569987A (en) * 2021-08-19 2021-10-29 北京沃东天骏信息技术有限公司 Model training method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
XIAN, LT ET AL: "H-PS: A Heterogeneous-Aware Parameter Server With Distributed Neural Network Training", 《IEEE ACCESS》, 31 December 2021 (2021-12-31), pages 44049 - 44058 *
李俊江: "基于Kubernetes的机器学习云平台设计与实现", 《中国优秀硕士学位论文全文数据库(电子期刊)》, vol. 2022, no. 03, 15 March 2022 (2022-03-15) *

Similar Documents

Publication Publication Date Title
CN112559007B (en) Parameter updating method and device of multitask model and electronic equipment
CN112579909A (en) Object recommendation method and device, computer equipment and medium
CN111340220B (en) Method and apparatus for training predictive models
CN113159091A (en) Data processing method and device, electronic equipment and storage medium
EP3869404A2 (en) Vehicle loss assessment method executed by mobile terminal, device, mobile terminal and medium
US20220101199A1 (en) Point-of-interest recommendation
CN112328301B (en) Method and device for maintaining consistency of operating environments, storage medium and electronic equipment
CN115511779B (en) Image detection method, device, electronic equipment and storage medium
CN113159284A (en) Model training method and device
CN114091672B (en) Distributed model reasoning method and device, electronic equipment and medium
CN113627536A (en) Model training method, video classification method, device, equipment and storage medium
CN114970883A (en) Model quantization method and device, electronic equipment and storage medium
CN112784102B (en) Video retrieval method and device and electronic equipment
US20230153612A1 (en) Pruning complex deep learning models based on parent pruning information
CN114816669A (en) Distributed training method and data processing method of model
US20190385091A1 (en) Reinforcement learning exploration by exploiting past experiences for critical events
CN115797660A (en) Image detection method, image detection device, electronic equipment and storage medium
CN114842476A (en) Watermark detection method and device and model training method and device
CN114429678A (en) Model training method and device, electronic device and medium
CN114842474B (en) Character recognition method, device, electronic equipment and medium
CN114390366B (en) Video processing method and device
US20230385599A1 (en) Parallel and distributed processing of propositional logical neural networks
CN115062022B (en) Aircraft manual splitting method, device, electronic equipment and computer readable medium
CN113360624B (en) Training method, response device, electronic device and storage medium
CN114861658B (en) Address information analysis method and device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination