CN114139605A

CN114139605A - Distributed model training method, system, device and storage medium

Info

Publication number: CN114139605A
Application number: CN202111301761.7A
Authority: CN
Inventors: 胡建猛
Original assignee: LeTV Sports Culture Develop Beijing Co Ltd
Current assignee: LeTV Sports Culture Develop Beijing Co Ltd
Priority date: 2021-11-04
Filing date: 2021-11-04
Publication date: 2022-03-04

Abstract

Embodiments of the present disclosure provide distributed model training methods, systems, devices, and storage media. The method is applied to a cluster server comprising a plurality of node servers, and comprises the steps of receiving a model to be trained and a training sample input by a user; each node server in the cluster server trains a model to be trained according to a training sample, wherein the training sample comprises sample data and an identifier corresponding to the sample data; and when the difference value of the identifier obtained by calculating the sample data in the training sample by the trained model and the identifier corresponding to the sample data is smaller than a preset threshold value, taking the trained model as a target model. In this way, the efficiency of model training may be improved.

Description

Distributed model training method, system, device and storage medium

Technical Field

The present disclosure relates to the field of model training, and more particularly, to the field of distributed model training.

Background

At present, under the same system, models held by all nodes are different, so that model training results of all nodes are different, overall data under the system are unstable, computing resources are wasted, and actual training efficiency is low.

Disclosure of Invention

The present disclosure provides a distributed model training method, system, device and storage medium.

According to a first aspect of the present disclosure, a distributed model training method is provided, which is applied to a cluster server including a plurality of node servers, and the method includes:

receiving a model to be trained and a training sample input by a user;

each node server in the cluster server trains a model to be trained according to a training sample, wherein the training sample comprises sample data and an identifier corresponding to the sample data;

and when the difference value of the identifier obtained by calculating the sample data in the training sample by the trained model and the identifier corresponding to the sample data is smaller than a preset threshold value, taking the trained model as a target model.

In some implementations of the first aspect, training, by each node server in the cluster servers, a model to be trained according to a training sample includes:

each node server in the cluster server runs in parallel, and based on a model to be trained, sample data in a training sample is calculated to obtain an identifier;

and adjusting parameters in the model to be trained according to the difference between the identifier and the identifier corresponding to the sample data.

In some implementation manners of the first aspect, when a difference between an identifier obtained by calculating sample data in a training sample by using a trained model and an identifier corresponding to the sample data is smaller than a preset threshold, taking the trained model as a target model includes:

when the difference value between the identifier obtained by calculating the sample data in the training sample by the model obtained by training and the identifier corresponding to the sample data is smaller than a preset threshold value, generating a training stopping command for stopping training of each node server in the cluster server based on the training stopping command;

and taking the trained model as a target model.

In some implementations of the first aspect, after the training results in the target model, the method further includes:

and training the node server to obtain the target model, and sending the target model to other node servers in the cluster server for the other node servers to perform corresponding calculation based on the target model.

In some implementations of the first aspect, the method further comprises:

and selecting the target model with the minimum difference as the optimal model according to the difference corresponding to the target model in each node server.

In some implementations of the first aspect, prior to receiving the user-input model to be trained, the method further comprises:

when other models except the model to be trained exist in the cluster server, the other models are deleted.

In some implementations of the first aspect, the method further comprises:

receiving a new training sample input by a user;

and adjusting parameters in the target model based on the new training sample according to a preset period.

According to a second aspect of the present disclosure, there is provided a distributed model training system comprising a plurality of node servers; wherein the content of the first and second substances,

the node server is used for receiving a model to be trained and a training sample input by a user;

each node server is used for training a model to be trained according to a training sample, wherein the training sample comprises sample data and an identifier corresponding to the sample data; and when the difference value of the identifier obtained by calculating the sample data in the training sample by the trained model and the identifier corresponding to the sample data is smaller than a preset threshold value, taking the trained model as a target model.

According to a third aspect of the present disclosure, an electronic device is provided. The electronic device includes: a memory having a computer program stored thereon and a processor that, when executing the program, implements the method of the first aspect and any possible implementation of the first aspect.

According to a fourth aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method as in the first aspect, and any possible implementation manner of the first aspect.

The distributed model training method, the system, the equipment and the storage medium provided by the disclosure are applied to a cluster server comprising a plurality of node servers, wherein the cluster server firstly receives a model to be trained and a training sample input by a user; each node server in the cluster server trains a model to be trained according to a training sample, wherein the training sample comprises sample data and an identifier corresponding to the sample data; and when the difference value of the identifier obtained by calculating the sample data in the training sample by the trained model and the identifier corresponding to the sample data is smaller than a preset threshold value, taking the trained model as a target model. In the training process, each node server in the cluster server can perform model training based on the model to be trained and the training sample, so that the training can be performed on one device without limitation, and the model training efficiency can be improved.

It should be understood that the statements herein reciting aspects are not intended to limit the critical or essential features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. The accompanying drawings are included to provide a further understanding of the present disclosure, and are not intended to limit the disclosure thereto, and the same or similar reference numerals will be used to indicate the same or similar elements, where:

FIG. 1 illustrates a flow diagram of a distributed model training method of an embodiment of the present disclosure;

FIG. 2 illustrates a block diagram of a distributed model training system in accordance with an embodiment of the present disclosure;

FIG. 3 illustrates a block diagram of an exemplary electronic device capable of implementing embodiments of the present disclosure.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, but not all embodiments of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

In addition, the term "and/or" herein is only one kind of association relationship describing an associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

Usually, a model file is shared by a plurality of node servers, and a model is shared in a Network Attached Storage (NAS) mode, so that each node is not required to have a model.

However, the model in the network structure has a problem of low training efficiency.

In order to solve the problem of low training efficiency, the present disclosure provides a distributed model training method, system, device and storage medium, which are applied to a cluster server including a plurality of node servers, and first receive a model to be trained and a training sample input by a user; each node server in the cluster server trains a model to be trained according to a training sample, wherein the training sample comprises sample data and an identifier corresponding to the sample data; and finally, when the difference value between the identifier obtained by calculating the sample data in the training sample by the model obtained by training and the identifier corresponding to the sample data is smaller than a preset threshold value, taking the model obtained by training as a target model. Because each node server in the cluster server performs model training based on the model to be trained and the training sample in the disclosure, the training efficiency can be improved.

The technical solutions provided by the embodiments of the present disclosure are described below with reference to the accompanying drawings.

Fig. 1 is a schematic flowchart of a distributed model training method, which is applied to a cluster server including a plurality of node servers according to an embodiment of the present disclosure.

As shown in fig. 1, the distributed model training method may specifically include:

s101: and receiving a model to be trained and a training sample input by a user.

S102: and each node server in the cluster server trains the model to be trained according to a training sample, wherein the training sample comprises sample data and an identifier corresponding to the sample data.

S103: and when the difference value of the identifier obtained by calculating the sample data in the training sample by the trained model and the identifier corresponding to the sample data is smaller than a preset threshold value, taking the trained model as a target model.

In the distributed model training method provided by the disclosure, each node server in the cluster server performs model training based on the model to be trained and the training sample, so that training on one device is not limited, and the model training efficiency can be improved.

In one embodiment, the training process in S102 may be parallel to further increase the training speed, and the specific training process may include: each node server in the cluster server runs in parallel, and based on a model to be trained, sample data in a training sample is calculated to obtain an identifier; and adjusting parameters in the model to be trained according to the difference between the identifier and the identifier corresponding to the sample data.

Because the training processes described above can occur in parallel, the training speed can be further increased.

In order to save computing resources, in an embodiment, in S103, when the difference between the identifier obtained by computing the model obtained by training the sample data in the training sample and the identifier corresponding to the sample data is smaller than a preset threshold, a training stop command may be generated for each node server in the cluster servers to stop training based on the training stop command; and then taking the trained model as a target model.

In the process, after a model meeting the requirements is obtained through training, other training processes which exist at the same time can be stopped, so that limited computing resources can be saved, and the utilization rate of the resources is improved.

The cluster server may detect whether there is a training stop command according to a preset frequency, where the frequency may be once every 30 seconds, or may be adjusted according to an actual situation.

In order to enable other node servers to obtain the trained target model and perform subsequent calculation, in an embodiment, after the trained target model is obtained, the method may further include: and training the node server to obtain the target model, and sending the target model to other node servers in the cluster server for the other node servers to perform corresponding calculation based on the target model. In the process, the target model is sent to other node servers, so that the deployment of the trained model, namely the deployment of the target model, is realized.

In order to utilize the computing resources of the plurality of node servers to the maximum, the optimal effect model is obtained based on the target model, and in one embodiment, the target model with the minimum difference value is selected as the optimal model according to the difference value corresponding to the target model obtained by training in each node server. Therefore, an optimal model is selected from a plurality of target models meeting preset conditions, and a high-precision model is obtained.

In order to make the cluster server have sufficient storage space and sufficient computing resources to train the model to be trained, in an embodiment, when other models except the model to be trained exist in the cluster server, the other models may be deleted, so that the cluster server has sufficient storage space and computing resources to train the model to be trained, thereby further improving the efficiency of model training.

In one embodiment, in the training process, the parameters can be automatically adjusted by means of an interface of an existing tool, and then the parameters are transferred to source codes with different sample data to run, so that the optimal parameters are tried to be selected from multiple sets of parameters, and the parameters are adjusted.

Furthermore, in order for the model to continuously learn the latest sample data to adapt to the changing actual situation, in one embodiment, a new training sample input by the user may be received; and adjusting parameters in the target model based on the new training sample according to a preset period. So that the model can continuously learn the latest sample data and adapt to the continuously changing actual situation.

In the distributed model training method provided by the disclosure, each node server in the cluster server can perform model training in parallel based on the model to be trained and the training sample, so that training on one device is not limited, and further the model training efficiency can be improved.

It is noted that while for simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and concurrently. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that acts and modules referred to are not necessarily required by the disclosure.

The above is a description of method embodiments, and the following is a further description of the embodiments of the present disclosure by way of system embodiments.

FIG. 2 illustrates a block diagram of a distributed model training system 200 according to an embodiment of the present disclosure. Distributed model training system 200 as shown in FIG. 2, may include a plurality of node servers; wherein the content of the first and second substances,

the node server can be used for receiving a model to be trained and a training sample input by a user;

each node server can be used for training a model to be trained according to a training sample, wherein the training sample comprises sample data and an identifier corresponding to the sample data;

In one embodiment, each node server runs in parallel, and based on a model to be trained, sample data in a training sample is calculated to obtain an identifier; and adjusting parameters in the model to be trained according to the difference between the identifier and the identifier corresponding to the sample data.

In one embodiment, when the difference between the identifier obtained by calculating the sample data in the training sample by the trained model and the identifier corresponding to the sample data is smaller than a preset threshold, a training stop command is generated for each node server in the cluster servers to stop training based on the training stop command; and taking the trained model as a target model.

In one embodiment, the node server that obtains the target model is trained, and the target model is sent to other node servers in the cluster server, so that the other node servers perform corresponding calculation based on the target model.

In an embodiment, at least one node server in the model training system may select, as the optimal model, the target model with the smallest difference according to the difference corresponding to the target model in each node server.

In one embodiment, before at least one node server receives a model to be trained and training samples input by a user, when other models except the model to be trained exist in a model training system, the other models are deleted.

In one embodiment, the at least one node server may be further configured to receive a new training sample input by a user; and adjusting parameters in the target model based on the new training sample according to a preset period.

In the distributed model training system provided by the disclosure, each node server in the model training system can perform model training in parallel based on the model to be trained and the training sample, so that the training can be performed on one device without limitation, and the model training efficiency can be improved.

It can be understood that the node servers in the distributed model training system shown in fig. 2 have functions of implementing each step of model training in fig. 1, and can achieve corresponding technical effects, and for brevity, no further description is given here.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the described module may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

The device 300 comprises a computing unit 301 which may perform various suitable actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)302 or a computer program loaded from a storage unit 308 into a Random Access Memory (RAM) 303. In the RAM 303, various programs and data required for the operation of the device 300 can also be stored. The calculation unit 301, the ROM 302, and the RAM 303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.

Various components in device 300 are connected to I/O interface 305, including: an input unit 306 such as a keyboard, a mouse, or the like; an output unit 307 such as various types of displays, speakers, and the like; a storage unit 308 such as a magnetic disk, optical disk, or the like; and a communication unit 309 such as a network card, modem, wireless communication transceiver, etc. The communication unit 309 allows the device 300 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 301 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 301 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 301 performs the various methods and processes described above, such as a distributed model training method. For example, in some embodiments, the distributed model training method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 308. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 300 via ROM 302 and/or communication unit 309. When loaded into RAM 303 and executed by computing unit 301, may perform one or more steps of the distributed model training method described above. Alternatively, in other embodiments, the computing unit 301 may be configured to perform the distributed model training method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A distributed model training method is applied to a cluster server comprising a plurality of node servers, and comprises the following steps:

receiving a model to be trained and a training sample input by a user;

each node server in the cluster server trains the model to be trained according to the training sample, wherein the training sample comprises sample data and an identifier corresponding to the sample data;

and when the difference value between the identifier obtained by calculating the sample data in the training sample by the trained model and the identifier corresponding to the sample data is smaller than a preset threshold value, taking the trained model as a target model.

2. The method of claim 1, wherein each node server in the cluster servers trains the model to be trained according to the training samples, and comprises:

each node server in the cluster server runs in parallel, and based on the model to be trained, sample data in the training sample is calculated to obtain an identifier;

3. The method according to claim 1 or 2, wherein when the difference between the identifier obtained by calculating the sample data in the training sample by the trained model and the identifier corresponding to the sample data is smaller than a preset threshold, taking the trained model as a target model comprises:

when the difference value between the identifier obtained by calculating the sample data in the training sample by the trained model and the identifier corresponding to the sample data is smaller than a preset threshold value, generating a training stop command for each node server in the cluster servers to stop training based on the training stop command;

and taking the trained model as a target model.

4. The method of claim 3, wherein after training to obtain the target model, the method further comprises:

5. The method of claim 1, further comprising:

6. The method of claim 1, wherein prior to receiving a user-input model to be trained, the method further comprises:

and when other models except the model to be trained exist in the cluster server, deleting the other models.

7. The method of claim 1, further comprising:

receiving a new training sample input by a user;

8. A distributed model training system, characterized in that the model training system comprises a plurality of node servers; wherein the content of the first and second substances,

each node server is used for training the model to be trained according to the training sample, wherein the training sample comprises sample data and an identifier corresponding to the sample data; and when the difference value between the identifier obtained by calculating the sample data in the training sample by the trained model and the identifier corresponding to the sample data is smaller than a preset threshold value, taking the trained model as a target model.

9. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

10. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.