CN111445027A - Training method and device of machine learning model - Google Patents

Training method and device of machine learning model Download PDF

Info

Publication number
CN111445027A
CN111445027A CN201910041301.1A CN201910041301A CN111445027A CN 111445027 A CN111445027 A CN 111445027A CN 201910041301 A CN201910041301 A CN 201910041301A CN 111445027 A CN111445027 A CN 111445027A
Authority
CN
China
Prior art keywords
training
parameter server
machine learning
parameter
learning model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910041301.1A
Other languages
Chinese (zh)
Other versions
CN111445027B (en
Inventor
张强
谈政荣
王栎汉
姚小龙
蔡适择
陈敏
任亚坤
陈军
龚杰文
韩兆鸣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SF Technology Co Ltd
Original Assignee
SF Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SF Technology Co Ltd filed Critical SF Technology Co Ltd
Priority to CN201910041301.1A priority Critical patent/CN111445027B/en
Publication of CN111445027A publication Critical patent/CN111445027A/en
Application granted granted Critical
Publication of CN111445027B publication Critical patent/CN111445027B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The application discloses a training method and a device of a machine learning model, wherein the method comprises the following steps: issuing the model parameters and the training data sets divided into m parts to (1+ a) m parameter server groups for training, and receiving the training results of the parameter server groups; wherein m is a positive integer, 0< a <1, and (1+ a) × m is further rounded; thus, the training result is processed according to the training times. According to the training method of the machine learning model, model parameters and the training data set divided into m parts are issued to the (1+ a) m parameter server groups for training, the number of the parameter server groups is increased, when the parameter server groups fail and cannot train data, the standby parameter server groups continue to operate and process the data, normal training process is guaranteed, and training efficiency of the machine learning model is improved.

Description

Training method and device of machine learning model
Technical Field
The invention relates to the technical field of information, in particular to a training method and a training device for a machine learning model.
Background
In modern society, people are increasingly using couriers to receive and send items. Especially, with the rapid development of electronic commerce, the way of online shopping is rapidly popularized and applied. In addition, with the advent of the big data era, mass data are continuously generated by express business every moment.
Nowadays, model training is performed by collecting a large amount of sample data with the help of machine learning and artificial intelligence techniques. Therefore, when new data is generated, the trained model can be conveniently used for processing the new data. In the training process of the model, the parameter training needs to consume a long time, so that the technical problem to be solved urgently is to shorten the training time of the parameters and improve the operation efficiency of the algorithm.
In the related technology, after training data are randomly divided into a certain number, the training data are directly issued to the training machines with the same number for parameter training, and a training result is obtained. However, in the parameter training process, the trainer is prone to malfunction due to the huge amount of operation data, so that the training speed is slow, and the training efficiency of the machine learning model is further affected.
Disclosure of Invention
In view of the above-mentioned defects or shortcomings in the prior art, it is desirable to provide a training method and apparatus for a machine learning model, in which, by increasing the number of nodes configured, when a node fails and cannot train data, a standby node continues to perform operation processing on the data, so as to ensure normal operation of the training process, and further improve the training efficiency of the machine learning model.
In a first aspect, the present application provides a training method for a machine learning model, including:
issuing the model parameters and the training data sets divided into m parts to (1+ a) m parameter server groups for training; wherein m is a positive integer, 0< a <1, and (1+ a) × m is further rounded;
receiving a training result of the parameter server group;
and processing the training result according to the training times.
In a second aspect, the present application provides a training apparatus for a machine learning model, comprising:
the issuing module is used for issuing the model parameters and the training data sets divided into m parts to (1+ a) m parameter server groups for training; wherein m is a positive integer, 0< a <1, and (1+ a) × m is further rounded;
the receiving module is used for receiving the training result of the parameter server group;
and the processing module is used for processing the training result according to the training times.
In summary, according to the training method and apparatus for machine learning model provided in this embodiment of the present application, after a training data set is divided into m parts, the training data set and model parameters are sent to (1+ a) m parameter server groups together for training, where m is a positive integer, 0< a <1, (1+ a) m is rounded by one bit; due to the fact that the number of the parameter server groups is increased, when the parameter server groups fail and cannot train data, the standby parameter server groups continue to perform operation processing on the data; further, receiving a training result of the parameter server group, and processing the training result according to the training times; based on this, this application embodiment can improve the training efficiency of machine learning model when guaranteeing that the training process normally goes on.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
fig. 1 is a schematic basic flow chart of a training method of a machine learning model according to an embodiment of the present disclosure;
FIG. 2 is an example of a training method for a machine learning model according to an embodiment of the present disclosure;
FIG. 3 is a training apparatus for machine learning models according to an embodiment of the present disclosure;
FIG. 4 is a block diagram of another apparatus for training a machine learning model according to an embodiment of the present disclosure;
FIG. 5 is a block diagram of a training apparatus for a machine learning model according to an embodiment of the present disclosure;
fig. 6 is a computer system according to an embodiment of the present disclosure.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
The embodiment of the application provides a training method of a machine learning model, and the method is applied to a terminal. It should be noted that the terminal referred to in the embodiments of the present application may include, but is not limited to, a Personal Computer (PC), a Personal Digital Assistant (PDA), a Tablet Computer (Tablet Computer), a wireless handheld device, a mobile phone, and the like.
For convenience of understanding and explanation, the following describes in detail a training method and apparatus of a machine learning model provided in an embodiment of the present application with reference to fig. 1 to 5.
Please refer to fig. 1, which is a basic flowchart of a training method of a machine learning model according to an embodiment of the present application, the method includes the following steps:
and S101, issuing the model parameters and the training data sets divided into m parts to (1+ a) × m parameter server groups for training.
Wherein m is a positive integer, 0< a <1, (1+ a) × m further rounded.
For example, after the training data set is randomly divided into m parts, the training data set and the model parameters are sent to (1+ a) m parameter server groups together for training, and the a m parameter server group is used as a standby parameter server group. Wherein m is an integer greater than 0, and the calculation results of (1+ a) m and a m are rounded. For example, when m is 1 and a is 10%, the calculation result of (1+ a) × m is 1.1, and further one bit is rounded to 2, that is, 2 parameter server groups; for example, when m is 6 and a is 20%, the result of the calculation of (1+ a) × m is 7.2, and further one bit is rounded to 8, that is, 8 parameter server groups. Because the number of the parameter server groups is increased, when the parameter server groups have faults and cannot train data, the standby parameter server groups can continue to perform operation processing on the data according to the model parameters and the training results recorded in the logs of the failed parameter server groups when the parameter server groups are down, and the training efficiency of the machine learning model is effectively improved.
It should be noted that the parameter server group in the embodiment of the present application further includes a main parameter server and a training parameter server. The main parameter server has the functions of distributing training parameters and counting training results fed back by the training parameter servers and training data, and the training parameter server is used for training according to the training parameters distributed by the main parameter server.
For example, the main parameter server distributes the model parameters and m training data sets to (1+ b) n training parameter servers in the parameter server group for training, where n is a positive integer, 0< b <1, and (1+ b) n is further rounded by one bit. For example, when n is 1 and b is 10%, the calculation result of (1+ b) × n is 1.1, and further one bit is rounded up to 2, that is, 2 training parameter servers; for example, when n is 5 and b is 30%, the calculation result of (1+ b) × n is 6.5, and a further bit is rounded to 7, that is, 7 training parameter servers. Due to the fact that the number of the training parameter servers is increased, the main parameter server can distribute the training data sets with the same total number to the training parameter servers for training; for each training parameter server, the distributed training data amount is reduced, and the training efficiency of the machine learning model is improved.
In each parameter server group, one parameter server is designated as the master parameter server, i.e., master node (master), in the group, and the other parameter servers are designated as the training parameter servers, i.e., slave nodes (slave). When the master node cannot normally communicate with the slave nodes due to reasons of unsmooth network connection, power outage, crash and the like, namely the master node cannot receive heartbeat information of the slave nodes and responds, one training parameter server is appointed or randomly selected from the (1+ b) n training parameter servers to serve as the master parameter server, and an additionally configured standby parameter server is used as the master parameter server.
For the purpose of designating a training parameter server as a main parameter server, for example, the number of the training parameter servers is 10, the numbers are respectively training parameter server 1, training parameter server 2, …, and training parameter server 10, according to the sequence of successive complementation, when the main parameter server fails, the training parameter server 1 complements as the main parameter server, the data of the training parameter server 1 can be continuously trained by the main parameter server, or the data is transmitted to the training parameter server 2 for training, and the original main parameter server does not have data transfer when the main parameter server fails; in addition, it should be noted that, after the failure is resolved, the original main parameter server may be used as a training parameter server to implement the function of training data. When the training parameter server 1 breaks down, the training parameter server 2 is used as a main parameter server for supplementing, and the data of the training parameter server 1 is transmitted to the training parameter server 2 for training, at the moment, the data of the training parameter server 2 comprises the residual data which are not trained by the training parameter server 1 and the training data which are originally distributed to the training parameter server 2; or, the training parameter server 2 transfers the remaining data that has not been trained by the training parameter server 1 and the remaining data that has not been trained and that has been originally distributed to the training parameter server 2 to the training parameter server 3 for training. Since the model parameters of each training parameter server are the same, only the specific training data is different. Therefore, the training parameter server can transmit the residual data of the incomplete training to the training parameter server without faults for training, and the normal operation of the training process is guaranteed. By analogy, when the training parameter server 9 fails, the training parameter server 10 complements as a main parameter server, and the data of the training parameter server 9 is transmitted to the training parameter server 10 for training, at this time, the data of the training parameter server 10 includes the remaining data of the training parameter server 9 which does not complete training and the training data originally distributed to the training parameter server 10; or, the training parameter server 10 transmits the remaining data which is not trained by the training parameter server 9 and the remaining data which is not trained by the training parameter server 10 to the training parameter server 1 for training.
For example, when the number of the training parameter servers is 10, the numbers of the training parameter servers are respectively training parameter server 1, training parameter server 2, …, and training parameter server 10, and the main parameter server fails, the parameter server 2 is randomly selected from the 10 training parameter servers as the main parameter server, and the data of the training parameter server 2 can be continuously trained by the main parameter server, or the data is transmitted to the training parameter server 3 for training. By analogy, the data of the training parameter server 10 can be continued to be trained by it, or the data can be passed to the training parameter server 1 for training.
For example, as shown in fig. 2, it is an example of a training method of a machine learning model provided in an embodiment of the present application. Firstly, a data distribution gateway randomly divides a training data set into m parts, and sends model parameters and the m parts of training data set to (1+ 10%) m parameter server groups, wherein a is 10%; such as parameter server group 1, parameter server group 2, …, parameter server group 1.1m-1, parameter server group 1.1 m; then, the main parameter server corresponding to the main node in the parameter server group distributes the model parameter and the training data set issued to the parameter server group to (1+ 10%) n training parameter servers corresponding to the slave nodes in the parameter server group for training, wherein b is 10%; such as slave node 1, slave node 2, …, slave node 1.1n-1, slave node 1.1 n; thus, the (1+ 10%) n training parameter servers corresponding to the slave node feed back the training results to the master parameter server corresponding to the master node.
And after the main parameter server corresponding to the main node receives the feedback data of the training parameter server corresponding to the first 90% of the slave nodes, discarding abnormal values in the training results according to the Euclidean distance, calculating the average value of the rest training results, and feeding the average value serving as the training result of the model back to the data distribution gateway. Wherein the number of training parameter servers corresponding to 90% of slave nodes is rounded by one bit. For example, when the number of the training parameter servers corresponding to the slave nodes is 3, the obtained calculation result is 2.7, and one bit is rounded to 2; when the number of training parameter servers corresponding to the slave node is 7, the obtained calculation result is 6.3, and the rounding is 6. It should be noted that, in order to ensure the calculation accuracy and the training efficiency of the training result, the main parameter server in the embodiment of the present application receives the feedback data of the first 90% of the training parameter servers. Of course, the main parameter server can also receive feedback data of other proportions of the training parameter server, such as 80% and 95%, which is not limited in the embodiment of the present application.
Specific implementations are illustrated for ease of understanding. For example, 1000 pieces of training data are divided into 5 pieces, each training data set includes 200 pieces of training data, and m is 5; secondly, issuing 5 training data sets to 6 parameter server groups, wherein the 5 parameter server groups respectively obtain 1 part of data for training, and the 1 parameter server group is used as a standby parameter server group, wherein a is 10%; thirdly, the main parameter server in each parameter server group distributes 200 pieces of training data to 4 training parameter servers, each training parameter server trains 50 pieces of training data, wherein b is 20% and n is 3; therefore, the main parameter server corresponding to the main node can obtain the feedback data of the training parameter server corresponding to the first 3 slave nodes. Among them, Euclidean distance (Euclidean distance) is a commonly used distance definition that represents a true distance between two points in an l-dimensional space.
The main parameter server discards abnormal values in the training result according to the Euclidean distance, and the adopted method can include but is not limited to a K-mean algorithm and a K-center point algorithm. For the sake of understanding, the K-means algorithm is exemplified in the embodiments of the present application. Since the K-center point algorithm is a known technique, the embodiments of the present application are not described in detail.
The K-means algorithm is a very typical clustering algorithm based on distance, and adopts the distance as an evaluation index of similarity, that is, the similarity of two objects is considered to be very close, and the similarity is higher, and similar data is divided into the same cluster. Specifically, the K-means algorithm randomly selects K objects from the c data objects as initial clustering centers, and allocates the K objects to the most similar clusters (represented by the initial clustering centers) according to the distances between the K objects and the initial clustering centers for the rest data objects; then, after the clustering center of each new cluster (namely the average value of all the data objects in the cluster) is obtained, continuously dividing the k clusters according to the distance between the data objects in each new cluster and the clustering center obtained by calculation; repeating the process until the clustering criterion function converges; the algorithm adopts a standard function of the sum of squared errors of the data as a clustering standard function.
Since the K-means algorithm and the K-center algorithm are clustering algorithms, i.e., all data is divided into K clusters. In the embodiment of the present application, only 1 cluster needs to be divided, that is, k is 1. Specifically, the master parameter server corresponding to the master node calculates an average value of training results fed back by the training parameter server corresponding to the slave node, and the average value is used as a clustering center; then, respectively calculating the Euclidean distance between each training result and the clustering center; therefore, if the euclidean distance is greater than a preset threshold, for example, the preset threshold is 5%, the training result corresponding to the euclidean distance is discarded as an abnormal value.
It should be noted that, in the embodiment of the present application, model parallel is implemented between groups of parameter server groups, data parallel is implemented in each group of parameter servers, and in a manner of resource swapping time, by increasing the configuration number of nodes, when a node fails and cannot train data, a standby node continues to perform operation processing on data, so as to ensure normal training process, and further improve training efficiency of a machine learning model.
Meanwhile, each parameter server is deployed in a docker, and kubernets are adopted to manage the docker parameter servers and monitor the running state of the docker in real time, wherein the docker is an open-source application container engine, developers can pack applications and dependency packages of the applications into a portable container and then distribute the containers to any popular L inux machine, virtualization can be achieved, the container completely uses a sandbox mechanism, and no interface exists between the containers.
S102, receiving a training result of the parameter server group.
For example, the data distribution gateway receives a training result of each parameter server group, where the training result is an average value obtained by calculating remaining feedback data, after the main parameter server corresponding to the main node in each parameter server group receives the feedback data of the training parameter server corresponding to the slave node, and the main parameter server corresponding to the main node discards abnormal values in the feedback data according to the euclidean distance.
It should be noted that the data distribution gateway in the embodiment of the present application has a function of issuing a training data set, collecting a training result, and performing corresponding processing. Of course, a terminal having the same function as the data distribution gateway in the embodiment of the present application may be used, and the embodiment of the present application is not limited thereto.
And S103, processing the training result according to the training times.
Specifically, when the training times are equal to the preset times, calculating and outputting the average value of the training results; and when the training times are less than the preset times, the model parameters and the training data sets divided into m parts are issued to the (1+ a) m parameter server groups again for training. According to the embodiment of the application, the training times are controlled to reach the preset times, so that data can be fully trained on the basis of ensuring the calculation precision.
For example, after receiving the feedback data of the first 90% parameter server group, the data distribution gateway calculates an average value of the feedback data as a training result of the model. Where the number of 90% parameter server groups is rounded by one bit. For example, when there are 5 parameter server groups, the obtained calculation result is 4.5, and then the data distribution gateway obtains the feedback data corresponding to the previous 4 parameter server groups; when there are 10 parameter server groups, the data distribution gateway obtains the feedback data corresponding to the first 9 parameter server groups. It should be noted that, in order to ensure the calculation accuracy and the training efficiency of the training result, in the embodiment of the present application, the data distribution gateway receives the feedback data of the first 90% of the parameter server group. Of course, the data distribution gateway can also receive feedback data of other proportions of the parameter server group, such as 80% and 95%, which is not limited in the embodiment of the present application.
Under normal conditions, the training parameter servers in each parameter server group feed data back to the main parameter servers in the parameter server group after the training times reach the preset times. However, in special cases, for example, when a training parameter server in the parameter server group goes down, the result obtained by training is directly fed back to the main parameter server, and the training times do not reach the preset times; the main parameter server does not have the capacity of judging the training times, and the data distribution gateway has the capacity of monitoring the parameter server and judging the training times; therefore, as a precaution mechanism, the data distribution gateway determines whether the training times of the parameter server group meet the training requirement, so as to ensure that the data are sufficiently trained. If the training times reach the preset times, the data distribution gateway stops the training of the parameter server group and outputs a training result; if the training times do not reach the preset times, in order to ensure the orderly progress of the whole training process, the data distribution gateway issues the model parameters and the divided training data sets to each parameter server group again for training.
It should be noted that, in the embodiment of the present application, when the model parameters and the training data set are issued to each parameter server group again for training, the training parameter servers in the parameter server groups can also perform training in the training parameter servers again according to the model parameters and the training results recorded when the log is down.
In the training process of the machine learning model, each parameter server is deployed in the docker, and the performance of the docker is unstable, so that the downtime phenomenon is easy to happen. Therefore, the embodiment of the application adopts a mode of recording logs, the training parameter servers in each parameter server group write the training results and the training times of the training parameter servers into the logs at the beginning of model training, and the docker can perform positioning according to the logs after restarting to obtain the model parameters, the training results and the training times recorded during downtime, so that the training state during downtime can be quickly recovered, and the waiting time is effectively reduced.
In the training method of the machine learning model provided in the embodiment of the present application, after a training data set is divided into m parts, the training data set and model parameters are sent to (1+ a) m parameter server groups together for training, where m is a positive integer, 0< a <1, (1+ a) m is further rounded; due to the fact that the number of the parameter server groups is increased, when the parameter server groups fail and cannot train data, the standby parameter server groups continue to perform operation processing on the data; further, receiving a training result of the parameter server group, and processing the training result according to the training times; based on this, this application embodiment can improve the training efficiency of machine learning model when guaranteeing that the training process normally goes on.
Based on the foregoing embodiments, the present application provides a training apparatus for a machine learning model, and the apparatus may be applied to the training method for a machine learning model provided in the embodiments corresponding to fig. 1 to 2. Referring to fig. 3, the machine learning model training apparatus 3 includes:
the issuing module 31 is configured to issue the model parameters and the training data sets divided into m parts to (1+ a) × m parameter server groups for training; wherein m is a positive integer, 0< a <1, (1+ a) × m is rounded by one digit;
a receiving module 32, configured to receive a training result of the parameter server group;
and the processing module 33 is configured to process the training result according to the training times.
In other embodiments of the present application, as shown in fig. 4, the issuing module 31 further includes:
a distribution module 311, configured to distribute the model parameters and the m training data sets to (1+ b) × n training parameter servers in the parameter server group for training; wherein n is a positive integer, 0< b <1, (1+ b) × n is further rounded.
In other embodiments of the present application, as shown in fig. 5, the issuing module 31 further includes:
a discarding module 312, configured to discard an abnormal value in the training result according to the training result and the euclidean distance fed back by the training parameter server;
a calculation module 313 for calculating an average of the remaining training results;
wherein the training result is a set of the abnormal value and the residual training result.
In other embodiments of the present application, the discarding module 312 is specifically configured to calculate an average value of the training results, where the average value is used as a cluster center;
respectively calculating the Euclidean distance between each training result and the clustering center;
and if the Euclidean distance is larger than a preset threshold value, discarding the training result corresponding to the Euclidean distance as an abnormal value.
In other embodiments of the present application, the processing module 33 is specifically configured to calculate and output an average value of the training results when the number of times of training is equal to a preset number of times.
In other embodiments of the present application, the processing module 33 is further configured to, when the number of times of training is smaller than the preset number of times, send the model parameters and the training data sets divided into m parts to the (1+ a) × m parameter server groups again for training.
In other embodiments of the present application, the processing module 33 is further configured to retrain the parameter server group according to the log recorded by the parameter server group; the log comprises model parameters and training results recorded by the parameter server group when the server group is down.
It should be noted that, for the descriptions of the same steps and the same contents in this embodiment as those in other embodiments, reference may be made to the descriptions in other embodiments, which are not described herein again.
In the training device of the machine learning model provided in the embodiment of the present application, after a training data set is divided into m parts, the training data set and model parameters are sent to (1+ a) × m parameter server groups together for training, where m is a positive integer, 0< a <1, (1+ a) × m is further rounded; due to the fact that the number of the parameter server groups is increased, when the parameter server groups fail and cannot train data, the standby parameter server groups continue to perform operation processing on the data; further, receiving a training result of the parameter server group, and processing the training result according to the training times; based on this, this application embodiment can improve the training efficiency of machine learning model when guaranteeing that the training process normally goes on.
Based on the foregoing embodiments, the present application provides a computer system. Referring to fig. 6, the computer system 600 includes a Central Processing Unit (CPU)601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section into a Random Access Memory (RAM) 603. In the RAM603, various programs and data necessary for system operation are also stored. The CPU 601, ROM 602, and RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
To the I/O interface 605, AN input section 606 including a keyboard, a mouse, and the like, AN output section 607 including a network interface card such as a Cathode Ray Tube (CRT), a liquid crystal display (L CD), and the like, a speaker, and the like, a storage section 608 including a hard disk, and the like, and a communication section 609 including a network interface card such as a L AN card, a modem, and the like, the communication section 609 performs communication processing via a network such as the internet, a drive 610 is also connected to the I/O interface 605 as necessary, a removable medium 611 such as a magnetic disk, AN optical disk, a magneto-optical disk, a semiconductor memory, and the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted into the storage section 608 as necessary.
In particular, according to embodiments of the present application, the process described above with reference to the flowchart fig. 1 may be implemented as a computer software program. For example, embodiment 1 of the present application includes a computer program product including a computer program carried on a computer-readable medium, the computer program being executed by the CPU 601 to implement the steps of:
issuing the model parameters and the training data sets divided into m parts to (1+ a) m parameter server groups for training; wherein m is a positive integer, 0< a <1, (1+ a) × m is rounded by one digit;
receiving a training result of the parameter server group;
and processing the training result according to the training times.
In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611.
It should be noted that the computer readable medium shown in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products for machine learning model training according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present application may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves. The described units or modules may also be provided in a processor, and may be described as: a processor comprises a sending module, a receiving module and a processing module. Wherein the designation of a unit or module does not in some way constitute a limitation of the unit or module itself.
As another aspect, the present application also provides a computer-readable medium, which may be contained in the terminal described in the above embodiments; or may exist separately and not be assembled into the terminal. The computer readable medium carries one or more programs which, when executed by the terminal, cause the terminal to implement the method for training a machine learning model as in the above embodiments.
For example, the terminal may implement the following as shown in fig. 1: s101, model parameters and training data sets divided into m parts are sent to (1+ a) m parameter server groups for training; wherein m is a positive integer, 0< a <1, (1+ a) × m is rounded by one digit; s102, receiving a training result of the parameter server group; and S103, processing the training result according to the training times.
It should be noted that although in the above detailed description several modules or units of the terminal for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
Moreover, although the steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that the steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware.
The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by a person skilled in the art that the scope of the invention as referred to in the present application is not limited to the embodiments with a specific combination of the above-mentioned features, but also covers other embodiments with any combination of the above-mentioned features or their equivalents without departing from the inventive concept. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims (10)

1. A method of training a machine learning model, the method comprising:
issuing the model parameters and the training data sets divided into m parts to (1+ a) m parameter server groups for training; wherein m is a positive integer, 0< a <1, and (1+ a) × m is further rounded;
receiving a training result of the parameter server group;
and processing the training result according to the training times.
2. The method of claim 1, wherein the step of sending model parameters and the training data sets divided into m parts to (1+ a) m parameter server groups for training further comprises:
distributing the model parameters and the training data set issued to the parameter server group to (1+ b) n training parameter servers in the parameter server group for training; wherein n is a positive integer, 0< b <1, and (1+ b) × n is further rounded.
3. A method of training a machine learning model according to claim 2, the method further comprising:
discarding abnormal values in the training result according to the training result fed back by the training parameter server and the Euclidean distance;
calculating the average value of the residual training results; wherein the training result is a set of the outliers and the remaining training results.
4. The method for training a machine learning model according to claim 3, wherein discarding outliers in the training results according to the training results fed back by the training parameter server and the Euclidean distance comprises:
calculating an average value of the training results, wherein the average value is used as a clustering center;
respectively calculating the Euclidean distance between each training result and the clustering center;
and if the Euclidean distance is larger than a preset threshold value, discarding the training result corresponding to the Euclidean distance as an abnormal value.
5. The training method of machine learning model according to any one of claims 1-4, wherein the processing the training result according to the training times comprises:
and when the training times are equal to the preset times, calculating the average value of the training results and outputting the average value.
6. A method of training a machine learning model according to claim 5, the method further comprising:
and when the training times are less than the preset times, issuing the model parameters and the training data sets divided into m parts to the (1+ a) m parameter server groups again for training.
7. A method of training a machine learning model according to claim 6, the method further comprising:
according to the log recorded by the parameter server group, training in the parameter server group again; and the log comprises the model parameters and the training results recorded by the parameter server group when the parameter server group is down.
8. An apparatus for training a machine learning model, the apparatus comprising:
the issuing module is used for issuing the model parameters and the training data sets divided into m parts to (1+ a) m parameter server groups for training; wherein m is a positive integer, 0< a <1, and (1+ a) × m is further rounded;
the receiving module is used for receiving the training result of the parameter server group;
and the processing module is used for processing the training result according to the training times.
9. The apparatus for training a machine learning model according to claim 8, wherein the issuing module further comprises:
a distribution module, configured to distribute the model parameters and the m training data sets to (1+ b) × n training parameter servers in the parameter server group for training; wherein n is a positive integer, 0< b <1, and (1+ b) × n is further rounded.
10. The apparatus for training a machine learning model according to claim 9, wherein the issuing module further comprises:
the discarding module is used for discarding abnormal values in the training result according to the training result fed back by the training parameter server and the Euclidean distance;
the calculation module is used for calculating the average value of the residual training results; wherein the training result is a set of the outliers and the remaining training results.
CN201910041301.1A 2019-01-16 2019-01-16 Training method and device for machine learning model Active CN111445027B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910041301.1A CN111445027B (en) 2019-01-16 2019-01-16 Training method and device for machine learning model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910041301.1A CN111445027B (en) 2019-01-16 2019-01-16 Training method and device for machine learning model

Publications (2)

Publication Number Publication Date
CN111445027A true CN111445027A (en) 2020-07-24
CN111445027B CN111445027B (en) 2024-01-16

Family

ID=71626888

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910041301.1A Active CN111445027B (en) 2019-01-16 2019-01-16 Training method and device for machine learning model

Country Status (1)

Country Link
CN (1) CN111445027B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111641716A (en) * 2020-06-01 2020-09-08 第四范式(北京)技术有限公司 Self-healing method of parameter server, parameter server and parameter service system
CN114936117A (en) * 2021-09-02 2022-08-23 华为技术有限公司 Model training method, server, chip and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156810A (en) * 2015-04-26 2016-11-23 阿里巴巴集团控股有限公司 General-purpose machinery learning algorithm model training method, system and calculating node
CN107025205A (en) * 2016-01-30 2017-08-08 华为技术有限公司 A kind of method and apparatus of training pattern in distributed system
US20180039914A1 (en) * 2016-08-04 2018-02-08 Loom Systems LTD. Machine learning techniques for providing enriched root causes based on machine-generated data
CN107819605A (en) * 2016-09-14 2018-03-20 北京百度网讯科技有限公司 Method and apparatus for the switching server in server cluster
CN108665072A (en) * 2018-05-23 2018-10-16 中国电力科学研究院有限公司 A kind of machine learning algorithm overall process training method and system based on cloud framework

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156810A (en) * 2015-04-26 2016-11-23 阿里巴巴集团控股有限公司 General-purpose machinery learning algorithm model training method, system and calculating node
CN107025205A (en) * 2016-01-30 2017-08-08 华为技术有限公司 A kind of method and apparatus of training pattern in distributed system
US20180039914A1 (en) * 2016-08-04 2018-02-08 Loom Systems LTD. Machine learning techniques for providing enriched root causes based on machine-generated data
CN107819605A (en) * 2016-09-14 2018-03-20 北京百度网讯科技有限公司 Method and apparatus for the switching server in server cluster
CN108665072A (en) * 2018-05-23 2018-10-16 中国电力科学研究院有限公司 A kind of machine learning algorithm overall process training method and system based on cloud framework

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111641716A (en) * 2020-06-01 2020-09-08 第四范式(北京)技术有限公司 Self-healing method of parameter server, parameter server and parameter service system
CN111641716B (en) * 2020-06-01 2023-05-02 第四范式(北京)技术有限公司 Self-healing method of parameter server, parameter server and parameter service system
CN114936117A (en) * 2021-09-02 2022-08-23 华为技术有限公司 Model training method, server, chip and system

Also Published As

Publication number Publication date
CN111445027B (en) 2024-01-16

Similar Documents

Publication Publication Date Title
CN110520853B (en) Queue management for direct memory access
US10140572B2 (en) Memory bandwidth management for deep learning applications
US20190378016A1 (en) Distributed computing architecture for large model deep learning
CN107037978B (en) Data Migration bearing calibration and system
US10878335B1 (en) Scalable text analysis using probabilistic data structures
US9754216B2 (en) Labeling of data for machine learning
WO2021068513A1 (en) Abnormal object recognition method and apparatus, medium, and electronic device
US20220327012A1 (en) Software validation framework
US11886969B2 (en) Dynamic network bandwidth in distributed deep learning training
CN109308170A (en) A kind of data processing method and device
CN111445027B (en) Training method and device for machine learning model
CN108829802B (en) Associated log playback method and device
CN112035879A (en) Information processing method and system for improving confidentiality of automatic logistics of cell
US11803374B2 (en) Monolithic computer application refactoring
WO2023020355A1 (en) Distributed training method for ai model and related device
CN107016115A (en) Data export method, device, computer-readable recording medium and electronic equipment
CN111062521B (en) Online prediction method, system and server
CN115129679A (en) Service request remediation through machine-learning based identification of critical areas of log files
US10915704B2 (en) Intelligent reporting platform
EP3499378A1 (en) Method and system of sharing product data in a collaborative environment
CN111951112A (en) Intelligent contract execution method based on block chain, terminal equipment and storage medium
CN112825525A (en) Method and apparatus for processing transactions
CN109558222A (en) Batch service process monitoring method, device, computer and readable storage medium storing program for executing
CN107807608A (en) Data processing method, data handling system and storage medium
US20210004658A1 (en) System and method for provisioning of artificial intelligence accelerator (aia) resources

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant