CN112740237A

CN112740237A - Method and apparatus for training artificial neural network

Info

Publication number: CN112740237A
Application number: CN201880097943.7A
Authority: CN
Inventors: 马涛; 范礼; 苏箐
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2018-09-30
Filing date: 2018-09-30
Publication date: 2021-04-30
Also published as: WO2020062250A1

Abstract

A method of training an artificial neural network, comprising: obtaining M training samples, wherein each training sample in the M training samples comprises at least one target, and M is an integer greater than or equal to 2 (S510); dividing the M training samples into K groups of training samples according to targets contained in the M training samples, wherein the K groups of training samples correspond to the K computing units in a one-to-one mode, the K computing units are used for processing the targets in the K groups of training samples, the K computing units are used for training the artificial neural network based on the processing results of the targets in the K groups of training samples, and K is an integer greater than or equal to 2 (S520). The method ensures that the number of the training samples contained in each group of training samples is equal or approximately equal, so that the difference of the time required by different computing units to process the targets can be reduced, the next step can be performed in advance only based on the processing results of all the targets, and the training efficiency of the artificial neural network is improved.

Description

Method and apparatus for training artificial neural network

Technical Field

The present application relates to the field of artificial neural networks, and in particular, to a method and an apparatus for training an artificial neural network.

Background

Artificial Neural Networks (ANNs) are the basis for artificial intelligence, which requires extensive sample training to perform certain functions, such as recognizing the content of pictures or translating text.

To speed up the training rate of an ANN, one method of training an ANN is to train the ANN through a distributed computing system, for example, using a Synchronous Stochastic Gradient Descent (SSGD) method. The SSGD method comprises the steps of slicing a training sample to obtain a plurality of slice data, then sending each slice data to a computing unit of a distributed computing system to conduct forward propagation (forward propagation) computation, conducting back propagation (back propagation) computation after a loss function of the forward propagation computation is obtained to obtain the gradient of the loss function, finally aggregating the gradient values obtained by each computing unit, and updating an ANN based on the gradient aggregation result. And repeating the steps until the training is completed.

The key to training an ANN using the SSGD method is to keep the processing rates of the various computing units synchronized so that training can be accomplished at a faster training rate. For some ANNs that require multi-stage processing of samples, for example, a faster region-based convolutional neural network (RCNN) or a mask RCNN (mask RCNN), where an object (object) extracted from a training sample in a first stage is related to sample content, so that loads of each computing unit of a distributed computing system are unbalanced when performing the second stage of computation, and a computing unit with a larger load needs a longer time to compute a gradient value, so that a gradient aggregation process is blocked by a computing unit with a larger load, and finally training efficiency of the ANN is reduced.

Disclosure of Invention

The application provides a method and a device for training an ANN (artificial neural network), which can improve the efficiency of a distributed training system in training the ANN by classifying training samples by using a method for classifying based on targets.

In a first aspect, a method for training an ANN is provided, including: obtaining M training samples, wherein each training sample in the M training samples comprises at least one target, and M is an integer greater than or equal to 2; dividing the M training samples into K groups of training samples according to targets contained in the M training samples, wherein the K groups of training samples correspond to the K calculation units one by one, the K calculation units are used for processing the targets in the K groups of training samples, the K calculation units are used for training ANN based on the processing results of the targets in the K groups of training samples, and K is an integer greater than or equal to 2.

The method for training the ANN is used for grouping the training samples based on the targets contained in the training samples, and in the training process of the ANN needing to synchronously process the targets in the training samples, the number of the training samples contained in each group of training samples can be determined according to the actual situation of a distributed computing system, for example, the number of the training samples contained in each group of training samples is equal or approximately equal, so that the difference of time needed by different computing units to process the targets can be reduced, the next processing step which can be performed based on the processing results of all the targets can be performed in advance, and the training efficiency of the ANN is improved.

Optionally, dividing the M training samples into K groups of training samples according to targets included in the M training samples includes: dividing the M training samples into K groups of training samples according to the targets contained in the M training samples and the computing power of K computing units, wherein the number of the targets contained in a first group of training samples is matched with the computing power of a first computing unit, the first group of training samples is any one of the M groups of training samples, and the first computing unit is a computing unit corresponding to the first group of training samples in the K computing units.

The M training samples are grouped according to the computing power of each computing unit, so that the number of the targets processed by each computing unit is matched with the processing power of the computing unit, the difference of time required by different computing units to process the targets can be reduced, the next processing which can be performed based on the processing results of all the targets can be performed in advance, and the training efficiency of the ANN is improved.

Optionally, dividing the M training samples into K groups of training samples according to targets included in the M training samples includes: arranging M training samples into a sample queue, wherein in any two adjacent training samples in the sample queue, the number of targets contained in the previous training sample is less than or equal to the number of targets contained in the next training sample, or the number of targets contained in the previous training sample is greater than or equal to the number of targets contained in the next training sample; dividing a sample queue into M/(n.K) groups of training samples in sequence, wherein n is the number of the training samples which can be processed by any one of K computing units at a time, and the number of the training samples which can be processed by each computing unit in the K computing units at a time is the same; extracting M/n training samples from M/(n.K) training samples according to an extraction rule to obtain K training samples, wherein any one training sample in the K training samples comprises n training samples in any one training sample in the M/(n.K) training samples, and the extraction rule is as follows: each training sample in the M/(n.K) training samples is extracted n training samples at a time.

When the processing capacities of the computing units in the distributed training system are the same, the M training samples can be grouped according to the scheme, so that the number of targets contained in each group of training samples is the same or approximately the same, the time difference of generating gradient values between the computing unit with the highest computing rate and the computing unit with the lowest computing rate is reduced, synchronous gradient aggregation is performed in advance, and the training efficiency of the ANN is improved.

Optionally, each training sample in the M/(n · K) sets of training samples is extracted n training samples at a time, including: each of the M/(n K) training samples is randomly sampled n training samples at a time.

The scheme can further reduce the difference of the number of the targets contained in each training sample group in the K training sample groups.

Optionally, the method further comprises: shuffling each of the K sets of training samples; and respectively sending the shuffled K groups of training samples to K processing units.

The scheme can disorder the sequence of the training samples in each group of training samples, enhance the randomness of the number of the targets processed by each computing unit in a single batch, and avoid the problem of large difference of the number of the targets in the process of processing the targets in the single batch by different computing units.

Optionally, dividing the M training samples into K groups of training samples is a first grouping mode, where the first grouping mode is one of at least two preset grouping modes, and a second grouping mode of the at least two grouping modes is: dividing the M training samples into groups of which the K training samples are the same in quantity or approximately the same in quantity; dividing the M training samples into K groups of training samples according to targets contained in the M training samples, wherein the K groups of training samples comprise: acquiring indication information, wherein the indication information is used for indicating that a first packet mode is selected from at least two packet modes; and dividing the M training samples into K groups of training samples according to the indication information and the targets contained in the M training samples.

The scheme provides more options for the user, so that the user can flexibly select the grouping mode suitable for the current ANN training scene according to the actual situation, and the training efficiency of the ANN is improved.

In a second aspect, the present application provides an apparatus for training an ANN, where the apparatus may implement functions corresponding to the steps in the method according to the first aspect, where the functions may be implemented by hardware, or may be implemented by hardware executing corresponding software. The hardware or software includes one or more units or modules corresponding to the above functions.

In one possible design, the apparatus includes a processor configured to support the apparatus to perform the corresponding functions in the method according to the first aspect. The apparatus may also include a memory, coupled to the processor, that retains program instructions and data necessary for the apparatus. Optionally, the apparatus further comprises a communication interface for supporting communication between the apparatus and other devices.

In a third aspect, the present application provides a computer program product comprising: computer program code which, when executed by a processor of an apparatus (e.g. a server) for training an ANN, causes the apparatus for training an ANN to perform the method of the first aspect.

In a fourth aspect, the present application provides a computer storage medium for storing computer software instructions for an apparatus for training an ANN as described above, comprising a program designed to perform the method of the first aspect.

In a fifth aspect, the present application provides an ANN training system comprising the apparatus of the second aspect, the computer program product of the third aspect, and the computing storage medium of the fourth aspect.

Drawings

FIG. 1 is a schematic diagram of a distributed computing system architecture suitable for use in the subject technology;

FIG. 2 is a schematic diagram of the architecture of a mask RCNN suitable for use in the present application;

FIG. 3 is a schematic flow chart of a training mask RCNN provided herein;

FIG. 4 is a schematic illustration of an ANN training interface provided herein;

FIG. 5 is a schematic diagram of a method for partitioning training samples during ANN training provided herein;

FIG. 6 is a schematic diagram of a method of training an ANN provided herein;

FIG. 7 is a schematic diagram of an apparatus for training an ANN provided herein;

FIG. 8 is a schematic diagram of another apparatus for training an ANN provided herein;

fig. 9 is a schematic diagram of a system for training an ANN according to the present application.

Detailed Description

In order to facilitate understanding of the technical solutions of the present application, first, concepts related to the present application are briefly introduced.

The work of each layer in the ANN can be expressed mathematically

To describe. From a physical perspective, the work of each layer in the ANN can be understood as performing a transformation of an input space (i.e. a row space to a column space of a matrix) to an output space through five operations on the input space (a set of input vectors), which include: 1. ascending/descending dimensions; 2. zooming in/out; 3. rotating; 4. translating; 5. "bending". Wherein 1,2,3 are operated by

The operation of 4 is completed by + b, and the operation of 5 is realized by a (). The expression "space" is used herein because the object being classified is not a single thing, but a class of things, and space refers to the collection of all individuals of such things. Where W is a weight vector, each value in the vector representing a weight value for a neuron in the layer of neural network. The vector W determines the spatial transformation of the input space into the output space described above, i.e. the weight W of each layer controls how the space is transformed. The purpose of training the deep neural network is to finally obtain the weight matrix (the weight matrix formed by the vectors W of many layers) of all the layers of the trained neural network. Thus, the training process of the ANN is essentially a way of learning the control space transform, and more specifically the weight matrix.

Because it is desirable that the output of the ANN is as close as possible to the value actually desired to be predicted, the weight vector of each layer of the neural network can be updated by comparing the predicted value of the current network with the value actually desired to be predicted, and then updating the weight vector according to the difference between the predicted value and the value actually desired (of course, there is usually an initialization process before the first update, that is, the parameters are configured in advance for each layer in the deep neural network). Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the deep neural network becomes a process of narrowing the output value as much as possible.

The loss function is usually a multivariable function, the gradient can reflect the change rate of the output value of the loss function when the variable changes, the larger the absolute value of the gradient value is, the larger the change rate of the output value of the loss function is, the gradient value of the loss function when different parameters are updated can be calculated, the parameter is continuously updated along the direction in which the gradient value is decreased fastest, and the output value of the loss function is reduced as soon as possible.

The technical solutions provided in the present application will be described in detail below with reference to the accompanying drawings.

Fig. 1 is a schematic diagram of an architecture of a distributed computing system suitable for use in the solution of the present application.

The system 100 includes 4 computing units (workers) and a Parameter Server (PS). The 4 calculation units (worker0, worker1, worker2 and worker3) perform forward propagation calculation and backward propagation calculation on input training samples (such as images or sentences) to obtain gradient values (g0, g1, g2 and g3) of a loss function, then the 4 calculation units send the respective gradient values to a PS, the PS performs synchronous aggregation processing on the gradient values to obtain an aggregation processing result g, and g is an average value of the gradient values of all calculation units participating in distributed training and is used for updating a model parameter W' of the ANN to obtain an updated model parameter W. And finally, the PS distributes the W to each computing unit, and updates the model parameters of each computing unit, thereby completing one-time iterative training.

The computing unit may be a Graphics Processing Unit (GPU), a Tensor Processing Unit (TPU), or another type of computing unit, such as a Central Processing Unit (CPU).

The PS may be located in the same device as the 4 calculation units or may be located in a different device from the 4 calculation units.

It should be understood that the system 100 shown in fig. 1 is only one example of a technical solution applicable to the present application, the number of computing units and PS that the system 100 further comprises may be other numbers, and the system 100 may further comprise other units, for example, an input unit.

The system 100 may train a plurality of ANN, and the method for training ANN provided by the present application is described below by taking mask RCNN as an example.

Fig. 2 shows a schematic architecture of a mask RCNN suitable for use in the present application.

The Mask RCNN is a convolutional neural network improved based on the faster RCNN, and a branch network (Mask branch) is added on the basis of the faster RCNN, so that a target pixel can be segmented while target detection is achieved, the Mask RCNN can be used in combination with various RCNNs, and the Mask RCNN has strong generalization capability.

The flow of the mask RCNN processing picture is as follows:

(a) the pictures are input into the backbone network (i.e., the underlying network of the faster RCNN) and feature extraction is performed to obtain a feature map. Among them, the backbone network (backbone) may be ResNet50, and ResNet50 represents a deep residual network (ResNet) of 50 layers.

(b) The feature map is processed through a regional recommendation network (RPN), i.e., RPN forward propagation calculations are performed, and one or more recommendation windows (recommendations) are generated for each feature map. For example, the RPN performs anchor decoding (anchor decoding) and non-maximum suppression (NMS) processing, generating one or more suggested windows (i.e., regions of interest).

(c) Selecting different numbers of suggestion windows, sending the suggestion windows into a region of interest (ROI) alignment layer (align layer), and performing size normalization operation on the suggestion windows with different sizes through the ROI alignment layer. Meanwhile, the RPN calculates an RPN loss function (RPN loss) and performs an RPN back propagation calculation to obtain a gradient value of the RPN loss function, which is used to evaluate whether a region is a potential recommendation window. Subsequently, the RPN performs an aggregation process on the gradient values of the RPN loss function and updates the model parameters of the RPN, i.e., performs a gradient aggregation and model parameter update of the RPN branches.

(d) The recommendation window is processed by the faster RCNN branch to obtain a classification loss function (classification loss) and a regression loss function (regression loss). Wherein, the classification loss function is used to evaluate the type of the object (object) in the ROI, and the regression loss function is used to evaluate the deviation between the coordinates of the candidate frame mapped onto the original image and the coordinates of the correct label (ground route label).

(e) And synchronizing the gradient values of the computing units, performing faster gradient aggregation and model parameter updating operations of the RCNN branches, and performing gradient aggregation and model parameter updating operations of a backbone network.

(f) When the training results of the backbone, RPN and faster RCNN branches all converge, the model parameters are frozen and then the training of the mask branches is performed (the training of the mask branches may also be performed simultaneously with the training of the faster RCNN branches). The mask loss function shown in fig. 2 is used to evaluate the deviation between the predicted result and the correct label (ground route label) of the segmentation example.

Fig. 3 shows a flow of the system 100 for training the mask RCNN.

As shown in fig. 3, the training sample of the mask RCNN is a data set including a plurality of pictures, each computing unit has a file name list of the data set, and can independently and concurrently extract a plurality of pictures from the data set according to the file name list and add the pictures to a queue (input pipeline).

Taking the calculating unit 0 as an example, the calculating unit 0 first performs preprocessing on the pictures in the queue 0, that is, packs the pictures in the queue according to the number of the pictures processed at one time, so as to obtain a small batch data packet (minipacket). The small batch of data packets is then sent to the backbone network, performing the steps shown in figure 2.

As can be seen from the process of processing pictures by the mask RCNN, after the step (c), because the faster RCNN branches process the suggestion windows, and the number of the suggestion windows selected in each picture is different, for example, the number of the suggestion windows 0 and the number of the suggestion windows 1 shown in fig. 3 are different, the number of the suggestion windows processed by the faster RCNN branches of each computing unit is different, so that the time for generating the gradient values by the faster RCNN branches of different computing units when performing the back propagation computation is also different, and each computing unit needs to wait for the faster RCNN branch of the computing unit whose gradient value generation time is the latest to complete the back propagation computation before performing the synchronous gradient aggregation, thereby reducing the training efficiency of the mask RCNN.

The application provides a method for training an ANN, which is applied to a distributed computing system shown in FIG. 1. The system 100 may include a user interface including options for input data sets, hyper parameters (hyper parameters), and shuffle (shuffle), as shown in fig. 4.

The dataset options are used to provide the user with the ability to select different training samples, and the user can select the appropriate training sample through the option. For example, when a user needs to train the mask RCNN, dataset 1 may be selected, dataset 1 may be a graph net (ImageNet) dataset or a coconut (COCO) dataset; when a user needs to train the natural language processing network, data set 2 may be selected, and data set 2 may be a Penn Treebank (PTB) data set.

The hyper-parameter option is used for providing a function of selecting different training models for a user, and the user can set the learning rate, the iteration times, the number of ANN layers and the number of neurons in each layer through the option.

The shuffling option is used for providing a function of selecting different training sample grouping modes for a user, and the user can select different grouping modes according to actual conditions.

For example, if each computing unit does not need to process the target in the training sample in the training process of the ANN, or if each computing unit can perform asynchronous processing on the target in the training sample in the training process of the ANN, the user may select "general global shuffle", and after each computing unit in the system 100 obtains the "general global shuffle" command, the computing units randomly extract the training samples with the same number from the data set and input the training samples into respective queues.

For another example, if each computing unit needs to synchronize the targets in the training samples during the training process of the ANN, the user may select "shuffle based on the targets," and each computing unit in the system 100 acquires a "shuffle based on the targets" command, and then randomly extracts training samples with the same number of targets from the data set and adds the training samples to the respective queue.

The above method is only an example, the function of the shuffling option provided by the present application is not limited thereto, and the shuffling option may also provide more shuffling functions for the user, for example, shuffling based on target types such that the types of targets of training samples input by the respective computing units are the same.

The method 500 shown in FIG. 5 may be performed by each computing unit in the system 100 after receiving a command for a data set option and an indication of "shuffle target based". The method 500 may be performed by an overall controller (e.g., a CPU, or a dedicated processor) of the system 100.

S510, obtaining M training samples, wherein each training sample in the M training samples comprises at least one target, and M is an integer greater than or equal to 2.

The training samples can be pictures or sentences. When the training sample is a picture, the target can be an object in the picture; when the training sample is a sentence, the target may be a word or letter in the sentence.

S520, dividing the M training samples into K groups of training samples according to targets contained in the M training samples, wherein the K groups of training samples correspond to the K computing units in a one-to-one mode, the K computing units are used for processing the targets in the K groups of training samples, the K computing units are used for training ANN based on processing results of the targets in the K groups of training samples, and K is an integer greater than or equal to 2.

When the method 500 is applied to the system 100, K equals 4. The value of M may be 20, that is, there are 20 training samples in the current data set, where each training sample contains at least 1 target, and the system 100 may determine the number of targets contained in the training sample according to the label (label) of the training sample, for example, the number of targets contained in the 20 training samples is 40.

The system 100 may divide the 20 training samples into 4 groups according to the labels of the training data, each group of training samples includes 10 targets (the number of training samples included in each group of training samples may be different), the 4 groups of training samples respectively correspond to 4 calculation units of the system 100, and the 4 calculation units are configured to perform synchronous gradient aggregation processing on the targets in the 4 groups of training samples, and update the model parameters of the ANN according to the result of the synchronous gradient aggregation processing.

The number of targets included in each training sample group is only an example, and 4 training sample groups may be further divided according to the following number.

The first method is as follows: group 1, 8 targets; group 2, 12 targets; group 3, 8 targets; group 4, 12 targets.

The second method comprises the following steps: group 1, 11 targets; group 2, 9 targets; group 3, 12 targets; group 4, 8 targets.

The third method comprises the following steps: group 1, 1 target; group 2, 9 targets; group 3, 1 target; group 4, 9 targets.

In summary, the method 500 for training an ANN provided by the present application groups training samples based on the targets included in the training samples, and during an ANN training process that requires synchronous processing on the targets in the training samples, the number of training samples included in each group of training samples can be determined according to the actual situation of the distributed computing system, thereby improving the training efficiency of the ANN.

As an alternative example, the system 100 may divide the M training samples according to the computing power of each computing unit when executing S520, so that the number of targets processed by each computing unit matches the processing power of the computing unit, thereby reducing the difference of the time required for each computing unit to process the targets.

For example, the processing power of computing unit 0 is relatively weak, the processing power of computing unit 1 is relatively strong, the processing powers of computing units 0 and 2 are the same, and the processing powers of computing units 2 and 3 are the same, then system 100 may divide 20 training samples into 4 groups in one or three ways as described above, where computing unit 0 processes the training samples of group 1, computing unit 1 processes the training samples of group 2, computing unit 2 processes the training samples of group 3, and computing unit 3 processes the training samples of group 4.

As another example, the order of the processing power of the 4 computing units of system 100 is: computing unit 2> computing unit 0> computing unit 1> computing unit 3, then system 100 may divide the 20 training samples into 4 groups in the manner described above, where computing unit 0 processes the training samples of group 1, computing unit 1 processes the training samples of group 2, computing unit 2 processes the training samples of group 3, and computing unit 3 processes the training samples of group 4.

For another example, if the processing capabilities of 4 computing units of the system 100 are the same, the system 100 may divide the 20 training samples into 4 training sample groups each including 10 targets.

When the processing capabilities of the computing units of the system 100 are the same, the system 100 may group the M training samples in S520 as described below.

Arranging M training samples into a sample queue, wherein in any two adjacent training samples in the sample queue, the number of targets contained in the previous training sample is less than or equal to the number of targets contained in the next training sample (namely, the samples are arranged according to the sequence of the number of the targets from small to large), or the number of targets contained in the previous training sample is greater than or equal to the number of targets contained in the next training sample (namely, the samples are arranged according to the sequence of the targets from large to small);

dividing a sample queue into M/(n K) groups of training samples in sequence, wherein n is the number of training samples which can be processed by any one computing unit in K computing units at a time, namely the number of samples contained in a small batch data packet (minipatch);

extracting M/n training samples from M/(n.K) training samples according to an extraction rule to obtain K training samples, wherein any one training sample in the K training samples comprises n training samples in any one training sample in the M/(n.K) training samples, and the extraction rule is as follows: each training sample in the M/(n.K) training samples is extracted n training samples at a time.

In the above method, M can be divided by (n · K), and the processor may obtain M training samples according to the values of n and K when performing S510. For example, n is equal to 1, K is equal to 4, there are 99 samples in the current data set, and 96 samples can be selected from the 99 samples as M training samples.

Still taking M equal to 20 and K equal to 4 as an example, n may be equal to 1, where n is a value set by the user. The system 100 may rank the 20 training samples according to the number of targets included in each training sample, and may rank the training samples in order from small to large, or rank the training samples in order from large to small.

The system 100 divides 5 training samples into a sample queue in order, for example, 5 training samples can be divided into one group from the beginning of the sample queue every 4 training samples, and each group of training samples can be regarded as a data block (chunk), where 5 is the value of M/(n · K). Subsequently, the system 100 extracts 20 times (20 is a value of M/n) from the 5 data blocks, and extracts 1 training sample (1 is a value of n) each time to input into the queue of the computing unit, wherein at least 1 training sample of any one of the 5 data blocks must be included in the queue of each computing unit, that is, 1 training sample is extracted from each of the 5 data blocks to input into the queue of one computing unit.

The 20 times of decimation may be performed in parallel or in series. For example, when the system 100 can simultaneously extract 2n training samples from a data block, where the 2n training samples include n training samples for computing unit 0 and n training samples for computing unit 1, the above behavior is considered to be 2 extractions. For another example, system 100 may extract n training samples for computing unit 0 from a data block and then extract n training samples for computing unit 1 from the data block, which may also be considered as 2 extractions.

Because the training samples in the sample queue are arranged according to the target number in the scheme, and the data blocks are obtained by dividing in sequence, therefore, the number of targets contained in different training samples in each data block is equal or approximately equal, and the training samples in each data block are respectively allocated to different computing units, so that the number of targets contained in the training samples processed by the computing units is equal or approximately equal, and the time difference of the training samples processed by the computing units is reduced, for example, as shown in fig. 6, in the ANN training system applying the method provided by the application, the time difference of generating the gradient value between the computing unit with the highest computing speed and the computing unit with the lowest computing speed is reduced, so that the synchronous gradient aggregation is performed in advance, and the training efficiency of the ANN is improved.

The step of extracting the training samples may be completed by input loader (input loader) threads of the system 100, each computing unit corresponds to one input loader thread, and the input loader threads corresponding to the computing units may extract the training samples in parallel or in series.

In addition, the order of extracting the data blocks by the input loading thread is not limited, and the input loading thread can randomly select the data blocks to extract the training samples, and can also extract the training samples from the data blocks according to the order. Optionally, the system 100 may also perform a shuffling process on a plurality of data blocks before the extraction step, i.e., the order of the data blocks is disturbed, the randomness of the training samples processed by each computing unit is enhanced, and the phenomenon of over-fitting or misfitting of the system 100 is avoided.

As an alternative example, the system 100 may randomly draw training samples when drawing training samples from multiple data blocks.

For example, data block 1 includes 4 training samples, which are a, b, c, and d, where a and b include 1 target, and c and d include 2 targets, respectively. The 4 training samples are arranged into a, b, c and d according to the number of targets contained in the 4 training samples; the data block 2 includes 4 training samples, which are e, f, g, and h, respectively, where e includes 2 targets, f and g include 3 targets, and d includes 4 targets. The 4 training samples are arranged as e, f, g, h according to the number of targets contained in each training sample.

If the extraction is performed in sequence, a and e are input into a queue of a computing unit, and the number of the targets processed by the computing unit is 3; d and h are input into a queue of another computing unit which processes objects with a number of 6, so that the difference in the number of objects processed between different computing units is large.

If random, a and h may be input into a queue of computing units that process 5 objects; d and f may be input into another queue of computing units, the number of objects processed by the computing units is 5, and the number of objects processed by each computing unit may be equal or approximately equal, so that the time required for each processing unit to process the objects is equal or approximately equal, thereby improving the training efficiency of the ANN.

As another alternative example, the system 100 may read n training samples from each of the queues of the computing units via a feed (feed) thread, shuffle the n training samples, and then feed the shuffled training samples to the computing units.

For example, 5 training samples in the queue 0 include the number of objects 1,2,3,4, and 5, and the respective training samples are sent to the computing unit 0 in the above order; the 5 training samples in queue 1 contain 5,4,3,2 and 1 objects, and the training samples are sent to the computing unit 1 in the above order, and if the queue 0 and queue 1 are not shuffled, the time required by the computing unit 1 to process the first training sample will far exceed the time required by the computing unit 0 to process the first training sample. If the training samples in queue 0 and queue 1 are shuffled according to the above method, the number of objects processed by processing unit 0 and processing unit 1 at each time may be equal or approximately equal, so that the time required for processing the objects by processing unit 0 and processing unit 1 is equal or approximately equal, thereby improving the training efficiency of the ANN.

It should be noted that the above-mentioned schemes are only examples, and any method capable of dividing M training samples into K sets of training sample sets with equal or approximately equal total target numbers falls within the scope of the present application.

Examples of the ANN training methods provided herein are described in detail above. It is understood that the ANN training means, in order to implement the above-described functions, includes corresponding hardware structures and/or software modules for performing the respective functions. Those of skill in the art would readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The present application may perform the division of the functional units for the ANN training apparatus according to the above method examples, for example, each function may be divided into each functional unit, or two or more functions may be integrated into one processing unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit. It should be noted that the division of the units in the present application is schematic, and is only one division of logic functions, and there may be another division manner in actual implementation.

In the case of an integrated unit, fig. 7 shows a schematic diagram of a possible structure of the ANN training device provided by the present application. The apparatus 700 comprises: a processing unit 701 and an input unit 702. The processing unit 701 is adapted to control the apparatus 700 to perform the steps of the method shown in fig. 5. Processing unit 701 may also be used to perform other processes for the techniques described herein. The apparatus 700 may further comprise a storage unit 703 for storing program codes and data of the apparatus 700.

For example, the processing unit 701 is configured to control the input unit 702 to perform:

obtaining M training samples, wherein each training sample in the M training samples comprises at least one target, and M is an integer greater than or equal to 2.

The processing unit 701 is further configured to perform:

dividing the M training samples into K groups of training samples according to targets contained in the M training samples, wherein the K groups of training samples correspond to the K calculation units in a one-to-one mode, the K calculation units are used for processing the targets in the K groups of training samples, the K calculation units are used for training ANN based on the processing results of the targets in the K groups of training samples, and K is an integer greater than or equal to 2.

The processing unit 701 may be a processor or a controller, such as a CPU, a general purpose processor, a Digital Signal Processor (DSP), an application-specific integrated circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure herein. The processor may also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs, and microprocessors, among others. The input unit 702 is, for example, a transceiver, and the storage unit 703 may be a memory.

When the processing unit 701 is a processor, the input unit 702 is a transceiver, and the storage unit 703 is a memory, the ANN training apparatus according to the present application may be the apparatus shown in fig. 8.

Referring to fig. 8, the apparatus 800 includes: a processor 801, a communication interface 802, and a memory 803 (optional). The processor 801, the communication interface 802, and the memory 803 may communicate with each other via internal connection paths to transfer control and/or data signals.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the apparatuses and units described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

Therefore, the ANN training device provided by the application groups the training samples based on the targets contained in the training samples, and in the ANN training process that the targets in the training samples need to be synchronously processed, the number of the training samples contained in each group of training samples can be determined according to the actual situation of the distributed computing system, for example, the number of the training samples contained in each group of training samples is equal or approximately equal, so that the difference of time required by different computing units to process the targets can be reduced, the next processing that can be performed based on the processing results of all the targets can be performed in advance, and the training efficiency of the ANN is improved.

Referring to fig. 9, the present application also provides a system architecture 200 for training an ANN.

The server 210 is configured with an input/output (I/O) interface 212 to interact with external devices (e.g., the client device 230) and a "user" can input data to the I/O interface 212 via the client device 230, such as: a specified data set (i.e., a set of training samples), hyper-parameters, and shuffle types, etc.

Server 210 may call data, code, etc. from data storage system 240 and may store data, instructions, etc. in data storage system 250.

The processor 211 may process the data (e.g., training samples) using the method 500, and the specific processing may be described in relation to the method 500.

Training apparatus 220 is used to train the ANN based on commands from processor 211, training apparatus 220 being, for example, the various computational units shown in FIG. 1.

Finally, the I/O interface 212 returns the results of the processing (e.g., the trained ANN) to the client device 240 for presentation to the user.

In the case shown in FIG. 9, the user may manually specify the data to be entered into the server 210, for example, operating in an interface provided by the I/O interface 212 (as shown in FIG. 4). Alternatively, the client device 230 may automatically enter data into the I/O interface 212 and obtain the results, and if the client device 230 automatically enters data to obtain authorization from the user, the user may set the corresponding permissions in the client device 230. The user may view the results output by processor 210 at client device 230, which may be presented in a particular form, for example, by displaying the output results on a screen. The client device 230 may also act as a data collection site to store collected data (e.g., training samples) in the data storage system 240.

It should be noted that fig. 9 is only a schematic diagram of a system architecture provided by an embodiment of the present invention, and a position relationship between the devices, modules, and the like shown in the diagram does not set any limit to the technical solution of the present application, for example, in fig. 9, the data storage system 240 is an external memory with respect to the server 210, and optionally, the data storage system 240 may also be disposed in the server 210. Similarly, the training apparatus 200 may also be located in the server 210.

In the embodiments of the present application, the sequence numbers of the processes do not mean the execution sequence, and the execution sequence of the processes should be determined by the functions and the inherent logic of the processes, and should not limit the implementation processes of the present application.

In addition, the term "and/or" herein is only one kind of association relationship describing an associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

The steps of a method or algorithm described in connection with the disclosure herein may be embodied in hardware or in software instructions executed by a processor. The software instructions may be comprised of corresponding software modules that may be stored in Random Access Memory (RAM), flash memory, Read Only Memory (ROM), Erasable Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), registers, a hard disk, a removable disk, a compact disc read only memory (CD-ROM), or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the procedures or functions described in accordance with the present application are generated, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in or transmitted over a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)), or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., Digital Versatile Disk (DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), etc.

The above-mentioned embodiments, objects, technical solutions and advantages of the present application are further described in detail, it should be understood that the above-mentioned embodiments are only examples of the present application, and are not intended to limit the scope of the present application, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present application should be included in the scope of the present application.

Claims

A method of training an artificial neural network, comprising:

the method comprises the steps that a processor obtains M training samples, each training sample in the M training samples comprises at least one target, and M is an integer larger than or equal to 2;

the processor divides the M training samples into K groups of training samples according to targets contained in the M training samples, the K groups of training samples correspond to K calculation units in a one-to-one mode, the K calculation units are used for processing the targets in the K groups of training samples, the K calculation units are used for training an artificial neural network based on the processing results of the targets in the K groups of training samples, and K is an integer greater than or equal to 2.
The method of claim 1, wherein the processor divides the M training samples into K sets of training samples according to the targets included in the M training samples, including:

the processor divides the M training samples into K groups of training samples according to the targets contained in the M training samples and the computing power of the K computing units, wherein the number of the targets contained in a first group of training samples is matched with the computing power of a first computing unit, the first group of training samples is any one of the M groups of training samples, and the first computing unit is a computing unit corresponding to the first group of training samples in the K computing units.
The method according to claim 1 or 2, wherein the processor divides the M training samples into K sets of training samples according to the targets included in the M training samples, including:

the processor arranges the M training samples into a sample queue, wherein in any two adjacent training samples in the sample queue, the number of targets contained in the previous training sample is less than or equal to the number of targets contained in the next training sample, or the number of targets contained in the previous training sample is greater than or equal to the number of targets contained in the next training sample;

the processor divides the sample queue into M/(n.K) groups of training samples in sequence, wherein n is the number of the training samples which can be processed by any one computing unit in the K computing units at one time, and the number of the training samples which can be processed by each computing unit in the K computing units at one time is the same;

the processor extracts M/n training samples from the M/(n.K) training samples to obtain the K training samples, wherein any one training sample in the K training samples comprises n training samples in any one training sample in the M/(n.K) training samples.
The method of claim 3, wherein each training sample in the M/(n-K) training samples is extracted n training samples at a time, comprising:

each of the M/(n K) sets of training samples is randomly sampled n training samples at a time.
The method according to any one of claims 1 to 4, further comprising:

the processor shuffling each of the K sets of training samples;

the processor sends the shuffled K groups of training samples to the K processing units respectively.
The method according to any one of claims 1 to 5, wherein the dividing the M training samples into K groups of training samples is a first grouping mode, the first grouping mode is one of at least two preset grouping modes, and a second grouping mode of the at least two grouping modes is: dividing the M training samples into K training sample groups with the same number of training samples,

the processor divides the M training samples into K groups of training samples according to targets contained in the M training samples, including:

the processor acquires indication information, wherein the indication information is used for indicating that the first packet mode is selected from the two at least two packet modes;

and the processor divides the M training samples into K groups of training samples according to the indication information and the targets contained in the M training samples.
An artificial neural network training system comprising a processor, K computational units and a memory, the processor being configured to perform the method of any one of claims 1 to 6 based on instructions stored in the memory, the M training samples being divided into the K sets of training samples;

the K calculation units are used for: processing the K groups of training samples to obtain targets of the K groups of training samples; processing the targets of the K sets of training samples; training an artificial neural network based on the processing results of the targets of the K sets of training samples.
An apparatus for training an artificial neural network, comprising an input unit and a processing unit,

the input unit is used for: obtaining M training samples, wherein each training sample in the M training samples comprises at least one target, and M is an integer greater than or equal to 2;

the processing unit is configured to: dividing the M training samples into K groups of training samples according to targets contained in the M training samples, wherein the K groups of training samples correspond to K computing units one by one, the K computing units are used for processing the targets in the K groups of training samples, the K computing units are used for training an artificial neural network based on the processing results of the targets in the K groups of training samples, and K is an integer greater than or equal to 2.
The apparatus according to claim 8, wherein the processing unit is specifically configured to:

dividing the M training samples into K groups of training samples according to the targets contained in the M training samples and the computing power of the K computing units, wherein the number of the targets contained in a first group of training samples is matched with the computing power of a first computing unit, the first group of training samples is any one of the M groups of training samples, and the first computing unit is a computing unit corresponding to the first group of training samples in the K computing units.
The apparatus according to claim 8 or 9, wherein the processing unit is specifically configured to:

arranging the M training samples into a sample queue, wherein in any two adjacent training samples in the sample queue, the number of targets contained in the previous training sample is less than or equal to the number of targets contained in the next training sample, or the number of targets contained in the previous training sample is greater than or equal to the number of targets contained in the next training sample;

dividing the sample queue into M/(n.K) groups of training samples in sequence, wherein n is the number of training samples which can be processed by any one of the K computing units at a time, and the number of training samples which can be processed by each computing unit in the K computing units at a time is the same;

and extracting M/n training samples from the M/(n.K) training samples to obtain the K training samples, wherein any one training sample group in the K training sample groups comprises n training samples in any one training sample group in the M/(n.K) training sample groups.
The apparatus according to claim 10, wherein the processing unit is specifically configured to:

each of the M/(n K) sets of training samples is randomly sampled n training samples at a time.
The apparatus according to any one of claims 8 to 11, wherein the processing unit is further configured to:

shuffling each of the K sets of training samples;

and respectively sending the K groups of training samples subjected to shuffling processing to the K processing units.
The apparatus according to any one of claims 8 to 12, wherein the dividing of the M training samples into K training samples is a first grouping mode, the first grouping mode is one of at least two preset grouping modes, and a second grouping mode of the at least two grouping modes is: dividing the M training samples into K training sample groups with the same number of training samples,

the input unit is further configured to: acquiring indication information, wherein the indication information is used for indicating that the first grouping mode is selected from the two at least two grouping modes;

the processing unit is further to: and dividing the M training samples into K groups of training samples according to the indication information and the targets contained in the M training samples.