WO2022228060A1 - 数据处理方法、装置及系统 - Google Patents
数据处理方法、装置及系统 Download PDFInfo
- Publication number
- WO2022228060A1 WO2022228060A1 PCT/CN2022/085353 CN2022085353W WO2022228060A1 WO 2022228060 A1 WO2022228060 A1 WO 2022228060A1 CN 2022085353 W CN2022085353 W CN 2022085353W WO 2022228060 A1 WO2022228060 A1 WO 2022228060A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- processor
- data
- message
- embedded
- gradient
- Prior art date
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 31
- 230000006854 communication Effects 0.000 claims abstract description 236
- 238000004891 communication Methods 0.000 claims abstract description 232
- 238000000034 method Methods 0.000 claims abstract description 184
- 238000012549 training Methods 0.000 claims abstract description 141
- 230000015654 memory Effects 0.000 claims description 60
- 238000013507 mapping Methods 0.000 claims description 26
- 238000012545 processing Methods 0.000 claims description 26
- 238000004590 computer program Methods 0.000 claims description 25
- 238000013506 data mapping Methods 0.000 claims description 24
- 238000003860 storage Methods 0.000 claims description 11
- 230000001902 propagating effect Effects 0.000 claims description 9
- 238000013473 artificial intelligence Methods 0.000 abstract description 22
- 230000008569 process Effects 0.000 description 119
- 230000006870 function Effects 0.000 description 38
- 238000004364 calculation method Methods 0.000 description 29
- 239000013598 vector Substances 0.000 description 25
- 238000010586 diagram Methods 0.000 description 20
- 238000013528 artificial neural network Methods 0.000 description 19
- 239000011159 matrix material Substances 0.000 description 16
- 230000004044 response Effects 0.000 description 10
- MHABMANUFPZXEB-UHFFFAOYSA-N O-demethyl-aloesaponarin I Natural products O=C1C2=CC=CC(O)=C2C(=O)C2=C1C=C(O)C(C(O)=O)=C2C MHABMANUFPZXEB-UHFFFAOYSA-N 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 4
- 239000000872 buffer Substances 0.000 description 4
- 230000008878 coupling Effects 0.000 description 4
- 238000010168 coupling process Methods 0.000 description 4
- 238000005859 coupling reaction Methods 0.000 description 4
- 238000013500 data storage Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 230000003993 interaction Effects 0.000 description 4
- 238000010606 normalization Methods 0.000 description 4
- 230000004913 activation Effects 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000009467 reduction Effects 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 2
- 125000004122 cyclic group Chemical group 0.000 description 2
- 238000013480 data collection Methods 0.000 description 2
- 230000001934 delay Effects 0.000 description 2
- 238000009795 derivation Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013475 authorization Methods 0.000 description 1
- 238000005452 bending Methods 0.000 description 1
- FFBHFFJDDLITSX-UHFFFAOYSA-N benzyl N-[2-hydroxy-4-(3-oxomorpholin-4-yl)phenyl]carbamate Chemical compound OC1=C(NC(=O)OCC2=CC=CC=C2)C=CC(=C1)N1CCOCC1=O FFBHFFJDDLITSX-UHFFFAOYSA-N 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000011217 control strategy Methods 0.000 description 1
- 238000013499 data model Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000006073 displacement reaction Methods 0.000 description 1
- 238000005538 encapsulation Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000008570 general process Effects 0.000 description 1
- 238000011423 initialization method Methods 0.000 description 1
- 239000007788 liquid Substances 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/16—Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
- G06F15/163—Interprocessor communication
- G06F15/173—Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
- G06F15/17356—Indirect interconnection networks
- G06F15/17368—Indirect interconnection networks non hierarchical topologies
- G06F15/17375—One dimensional, e.g. linear array, ring
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogram communication
- G06F9/546—Message passing systems or structures, e.g. queues
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Definitions
- the present application relates to the technical field of artificial intelligence, and in particular, to a large-scale data processing method, device and system.
- Artificial Intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
- artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that responds in a similar way to human intelligence.
- Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
- Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, and basic AI theory.
- large-scale data model training is a core technology widely used in Internet search, advertising, recommendation business and other scenarios.
- Typical application scenarios include click-through rate (CTR) models, etc.
- CTR click-through rate
- sample data is first input, and most of these sample data are data.
- the data itself cannot be numerically calculated, so the sample data must be converted into numerical values through the embedding method. Therefore, the entry operators of the large-scale data training model are all embedding operators.
- the loss function (loss) can be obtained, and then the loss function is back-propagated, that is, a round (step) training process is completed.
- GPUs graphics processing units
- NPUs neural-network processing units
- the present application discloses a data processing method, device and system, which can improve the training efficiency and performance of a data training model.
- the present application provides a data processing method, the method comprising:
- the first processor sends a first lookup message to the second processor; the first lookup message includes first data, and the first lookup message is used to look up the embedded parameter of the first data; the second processor is the first lookup message the next-hop processor of the first processor in the ring communication architecture where the processor is located;
- the first processor receives a second lookup message from a third processor; the second lookup message includes second data, and the second lookup message is used to look up embedded parameters of the second data; the third processor is the the previous hop processor of the first processor in the ring communication architecture;
- the first processor, the second processor and the third processor are processors among the N processors included in the data training system, where N is an integer greater than or equal to 3;
- the ring communication architecture implements communication in which each of the N processors only receives messages from the previous hop processor of each processor, and only sends messages to the next hop of each processor.
- the hop handler sends the message.
- the data training system includes N processors, and in order to train large-scale sample data, the data training system composed of the N processors implements data training in the manner of data parallelism and model parallelism. Based on the data-parallel plus model-parallel training method, the N processors randomly obtain a part of the sample data for training. After the trained data is input into the training model, the data needs to be mapped to dense data through the embedding layer.
- the vector (also called the embedding parameter) can be used for subsequent calculations; however, since the training data on a processor is randomly obtained, the embedding parameters of these data do not necessarily exist in the processor, and need to be obtained from the N
- the corresponding embedded parameters are obtained from other processors of a processor, which requires message communication with other processors.
- the ring communication architecture is used between the N processors to realize the ring communication of messages. The application can make full use of bandwidth resources between processors, avoid single-point communication bottlenecks, reduce communication delay, and improve communication efficiency, thereby improving the training efficiency and performance of the entire data training system.
- the above-mentioned method further includes: in the case that the embedded parameters of part or all of the data in the second data are found based on the second search message, the first processor performs the part or all of the data as embedded parameters.
- the embedded parameter is added to the second search message, a third search message is obtained, and the third search message is sent to the second processor; or, the embedded parameter of the second data is not found based on the second search message
- the first processor sends the second search message to the second processor.
- the processor after receiving the search message of the embedded parameter of the data, the processor continues to forward the search message to the next hop processor based on the above-mentioned ring communication architecture regardless of whether the embedded parameter of the corresponding data is found locally, Through cyclic forwarding and searching, the embedded parameters of all the data required by the processor can finally be found.
- the first processor adds the embedded parameters of the part or all of the data to In the second search message, a third search message is obtained, and the third search message is sent to the second processor, including:
- the first processor looks up the embedded parameters of the part or all of the data mapping in a first embedded table;
- the first embedded table is an embedded table maintained by the first processor for storing data and embedded parameters, and the first embedded table There is a one-to-one mapping relationship between the data in the table and the embedded parameters;
- the first processor adds the embedded parameter of the part or all of the data mapping to the value field corresponding to the part or all of the data in the second lookup message to obtain the third lookup message;
- the first processor sends the third lookup message to the second processor, where the third lookup message is used to look up the embedded parameter of the data for which the embedded parameter is not found in the second data.
- each of the above-mentioned N processors maintains an embedded table, and the embedded table is used to store data and corresponding embedded parameters. Therefore, after the processor receives the search message for the embedded parameters, The data in the lookup message can be indexed in the processor's embedded table. If the data in the search message exists in the embedded table, the corresponding embedded parameter can be searched.
- the first processor adds the embedded parameters of the part or all of the data to In the second search message, a third search message is obtained, and the third search message is sent to the second processor, including:
- the first processor determines that the part or all of the data belongs to the first embedded table, and the first embedded table does not yet include the part or all of the data; the first embedded table is maintained by the first processor for storing data and an embedded table of embedded parameters, there is a one-to-one mapping relationship between data and embedded parameters in the first embedded table;
- the first processor generates respective embedded parameters corresponding to the part or all of the data
- the first processor adds the respective embedded parameters corresponding to the part or all of the data to the range of values corresponding to the part or all of the data in the second search message to obtain the third search message;
- the first processor sends the third lookup message to the second processor, where the third lookup message is used to look up the embedded parameter of the data for which the embedded parameter is not found in the second data.
- each of the above-mentioned N processors maintains an embedded table, and the embedded table is used to store data and corresponding embedded parameters. Therefore, after the processor receives the search message for the embedded parameters, If the processor determines that there is data in the message that belongs to the embedding table of the processor, but is not in the embedding table, the processor may randomly generate corresponding embedding parameters for the data belonging to the embedding table. Optionally, the remainder obtained after the data belonging to the embedding table is modulo N is the same as the number of the training program run by the processor.
- the first processor sends the second search message to the second processor, including:
- the first processor sends the second lookup message to the second processor; the first embedded table is maintained by the first processor An embedded table for storing data and embedded parameters, and there is a one-to-one mapping relationship between data and embedded parameters in the first embedded table.
- the processor directly sends the received lookup message to the processor based on the above ring communication architecture. Next hop processor.
- the method further includes:
- the first processor receives a fourth lookup message from the third processor; the fourth lookup message includes third data and an embedded parameter mapped to the first part of the data in the third data, and the fourth lookup message is used to look up the Embedded parameters of data mappings in the third data other than the first part of data;
- the first processor adds the embedded parameter of the second part of the data to the fourth search message to obtain the first five search messages, and send the fifth search message to the second processor;
- the first processor sends the fourth search message to the second processor.
- the above-mentioned ring communication architecture is used to realize the search of the embedded parameters required by each of the above N processors, and the ring communication of the search message can be implemented multiple times based on this architecture to search for the embedded parameters of the data.
- at least N times of message communication and embedding parameter search can be repeated among the N processors, so as to ensure that each processor can obtain the embedding parameters of all required data.
- the method further includes: the first processor receives a sixth lookup message from the third processor, where the sixth lookup message includes the first data and an embedded parameter of the first data.
- the message communication of the above N processors is implemented to find the embedded parameters required by each processor. message with all embedded parameters.
- the present application provides a data processing method, the method comprising:
- the first processor sends a first notification message to the second processor;
- the first notification message includes first data and a first gradient, and is used to propagate the first gradient to the first target processor;
- the first gradient is The gradient corresponding to the embedding parameter of the first data;
- the second processor is the next-hop processor of the first processor in the ring communication architecture where the first processor is located;
- the first processor receives a second notification message from the third processor;
- the second notification message includes second data and a second gradient for propagating the second gradient into the second target processor;
- the second The gradient is the gradient corresponding to the embedding parameter of the second data;
- the third processor is the previous hop processor of the first processor in the ring communication architecture;
- the first processor, the second processor and the third processor are processors among the N processors included in the data training system, where N is an integer greater than or equal to 3;
- the ring communication architecture implements communication in which each of the N processors only receives messages from the previous hop processor of each processor, and only sends messages to the next hop of each processor.
- the hop handler sends the message.
- the embedded parameters of the data in the forward propagation process of the training data of the processor are obtained from other processors, that is, the embedded parameters of the data are stored in other processors, but in the reverse direction of training During the propagation process, the embedded parameters of the data need to be optimized based on the calculated gradients. Then, the processor needs to send the calculated gradients corresponding to the embedded parameters of the data to the corresponding processor, so that the corresponding processor can optimize the data. embedded parameters.
- the ring communication architecture of the messages is implemented between the N processors through the ring communication architecture.
- the present application can make full use of the bandwidth resources between processors, avoid single-point communication bottlenecks, reduce communication delay, improve communication efficiency, and further improve the overall data Training efficiency and performance of the training system.
- the method further includes: when the second notification message includes the first target gradient, the first processor acquires the first target gradient in the second notification message, and sends the first target gradient to the second notification message.
- the second processor sends the second notification message;
- the first target gradient is the gradient of the embedding parameters in the first embedding table maintained by the first processor, and there is a one-to-one mapping relationship between the data in the first embedding table and the embedding parameters ;
- the first processor sends the second notification message to the second processor.
- the processor after receiving the notification message of the gradient, the processor continues to forward the notification message to the next hop processor based on the above-mentioned ring communication architecture, regardless of whether the desired gradient is found in the notification message, Through cyclic forwarding, each processor can finally obtain the gradient required by each processor.
- the first processor obtains the first target gradient in the second notification message, including:
- the first processor determines that part or all of the data in the second data is data in the first embedded table
- the first processor obtains the first target gradient in the second notification message based on the part or all of the data.
- each of the above-mentioned N processors maintains an embedding table, and the embedding table is used to store data and corresponding embedding parameters. Therefore, after the processor receives the gradient notification message, if the If data exists in the embedding table in the message, the processor can obtain the corresponding gradient from the message to optimize the data.
- the method further includes:
- the first processor receives a third notification message from the third processor; the third notification message includes third data and a third gradient for propagating the third gradient to the third target processor; the third The third gradient is the gradient corresponding to the embedding parameter of the third data;
- the first processor acquires the second target gradient in the third notification message, and sends the third notification message to the second processor;
- the second target gradient is the gradient of the embedding parameters in the first embedding table maintained by the first processor, where the first embedding table includes the mapping relationship between data and the embedding parameters of the data;
- the first processor sends the third notification message to the second processor.
- the above-mentioned ring communication architecture is adopted to enable each of the above N processors to obtain the required gradients, and the ring communication of notification messages can be implemented multiple times based on this architecture. At least N-1 times of message communication are circulated between each other, so as to ensure that all the required gradients can be obtained by each processor.
- any one of the above-mentioned first aspect and its possible implementations can be implemented in combination with any of the second aspect and its possible implementations.
- the first aspect and its possible implementations Any one of the embodiments is applied to the forward propagation process of the embedding layer of data training, and any one of the second aspect and its possible embodiments is applied to the back propagation of the embedding layer of data training. in the process.
- the present application provides a data processing device, the device comprising:
- a sending unit configured to send a first search message to a second processor;
- the first search message includes first data, and the first search message is used to search for embedded parameters of the first data;
- the second processor is the first search message The next-hop processor of the first processor in the ring communication architecture where the processor is located;
- a receiving unit for receiving a second search message from a third processor;
- the second search message includes second data, and the second search message is used to search for embedded parameters of the second data;
- the third processor is the the previous hop processor of the first processor in the ring communication architecture;
- the first processor, the second processor and the third processor are processors among the N processors included in the data training system, where N is an integer greater than or equal to 3;
- the ring communication architecture implements communication in which each of the N processors only receives messages from the previous hop processor of each processor, and only sends messages to the next hop of each processor.
- the hop handler sends the message.
- the device further includes an adding unit
- the above-mentioned adding unit is configured to add the embedded parameters of the part or all of the data to the second search message when the embedded parameters of part or all of the data in the second data are found based on the second search message, to obtain the third lookup message;
- the above-mentioned sending unit is further configured to send the third search message to the second processor
- the sending unit is further configured to send the second search message to the second processor when the embedded parameter of the second data is not found based on the second search message.
- the device further includes a search unit
- the lookup unit is configured to look up the embedded parameters of the part or all of the data mapping in the first embedded table;
- the first embedded table is an embedded table maintained by the first processor for storing data and embedded parameters, and the first embedded table is used to store data and embedded parameters. There is a one-to-one mapping relationship between the data in the embedded table and the embedded parameters;
- the above-mentioned adding unit is specifically used to add the embedded parameter of the part or all of the data mapping to the value range corresponding to the part or all of the data in the second search message to obtain the third search message;
- the above-mentioned sending unit is specifically configured to send the third search message to the second processor, where the third search message is used to search for the embedded parameter of the data for which the embedded parameter is not found in the second data.
- the above-mentioned apparatus further includes a determining unit and a generating unit;
- the determining unit is configured to determine that part or all of the above data belongs to the first embedded table, and the first embedded table does not yet include the part or all of the data; the first embedded table is maintained by the first processor for storage an embedded table of data and embedded parameters, and there is a one-to-one mapping relationship between data and embedded parameters in the first embedded table;
- the generating unit is used to generate the respective embedded parameters corresponding to the part or all of the data
- the above-mentioned adding unit is specifically used for adding the embedded parameters corresponding to the part or all of the data to the value range corresponding to the part or all of the data in the second search message to obtain the third search message;
- the above-mentioned sending unit is specifically configured to send the third search message to the second processor, where the third search message is used to search for the embedded parameter of the data for which the embedded parameter is not found in the second data.
- the above-mentioned sending unit is specifically used for:
- the first embedded table is maintained by the first processor for storing data and An embedded table of embedded parameters, there is a one-to-one mapping relationship between the data in the first embedded table and the embedded parameters.
- the above receiving unit is further configured to receive a fourth search message from the third processor;
- the fourth search message includes the third data and the embedded parameters of the first part of the data in the third data.
- the fourth lookup message is used to look up the embedded parameters of the data mapping in the third data except the first part of data;
- the apparatus further includes an adding unit, configured to add the embedded parameter of the second part of the data to the fourth search message when the embedded parameter of the second part of the data in the third data is found based on the fourth search message , get the fifth search message;
- the above-mentioned sending unit is further configured to send the fifth search message to the second processor
- the sending unit is further configured to send the fourth search message to the second processor when the embedded parameter of the third data is not found based on the fourth search message.
- the receiving unit is further configured to: receive a sixth search message from the third processor, where the sixth search message includes the first data and an embedded parameter of the first data.
- the present application provides a data processing device, the device comprising:
- a sending unit configured to send a first notification message to the second processor;
- the first notification message includes first data and a first gradient, and is used to propagate the first gradient to the first target processor;
- the first gradient is the gradient corresponding to the embedding parameter of the first data;
- the second processor is the next-hop processor of the first processor in the ring communication architecture where the first processor is located;
- a receiving unit configured to receive a second notification message from a third processor;
- the second notification message includes second data and a second gradient, and is used to propagate the second gradient to the second target processor;
- the second The gradient is the gradient corresponding to the embedding parameter of the second data;
- the third processor is the previous hop processor of the first processor in the ring communication architecture;
- the first processor, the second processor and the third processor are processors among the N processors included in the data training system, where N is an integer greater than or equal to 3;
- the ring communication architecture implements communication in which each of the N processors only receives messages from the previous hop processor of each processor, and only sends messages to the next hop of each processor.
- the hop handler sends the message.
- the device further includes an acquisition unit;
- the obtaining unit configured to obtain the first target gradient in the second notification message when the second notification message includes the first target gradient
- the sending unit is further configured to send the second notification message to the second processor;
- the first target gradient is the gradient of the embedded parameters in the first embedded table maintained by the first processor, and the data in the first embedded table There is a one-to-one mapping relationship with the embedded parameters;
- the sending unit is further configured to send the second notification message to the second processor when the first target gradient is not included in the second notification message.
- the obtaining unit is specifically used for:
- the first target gradient is obtained in the second notification message based on the part or all of the data.
- the receiving unit is further configured to receive a third notification message from the third processor;
- the third notification message includes third data and a third gradient, and is used for propagating the third gradient into the third target processor;
- the third gradient is the gradient corresponding to the embedding parameter of the third data;
- the apparatus further includes an acquiring unit configured to acquire the second target gradient in the third notification message when the third notification message includes the second target gradient,
- the sending unit is further configured to send the third notification message to the second processor;
- the second target gradient is the gradient of the embedding parameter in the first embedding table maintained by the first processor, and the first embedding table includes The mapping relationship between the data and the embedded parameters of the data;
- the sending unit is further configured to send the third notification message to the second processor under the condition that the second target gradient is not included in the third notification message.
- the present application provides an apparatus, which may include a processor and a memory, for implementing the data processing method described in the first aspect above.
- the memory is coupled to the processor, and when the processor executes the computer program stored in the memory, the method described in the first aspect or any possible implementation manner of the first aspect can be implemented.
- the apparatus may also include a communication interface for the apparatus to communicate with other apparatuses, for example, the communication interface may be a transceiver, circuit, bus, module or other type of communication interface.
- the communication interface includes a receiving interface and a sending interface, the receiving interface is used for receiving messages, and the sending interface is used for sending messages.
- the apparatus may include:
- a processor configured to send a first lookup message to a second processor through a sending interface; the first lookup message includes first data, and the first lookup message is used to look up embedded parameters of the first data; the second processor is the next hop processor of the first processor in the ring communication architecture where the first processor is located;
- the third processor is the ring communication the previous hop processor of the first processor in the architecture;
- the first processor, the second processor and the third processor are processors among the N processors included in the data training system, where N is an integer greater than or equal to 3;
- the ring communication architecture implements communication in which each of the N processors only receives messages from the previous hop processor of each processor, and only sends messages to the next hop of each processor.
- the hop handler sends the message.
- the computer program in the memory in this application can be pre-stored or stored after being downloaded from the Internet when the device is used.
- This application does not specifically limit the source of the computer program in the memory.
- the coupling in the embodiments of the present application is an indirect coupling or connection between devices, units or modules, which may be in electrical, mechanical or other forms, and is used for information exchange between devices, units or modules.
- the present application provides an apparatus, which may include a processor and a memory, for implementing the data processing method described in the second aspect above.
- the memory is coupled to the processor, and when the processor executes the computer program stored in the memory, the method described in the second aspect or any possible implementation manner of the second aspect can be implemented.
- the apparatus may also include a communication interface for the apparatus to communicate with other apparatuses, for example, the communication interface may be a transceiver, circuit, bus, module or other type of communication interface.
- the communication interface includes a receiving interface and a sending interface, the receiving interface is used for receiving messages, and the sending interface is used for sending messages.
- the apparatus may include:
- a processor configured to send a first notification message to the second processor through a sending interface;
- the first notification message includes first data and a first gradient, and is used to propagate the first gradient to the first target processor;
- the first gradient is the gradient corresponding to the embedding parameter of the first data;
- the second processor is the next-hop processor of the first processor in the ring communication architecture where the first processor is located;
- a second notification message from the third processor is received through the receiving interface;
- the second notification message includes second data and a second gradient, and is used for propagating the second gradient to the second target processor;
- the second gradient is The gradient corresponding to the embedded parameter of the second data;
- the third processor is the previous hop processor of the first processor in the ring communication architecture;
- the first processor, the second processor and the third processor are processors among the N processors included in the data training system, where N is an integer greater than or equal to 3;
- the ring communication architecture implements communication in which each of the N processors only receives messages from the previous hop processor of each processor, and only sends messages to the next hop of each processor.
- the hop handler sends the message.
- the computer program in the memory in this application can be pre-stored or stored after being downloaded from the Internet when the device is used.
- This application does not specifically limit the source of the computer program in the memory.
- the coupling in the embodiments of the present application is an indirect coupling or connection between devices, units or modules, which may be in electrical, mechanical or other forms, and is used for information exchange between devices, units or modules.
- the present application provides a data training system, the system includes N processors, where N is an integer greater than or equal to 3; communication between the N processors is achieved through a ring communication architecture, in the ring communication architecture , each of the N processors only receives messages from the previous-hop processor of each processor, and only sends messages to the next-hop processor of each processor; among the N processors
- Each of the processors may be the device described in any one of the third aspect and its possible implementations; or, each of the N processors may be the fourth aspect and its possible implementations
- the apparatus described in any one of the manners; or, each of the N processors may be the apparatus described in any one of the fifth aspect and its possible implementation manners; or, the N processors
- Each processor in the apparatus may be an apparatus according to any one of the above-mentioned sixth aspect and possible implementations thereof.
- the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and the computer program is executed by a processor to implement the above-mentioned first aspect and any one of its possible implementations. or, the computer program is executed by the processor to implement the method described in any one of the second aspect and its possible implementation manners.
- the present application provides a computer program product, when the computer program product is executed by a processor, the method described in any one of the above-mentioned first aspect and its possible implementation manners will be executed; or, the above-mentioned second The method of any of the aspects and possible embodiments thereof is to be performed.
- FIG. 1 is a schematic diagram of an artificial intelligence main body framework provided by an embodiment of the present application.
- FIG. 2 is a schematic diagram of an application environment provided by an embodiment of the present application.
- FIG. 3 is a schematic structural diagram of a neural network processor according to an embodiment of the present application.
- FIG. 4 shows a schematic diagram of a data training system provided by an embodiment of the present application
- Figure 5 is a schematic diagram of a data training model
- Figure 7 is a schematic diagram of the communication part and the calculation part of the data training process
- FIG. 8 is a schematic diagram of performing ring communication in the data training model of the present application.
- FIG. 9 shows a schematic flowchart of a data processing method provided by the present application.
- 10A to 10E are schematic flowcharts of a ring communication provided by the present application.
- FIG. 11 is a schematic flowchart of another data processing method provided by the present application.
- 12A to 12D are schematic flowcharts of a ring communication provided by the present application.
- FIG. 13 is a schematic diagram showing the comparison of the communication throughput of the present solution and the communication throughput of the existing technical solution;
- FIG. 14 and FIG. 15 are schematic diagrams of the logical structure of the apparatus provided by the present application.
- FIG. 16 is a schematic diagram showing the physical structure of the apparatus provided by the present application.
- Figure 1 shows a schematic diagram of an artificial intelligence main frame, which describes the overall workflow of an artificial intelligence system and is suitable for general artificial intelligence field requirements.
- the "intelligent information chain” reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, data has gone through the process of "data-information-knowledge-wisdom".
- the "IT value chain” reflects the value brought by artificial intelligence to the information technology industry from the underlying infrastructure of human intelligence, information (providing and processing technology implementation) to the industrial ecological process of the system.
- the infrastructure provides computing power support for artificial intelligence systems, realizes communication with the outside world, and supports through the basic platform. Communication with the outside world through sensors; computing power is provided by smart chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA); the basic platform includes distributed computing framework and network-related platform guarantee and support, which can include cloud storage and computing, interconnection networks, etc. For example, sensors communicate with external parties to obtain data, and these data are provided to the intelligent chips in the distributed computing system provided by the basic platform for calculation.
- smart chips hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA
- the basic platform includes distributed computing framework and network-related platform guarantee and support, which can include cloud storage and computing, interconnection networks, etc. For example, sensors communicate with external parties to obtain data, and these data are provided to the intelligent chips in the distributed computing system provided by the basic platform for calculation.
- the data on the upper layer of the infrastructure is used to represent the data sources in the field of artificial intelligence.
- the data involves graphics, images, voice, and text, as well as IoT data from traditional devices, including business data from existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
- Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making, etc.
- machine learning and deep learning can perform symbolic and formalized intelligent information modeling, extraction, preprocessing, training, etc. on data.
- Reasoning refers to the process of simulating human's intelligent reasoning method in a computer or intelligent system, using formalized information to carry out machine thinking and solving problems according to the reasoning control strategy, and the typical function is search and matching.
- Decision-making refers to the process of making decisions after intelligent information is reasoned, usually providing functions such as classification, sorting, and prediction.
- some general capabilities can be formed based on the results of data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, image identification, etc.
- Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of the overall artificial intelligence solution, and the productization of intelligent information decision-making and implementation of applications. Its application areas mainly include: intelligent manufacturing, intelligent transportation, Smart home, smart medical care, smart security, autonomous driving, safe city, smart terminals, etc.
- an embodiment of the present application provides a system architecture 200 .
- the data collection device 260 is used to collect training sample data and store it in the database 230 , and the training device 220 generates the target model/rule 201 based on the sample data maintained in the database 230 .
- the following will describe in more detail how the training device 220 obtains the target model/rule 201 based on the sample data, and the target model/rule 201 can implement functions such as click-through rate estimation, information recommendation, or search.
- the work of each layer in a deep neural network can be expressed mathematically To describe: From the physical level, the work of each layer in the deep neural network can be understood as completing the transformation from the input space to the output space (that is, the row space of the matrix to the column) through five operations on the input space (set of input vectors). Space), these five operations include: 1. Dimension raising/lowering; 2. Enlarging/reducing; 3. Rotation; 4. Translation; 5. "Bending”. Among them, the operations of 1, 2, and 3 are determined by Complete, the operation of 4 is completed by +b, and the operation of 5 is realized by a().
- W is the weight vector, and each value in the vector represents the weight value of a neuron in the neural network of this layer.
- the vector W determines the space transformation from the input space to the output space described above, that is, the weight W of each layer controls how the space is transformed.
- the purpose of training the deep neural network is to finally obtain the weight matrix of all layers of the trained neural network (the weight matrix formed by the vectors W of many layers). Therefore, the training process of the neural network is essentially learning the way to control the spatial transformation, and more specifically, learning the weight matrix.
- the weight vector of the network (of course, there is usually an initialization process before the first update, that is, the parameters are pre-configured for each layer in the deep neural network), for example, if the predicted value of the network is high, adjust the weight vector to make it Predict lower and keep adjusting until the neural network can predict the actual desired target value.
- the target models/rules obtained by training the device 220 can be applied in different systems or devices.
- the execution device 210 is configured with an I/O interface 212 for data interaction with external devices, and a “user” can input data to the I/O interface 212 through the client device 240 .
- the execution device 210 can call data, codes, etc. in the data storage system 250 , and can also store data, instructions, etc. in the data storage system 250 .
- the calculation module 211 uses the target model/rule 201 to process the input data. For example, for a click rate estimation scenario, the calculation module 211 uses the target model/rule 201 to predict information that the user may click.
- the I/O interface 212 returns the processing result to the client device 240, which is provided to the user.
- the training device 220 can generate corresponding target models/rules 201 based on different data for different targets, so as to provide users with better results.
- the user can manually specify data in the input execution device 210 , eg, operate in the interface provided by the I/O interface 212 .
- the client device 240 can automatically input data to the I/O interface 212 and obtain the result. If the client device 240 automatically inputs data and needs to obtain the user's authorization, the user can set the corresponding permission in the client device 240 .
- the user can view the result output by the execution device 210 on the client device 240, and the specific presentation form can be a specific manner such as display, sound, and action.
- the client device 240 can also act as a data collection terminal to store the collected sample data in the database 230 .
- FIG. 2 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship among the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
- the data storage system 250 is an external memory relative to the execution device 210 , and in other cases, the data storage system 250 may also be placed in the execution device 210 .
- FIG. 3 is a structural diagram of a chip hardware provided by an embodiment of the present application.
- the neural network processor NPU 30 is mounted on the main CPU (Host CPU) as a co-processor, and tasks are assigned by the Host CPU.
- the core part of the NPU is the arithmetic circuit 305, which is controlled by the controller 304 to extract the matrix data in the memory and perform multiplication operations.
- the arithmetic circuit 305 includes multiple processing units (Process Engine, PE). In some implementations, the arithmetic circuit 305 is a two-dimensional systolic array. The arithmetic circuit 305 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 305 is a general-purpose matrix processor.
- the operation circuit fetches the data corresponding to the matrix B from the weight memory 302 and buffers it on each PE in the operation circuit.
- the operation circuit fetches the data of matrix A and matrix B from the input memory 301 to perform matrix operation, and the obtained partial result or final result of the matrix is stored in the accumulator 308 .
- Unified memory 306 is used to store input data and output data.
- the weight data is directly transferred to the weight memory 302 through a storage unit access controller (Direct Memory Access Controller, DMAC) 305 .
- DMAC Direct Memory Access Controller
- Input data is also moved to unified memory 306 via the DMAC.
- the BIU is the Bus Interface Unit, that is, the bus interface unit 310, which is used for the interaction between the AXI bus and the DMAC and the instruction fetch memory 309 Instruction Fetch Buffer.
- the bus interface unit 310 (Bus Interface Unit, BIU for short) is used for the instruction fetch memory 309 to obtain instructions from the external memory, and also for the storage unit access controller 305 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
- BIU Bus Interface Unit
- the DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 306 , the weight data to the weight memory 302 , or the input data to the input memory 301 .
- the vector calculation unit 307 includes a plurality of operation processing units, and further processes the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc., if necessary. It is mainly used for non-convolutional layer network calculations in neural networks, such as pooling (Pooling), batch normalization (Batch Normalization), and local response normalization (Local Response Normalization).
- vector computation unit 307 stores the processed output vectors to unified buffer 306 .
- the vector calculation unit 307 may apply a nonlinear function to the output of the arithmetic circuit 305, such as a vector of accumulated values, to generate activation values.
- vector computation unit 307 generates normalized values, merged values, or both.
- the vector of processed outputs can be used as activation input to the arithmetic circuit 305, eg, for use in subsequent layers in a neural network.
- the instruction fetch memory (instruction fetch buffer) 309 connected to the controller 304 is used to store the instructions used by the controller 304;
- the unified memory 306, the input memory 301, the weight memory 302 and the instruction fetch memory 309 are all On-Chip memories. External memory is private to the NPU hardware architecture.
- FIG. 4 is a schematic diagram of the data training system provided by the present application.
- the system includes N processors, where N is an integer greater than one.
- the N processors may be the training device 220 in FIG. 2 described above.
- a ring communication architecture can be used to implement message communication.
- the ring communication architecture is a logical architecture for realizing ring communication among the N processors.
- each of the N processors only receives messages from the previous hop processor of each processor, and only sends messages to the next hop processor of each processor.
- processor 0 only sends messages to processor 1 and only receives messages from processor 3;
- processor 1 only sends messages to processor 2;
- processor 2 only sends messages to processor 3, and only receives messages from processor 1;
- processor 3 only sends messages to processor 0, and only receives messages from processor 2.
- the above-mentioned ring communication mode based on the ring communication architecture may be implemented by adopting a ring communication mode in a message passing interface (message passing interface, MPI).
- MPI message passing interface
- the message communication in the whole process may adopt the above-mentioned ring communication architecture to realize communication;
- the message includes, for example, a message used to find the embedding variable of the data during the forward propagation of the embedding layer, and/or included in the back propagation of the embedding layer, used to obtain the gradient for optimizing the embedding parameter. (gradient) message, the rest of the message communication may adopt other communication methods, which is not limited in this application.
- the above-mentioned embedding parameter is in the form of a vector (vector), and the embedding parameter may be referred to as an embedding vector or an embedding vector.
- the above gradient can also be in the form of a vector, and the gradient can also be called a gradient vector.
- the N processors may all be graphics processing units (graphics processing units, GPU); or, the N processors may all be network processors (neural-network processing units, NPU); or, the N A processor can be part GPU and part NPU.
- the NPU may be the neural network processor described in FIG. 3 above. It should be noted that the N processors are not limited to being GPUs or NPUs, but may also be other high-speed processors.
- the above data training system can be applied to scenarios where the amount of training embedded parameters reaches tens of billions, or even hundreds of billions.
- the data training system can be applied in practical application scenarios such as information search, information recommendation, advertisement such as click-through rate (click-through rate, CTR) and so on.
- the data trained by the data training system may be sparse data, or may be dense data, which is not limited in this application.
- the above-mentioned data to be trained may be identification (identity document, id) data
- the id data may be a number or a character string or the like.
- the id data may be the identification code of the product or the address of the merchant's store, or the like.
- the data processing method provided by this application is mainly described below by taking the data to be trained as id data as an example, but the data processing method provided by this application can also process other types of data, not limited to id data.
- FIG. 5 can be exemplified.
- each piece of data in the N pieces of data may be referred to as batch data (batch), each batch contains q sample data, and each sample data contains multiple pieces of data.
- the q is an integer greater than 0, and the q is called the batch size, and the number of data contained in each sample data can be different.
- each processor runs a training process to train the corresponding data, and each training process has its own serial number, which is used by the processor to distinguish different processes.
- the message communication between the processors described later can also be said to be the message communication between the training processes.
- the training of data is performed by deep learning neural network. Therefore, the model of each processor training data includes but is not limited to input layer, embedded layer, hidden layer, loss function operator, gradient calculation operator and For sub-models such as parameter update operators, the model of the training data shown in FIG. 5 only exemplarily draws part of the sub-models.
- the whole training process includes forward-propagation (FP) process and back-propagation (back-propagation, BP) process.
- FP forward-propagation
- BP back-propagation
- the embedding layer in forward propagation and the embedding layer in back propagation shown in Figure 5 are the same embedding layer, and similarly, the hidden layer in forward propagation and the hidden layer in back propagation It is the same hidden layer, which is drawn separately to better distinguish and reflect the process of forward propagation and back propagation.
- the process of forward propagation includes: inputting data into the embedding layer, which is used to map the data into dense embedding parameters for calculation; in the calculation of the embedding layer, messages need to be communicated between N processors to find The embedding parameters of the respective training data (we will introduce why and how to communicate in detail later, and will not go into details here); the output of the embedding layer is the embedding parameters of the data, and these embedding parameters are input to the hidden layer for calculation and output prediction. value.
- the predicted value of the output can be combined with the label to establish a loss function (loss), and the gradient can be calculated by automatic derivation.
- the process of backpropagation includes: the processor derives the gradients of all training parameters of the hidden layer and the embedding layer through a reverse chain derivation process based on the above-mentioned loss function and gradient, and then optimizes the parameters through an optimization algorithm. Specifically, when these gradients are back-propagated to the embedding layer, the processor calculates the gradient corresponding to the embedding parameters of each data based on these gradients; then, the N processors obtain the corresponding embedding parameters required by each processor through message communication. Gradient, the processor optimizes the corresponding embedding parameters based on the obtained gradient (we will introduce why and how to communicate in detail later, and will not go into details here).
- the function of the embedding layer is mainly to map the data into a dense vector, and the dense vector is the above-mentioned embedding parameter. Since the amount of data to be trained is huge, and the model is trained in parallel, in order to facilitate calculation and save computing resources for preprocessing, the data to be trained can be randomly allocated to N processors (N training processes) for training.
- N processors N training processes
- each processor or the training process of each processor maintains an embedding table (embedding table), which is used to store data and embedding parameters.
- an embedding table embedding table
- the embedded parameters of the data randomly assigned to a processor are not necessarily in the embedding table of the processor. Therefore, the embedding parameters of the corresponding data need to be obtained in the embedding table of other processors, so it is necessary to communicate with each other through message communication. Query the respective embedded parameters.
- the present application may divide different embedding tables by modulo (mod) calculation. Specifically, the remainder after the modulo calculation of N is the same for the data in the same embedding table. Optionally, the remainder after the data in the embedding table of processor i in the N processors is calculated modulo N is i, and the input program number of the training data in processor i is i.
- the present application may divide different embedding tables by means of "division" calculation. Specifically, the data in the same embedding table is divided by N and the result is the same. For example, assuming that N is 3, 4 and 5 belong to the same embedding table, then 4 divided by 3 is equal to 1, and 5 divided by 3 is also equal to 1.
- the present application may divide different embedding tables by random allocation. Specifically, the data in the embedding table of each processor in the N processors is random. At this time, you can directly use the data itself as an index to find the corresponding embedding parameters in the embedding table.
- the embodiments of the present application do not limit the manner of how to segment the embedding table.
- the following describes the specific implementation process mainly by taking the embedding table segmented by modulo calculation as an example. This does not constitute a limitation to the embodiments of the present application.
- the data and embedding parameters in the embedding table of each processor in the above N processors can be loaded as the data in the embedding table in the current training model by loading the data in the embedding table in the current training. initialization data. Or, for the data queried for the first time when looking up the embedding table, you can directly use random numbers to initialize the embedding parameters of the data, and insert the data and the randomly generated embedding parameters into the embedding table to complete the initialization. This application does not limit the data initialization method of the embedding table.
- Table 1 exemplarily shows the content and structure included in the embedding table, the id in Table 1 is the data, and in the embedding table, each data mapping has an embedding parameter.
- m in Table 1 can be any integer greater than 1.
- a processor calculates the gradient of the embedding parameters corresponding to its training data, it needs to distribute the gradients corresponding to the embedding parameters of the data to the processor corresponding to the embedding table where the data is located. , for the processor to optimize the embedding parameters in its own embedding table. For ease of understanding, see Table 2 for an example.
- Processor number training data The remainder of the data modulo 3 Processor 0 (training process 0) 10, 21, 14 and 19 1, 0, 2 and 1 Processor 1 (training process 1) 31, 23, 3 and 8 1, 2, 0 and 2 Processor 2 (training process 2) 12, 5, 19 and 33 0, 2, 1 and 0
- Table 2 it is assumed that the above N is 3, that is, the data is trained by 3 processors.
- Table 2 exemplifies the data randomly assigned to each processor for training, and gives the remainder of the modulo operation of these data with 3.
- processor 0 the data to be trained randomly obtained by processor 0 are 10, 21, 14, and 19, and the remainders of the modulo 3 corresponding to these data are 1, 0, 2, and 1, respectively.
- N is i
- the remainder after the data in the embedding table in the 0th processor is calculated modulo 3 is 0
- the remainder after the data in the embedding table in the first processor is calculated modulo 3
- the remainder after the data in the embedding table in the second processor is calculated modulo 3 is 2.
- the embedding table in processor 0 only has the embedding parameter of the data map whose remainder is 0 after modulo 3, but there is no embedding parameter for the data map whose remainder is 1 and 2 after modulo 3. parameter. Therefore, in the forward propagation process, processor 0 needs to communicate with processor 1 and processor 2 to obtain the embedded parameters of data 10, 14 and 19.
- processor 0 calculates the gradient corresponding to data 10, 21, 14 and 19, which is used to correct and update the embedding of data 10, 21, 14 and 19 in the embedding table parameter.
- the embedded parameters of data 10 and 19 are in processor 1, and the embedded parameters of data 14 are in processor 2. Therefore, processor 0 needs to send the calculated gradients of data 10 and 19 to processor 1, and data 14 The gradient is sent to processor 2.
- the communication of this gradient can be achieved by message communication.
- N processors send messages to each other, and each of the N processors sends messages to other multiple processors.
- both resources and received bandwidth resources require a large amount of consumption, and the processor needs to queue up to send messages and queue up to receive messages, which easily leads to communication bottlenecks and increases communication delays.
- FIG. 7 In order to facilitate the understanding that the above-mentioned message communication process cannot be superimposed and optimized with the calculation process, please refer to FIG. 7 . It can be seen in FIG. 7 that the process of the communication part and the process of the calculation part cannot be superimposed, and the processor needs to wait for the completion of the communication process before performing the next calculation process. Therefore, if the delay in the communication process is large, the efficiency of the entire training process will be seriously affected, thereby reducing the performance of the training system.
- the present application provides a data processor method, which can improve the utilization rate of communication bandwidth between processors, reduce the communication delay in the forward propagation and back propagation of the embedding layer, and improve the training efficiency.
- the data processing method provided by the present application mainly deploys a ring communication architecture among the above N processors, so that the processors communicate with other processors through the ring communication architecture during the forward propagation process of the embedding layer of data training to find corresponding and enable the processor to communicate with other processors through the ring communication architecture to obtain gradients corresponding to the embedding parameters of the respective required data during the back-propagation process of the embedding layer of data training.
- FIG. 8 FIG. 8
- each processor can search for the required embedding parameters through the ring communication architecture; and N processing
- the message communication can be carried out through the ring communication architecture to obtain the required gradient.
- each of the above N processors can generate a search message, and then the N processors communicate the generated search message in a ring.
- the communication mode of the architecture is sent to its own next-hop processor.
- each processor receives the search message, it can identify whether the data in the search message belongs to the data in the embedding table maintained by its own. If there is data belonging to it, then Find the embedded parameter corresponding to the data, and add the found embedded parameter to the received message. Then, the processor sends the message with the added embedded parameter to its next-hop processor again in the communication mode of the ring communication architecture.
- the received message is directly sent to the next-hop processor in the communication mode of the ring communication architecture. After repeating the search and sending operations for at least N times, each processor can obtain the embedded parameters of all the data searched by itself.
- each of the above N processors can generate a message including data and the corresponding gradient, and then the N processors will generate a message in a circular
- the communication mode of the communication architecture is each sent to its own next-hop processor. After each processor receives the message, it can identify whether the data in the message belongs to the data in the embedding table maintained by each processor. If there is data to which it belongs, the gradient corresponding to the data in the message is obtained to optimize and update the corresponding embedding parameters in the embedding table. Then, the received message is sent to the next-hop processor in the communication mode of the ring communication architecture.
- the processor also sends the received message to the next-hop processor in the communication mode of the ring communication architecture. After repeating the operation of sending and obtaining, after at least N-1 cycles, each processor can obtain the gradient corresponding to the embedding parameters of all the data in its own embedding table, so that the optimization and update of the embedding parameters of all the data can be completed.
- the processor can propagate the data forward in the embedding layer. Before finding the embedding parameters of these sparse data, first convert the sparse data into dense data, and then use the converted dense data as an index to find the corresponding embedding parameters.
- the messages sent and received based on the above ring communication architecture include: The data are in the form of dense data. And, in this case, the data in the embedding table maintained by the processor is also in the form of dense data.
- the process of implementing ring message communication between the above N processors through a ring communication architecture can be encapsulated into a communication interface, that is, the operation of each processor sending a message to the next-hop processor can be encapsulated into a sending interface, each The operation of a processor to receive a message from the previous hop processor can be encapsulated into a receive interface, so that the processor needs to call the encapsulated send interface to send a message based on the above ring communication architecture. The processor needs to be based on the above ring communication architecture. When receiving a message, call the encapsulated receive interface to receive it.
- the search process of embedding parameters in the forward propagation process of the embedding layer can be designed, and the gradient acquisition process in the back propagation process of the embedding layer can be encapsulated into a callable interface and exposed to artificial intelligence.
- artificial intelligence artificial intelligence, AI
- the processor can directly call the encapsulated interface to realize the search of the embedded parameter, and then return the search result.
- the search result may be an embedded parameter of the found data, or if no corresponding embedded parameter is found, the returned result may be a null value.
- the processor can directly call the encapsulated interface to search for the gradient, and return the operation result. Since the processor searches the message for the gradient corresponding to the embedding parameter of the data in its own embedding table, whether it is found or not, the returned operation result can be a null value.
- each operation of the above data processing method provided by the present application is encapsulated into an interface is not limited to the implementation manner shown in the above example.
- Each interface is provided for use by the AI framework, which is not limited in this application.
- the data processing method may include but is not limited to the following steps:
- a first processor sends a first lookup message to a second processor; the first lookup message includes first data and is used to look up embedded parameters of the first data; the second processor is where the first processor is located.
- the next-hop processor of the first processor in the ring communication architecture is not limited to:
- the above-mentioned N processors need to obtain the embedding parameters of the respective required data through message communication.
- the message communication among the N processors may be implemented by adopting a ring communication architecture.
- the first processor, the next-hop processor of the first processor (the above-mentioned second processor), and the previous-hop processor of the first processor (the third processor in the following step 902) are used first.
- the communication between ) is taken as an example to introduce the search process of embedded parameters using the ring communication architecture.
- the above-mentioned first data may include one or more data.
- the first data may be sparse data or dense data.
- Table 3 can be exemplified for the content included in the first search message.
- Table 3 exemplarily shows part of the content included in the first search message.
- the id in Table 3 is the data, and the value field is used to fill in the embedded parameter corresponding to the data.
- k1 in Table 3 can be any integer greater than 0.
- the value range corresponding to the data in the message may be a null value or may be a default original value (for example, the original value may be 0, etc.)
- the above-mentioned first search message includes first data, so that the processor that receives the message can search for the embedded parameter corresponding to the first data based on the first data, so as to fill in the value range corresponding to the first data of the message middle.
- the above-mentioned first search message may also include the data in the N
- the remainder after the modulo since the remainder is the same as the serial number of the processor where the embedding table where the data is located (the sequence number of the training process), it can also be said that the first search message can also include the process where the embedding table where the data is located is located. 's serial number. See Table 4 for an example.
- the program number is included in the first search message so that after the processor receives the message, it can quickly determine which data belongs to the embedding table maintained by itself through the program number, so as to quickly find the corresponding embedded parameters and fill in the message In the value range of , the search efficiency can be improved.
- the format of the content included in the first search message may be the format shown in Table 4.
- the first processor receives a second lookup message from a third processor; the second lookup message includes second data for looking up embedded parameters of the second data; the third processor is the ring communication architecture The previous hop processor of the first processor in the .
- the above-mentioned second data may include one or more data, and the second data is generally different from the above-mentioned first data.
- the second data may be sparse data or dense data.
- part of the data in the second data may be the same as part of the data in the first data.
- the format of the foregoing second search message is similar to that of the foregoing first search message.
- For the content format included in the second search message reference may be made to the description corresponding to the foregoing Table 3 or Table 4, which will not be repeated here.
- the first processor sends the above-mentioned first search message to its next hop processor, the second processor, the third processor from its previous hop processor After the processor receives the above-mentioned second search message.
- the first processor performs the search operation of the embedded parameter in response to the second search message, which is described in two cases below:
- the first processor adds the embedded parameters of the part or all of the data to the second search In the message, a third search message is obtained, and the third search message is sent to the second processor.
- the third search message is used to search for embedded parameters of the data for which no embedded parameters are found in the second data.
- the first processor parses the message to obtain the content in the message.
- the second data is compared with the data in the embedding table maintained by the first processor itself. If part or all of the second data exists in the embedding table, the first processor obtains the embedded parameters of the part or all of the data mapping from the embedding table. Then, the first processor adds the embedded parameter of the part or all of the data mapping to the value field corresponding to the part or all of the data in the second search message to obtain a third search message. Then, the first processor sends the third search message to the next-hop processor, that is, to the second processor.
- adding the embedded parameter to the value field of the message may be adding the embedded parameter to the value field of the message by operations such as accumulation.
- the remainder after the data in the embedding table in the processor i in the N processors is calculated by modulo N is i, and the format of the content carried by the second search message is as shown in Table 3 above. is displayed, that is, the program number is not carried.
- the first processor parses the message to obtain the second data in the message.
- the first processor performs modulo calculation on each data of the second data and N, and obtains the remainder after the modulo of each data.
- the data corresponding to the one or more remainders is stored in the embedding table maintained by the first processor.
- the first processor uses the data corresponding to the one or more remainders as an index, and finds the embedding parameters of the data corresponding to the one or more remainders in the embedding table.
- the found embedded parameters are correspondingly added to the value range corresponding to the data corresponding to the one or more remainders in the second search message to obtain a third search message.
- the first processor sends the third search message to the next-hop processor, that is, to the second processor.
- the first processor uses the data corresponding to the one or more Find the embedding parameters of the data corresponding to the one or more remainders in the embedding table. If there is data in the data corresponding to the one or more remainders that is not in the embedding table, then the processor can randomly generate corresponding embedding parameters for the data that is not in the embedding table, and then use the embedding parameters found in the embedding table. The parameter and the randomly generated embedded parameter are added to the corresponding value range in the second search message to obtain the third search message.
- the first processor sends the third search message to the next-hop processor, that is, to the second processor.
- the processor adds the data not in the embedding table and the randomly generated embedding parameters into the embedding table in a one-to-one correspondence.
- the remainder after the data in the embedding table in the processor i of the N processors is calculated by modulo N to be i, and the format of the content carried by the second search message is as shown in Table 4 above. display, that is, carry the program number.
- the first processor parses the message to obtain the second data in the message and the corresponding entry program number.
- the process number in the second search message has one or more sequence numbers of the training process running by the first processor, then the data corresponding to the one or more sequence numbers exists in the embedding table maintained by the first processor middle.
- the first processor uses the data corresponding to the one or more serial numbers as an index, and searches the embedding table to find the embedding parameters of the data corresponding to the one or more serial numbers.
- the found embedded parameters are correspondingly added to the value range corresponding to the data corresponding to the one or more sequence numbers in the second search message to obtain a third search message. Then, the first processor sends the third search message to the next-hop processor, that is, to the second processor.
- the first processor uses the data corresponding to the one or more sequence numbers as an index, in itself Find the embedded parameter of the data corresponding to the one or more serial numbers in the maintained embedding table. If there is data in the data corresponding to the one or more serial numbers that is not in the embedding table, then the processor can randomly generate corresponding embedding parameters for the data that is not in the embedding table, and then use the embedding parameters found in the embedding table. The parameter and the randomly generated embedded parameter are added to the corresponding value range in the second search message to obtain the third search message.
- the first processor sends the third search message to the next-hop processor, that is, to the second processor.
- the processor adds the data not in the embedding table and the randomly generated embedding parameters into the embedding table in a one-to-one correspondence.
- the value field is empty by default.
- the first processor determines that the data 9 and 3 in the table 5 belong to the data in the embedding table maintained by itself, and finds that the embedding parameters of the 9 and 3 in the embedding table are the parameter a and the parameter b respectively, and then the first processing
- the controller directly adds the parameter a and the parameter b to the value ranges corresponding to 9 and 3, respectively, after adding, see Table 6.
- the third search message obtained above includes the content shown in Table 6.
- the first processor sends the second search message to the second processor when the embedded parameter of the second data is not found based on the second search message.
- the first processor determines that no data in the second data included in the second search message belongs to the first process
- the data in the embedding table maintained by the processor itself that is, the first processor cannot find the embedded parameters of the second data in the embedding table maintained by itself
- the first processor will send the second search message to the next
- the one-hop processor is sent to the above-mentioned second processor.
- the data in the embedding table in the processor i in the above N processors is calculated by modulo N, the remainder is i, the remainder obtained after the modulo operation on N is the same as the training run by the first processor.
- the data with the same program number belongs to the data in the embedding table maintained by the first processor itself, and the other data does not belong to the data in the embedding table maintained by the first processor itself.
- the first processor further receives the fourth search message from the third processor.
- a lookup message; the fourth lookup message includes the third data and the embedded parameter of the first part of the data mapping in the third data, and the fourth lookup message is used to look up the data mapping of the third data except the first part of the data. Embedded parameters.
- the embedded parameters of the first part of the data in the third data carried in the fourth search message have been searched in other processors. Therefore, , and the fourth search message carries the embedded parameter of the first part of the data.
- the first part of data is one or more data in the third data.
- the third data may be sparse data or dense data.
- the first processor performs the search operation of the embedded parameter in response to the fourth search message, which is also described in two cases:
- the first processor adds the embedded parameter of the second part of the data to the fourth search In the message, a fifth search message is obtained, and the fifth search message is sent to the above-mentioned second processor.
- the second partial data is one or more pieces of data in the third data, and the second partial data and the first partial data are different data.
- the first processor parses the message to obtain the content in the message.
- the third data is compared with the data in the embedding table maintained by the first processor itself. If the second part of the data in the third data exists in the embedding table, the first processor obtains the embedding parameter of the second part of the data mapping in the embedding table. Then, the first processor adds the embedded parameter of the second part of the data mapping to the value range corresponding to the second part of the data in the fourth search message to obtain a fifth search message. Then, the first processor sends the fifth search message to the next-hop processor, that is, to the second processor.
- the fifth search message is used to search for embedded parameters of data for which no embedded parameters are found in the third data.
- the remainder after the data in the embedding table in the processor i in the N processors is calculated by modulo N to be i, and the format of the content carried by the fourth search message is as shown in Table 3 above. is displayed, that is, the program number is not carried.
- the first processor parses the message to obtain the third data in the message.
- the first processor performs a modulo calculation on each data of the third data with N to obtain a remainder after the modulo of each data.
- the data corresponding to the one or more remainders is stored in the embedding table maintained by the first processor.
- the data corresponding to the one or more remainders is the above-mentioned second part of the data.
- the first processor uses the data corresponding to the one or more remainders as an index, and finds the embedding parameters of the data corresponding to the one or more remainders in the embedding table.
- the found embedded parameters are correspondingly added to the value range corresponding to the data corresponding to the one or more remainders in the fourth search message to obtain a fifth search message.
- the first processor sends the fifth search message to the next-hop processor, that is, to the second processor.
- the first processor uses the data corresponding to the one or more Find the embedding parameters of the data corresponding to the one or more remainders in the embedding table. If there is data in the data corresponding to the one or more remainders that is not in the embedding table, then the processor can randomly generate corresponding embedding parameters for the data that is not in the embedding table, and then use the embedding parameters found in the embedding table. The parameter and the randomly generated embedded parameter are added to the corresponding value range in the fourth search message to obtain the fifth search message.
- the first processor sends the fifth search message to the next-hop processor, that is, to the second processor.
- the processor adds the data not in the embedding table and the randomly generated embedding parameters into the embedding table in a one-to-one correspondence.
- the remainder after the data in the embedding table in the processor i in the N processors is calculated by modulo N to be i, and the format of the content carried by the fourth search message is as shown in Table 4 above. display, that is, carry the program number.
- the first processor parses the message to obtain the third data in the message and the corresponding program number.
- the process number in the fourth search message has one or more sequence numbers of the training process running by the first processor, then the data corresponding to the one or more sequence numbers exists in the embedding table maintained by the first processor middle.
- the data corresponding to the one or more serial numbers is the above-mentioned second part of the data.
- the first processor uses the data corresponding to the one or more serial numbers as an index, and searches the embedding table to find the embedding parameters of the data corresponding to the one or more serial numbers.
- the found embedded parameters are correspondingly added to the value fields corresponding to the data corresponding to the one or more sequence numbers in the fourth search message to obtain a fifth search message.
- the first processor sends the fifth search message to the next-hop processor, that is, to the second processor.
- the first processor uses the data corresponding to the one or more sequence numbers as an index, Find the embedded parameter of the data corresponding to the one or more serial numbers in the maintained embedding table. If there is data in the data corresponding to the one or more serial numbers that is not in the embedding table, then the processor can randomly generate corresponding embedding parameters for the data that is not in the embedding table, and then use the embedding parameters found in the embedding table. The parameter and the randomly generated embedded parameter are added to the corresponding value range in the fourth search message to obtain the fifth search message.
- the first processor sends the fifth search message to the next-hop processor, that is, to the second processor.
- the processor adds the data not in the embedding table and the randomly generated embedding parameters into the embedding table in a one-to-one correspondence.
- the first processor determines that the data 15 in the table 7 belongs to the data in the embedding table maintained by itself, and finds that the embedded parameter of the 15 in the embedding table is the parameter e, and then the first processor directly adds the parameter e to the parameter e. In the value range corresponding to 15, you can refer to Table 8 after adding.
- the fifth search message obtained above includes the content shown in Table 8.
- the first processor sends the fourth search message to the second processor.
- the first processor determines that no data in the third data included in the fourth search message belongs to the first process
- the data in the embedding table maintained by the processor itself that is, the first processor cannot find the embedded parameters of the third data in the embedding table maintained by itself, then the first processor can send the fourth search message to the next hop.
- the processor is then sent to the above-mentioned second processor.
- the fourth search message received from the third processor by the first processor includes not the embedded parameters of the partial data mapping in the third data, but all the data in the third data. Embed parameter for datamap.
- the first processor may determine that no data in the third data included in the fourth search message belongs to the data in the embedding table maintained by the first processor itself, then the first processor may The four lookup message is sent to the next hop processor, that is, sent to the above-mentioned second processor.
- the above operations of message sending and embedded parameter search are repeated, and after looping N-1 times, in the Nth loop, the first processor can receive the sixth search from the third processor.
- the sixth lookup message includes the first data and an embedded parameter of the first data. That is, the above-mentioned first search message is a message generated by the first processor, and the first data carried in the message is the data that the first processor needs to train. After N cycles, the message carrying the first data passes through N processors, and the embedded parameters of the first data are found from one or more processors of the N processors. The found embedding parameters are continuously forwarded along with the message, and finally, the sixth search message is sent to the first processor, so that the first processor obtains all the embedding parameters of the training data. For example, see Table 9.
- Table 9 exemplarily shows the first data included in the sixth search message and the embedded parameters of the first data. It can be seen that the embedded parameters of the first data have all been found, and the values corresponding to each data are filled in in the domain.
- the first processor After the first processor obtains all the embedding parameters of the training data through the sixth search message, if the training data is sparse data, the first processor needs to reduce the obtained embedding parameters of the training data. operation, and then forward-propagating the reduced embedding parameters to the hidden layer.
- the reduction operation may be, for example, an operation such as weighting and summing the embedded parameters of the training data of the same type or relatively large correlation.
- the specific reduction operation refer to the operation in the existing solution. This does not limit.
- FIGS. 10A to 10E it is assumed that the above-mentioned N processors are 4 processors, which are processor 0 , processor 1 , processor 2 , and processor 3 respectively.
- the four processors implement message communication through the above-mentioned ring communication architecture.
- This example takes the data in the embedding table in the processor i of the N processors as an example, and the remainder after the modulo calculation of N is i as an example.
- each processor needs to first find the embedding parameters of the data trained by itself.
- the data that processor 0 needs to find the embedded parameters are the first batch of data: 21, 5, 14 and 25, and the remainders of the first batch of data modulo 4 are 1, 1, 2 and 1 respectively, that is, the data
- the embedded parameters of 21, 5 and 25 need to be looked up in processor 1, and the embedded parameters of data 14 need to be looked up in processor 2.
- Fig. 10A it is assumed that the data that processor 0 needs to find the embedded parameters are the first batch of data: 21, 5, 14 and 25, and the remainders of the first batch of data modulo 4 are 1, 1, 2 and 1 respectively, that is, the data The embedded parameters of 21, 5 and 25 need to be looked up in processor 1, and the embedded parameters of data 14 need to be looked up in processor 2.
- Fig. 10A it is assumed that the data that processor 0 needs to find the embedded parameters are the first batch of data: 21, 5, 14 and 25, and the remainders of the first batch of data modulo 4 are 1, 1, 2 and 1 respectively, that is,
- the data that the processor 1 needs to find the embedded parameters are the second batch of data: 19, 2, 10 and 32, and the remainders of the second batch of data modulo 4 are 3, 2, 2 and 0 respectively, that is, the data
- the embedded parameters of 2 and 10 need to be searched in processor 2
- the embedded parameters of data 19 need to be searched in processor 3
- the embedded parameters of data 32 need to be searched in processor 0.
- the data that the processor 2 needs to find the embedded parameters are the third batch of data: 13, 8, 16 and 29, and the remainders of the third batch of data modulo 4 are 1, 0, 0 and 1 respectively, that is, the data
- the embedded parameters of 8 and 16 need to be looked up in processor 0, and the embedded parameters of data 13 and 29 need to be looked up in processor 1.
- FIG. 10A it is assumed that the data that the processor 2 needs to find the embedded parameters are the third batch of data: 13, 8, 16 and 29, and the remainders of the third batch of data modulo 4 are 1, 0, 0 and 1 respectively, that is, the data
- the embedded parameters of 8 and 16 need to be looked up in processor 0, and the embedded parameters of data 13 and 29 need to be looked up in processor 1.
- the data that the processor 3 needs to find the embedded parameters are the fourth batch of data: 6, 33, 18 and 4, and the remainders of the third batch of data modulo 4 are 2, 1, 2 and 0 respectively, that is, the data
- the embedded parameters of 6 and 18 need to be searched in processor 2
- the embedded parameters of data 33 need to be searched in processor 1
- the embedded parameters of data 4 need to be searched in processor 0.
- the remainder of the above-mentioned data modulo 4 is the sequence number of the process where the embedded parameter of each data is located.
- each processor first generates a message, which includes the data to be searched for the embedded parameter, the corresponding program number, and the value range space for filling in the embedded parameter. After each processor generates its own message, according to the communication mode of the ring communication architecture, each of the processors sends the respective generated message to the next hop processor, and receives the message sent from the previous hop processor. After receiving the message, a corresponding table lookup operation is performed. For the specific table lookup operation, refer to the foregoing description, which will not be repeated here. Then, the found embedded parameters are filled into the respective received messages, see FIG. 10B for details.
- processor 0 sends a message including the first batch of data to processor 1, and receives a message including the fourth batch of data from processor 3, and then finds the message in its own embedding table.
- the embedded parameter of data 4 is added to the value field corresponding to data 4 in the received message to obtain a new message.
- Processor 1 sends a message including the second batch of data to processor 2, and receives a message including the first batch of data from processor 0, and then finds the embeddings of data 21, 5, and 25 in its own embedding table parameter, and add it to the value fields corresponding to data 21, 5 and 25 in the received message to obtain a new message.
- Processor 2 sends the message including the third batch of data to processor 3, and receives the message including the second batch of data from processor 1, and then finds the embedded parameters of data 2 and 10 in its own embedding table, And add it to the value fields corresponding to data 2 and 10 in the received message to obtain a new message.
- Processor 3 sends a message including the fourth batch of data to processor 0, and receives a message including the third batch of data from processor 2. Since no data in the third batch of data belongs to the data in the embedding table in processor 3, Therefore, the embedded parameter of any data in the third batch of data is not found in the processor 3.
- the processors that have obtained new messages send the new messages they have obtained to the next hop processor, and the processor that has not obtained new messages (processor 3) sends the received messages to the next hop
- the processors receive new messages from their respective previous-hop processors after sending messages, and then continue to search for embedded parameters in response to the new messages. Then, the found embedded parameters are filled into the respective received messages, see FIG. 10C for details.
- processor 0 sends a message including the fourth batch of data to processor 1, and receives a message including the third batch of data from processor 3, and then finds data 8 and 8 in its own embedding table.
- the embedded parameter of 16 is added to the value fields corresponding to data 8 and 16 in the received message to obtain a new message.
- Processor 1 sends a message including the first batch of data to processor 2, and receives a message including the fourth batch of data from processor 0, and then finds the embedded parameter of data 33 in its own embedding table, and adds A new message is obtained from the value field corresponding to the data 33 in the received message.
- the processor 2 sends the message including the second batch of data to the processor 3, and receives the message including the first batch of data from the processor 1, and then finds the embedded parameter of the data 14 in its own embedding table, and adds A new message is obtained in the value field corresponding to the data 14 in the received message.
- Processor 3 sends a message including the third batch of data to processor 0, and receives a message including the second batch of data from processor 2, and then finds the embedded parameter of data 19 in its own embedding table, and adds A new message is obtained from the value field corresponding to the data 19 in the received message.
- each processor sends the new message obtained by each processor to the next-hop processor, and each processor receives a new message from the respective previous-hop processor after sending the message, and then continues to respond to the new message Do a lookup of embedded parameters. Then, the found embedded parameters are filled into the respective received messages, see FIG. 10D for details.
- processor 0 sends a message including the third batch of data to processor 1, and receives a message including the second batch of data from processor 3, and then finds the data 32 in its own embedding table. Embed the parameter and add it to the value field corresponding to the data 32 in the received message to obtain a new message.
- Processor 1 sends the message including the fourth batch of data to processor 2, and receives the message including the third batch of data from processor 0, and then finds the embedded parameters of data 13 and 29 in its own embedding table, And add it to the value fields corresponding to data 13 and 29 in the received message to obtain a new message.
- Processor 2 sends a message including the first batch of data to processor 3, and receives a message including the fourth batch of data from processor 1, and then finds the embedded parameters of data 6 and 18 in its own embedding table, And add it to the value fields corresponding to data 6 and 18 in the received message to obtain a new message.
- Processor 3 sends a message including the second batch of data to processor 0, and receives a message including the first batch of data from processor 2. Since there is no data in the first batch of data that belongs to the data in the embedding table in processor 3, Therefore, the embedded parameter of any data in the first batch of data is not found in the processor 3.
- each processor receives a new message from its previous hop processor after sending the message.
- the messages received by each processor include its own training data and the required embedding parameters, thus completing the process of finding the embedding parameters of the entire embedding layer. See Figure 10E for details.
- processor 0 sends a message including the second batch of data to processor 1, and receives a message including the first batch of data from processor 3, the message includes the first batch of data required by processor 0. Embedding parameters for a batch of data.
- Processor 1 sends a message including the third batch of data to processor 2, and receives a message including the second batch of data from processor 0, where the message includes embedded parameters of the second batch of data required by processor 1.
- the processor 2 sends a message including the fourth batch of data to the processor 3, and receives a message including the third batch of data from the processor 1, where the message includes the embedded parameters of the third batch of data required by the processor 2.
- the processor 3 sends a message including the first batch of data to the processor 0, and receives a message including the fourth batch of data from the processor 2, where the message includes the embedded parameters of the fourth batch of data required by the processor 3.
- FIG. 10A to FIG. 10E and the related descriptions are only an example, and do not constitute a limitation to the present application, and modifications made based on the above-mentioned ideas are all within the protection scope of the present application.
- the 4 processors find their respective embedded parameters through 4 cycles. Since the communication between the processors is realized through the ring communication architecture, compared with the many-to-many in the existing technical solution The present application avoids the single-point communication bottleneck, reduces the communication delay, and improves the communication efficiency, thereby improving the training performance of the entire data training system.
- the “first processor (or data)”, the “first processor (or data)”, “ “Second processor (or data)” and “third processor (or data)” and so on may be the same objects or may be different objects with the same names as those in FIG. 9 and its possible implementations.
- the data processing method provided by the present application may include but not limited to the following steps during the backpropagation process of the embedding layer:
- the first processor sends a first notification message to the second processor; the first notification message includes first data and a first gradient, and is used to propagate the first gradient to the first target processor; the first The gradient is the gradient corresponding to the embedding parameter of the first data, and the first data and the first gradient are mapped one-to-one; the second processor is the lower part of the first processor in the ring communication architecture where the first processor is located. One hop processor.
- the above N processors each obtain the gradient of the embedding parameters of the data trained by each processor, but because the embedding parameters of the data trained by each processor are stored in the embedding tables of other processors , therefore, the gradient needs to be sent to the corresponding processor through message communication for optimizing the corresponding embedding parameters.
- the N processors implement message communication through a ring communication architecture.
- the first processor, the next-hop processor of the first processor (the above-mentioned second processor), and the previous-hop processor of the first processor (the third processor in step 1102 below) are used first.
- the communication between ) is taken as an example to introduce the process of obtaining the gradients required by each other by adopting the ring communication architecture.
- the above-mentioned first target processor includes one or more processors among the above-mentioned N processors.
- the specific processor of the first target processor is determined by the first data in the first notification message. Exemplarily, assuming that the first data includes part or all of the data in the embedding table in processor i, then the first target processor includes processor i.
- the above-mentioned first data may include one or more data.
- Table 10 For the content included in the above-mentioned first notification message, reference may be made to Table 10 for example.
- Table 10 exemplarily shows part of the content included in the first notification message.
- the id in Table 10 is the data, and the value range is the gradient corresponding to the embedded parameter of the data.
- k2 in Table 10 can be any integer greater than 0.
- the above-mentioned first notification message may also include the data in the N processor.
- the remainder after the modulo since the remainder is the same as the serial number of the processor where the embedding table where the data is located (the sequence number of the training process), it can also be said that the first notification message can also include the process where the embedding table where the data is located is located. 's serial number. See Table 11 for an example.
- the program number is included in the first notification message so that after the processor receives the message, the processor can quickly determine which data belongs to the embedding table maintained by itself through the program number, so as to quickly obtain the corresponding gradient.
- the format of the content included in the first notification message may be the format shown in Table 11.
- the first processor receives a second notification message from the third processor; the second notification message includes second data and a second gradient, and is used to propagate the second gradient to the second target processor; the The second gradient is the gradient corresponding to the embedding parameter of the second data, and the second data is mapped to the second gradient one-to-one; the third processor is the previous hop processor of the first processor in the ring communication architecture .
- the above-mentioned second target processor includes one or more processors among the above-mentioned N processors.
- the specific processor of the second target processor is determined by the second data in the second notification message. Exemplarily, it is assumed that the second data includes part or all of the data in the embedding table in the processor i, then the second target processor includes the processor i.
- the above-mentioned second data may include one or more data, and the second data is generally different from the above-mentioned first data.
- part of the data in the second data may be the same as part of the data in the first data.
- the format of the second notification message is similar to the format of the first notification message. For the content format included in the second notification message, reference may be made to the description corresponding to Table 10 or Table 11, which will not be repeated here.
- the first processor sends the above-mentioned first notification message to its next hop processor, the second processor, the third processor from its previous hop processor After the processor receives the above-mentioned second notification message.
- the first processor performs the gradient acquisition operation in response to the second notification message, which is described in the following two cases:
- the first processor acquires the first target gradient in the second notification message, and sends the first target gradient to the second processor.
- Two notification messages used to continue to notify other processors in the second target processor to obtain the required gradient;
- the first target gradient is the gradient of the embedded parameters in the first embedding table maintained by the first processor, and
- the first target gradient is the gradient of the embedded parameter in the first embedding table maintained by the first processor.
- the first processor parses the message to obtain the information in the message.
- the second data is compared with the data in the embedding table maintained by the first processor itself. If part or all of the second data exists in the embedding table, the first processor extracts the gradient corresponding to the part or all of the data from the range of the parsed second notification message, so as to optimize the first An embedded parameter of the part or all of the data in the first embedded table maintained by a processor. After extracting the gradient, the first processor repackages the second notification message and sends it to the next-hop processor, the second processor.
- the remainder after the data in the embedding table in the processor i in the above-mentioned N processors is calculated by modulo N to be i, and the format of the content carried by the second notification message is as shown in Table 10 above. shown, that is, the program number is not carried in.
- the first processor parses the message to obtain the second data in the message.
- the first processor performs modulo calculation on each data of the second data and N, and obtains the remainder after the modulo of each data.
- the data corresponding to the one or more remainders is stored in the embedding table maintained by the first processor.
- the first processor extracts the corresponding gradient of the data corresponding to the one or more remainders from the value range of the parsed second notification message, so as to optimize the one or more in the first embedding table maintained by the first processor. Embedding parameter for data corresponding to multiple remainders. After extracting the gradient, the first processor repackages the second notification message and sends it to the next-hop processor, the second processor.
- the remainder after the data in the embedding table in the processor i in the above-mentioned N processors is calculated by modulo N to be i, and the format of the content carried by the second notification message is as shown in Table 11 above. shown, that is, carry in the program number.
- the first processor parses the message to obtain the second data in the message and the corresponding program number. If the process number in the second notification message has one or more sequence numbers of the training process running by the first processor, then the data corresponding to the one or more sequence numbers exists in the embedding table maintained by the first processor middle.
- the first processor extracts the corresponding gradient of the data corresponding to the one or more sequence numbers from the value range of the parsed second notification message, so as to optimize the one or more in the first embedding table maintained by the first processor. Embedded parameters of data corresponding to multiple serial numbers. After extracting the gradient, the first processor repackages the second notification message and sends it to the next-hop processor, the second processor.
- the first processor sends the second notification message to the second processor.
- the first processor determines that no data in the second data included in the second notification message belongs to The data in the embedding table maintained by the first processor itself, then the first processor does not need to extract the gradient from the second notification message, and sends the second notification message to the next-hop processor, that is, to the above-mentioned second notification message. processor.
- the first processor after the first processor receives the second notification message and completes the response operation to the second notification message, the first processor further receives the third notification message from the third processor. notification message.
- the third notification message includes third data and a third gradient, and is used to propagate the third gradient to the third target processor; the third gradient is a gradient corresponding to an embedding parameter of the third data, and the third data One-to-one mapping with the third gradient.
- the above-mentioned third target processor includes one or more processors among the above-mentioned N processors.
- the specific processor of the third target processor is determined by the third data in the third notification message. Exemplarily, assuming that the third data includes part or all of the data in the embedding table in processor i, then the third target processor includes processor i.
- the above-mentioned third data may include one or more data, and the third data is generally different from the above-mentioned first data and second data.
- the partial data in the third data may be the same as the partial data of the first data or the same as the partial data of the second data.
- the format of the foregoing third notification message is similar to that of the foregoing first notification message.
- For the content format included in the third notification message reference may be made to the description corresponding to the foregoing Table 10 or Table 11, which will not be repeated here.
- the first processor acquires the second target gradient in the third notification message, and sends the third notification to the second processor
- the message is used to continue to notify other processors in the third target processor to obtain the required gradient
- the second target gradient is the gradient of the embedding parameter in the first embedding table maintained by the first processor.
- the first processor sends the third notification message to the second processor, so as to continue to notify other in the third target processor
- the processor obtains the required gradient.
- the above-mentioned N processors perform message communication through a ring communication architecture, repeat the above-mentioned message sending and gradient acquisition operations, and after at least N-1 cycles, each of the above-mentioned N processors is all
- the gradients of the embedded parameters in the own embedding table are obtained, so that the embedded parameters in the own embedding table can be optimized correspondingly based on the obtained gradients.
- FIGS. 12A to 12D it is assumed that the above-mentioned N processors are 4 processors, which are processor 0 , processor 1 , processor 2 , and processor 3 respectively.
- the four processors implement message communication through a ring communication architecture.
- This example takes the data in the embedding table in the processor i of the N processors as an example, and the remainder after the modulo calculation of N is i as an example.
- each processor needs to first obtain the gradient of the embedding parameters of the data trained by itself, so as to optimize the embedding parameters in the embedding table according to the gradients of the embedding parameters. .
- FIG. 12A for the introduction of the first batch of data, the second batch of data, the third batch of data and the fourth batch of data, reference may be made to the above description of FIG. 10A , which will not be repeated here.
- each processor first generates a message, which includes data, a corresponding process number, and a gradient corresponding to the data.
- each processor After each processor generates its own message, according to the communication mode of the ring communication architecture, each sends the generated message to the next hop processor, and receives the message sent by the previous hop processor, see FIG. 12B for details. After receiving the message, the processor may perform a corresponding gradient acquisition operation. For the specific acquisition operation, refer to the description in the foregoing step 1102, which will not be repeated here.
- processor 0 sends a message including the first batch of data to processor 1, and receives a message including the fourth batch of data from processor 3, and then obtains the corresponding data of data 4 in the received message.
- Processor 1 sends a message including the second batch of data to processor 2, and receives a message including the first batch of data from processor 0, and then obtains the gradient 1 corresponding to data 21, 5, and 25 in the received message, respectively , Gradient 2 and Gradient 4.
- Processor 2 sends the message including the third batch of data to processor 3, and receives the message including the second batch of data from processor 1, and then obtains the gradient 6 and gradient corresponding to data 2 and 10 in the received message, respectively 7.
- Processor 3 sends a message including the fourth batch of data to processor 0, and receives a message including the third batch of data from processor 2. Since no data in the third batch of data belongs to the data in the embedding table in processor 3, Therefore, processor 3 does not acquire any gradient in the received message.
- each processor performs the gradient acquisition operation in response to the received message, it sends the received message to the next-hop processor, see FIG. 12C for details.
- processor 0 sends a message including the fourth batch of data to processor 1, and receives a message including the third batch of data from processor 3, and then obtains data 8 and data in the received message.
- Data 16 corresponds to gradient 10 and gradient 11, respectively.
- the processor 1 sends the message including the first batch of data to the processor 2, and receives the message including the fourth batch of data from the processor 0, and then obtains the gradient 14 corresponding to the data 33 in the received message.
- the processor 2 sends the message including the second batch of data to the processor 3, receives the message including the first batch of data from the processor 1, and then obtains the gradient 3 corresponding to the data 14 in the received message.
- the processor 3 sends the message including the third batch of data to the processor 0, and receives the message including the second batch of data from the processor 2, and then obtains the gradient 5 corresponding to the data 19 in the received message.
- each processor After each processor has performed the gradient acquisition operation in response to the received message, it sends the received message to the next-hop processor, see FIG. 12D for details.
- processor 0 sends a message including the third batch of data to processor 1, and receives a message including the second batch of data from processor 3, and then obtains the corresponding data 32 in the received message.
- Gradient 8 Processor 1 sends a message including the fourth batch of data to processor 2, and receives a message including the third batch of data from processor 0, and then obtains the gradient 9 and gradient corresponding to data 13 and 29 in the received message, respectively 12.
- Processor 2 sends a message including the first batch of data to processor 3, and receives a message including the fourth batch of data from processor 1, and then obtains the gradient 13 and gradient corresponding to data 6 and 18 in the received message, respectively 15.
- Processor 3 sends a message including the second batch of data to processor 0, and receives a message including the first batch of data from processor 2. Since there is no data in the first batch of data that belongs to the data in the embedding table in processor 3, Therefore, processor 3 does not acquire any gradient in the received message.
- FIG. 12A to FIG. 12D and the related descriptions are only an example, and do not constitute a limitation to the present application, and modifications made based on the above-mentioned ideas are all within the protection scope of the present application.
- the above-mentioned four processors have obtained their respective required gradients. Because the communication between the processors is realized through the ring communication architecture, compared with the many-to-many message communication method in the existing technical solution, the present application avoids the bottleneck of single-point communication, reduces the communication delay, and improves the communication efficiency , which can improve the training performance of the entire data training system.
- the data processing method shown in FIG. and any of its possible implementations are used together, that is, in the forward propagation process of the embedding layer of the data training, the search of the embedding parameters is realized based on the ring communication architecture introduced above, and then, the embedding layer of the data training is used.
- the above-mentioned ring communication architecture is used to obtain the gradient.
- FIG. 13 is a schematic diagram showing the comparison of the communication throughput between the prior art solution shown in FIG. 3 and the solution provided by the present application.
- the throughput refers to the number of successfully sent data per unit time.
- the horizontal axis represents the number of processors used by the data training system, and the number of processors increases in the direction of the arrow; the vertical axis represents the throughput, and the throughput increases in the direction of the arrow.
- the solution of the prior art adopts a many-to-many message communication manner, and as the number of processors increases, the throughput does not change much, or even decreases.
- the present application adopts a ring communication architecture to realize the communication of messages.
- the throughput can increase with the increase of the number of processors, and it increases with excellent linearity.
- the ring communication architecture can make full use of the network bandwidth and is less prone to blocking and jitter in the many-to-many message communication method.
- the present application can reduce the communication delay to 10-30% of the original by using the ring communication architecture to communicate messages in the forward propagation and back propagation of the embedding layer, which greatly improves the Communication efficiency, thereby improving the performance of the data training system.
- each device includes corresponding hardware structures and/or software modules for performing each function.
- the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a function is performed by hardware or computer software driving hardware depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.
- the device may be divided into functional modules according to the foregoing method examples.
- each functional module may be divided corresponding to each function, or two or more functions may be integrated into one module.
- the above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules. It should be noted that, the division of modules in the embodiments of the present application is schematic, and is only a logical function division, and there may be other division manners in actual implementation.
- FIG. 14 shows a possible schematic diagram of the logical structure of the apparatus, and the apparatus may be the first processor in the method described in FIG. 9 and a possible implementation thereof. , or may be a chip in the first processor, or may be a processing system in the first processor, or the like.
- the apparatus 1400 includes a sending unit 1401 and a receiving unit 1402 . in:
- the sending unit 1401 is configured to send a first lookup message to a second processor; the first lookup message includes first data, and the first lookup message is used to look up the embedded parameters of the first data; the second processor is the The next-hop processor of the first processor in the ring communication architecture where the first processor is located; the sending unit 1401 can be implemented by a sending interface or a transmitter, and can perform the operations described in step 901 shown in FIG. 9 . .
- the receiving unit 1402 is configured to receive a second search message from a third processor; the second search message includes second data, and the second search message is used to search for embedded parameters of the second data; the third processor is The last hop processor of the first processor in the ring communication architecture; the receiving unit 1402 may be implemented by a receiving interface or a receiver, and may perform the operations described in step 902 shown in FIG. 9 .
- the first processor, the second processor and the third processor are processors among the N processors included in the data training system, where N is an integer greater than or equal to 3;
- the ring communication architecture implements communication in which each of the N processors only receives messages from the previous hop processor of each processor, and only sends messages to the next hop of each processor.
- the hop handler sends the message.
- the device further includes an adding unit
- the above-mentioned adding unit is configured to add the embedded parameters of the part or all of the data to the second search message when the embedded parameters of part or all of the data in the second data are found based on the second search message, to obtain the third lookup message;
- the above-mentioned sending unit 1401 is further configured to send the third search message to the second processor
- the sending unit 1401 is further configured to send the second search message to the second processor when the embedded parameter of the second data is not found based on the second search message.
- the device further includes a search unit
- the lookup unit is configured to look up the embedded parameters of the part or all of the data mapping in the first embedded table;
- the first embedded table is an embedded table maintained by the first processor for storing data and embedded parameters, and the first embedded table is used to store data and embedded parameters. There is a one-to-one mapping relationship between the data in the embedded table and the embedded parameters;
- the above-mentioned adding unit is specifically used to add the embedded parameter of the part or all of the data mapping to the value range corresponding to the part or all of the data in the second search message to obtain the third search message;
- the above-mentioned sending unit 1401 is specifically configured to send the third search message to the second processor, where the third search message is used to search for the embedded parameter of the data for which the embedded parameter is not found in the second data.
- the above-mentioned apparatus further includes a determining unit and a generating unit;
- the determining unit is configured to determine that part or all of the above data belongs to the first embedded table, and the first embedded table does not yet include the part or all of the data; the first embedded table is maintained by the first processor for storage an embedded table of data and embedded parameters, and there is a one-to-one mapping relationship between data and embedded parameters in the first embedded table;
- the generating unit is used to generate the respective embedded parameters corresponding to the part or all of the data
- the above-mentioned adding unit is specifically used for adding the embedded parameters corresponding to the part or all of the data to the value range corresponding to the part or all of the data in the second search message to obtain the third search message;
- the above-mentioned sending unit 1401 is specifically configured to send the third search message to the second processor, where the third search message is used to search for the embedded parameter of the data for which the embedded parameter is not found in the second data.
- the above-mentioned sending unit 1401 is specifically configured to:
- the first embedded table is maintained by the first processor for storing data and An embedded table of embedded parameters, there is a one-to-one mapping relationship between the data in the first embedded table and the embedded parameters.
- the above receiving unit 1402 is further configured to receive a fourth search message from the third processor; the fourth search message includes the third data and the embedding of the first part of the data mapping in the third data parameters, where the fourth lookup message is used to look up embedded parameters of data mappings in the third data other than the first part of data;
- the apparatus further includes an adding unit, configured to add the embedded parameter of the second part of the data to the fourth search message when the embedded parameter of the second part of the data in the third data is found based on the fourth search message , get the fifth search message;
- the above-mentioned sending unit 1401 is further configured to send the fifth search message to the second processor
- the sending unit 1401 is further configured to send the fourth search message to the second processor when the embedded parameter of the third data is not found based on the fourth search message.
- the receiving unit 1402 is further configured to: receive a sixth search message from the third processor, where the sixth search message includes the first data and an embedded parameter of the first data.
- FIG. 15 shows a schematic diagram of a possible logical structure of the apparatus, and the apparatus may be the first processor in the method described in FIG. 11 and a possible implementation thereof. , or may be a chip in the first processor, or may be a processing system in the first processor, or the like.
- the apparatus 1500 includes a sending unit 1501 and a receiving unit 1502 . in:
- a sending unit 1501 is configured to send a first notification message to the second processor; the first notification message includes first data and a first gradient, and is used to propagate the first gradient to the first target processor; the first notification message includes first data and a first gradient.
- the gradient is the gradient corresponding to the embedding parameter of the first data; the second processor is the next-hop processor of the first processor in the ring communication architecture where the first processor is located; the sending unit 1501 can be sent by a sending interface or the transmitter, the operations described in step 1101 shown in FIG. 11 can be performed.
- a receiving unit 1502 configured to receive a second notification message from a third processor; the second notification message includes second data and a second gradient, and is used to propagate the second gradient to the second target processor; the first The second gradient is the gradient corresponding to the embedding parameter of the second data; the third processor is the last hop processor of the first processor in the ring communication architecture; the receiving unit 1502 can be implemented by a receiving interface or a receiver , the operations described in step 1102 shown in FIG. 11 may be performed.
- the first processor, the second processor and the third processor are processors among the N processors included in the data training system, where N is an integer greater than or equal to 3;
- the ring communication architecture implements communication in which each of the N processors only receives messages from the previous hop processor of each processor, and only sends messages to the next hop of each processor.
- the hop handler sends the message.
- the device further includes an acquisition unit;
- the obtaining unit configured to obtain the first target gradient in the second notification message when the second notification message includes the first target gradient
- the sending unit 1501 is further configured to send the second notification message to the second processor;
- the first target gradient is the gradient of the embedding parameter in the first embedding table maintained by the first processor, and the first embedding table There is a one-to-one mapping relationship between data and embedded parameters;
- the sending unit 1501 is further configured to send the second notification message to the second processor under the condition that the first target gradient is not included in the second notification message.
- the obtaining unit is specifically used for:
- the first target gradient is obtained in the second notification message based on the part or all of the data.
- the receiving unit 1502 is further configured to receive a third notification message from the third processor; the third notification message includes third data and a third gradient, and is used for the third gradient Propagated to the third target processor; the third gradient is the gradient corresponding to the embedding parameter of the third data;
- the apparatus further includes an acquiring unit configured to acquire the second target gradient in the third notification message when the third notification message includes the second target gradient,
- the sending unit 1501 is further configured to send the third notification message to the second processor;
- the second target gradient is the gradient of the embedding parameter in the first embedding table maintained by the first processor, and the first embedding table Including the mapping relationship between data and embedded parameters of data;
- the sending unit 1501 is further configured to send the third notification message to the second processor under the condition that the second target gradient is not included in the third notification message.
- FIG. 16 is a schematic diagram showing a possible hardware structure of the apparatus provided by the present application, and the apparatus may be the first processor in the method described in the foregoing embodiment.
- the apparatus 1600 includes: a processor 1601 , a memory 1602 and a communication interface 1603 .
- the processor 1601 , the communication interface 1603 , and the memory 1602 may be connected to each other or to each other through a bus 1604 .
- the memory 1602 is used to store computer programs and data of the device 1600, and the memory 1602 may include, but is not limited to, random access memory (RAM), read-only memory (ROM), memory Erase programmable read only memory (erasable programmable read only memory, EPROM) or portable read only memory (compact disc read-only memory, CD-ROM), etc.
- RAM random access memory
- ROM read-only memory
- EPROM erasable programmable read only memory
- portable read only memory compact disc read-only memory
- the software or program codes required to perform the functions of all or part of the units in FIG. 14 are stored in the memory 1602 .
- the processor 1601 can not only call the program codes in the memory 1602 to realize some functions, but also cooperate with other The components (eg, the communication interface 1603 ) collectively perform other functions (eg, functions of receiving or sending messages) described in the embodiment of FIG. 14 .
- the software or program codes required to perform the functions of all or part of the units in FIG. 15 are stored in the memory 1602 .
- the processor 1601 can not only call the program codes in the memory 1602 to implement some functions, but also cooperate with other The components (eg, the communication interface 1603 ) collectively perform other functions (eg, functions of receiving or sending messages) described in the embodiment of FIG. 15 .
- the communication interface 1603 includes a sending interface and a receiving interface.
- the number of the communication interfaces 1603 may be multiple, and is used to support the apparatus 1600 to communicate, such as receiving or sending data or messages.
- the processor 1601 may be a central processing unit, a general purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof.
- a processor may also be a combination that performs computing functions, such as a combination comprising one or more microprocessors, a combination of a digital signal processor and a microprocessor, and the like.
- the processor 1601 can be used to read the program stored in the above-mentioned memory 1602, and execute any one of the data processing methods described in the above-mentioned FIG. 9 and its possible embodiments; or, the processor 1601 can be used to read the above-mentioned memory.
- the program stored in 1602 executes any one of the data processing methods described in the above-mentioned FIG. 11 and its possible embodiments; or, the processor 1601 can be used to read the program stored in the above-mentioned memory 1602, and executes the program as shown in the above-mentioned FIG. 9 and any of the data processing methods described in its possible embodiments and/or any of the data processing methods described in the above-mentioned FIG. 11 and its possible embodiments.
- the processor 1601 may be configured to read the program stored in the above-mentioned memory 1602, and perform the following operations:
- the first lookup message includes first data, and the first lookup message is used to look up the embedded parameters of the first data;
- the second processor processes the first the next-hop processor of the first processor in the ring communication architecture where the processor is located;
- the third processor is the ring communication the previous hop processor of the first processor in the architecture;
- the first processor, the second processor and the third processor are processors among the N processors included in the data training system, where N is an integer greater than or equal to 3;
- the ring communication architecture implements communication in which each of the N processors only receives messages from the previous hop processor of each processor, and only sends messages to the next hop of each processor.
- the hop handler sends the message.
- the processor 1601 may be configured to read the program stored in the above-mentioned memory 1602, and perform the following operations:
- the first notification message includes first data and a first gradient, and is used to propagate the first gradient to the first target processor;
- the first gradient is the the gradient corresponding to the embedding parameter of the first data;
- the second processor is the next-hop processor of the first processor in the ring communication architecture where the first processor is located;
- a second notification message from the third processor is received through the receiving interface;
- the second notification message includes second data and a second gradient, and is used for propagating the second gradient to the second target processor;
- the second gradient is The gradient corresponding to the embedded parameter of the second data;
- the third processor is the previous hop processor of the first processor in the ring communication architecture;
- the first processor, the second processor and the third processor are processors among the N processors included in the data training system, where N is an integer greater than or equal to 3;
- the ring communication architecture implements communication in which each of the N processors only receives messages from the previous hop processor of each processor, and only sends messages to the next hop of each processor.
- the hop handler sends the message.
- Embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement the above-mentioned FIG. 9 and/or FIG. 11 and possible method embodiments thereof The method of any embodiment.
- An embodiment of the present application further provides a computer program product, when the computer program product is read and executed by a computer, the method described in any of the above-mentioned FIG. 9 and/or FIG. 11 and the possible method embodiments thereof will be executed.
- the message communication in the process of forward propagation and back propagation of the above N processors in the embedding layer can be realized through the ring communication architecture, and the ring communication method is used to realize the interaction of messages, compared with the existing technical solutions
- the application can make full use of the bandwidth resources between processors, avoid single-point communication bottlenecks, reduce communication delays, improve communication efficiency, and further improve the training efficiency of the entire data training system. and performance.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computer Hardware Design (AREA)
- Devices For Executing Special Programs (AREA)
- Information Transfer Between Computers (AREA)
Abstract
Description
id | 嵌入参数 |
数据1 | 参数1 |
数据2 | 参数2 |
…… | …… |
数据m | 参数m |
处理器(训练进程)编号 | 训练的数据 | 数据对3取模的余数 |
处理器0(训练进程0) | 10、21、14和19 | 1、0、2和1 |
处理器1(训练进程1) | 31、23、3和8 | 1、2、0和2 |
处理器2(训练进程2) | 12、5、19和33 | 0、2、1和0 |
id | 数据1 | 数据2 | …… | 数据k1 |
值域(value) | - | - | - | - |
id | 数据1 | 数据2 | …… | 数据k1 |
进程(Rank)序号 | - | - | …… | - |
值域 | - | - | …… | - |
id | 9 | 8 | 13 | 3 |
值域 | - | - | - | - |
id | 9 | 8 | 13 | 3 |
值域 | 参数a | - | - | 参数b |
id | 11 | 10 | 5 | 15 |
值域 | 参数c | - | 参数d | - |
id | 11 | 10 | 5 | 15 |
值域 | 参数c | - | 参数d | 参数e |
id | 16 | 20 | 19 | 27 |
值域 | 参数f | 参数g | 参数h | 参数r |
id | 数据1 | 数据2 | …… | 数据k2 |
值域 | 梯度1 | 梯度2 | …… | 梯度k2 |
id | 数据1 | 数据2 | …… | 数据k2 |
进程(Rank)序号 | 序号1 | 序号2 | …… | 序号3 |
值域 | 梯度1 | 梯度2 | …… | 梯度k2 |
Claims (24)
- 一种数据处理方法,其特征在于,所述方法包括:第一处理器向第二处理器发送第一查找消息;所述第一查找消息包括第一数据,所述第一查找消息用于查找所述第一数据的嵌入参数;所述第二处理器为所述第一处理器所在的环形通信架构中所述第一处理器的下一跳处理器;所述第一处理器接收来自第三处理器的第二查找消息;所述第二查找消息包括第二数据,所述第二查找消息用于查找所述第二数据的嵌入参数;所述第三处理器为所述环形通信架构中所述第一处理器的上一跳处理器;所述第一处理器、所述第二处理器和所述第三处理器为数据训练系统包括的N个处理器中的处理器,所述N为大于或等于3的整数;所述N个处理器之间通过所述环形通信架构实现通信,所述环形通信架构中,所述N个处理器的每个处理器仅从所述每个处理器的上一跳处理器接收消息,并且仅向所述每个处理器的下一跳处理器发送消息。
- 根据权利要求1所述的方法,其特征在于,所述方法还包括:在基于所述第二查找消息查找到所述第二数据中部分或全部数据的嵌入参数的情况下,所述第一处理器将所述部分或全部数据的嵌入参数添加到所述第二查找消息中,得到第三查找消息,并向所述第二处理器发送所述第三查找消息;或者,在基于所述第二查找消息未查找到所述第二数据的嵌入参数的情况下,所述第一处理器向所述第二处理器发送所述第二查找消息。
- 根据权利要求2所述的方法,其特征在于,所述在基于所述第二查找消息查找到所述第二数据中部分或全部数据的嵌入参数的情况下,所述第一处理器将所述部分或全部数据的嵌入参数添加到所述第二查找消息中,得到第三查找消息,并向所述第二处理器发送所述第三查找消息,包括:所述第一处理器在第一嵌入表中查找所述部分或全部数据映射的嵌入参数;所述第一嵌入表为所述第一处理器维护的用于存储数据和嵌入参数的嵌入表,所述第一嵌入表中数据和嵌入参数存在一一映射关系;所述第一处理器将所述部分或全部数据映射的嵌入参数添加到所述第二查找消息中的所述部分或全部数据对应的值域中,得到所述第三查找消息;所述第一处理器向所述第二处理器发送所述第三查找消息,所述第三查找消息用于查找所述第二数据中未查找到嵌入参数的数据的嵌入参数。
- 根据权利要求2所述的方法,其特征在于,所述在基于所述第二查找消息未查找到所述第二数据的嵌入参数的情况下,所述第一处理器向所述第二处理器发送所述第二查找消息,包括:在所述第二数据均不属于第一嵌入表中的数据的情况下,所述第一处理器向所述第二处理器发送所述第二查找消息;所述第一嵌入表为所述第一处理器维护的用于存储数据和嵌入参数的嵌入表,所述第一嵌入表中数据和嵌入参数存在一一映射关系。
- 根据权利要求1至4任一项所述的方法,其特征在于,所述方法还包括:所述第一处理器接收来自所述第三处理器的第四查找消息;所述第四查找消息包括第三 数据和所述第三数据中第一部分数据映射的嵌入参数,所述第四查找消息用于查找所述第三数据中除了所述第一部分数据之外的数据映射的嵌入参数;在基于所述第四查找消息查找到所述第三数据中第二部分数据的嵌入参数的情况下,所述第一处理器将所述第二部分数据的嵌入参数添加到所述第四查找消息中,得到第五查找消息,并向所述第二处理器发送所述第五查找消息;或者,在基于所述第四查找消息未查找到所述第三数据的嵌入参数的情况下,所述第一处理器向所述第二处理器发送所述第四查找消息。
- 根据权利要求1至5任一项所述的方法,其特征在于,所述方法还包括:所述第一处理器接收来自所述第三处理器的第六查找消息,所述第六查找消息包括所述第一数据和所述第一数据的嵌入参数。
- 一种数据处理方法,其特征在于,所述方法包括:第一处理器向第二处理器发送第一通知消息;所述第一通知消息包括第一数据和第一梯度,用于将所述第一梯度传播到第一目标处理器中;所述第一梯度为所述第一数据的嵌入参数对应的梯度;所述第二处理器为所述第一处理器所在的环形通信架构中所述第一处理器的下一跳处理器;所述第一处理器接收来自第三处理器的第二通知消息;所述第二通知消息包括第二数据和第二梯度,用于将所述第二梯度传播到第二目标处理器中;所述第二梯度为所述第二数据的嵌入参数对应的梯度;所述第三处理器为所述环形通信架构中所述第一处理器的上一跳处理器;所述第一处理器、所述第二处理器和所述第三处理器为数据训练系统包括的N个处理器中的处理器,所述N为大于或等于3的整数;所述N个处理器之间通过所述环形通信架构实现通信,所述环形通信架构中,所述N个处理器的每个处理器仅从所述每个处理器的上一跳处理器接收消息,并且仅向所述每个处理器的下一跳处理器发送消息。
- 根据权利要求7所述的方法,其特征在于,所述方法还包括:在所述第二通知消息中包括第一目标梯度的情况下,所述第一处理器在所述第二通知消息中获取所述第一目标梯度,并向所述第二处理器发送所述第二通知消息;所述第一目标梯度为所述第一处理器维护的第一嵌入表中嵌入参数的梯度,所述第一嵌入表中数据和嵌入参数存在一一映射关系;或者,在所述第二通知消息中未包括所述第一目标梯度的情况下,所述第一处理器向所述第二处理器发送所述第二通知消息。
- 根据权利要求8所述的方法,其特征在于,所述在所述第二通知消息中包括第一目标梯度的情况下,所述第一处理器在所述第二通知消息中获取所述第一目标梯度,包括:所述第一处理器确定所述第二数据中的部分或全部数据为所述第一嵌入表中的数据;所述第一处理器基于所述部分或全部数据在所述第二通知消息中获取所述第一目标梯度。
- 根据权利要求7至9任一项所述的方法,其特征在于,所述方法还包括:所述第一处理器接收来自所述第三处理器的第三通知消息;所述第三通知消息包括第三 数据和第三梯度,用于将所述第三梯度传播到第三目标处理器中;所述第三梯度为所述第三数据的嵌入参数对应的梯度;在所述第三通知消息中包括第二目标梯度的情况下,所述第一处理器在所述第三通知消息中获取所述第二目标梯度,并向所述第二处理器发送所述第三通知消息;所述第二目标梯度为所述第一处理器维护的第一嵌入表中嵌入参数的梯度,所述第一嵌入表中包括数据和数据的嵌入参数的映射关系;或者,在所述第三通知消息中未包括所述第二目标梯度的情况下,所述第一处理器向所述第二处理器发送所述第三通知消息。
- 一种数据处理装置,其特征在于,所述装置包括:发送单元,用于向第二处理器发送第一查找消息;所述第一查找消息包括第一数据,所述第一查找消息用于查找所述第一数据的嵌入参数;所述第二处理器为所述第一处理器所在的环形通信架构中所述第一处理器的下一跳处理器;接收单元,用于接收来自第三处理器的第二查找消息;所述第二查找消息包括第二数据,所述第二查找消息用于查找所述第二数据的嵌入参数;所述第三处理器为所述环形通信架构中所述第一处理器的上一跳处理器;所述第一处理器、所述第二处理器和所述第三处理器为数据训练系统包括的N个处理器中的处理器,所述N为大于或等于3的整数;所述N个处理器之间通过所述环形通信架构实现通信,所述环形通信架构中,所述N个处理器的每个处理器仅从所述每个处理器的上一跳处理器接收消息,并且仅向所述每个处理器的下一跳处理器发送消息。
- 根据权利要求11所述的装置,其特征在于,所述装置还包括添加单元;所述添加单元,用于在基于所述第二查找消息查找到所述第二数据中部分或全部数据的嵌入参数的情况下,将所述部分或全部数据的嵌入参数添加到所述第二查找消息中,得到第三查找消息;所述发送单元,还用于向所述第二处理器发送所述第三查找消息;或者,所述发送单元,还用于在基于所述第二查找消息未查找到所述第二数据的嵌入参数的情况下,向所述第二处理器发送所述第二查找消息。
- 根据权利要求12所述的装置,其特征在于,所述装置还包括查找单元;所述查找单元,用于在第一嵌入表中查找所述部分或全部数据映射的嵌入参数;所述第一嵌入表为所述第一处理器维护的用于存储数据和嵌入参数的嵌入表,所述第一嵌入表中数据和嵌入参数存在一一映射关系;所述添加单元,具体用于将所述部分或全部数据映射的嵌入参数添加到所述第二查找消息中的所述部分或全部数据对应的值域中,得到所述第三查找消息;所述发送单元,具体用于向所述第二处理器发送所述第三查找消息,所述第三查找消息用于查找所述第二数据中未查找到嵌入参数的数据的嵌入参数。
- 根据权利要求12所述的装置,其特征在于,所述发送单元具体用于:在所述第二数据均不属于第一嵌入表中的数据的情况下,向所述第二处理器发送所述第二查找消息;所述第一嵌入表为所述第一处理器维护的用于存储数据和嵌入参数的嵌入表, 所述第一嵌入表中数据和嵌入参数存在一一映射关系。
- 根据权利要求11至14任一项所述的装置,其特征在于,所述接收单元,还用于接收来自所述第三处理器的第四查找消息;所述第四查找消息包括第三数据和所述第三数据中第一部分数据映射的嵌入参数,所述第四查找消息用于查找所述第三数据中除了所述第一部分数据之外的数据映射的嵌入参数;所述装置还包括添加单元,用于在基于所述第四查找消息查找到所述第三数据中第二部分数据的嵌入参数的情况下,将所述第二部分数据的嵌入参数添加到所述第四查找消息中,得到第五查找消息;所述发送单元,还用于向所述第二处理器发送所述第五查找消息;或者,所述发送单元,还用于在基于所述第四查找消息未查找到所述第三数据的嵌入参数的情况下,向所述第二处理器发送所述第四查找消息。
- 根据权利要求11至15任一项所述的装置,其特征在于,所述接收单元还用于:接收来自所述第三处理器的第六查找消息,所述第六查找消息包括所述第一数据和所述第一数据的嵌入参数。
- 一种数据处理装置,其特征在于,所述装置包括:发送单元,用于向第二处理器发送第一通知消息;所述第一通知消息包括第一数据和第一梯度,用于将所述第一梯度传播到第一目标处理器中;所述第一梯度为所述第一数据的嵌入参数对应的梯度;所述第二处理器为所述第一处理器所在的环形通信架构中所述第一处理器的下一跳处理器;接收单元,用于接收来自第三处理器的第二通知消息;所述第二通知消息包括第二数据和第二梯度,用于将所述第二梯度传播到第二目标处理器中;所述第二梯度为所述第二数据的嵌入参数对应的梯度;所述第三处理器为所述环形通信架构中所述第一处理器的上一跳处理器;所述第一处理器、所述第二处理器和所述第三处理器为数据训练系统包括的N个处理器中的处理器,所述N为大于或等于3的整数;所述N个处理器之间通过所述环形通信架构实现通信,所述环形通信架构中,所述N个处理器的每个处理器仅从所述每个处理器的上一跳处理器接收消息,并且仅向所述每个处理器的下一跳处理器发送消息。
- 根据权利要求17所述的装置,其特征在于,所述装置还包括获取单元;所述获取单元,用于在所述第二通知消息中包括第一目标梯度的情况下,在所述第二通知消息中获取所述第一目标梯度;所述发送单元,还用于向所述第二处理器发送所述第二通知消息;所述第一目标梯度为所述第一处理器维护的第一嵌入表中嵌入参数的梯度,所述第一嵌入表中数据和嵌入参数存在一一映射关系;或者,所述发送单元,还用于在所述第二通知消息中未包括所述第一目标梯度的情况下,向所述第二处理器发送所述第二通知消息。
- 根据权利要求18所述的装置,其特征在于,所述获取单元具体用于:确定所述第二数据中的部分或全部数据为所述第一嵌入表中的数据;基于所述部分或全部数据在所述第二通知消息中获取所述第一目标梯度。
- 根据权利要求17至19任一项所述的装置,其特征在于,所述接收单元,还用于接收来自所述第三处理器的第三通知消息;所述第三通知消息包括第三数据和第三梯度,用于将所述第三梯度传播到第三目标处理器中;所述第三梯度为所述第三数据的嵌入参数对应的梯度;所述装置还包括获取单元,用于在所述第三通知消息中包括第二目标梯度的情况下,在所述第三通知消息中获取所述第二目标梯度,所述发送单元,还用于向所述第二处理器发送所述第三通知消息;所述第二目标梯度为所述第一处理器维护的第一嵌入表中嵌入参数的梯度,所述第一嵌入表中包括数据和数据的嵌入参数的映射关系;或者,所述发送单元,还用于在所述第三通知消息中未包括所述第二目标梯度的情况下,向所述第二处理器发送所述第三通知消息。
- 一种装置,其特征在于,所述装置包括处理器和存储器,其中,所述存储器用于存储计算机程序,所述处理器用于执行所述存储器中存储的计算机程序,使得所述装置执行如权利要求1至6任一项所述的方法;或者,使得所述装置执行如权利要求7至10任一项所述的方法。
- 一种数据训练系统,其特征在于,所述系统包括N个处理器,所述N为大于或等于3的整数;所述N个处理器之间通过环形通信架构实现通信,所述环形通信架构中,所述N个处理器的每个处理器仅从所述每个处理器的上一跳处理器接收消息,并且仅向所述每个处理器的下一跳处理器发送消息;所述N个处理器中的每个处理器可以是权利要求11至16中任一项所述的装置,或者,所述N个处理器中的每个处理器可以是权利要求17至20中任一项所述的装置,或者,所述N个处理器中的每个处理器可以是权利要求21中所述的装置。
- 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行以实现权利要求1至6任意一项所述的方法;或者,所述计算机程序被处理器执行以实现权利要求7至10任意一项所述的方法。
- 一种计算机程序产品,其特征在于,所述计算机程序产品被处理器执行时,权利要求1至6任意一项所述的方法将被执行;或者,权利要求7至10任意一项所述的方法将被执行。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP22794531.8A EP4283523A1 (en) | 2021-04-29 | 2022-04-06 | Data processing method, apparatus, and system |
US18/491,844 US20240054031A1 (en) | 2021-04-29 | 2023-10-23 | Data processing method and apparatus, and system |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110477608.3A CN115271025A (zh) | 2021-04-29 | 2021-04-29 | 数据处理方法、装置及系统 |
CN202110477608.3 | 2021-04-29 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US18/491,844 Continuation US20240054031A1 (en) | 2021-04-29 | 2023-10-23 | Data processing method and apparatus, and system |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022228060A1 true WO2022228060A1 (zh) | 2022-11-03 |
Family
ID=83744678
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/085353 WO2022228060A1 (zh) | 2021-04-29 | 2022-04-06 | 数据处理方法、装置及系统 |
Country Status (4)
Country | Link |
---|---|
US (1) | US20240054031A1 (zh) |
EP (1) | EP4283523A1 (zh) |
CN (1) | CN115271025A (zh) |
WO (1) | WO2022228060A1 (zh) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104767664A (zh) * | 2014-01-07 | 2015-07-08 | 艾默生网络能源有限公司 | 一种环形通信网络增减从节点的方法、装置及系统 |
CN110378472A (zh) * | 2019-07-24 | 2019-10-25 | 苏州浪潮智能科技有限公司 | 一种深度神经网络模型的数据并行训练方法、装置及设备 |
CN112000473A (zh) * | 2020-08-12 | 2020-11-27 | 中国银联股份有限公司 | 深度学习模型的分布式训练方法以及装置 |
-
2021
- 2021-04-29 CN CN202110477608.3A patent/CN115271025A/zh active Pending
-
2022
- 2022-04-06 EP EP22794531.8A patent/EP4283523A1/en active Pending
- 2022-04-06 WO PCT/CN2022/085353 patent/WO2022228060A1/zh active Application Filing
-
2023
- 2023-10-23 US US18/491,844 patent/US20240054031A1/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104767664A (zh) * | 2014-01-07 | 2015-07-08 | 艾默生网络能源有限公司 | 一种环形通信网络增减从节点的方法、装置及系统 |
CN110378472A (zh) * | 2019-07-24 | 2019-10-25 | 苏州浪潮智能科技有限公司 | 一种深度神经网络模型的数据并行训练方法、装置及设备 |
CN112000473A (zh) * | 2020-08-12 | 2020-11-27 | 中国银联股份有限公司 | 深度学习模型的分布式训练方法以及装置 |
Also Published As
Publication number | Publication date |
---|---|
EP4283523A1 (en) | 2023-11-29 |
US20240054031A1 (en) | 2024-02-15 |
CN115271025A (zh) | 2022-11-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20220391771A1 (en) | Method, apparatus, and computer device and storage medium for distributed training of machine learning model | |
WO2022083536A1 (zh) | 一种神经网络构建方法以及装置 | |
CN109284823B (zh) | 一种运算装置及相关产品 | |
WO2022022274A1 (zh) | 一种模型训练方法及装置 | |
CN111401406B (zh) | 一种神经网络训练方法、视频帧处理方法以及相关设备 | |
CN112183718B (zh) | 一种用于计算设备的深度学习训练方法和装置 | |
WO2021233342A1 (zh) | 一种神经网络构建方法以及系统 | |
WO2022068623A1 (zh) | 一种模型训练方法及相关设备 | |
US20210295168A1 (en) | Gradient compression for distributed training | |
WO2022111617A1 (zh) | 一种模型训练方法及装置 | |
WO2023093724A1 (zh) | 神经网络模型的处理方法及装置 | |
WO2022057433A1 (zh) | 一种机器学习模型的训练的方法以及相关设备 | |
WO2022012668A1 (zh) | 一种训练集处理方法和装置 | |
WO2022170569A1 (zh) | 数据处理方法和装置 | |
WO2021169366A1 (zh) | 数据增强方法和装置 | |
US20220391781A1 (en) | Architecture-agnostic federated learning system | |
WO2022227777A1 (zh) | 一种模型处理方法及装置 | |
WO2022156475A1 (zh) | 神经网络模型的训练方法、数据处理方法及装置 | |
WO2022100607A1 (zh) | 一种神经网络结构确定方法及其装置 | |
CN112528108A (zh) | 一种模型训练系统、模型训练中梯度聚合的方法及装置 | |
WO2023185541A1 (zh) | 一种模型训练方法及其相关设备 | |
WO2022228060A1 (zh) | 数据处理方法、装置及系统 | |
WO2023143080A1 (zh) | 一种数据处理方法以及相关设备 | |
WO2023071658A1 (zh) | Ai模型的处理方法、运算方法及装置 | |
WO2023122854A1 (zh) | 数据处理的方法和装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22794531 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2022794531 Country of ref document: EP |
|
ENP | Entry into the national phase |
Ref document number: 2022794531 Country of ref document: EP Effective date: 20230825 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 11202306406R Country of ref document: SG |
|
NENP | Non-entry into the national phase |
Ref country code: DE |