WO2022228060A1 - 数据处理方法、装置及系统 - Google Patents

数据处理方法、装置及系统 Download PDF

Info

Publication number
WO2022228060A1
WO2022228060A1 PCT/CN2022/085353 CN2022085353W WO2022228060A1 WO 2022228060 A1 WO2022228060 A1 WO 2022228060A1 CN 2022085353 W CN2022085353 W CN 2022085353W WO 2022228060 A1 WO2022228060 A1 WO 2022228060A1
Authority
WO
WIPO (PCT)
Prior art keywords
processor
data
message
embedded
gradient
Prior art date
Application number
PCT/CN2022/085353
Other languages
English (en)
French (fr)
Inventor
郑坤
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP22794531.8A priority Critical patent/EP4283523A1/en
Publication of WO2022228060A1 publication Critical patent/WO2022228060A1/zh
Priority to US18/491,844 priority patent/US20240054031A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17356Indirect interconnection networks
    • G06F15/17368Indirect interconnection networks non hierarchical topologies
    • G06F15/17375One dimensional, e.g. linear array, ring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • the present application relates to the technical field of artificial intelligence, and in particular, to a large-scale data processing method, device and system.
  • Artificial Intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that responds in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, and basic AI theory.
  • large-scale data model training is a core technology widely used in Internet search, advertising, recommendation business and other scenarios.
  • Typical application scenarios include click-through rate (CTR) models, etc.
  • CTR click-through rate
  • sample data is first input, and most of these sample data are data.
  • the data itself cannot be numerically calculated, so the sample data must be converted into numerical values through the embedding method. Therefore, the entry operators of the large-scale data training model are all embedding operators.
  • the loss function (loss) can be obtained, and then the loss function is back-propagated, that is, a round (step) training process is completed.
  • GPUs graphics processing units
  • NPUs neural-network processing units
  • the present application discloses a data processing method, device and system, which can improve the training efficiency and performance of a data training model.
  • the present application provides a data processing method, the method comprising:
  • the first processor sends a first lookup message to the second processor; the first lookup message includes first data, and the first lookup message is used to look up the embedded parameter of the first data; the second processor is the first lookup message the next-hop processor of the first processor in the ring communication architecture where the processor is located;
  • the first processor receives a second lookup message from a third processor; the second lookup message includes second data, and the second lookup message is used to look up embedded parameters of the second data; the third processor is the the previous hop processor of the first processor in the ring communication architecture;
  • the first processor, the second processor and the third processor are processors among the N processors included in the data training system, where N is an integer greater than or equal to 3;
  • the ring communication architecture implements communication in which each of the N processors only receives messages from the previous hop processor of each processor, and only sends messages to the next hop of each processor.
  • the hop handler sends the message.
  • the data training system includes N processors, and in order to train large-scale sample data, the data training system composed of the N processors implements data training in the manner of data parallelism and model parallelism. Based on the data-parallel plus model-parallel training method, the N processors randomly obtain a part of the sample data for training. After the trained data is input into the training model, the data needs to be mapped to dense data through the embedding layer.
  • the vector (also called the embedding parameter) can be used for subsequent calculations; however, since the training data on a processor is randomly obtained, the embedding parameters of these data do not necessarily exist in the processor, and need to be obtained from the N
  • the corresponding embedded parameters are obtained from other processors of a processor, which requires message communication with other processors.
  • the ring communication architecture is used between the N processors to realize the ring communication of messages. The application can make full use of bandwidth resources between processors, avoid single-point communication bottlenecks, reduce communication delay, and improve communication efficiency, thereby improving the training efficiency and performance of the entire data training system.
  • the above-mentioned method further includes: in the case that the embedded parameters of part or all of the data in the second data are found based on the second search message, the first processor performs the part or all of the data as embedded parameters.
  • the embedded parameter is added to the second search message, a third search message is obtained, and the third search message is sent to the second processor; or, the embedded parameter of the second data is not found based on the second search message
  • the first processor sends the second search message to the second processor.
  • the processor after receiving the search message of the embedded parameter of the data, the processor continues to forward the search message to the next hop processor based on the above-mentioned ring communication architecture regardless of whether the embedded parameter of the corresponding data is found locally, Through cyclic forwarding and searching, the embedded parameters of all the data required by the processor can finally be found.
  • the first processor adds the embedded parameters of the part or all of the data to In the second search message, a third search message is obtained, and the third search message is sent to the second processor, including:
  • the first processor looks up the embedded parameters of the part or all of the data mapping in a first embedded table;
  • the first embedded table is an embedded table maintained by the first processor for storing data and embedded parameters, and the first embedded table There is a one-to-one mapping relationship between the data in the table and the embedded parameters;
  • the first processor adds the embedded parameter of the part or all of the data mapping to the value field corresponding to the part or all of the data in the second lookup message to obtain the third lookup message;
  • the first processor sends the third lookup message to the second processor, where the third lookup message is used to look up the embedded parameter of the data for which the embedded parameter is not found in the second data.
  • each of the above-mentioned N processors maintains an embedded table, and the embedded table is used to store data and corresponding embedded parameters. Therefore, after the processor receives the search message for the embedded parameters, The data in the lookup message can be indexed in the processor's embedded table. If the data in the search message exists in the embedded table, the corresponding embedded parameter can be searched.
  • the first processor adds the embedded parameters of the part or all of the data to In the second search message, a third search message is obtained, and the third search message is sent to the second processor, including:
  • the first processor determines that the part or all of the data belongs to the first embedded table, and the first embedded table does not yet include the part or all of the data; the first embedded table is maintained by the first processor for storing data and an embedded table of embedded parameters, there is a one-to-one mapping relationship between data and embedded parameters in the first embedded table;
  • the first processor generates respective embedded parameters corresponding to the part or all of the data
  • the first processor adds the respective embedded parameters corresponding to the part or all of the data to the range of values corresponding to the part or all of the data in the second search message to obtain the third search message;
  • the first processor sends the third lookup message to the second processor, where the third lookup message is used to look up the embedded parameter of the data for which the embedded parameter is not found in the second data.
  • each of the above-mentioned N processors maintains an embedded table, and the embedded table is used to store data and corresponding embedded parameters. Therefore, after the processor receives the search message for the embedded parameters, If the processor determines that there is data in the message that belongs to the embedding table of the processor, but is not in the embedding table, the processor may randomly generate corresponding embedding parameters for the data belonging to the embedding table. Optionally, the remainder obtained after the data belonging to the embedding table is modulo N is the same as the number of the training program run by the processor.
  • the first processor sends the second search message to the second processor, including:
  • the first processor sends the second lookup message to the second processor; the first embedded table is maintained by the first processor An embedded table for storing data and embedded parameters, and there is a one-to-one mapping relationship between data and embedded parameters in the first embedded table.
  • the processor directly sends the received lookup message to the processor based on the above ring communication architecture. Next hop processor.
  • the method further includes:
  • the first processor receives a fourth lookup message from the third processor; the fourth lookup message includes third data and an embedded parameter mapped to the first part of the data in the third data, and the fourth lookup message is used to look up the Embedded parameters of data mappings in the third data other than the first part of data;
  • the first processor adds the embedded parameter of the second part of the data to the fourth search message to obtain the first five search messages, and send the fifth search message to the second processor;
  • the first processor sends the fourth search message to the second processor.
  • the above-mentioned ring communication architecture is used to realize the search of the embedded parameters required by each of the above N processors, and the ring communication of the search message can be implemented multiple times based on this architecture to search for the embedded parameters of the data.
  • at least N times of message communication and embedding parameter search can be repeated among the N processors, so as to ensure that each processor can obtain the embedding parameters of all required data.
  • the method further includes: the first processor receives a sixth lookup message from the third processor, where the sixth lookup message includes the first data and an embedded parameter of the first data.
  • the message communication of the above N processors is implemented to find the embedded parameters required by each processor. message with all embedded parameters.
  • the present application provides a data processing method, the method comprising:
  • the first processor sends a first notification message to the second processor;
  • the first notification message includes first data and a first gradient, and is used to propagate the first gradient to the first target processor;
  • the first gradient is The gradient corresponding to the embedding parameter of the first data;
  • the second processor is the next-hop processor of the first processor in the ring communication architecture where the first processor is located;
  • the first processor receives a second notification message from the third processor;
  • the second notification message includes second data and a second gradient for propagating the second gradient into the second target processor;
  • the second The gradient is the gradient corresponding to the embedding parameter of the second data;
  • the third processor is the previous hop processor of the first processor in the ring communication architecture;
  • the first processor, the second processor and the third processor are processors among the N processors included in the data training system, where N is an integer greater than or equal to 3;
  • the ring communication architecture implements communication in which each of the N processors only receives messages from the previous hop processor of each processor, and only sends messages to the next hop of each processor.
  • the hop handler sends the message.
  • the embedded parameters of the data in the forward propagation process of the training data of the processor are obtained from other processors, that is, the embedded parameters of the data are stored in other processors, but in the reverse direction of training During the propagation process, the embedded parameters of the data need to be optimized based on the calculated gradients. Then, the processor needs to send the calculated gradients corresponding to the embedded parameters of the data to the corresponding processor, so that the corresponding processor can optimize the data. embedded parameters.
  • the ring communication architecture of the messages is implemented between the N processors through the ring communication architecture.
  • the present application can make full use of the bandwidth resources between processors, avoid single-point communication bottlenecks, reduce communication delay, improve communication efficiency, and further improve the overall data Training efficiency and performance of the training system.
  • the method further includes: when the second notification message includes the first target gradient, the first processor acquires the first target gradient in the second notification message, and sends the first target gradient to the second notification message.
  • the second processor sends the second notification message;
  • the first target gradient is the gradient of the embedding parameters in the first embedding table maintained by the first processor, and there is a one-to-one mapping relationship between the data in the first embedding table and the embedding parameters ;
  • the first processor sends the second notification message to the second processor.
  • the processor after receiving the notification message of the gradient, the processor continues to forward the notification message to the next hop processor based on the above-mentioned ring communication architecture, regardless of whether the desired gradient is found in the notification message, Through cyclic forwarding, each processor can finally obtain the gradient required by each processor.
  • the first processor obtains the first target gradient in the second notification message, including:
  • the first processor determines that part or all of the data in the second data is data in the first embedded table
  • the first processor obtains the first target gradient in the second notification message based on the part or all of the data.
  • each of the above-mentioned N processors maintains an embedding table, and the embedding table is used to store data and corresponding embedding parameters. Therefore, after the processor receives the gradient notification message, if the If data exists in the embedding table in the message, the processor can obtain the corresponding gradient from the message to optimize the data.
  • the method further includes:
  • the first processor receives a third notification message from the third processor; the third notification message includes third data and a third gradient for propagating the third gradient to the third target processor; the third The third gradient is the gradient corresponding to the embedding parameter of the third data;
  • the first processor acquires the second target gradient in the third notification message, and sends the third notification message to the second processor;
  • the second target gradient is the gradient of the embedding parameters in the first embedding table maintained by the first processor, where the first embedding table includes the mapping relationship between data and the embedding parameters of the data;
  • the first processor sends the third notification message to the second processor.
  • the above-mentioned ring communication architecture is adopted to enable each of the above N processors to obtain the required gradients, and the ring communication of notification messages can be implemented multiple times based on this architecture. At least N-1 times of message communication are circulated between each other, so as to ensure that all the required gradients can be obtained by each processor.
  • any one of the above-mentioned first aspect and its possible implementations can be implemented in combination with any of the second aspect and its possible implementations.
  • the first aspect and its possible implementations Any one of the embodiments is applied to the forward propagation process of the embedding layer of data training, and any one of the second aspect and its possible embodiments is applied to the back propagation of the embedding layer of data training. in the process.
  • the present application provides a data processing device, the device comprising:
  • a sending unit configured to send a first search message to a second processor;
  • the first search message includes first data, and the first search message is used to search for embedded parameters of the first data;
  • the second processor is the first search message The next-hop processor of the first processor in the ring communication architecture where the processor is located;
  • a receiving unit for receiving a second search message from a third processor;
  • the second search message includes second data, and the second search message is used to search for embedded parameters of the second data;
  • the third processor is the the previous hop processor of the first processor in the ring communication architecture;
  • the first processor, the second processor and the third processor are processors among the N processors included in the data training system, where N is an integer greater than or equal to 3;
  • the ring communication architecture implements communication in which each of the N processors only receives messages from the previous hop processor of each processor, and only sends messages to the next hop of each processor.
  • the hop handler sends the message.
  • the device further includes an adding unit
  • the above-mentioned adding unit is configured to add the embedded parameters of the part or all of the data to the second search message when the embedded parameters of part or all of the data in the second data are found based on the second search message, to obtain the third lookup message;
  • the above-mentioned sending unit is further configured to send the third search message to the second processor
  • the sending unit is further configured to send the second search message to the second processor when the embedded parameter of the second data is not found based on the second search message.
  • the device further includes a search unit
  • the lookup unit is configured to look up the embedded parameters of the part or all of the data mapping in the first embedded table;
  • the first embedded table is an embedded table maintained by the first processor for storing data and embedded parameters, and the first embedded table is used to store data and embedded parameters. There is a one-to-one mapping relationship between the data in the embedded table and the embedded parameters;
  • the above-mentioned adding unit is specifically used to add the embedded parameter of the part or all of the data mapping to the value range corresponding to the part or all of the data in the second search message to obtain the third search message;
  • the above-mentioned sending unit is specifically configured to send the third search message to the second processor, where the third search message is used to search for the embedded parameter of the data for which the embedded parameter is not found in the second data.
  • the above-mentioned apparatus further includes a determining unit and a generating unit;
  • the determining unit is configured to determine that part or all of the above data belongs to the first embedded table, and the first embedded table does not yet include the part or all of the data; the first embedded table is maintained by the first processor for storage an embedded table of data and embedded parameters, and there is a one-to-one mapping relationship between data and embedded parameters in the first embedded table;
  • the generating unit is used to generate the respective embedded parameters corresponding to the part or all of the data
  • the above-mentioned adding unit is specifically used for adding the embedded parameters corresponding to the part or all of the data to the value range corresponding to the part or all of the data in the second search message to obtain the third search message;
  • the above-mentioned sending unit is specifically configured to send the third search message to the second processor, where the third search message is used to search for the embedded parameter of the data for which the embedded parameter is not found in the second data.
  • the above-mentioned sending unit is specifically used for:
  • the first embedded table is maintained by the first processor for storing data and An embedded table of embedded parameters, there is a one-to-one mapping relationship between the data in the first embedded table and the embedded parameters.
  • the above receiving unit is further configured to receive a fourth search message from the third processor;
  • the fourth search message includes the third data and the embedded parameters of the first part of the data in the third data.
  • the fourth lookup message is used to look up the embedded parameters of the data mapping in the third data except the first part of data;
  • the apparatus further includes an adding unit, configured to add the embedded parameter of the second part of the data to the fourth search message when the embedded parameter of the second part of the data in the third data is found based on the fourth search message , get the fifth search message;
  • the above-mentioned sending unit is further configured to send the fifth search message to the second processor
  • the sending unit is further configured to send the fourth search message to the second processor when the embedded parameter of the third data is not found based on the fourth search message.
  • the receiving unit is further configured to: receive a sixth search message from the third processor, where the sixth search message includes the first data and an embedded parameter of the first data.
  • the present application provides a data processing device, the device comprising:
  • a sending unit configured to send a first notification message to the second processor;
  • the first notification message includes first data and a first gradient, and is used to propagate the first gradient to the first target processor;
  • the first gradient is the gradient corresponding to the embedding parameter of the first data;
  • the second processor is the next-hop processor of the first processor in the ring communication architecture where the first processor is located;
  • a receiving unit configured to receive a second notification message from a third processor;
  • the second notification message includes second data and a second gradient, and is used to propagate the second gradient to the second target processor;
  • the second The gradient is the gradient corresponding to the embedding parameter of the second data;
  • the third processor is the previous hop processor of the first processor in the ring communication architecture;
  • the first processor, the second processor and the third processor are processors among the N processors included in the data training system, where N is an integer greater than or equal to 3;
  • the ring communication architecture implements communication in which each of the N processors only receives messages from the previous hop processor of each processor, and only sends messages to the next hop of each processor.
  • the hop handler sends the message.
  • the device further includes an acquisition unit;
  • the obtaining unit configured to obtain the first target gradient in the second notification message when the second notification message includes the first target gradient
  • the sending unit is further configured to send the second notification message to the second processor;
  • the first target gradient is the gradient of the embedded parameters in the first embedded table maintained by the first processor, and the data in the first embedded table There is a one-to-one mapping relationship with the embedded parameters;
  • the sending unit is further configured to send the second notification message to the second processor when the first target gradient is not included in the second notification message.
  • the obtaining unit is specifically used for:
  • the first target gradient is obtained in the second notification message based on the part or all of the data.
  • the receiving unit is further configured to receive a third notification message from the third processor;
  • the third notification message includes third data and a third gradient, and is used for propagating the third gradient into the third target processor;
  • the third gradient is the gradient corresponding to the embedding parameter of the third data;
  • the apparatus further includes an acquiring unit configured to acquire the second target gradient in the third notification message when the third notification message includes the second target gradient,
  • the sending unit is further configured to send the third notification message to the second processor;
  • the second target gradient is the gradient of the embedding parameter in the first embedding table maintained by the first processor, and the first embedding table includes The mapping relationship between the data and the embedded parameters of the data;
  • the sending unit is further configured to send the third notification message to the second processor under the condition that the second target gradient is not included in the third notification message.
  • the present application provides an apparatus, which may include a processor and a memory, for implementing the data processing method described in the first aspect above.
  • the memory is coupled to the processor, and when the processor executes the computer program stored in the memory, the method described in the first aspect or any possible implementation manner of the first aspect can be implemented.
  • the apparatus may also include a communication interface for the apparatus to communicate with other apparatuses, for example, the communication interface may be a transceiver, circuit, bus, module or other type of communication interface.
  • the communication interface includes a receiving interface and a sending interface, the receiving interface is used for receiving messages, and the sending interface is used for sending messages.
  • the apparatus may include:
  • a processor configured to send a first lookup message to a second processor through a sending interface; the first lookup message includes first data, and the first lookup message is used to look up embedded parameters of the first data; the second processor is the next hop processor of the first processor in the ring communication architecture where the first processor is located;
  • the third processor is the ring communication the previous hop processor of the first processor in the architecture;
  • the first processor, the second processor and the third processor are processors among the N processors included in the data training system, where N is an integer greater than or equal to 3;
  • the ring communication architecture implements communication in which each of the N processors only receives messages from the previous hop processor of each processor, and only sends messages to the next hop of each processor.
  • the hop handler sends the message.
  • the computer program in the memory in this application can be pre-stored or stored after being downloaded from the Internet when the device is used.
  • This application does not specifically limit the source of the computer program in the memory.
  • the coupling in the embodiments of the present application is an indirect coupling or connection between devices, units or modules, which may be in electrical, mechanical or other forms, and is used for information exchange between devices, units or modules.
  • the present application provides an apparatus, which may include a processor and a memory, for implementing the data processing method described in the second aspect above.
  • the memory is coupled to the processor, and when the processor executes the computer program stored in the memory, the method described in the second aspect or any possible implementation manner of the second aspect can be implemented.
  • the apparatus may also include a communication interface for the apparatus to communicate with other apparatuses, for example, the communication interface may be a transceiver, circuit, bus, module or other type of communication interface.
  • the communication interface includes a receiving interface and a sending interface, the receiving interface is used for receiving messages, and the sending interface is used for sending messages.
  • the apparatus may include:
  • a processor configured to send a first notification message to the second processor through a sending interface;
  • the first notification message includes first data and a first gradient, and is used to propagate the first gradient to the first target processor;
  • the first gradient is the gradient corresponding to the embedding parameter of the first data;
  • the second processor is the next-hop processor of the first processor in the ring communication architecture where the first processor is located;
  • a second notification message from the third processor is received through the receiving interface;
  • the second notification message includes second data and a second gradient, and is used for propagating the second gradient to the second target processor;
  • the second gradient is The gradient corresponding to the embedded parameter of the second data;
  • the third processor is the previous hop processor of the first processor in the ring communication architecture;
  • the first processor, the second processor and the third processor are processors among the N processors included in the data training system, where N is an integer greater than or equal to 3;
  • the ring communication architecture implements communication in which each of the N processors only receives messages from the previous hop processor of each processor, and only sends messages to the next hop of each processor.
  • the hop handler sends the message.
  • the computer program in the memory in this application can be pre-stored or stored after being downloaded from the Internet when the device is used.
  • This application does not specifically limit the source of the computer program in the memory.
  • the coupling in the embodiments of the present application is an indirect coupling or connection between devices, units or modules, which may be in electrical, mechanical or other forms, and is used for information exchange between devices, units or modules.
  • the present application provides a data training system, the system includes N processors, where N is an integer greater than or equal to 3; communication between the N processors is achieved through a ring communication architecture, in the ring communication architecture , each of the N processors only receives messages from the previous-hop processor of each processor, and only sends messages to the next-hop processor of each processor; among the N processors
  • Each of the processors may be the device described in any one of the third aspect and its possible implementations; or, each of the N processors may be the fourth aspect and its possible implementations
  • the apparatus described in any one of the manners; or, each of the N processors may be the apparatus described in any one of the fifth aspect and its possible implementation manners; or, the N processors
  • Each processor in the apparatus may be an apparatus according to any one of the above-mentioned sixth aspect and possible implementations thereof.
  • the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and the computer program is executed by a processor to implement the above-mentioned first aspect and any one of its possible implementations. or, the computer program is executed by the processor to implement the method described in any one of the second aspect and its possible implementation manners.
  • the present application provides a computer program product, when the computer program product is executed by a processor, the method described in any one of the above-mentioned first aspect and its possible implementation manners will be executed; or, the above-mentioned second The method of any of the aspects and possible embodiments thereof is to be performed.
  • FIG. 1 is a schematic diagram of an artificial intelligence main body framework provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of an application environment provided by an embodiment of the present application.
  • FIG. 3 is a schematic structural diagram of a neural network processor according to an embodiment of the present application.
  • FIG. 4 shows a schematic diagram of a data training system provided by an embodiment of the present application
  • Figure 5 is a schematic diagram of a data training model
  • Figure 7 is a schematic diagram of the communication part and the calculation part of the data training process
  • FIG. 8 is a schematic diagram of performing ring communication in the data training model of the present application.
  • FIG. 9 shows a schematic flowchart of a data processing method provided by the present application.
  • 10A to 10E are schematic flowcharts of a ring communication provided by the present application.
  • FIG. 11 is a schematic flowchart of another data processing method provided by the present application.
  • 12A to 12D are schematic flowcharts of a ring communication provided by the present application.
  • FIG. 13 is a schematic diagram showing the comparison of the communication throughput of the present solution and the communication throughput of the existing technical solution;
  • FIG. 14 and FIG. 15 are schematic diagrams of the logical structure of the apparatus provided by the present application.
  • FIG. 16 is a schematic diagram showing the physical structure of the apparatus provided by the present application.
  • Figure 1 shows a schematic diagram of an artificial intelligence main frame, which describes the overall workflow of an artificial intelligence system and is suitable for general artificial intelligence field requirements.
  • the "intelligent information chain” reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, data has gone through the process of "data-information-knowledge-wisdom".
  • the "IT value chain” reflects the value brought by artificial intelligence to the information technology industry from the underlying infrastructure of human intelligence, information (providing and processing technology implementation) to the industrial ecological process of the system.
  • the infrastructure provides computing power support for artificial intelligence systems, realizes communication with the outside world, and supports through the basic platform. Communication with the outside world through sensors; computing power is provided by smart chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA); the basic platform includes distributed computing framework and network-related platform guarantee and support, which can include cloud storage and computing, interconnection networks, etc. For example, sensors communicate with external parties to obtain data, and these data are provided to the intelligent chips in the distributed computing system provided by the basic platform for calculation.
  • smart chips hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA
  • the basic platform includes distributed computing framework and network-related platform guarantee and support, which can include cloud storage and computing, interconnection networks, etc. For example, sensors communicate with external parties to obtain data, and these data are provided to the intelligent chips in the distributed computing system provided by the basic platform for calculation.
  • the data on the upper layer of the infrastructure is used to represent the data sources in the field of artificial intelligence.
  • the data involves graphics, images, voice, and text, as well as IoT data from traditional devices, including business data from existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making, etc.
  • machine learning and deep learning can perform symbolic and formalized intelligent information modeling, extraction, preprocessing, training, etc. on data.
  • Reasoning refers to the process of simulating human's intelligent reasoning method in a computer or intelligent system, using formalized information to carry out machine thinking and solving problems according to the reasoning control strategy, and the typical function is search and matching.
  • Decision-making refers to the process of making decisions after intelligent information is reasoned, usually providing functions such as classification, sorting, and prediction.
  • some general capabilities can be formed based on the results of data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, image identification, etc.
  • Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of the overall artificial intelligence solution, and the productization of intelligent information decision-making and implementation of applications. Its application areas mainly include: intelligent manufacturing, intelligent transportation, Smart home, smart medical care, smart security, autonomous driving, safe city, smart terminals, etc.
  • an embodiment of the present application provides a system architecture 200 .
  • the data collection device 260 is used to collect training sample data and store it in the database 230 , and the training device 220 generates the target model/rule 201 based on the sample data maintained in the database 230 .
  • the following will describe in more detail how the training device 220 obtains the target model/rule 201 based on the sample data, and the target model/rule 201 can implement functions such as click-through rate estimation, information recommendation, or search.
  • the work of each layer in a deep neural network can be expressed mathematically To describe: From the physical level, the work of each layer in the deep neural network can be understood as completing the transformation from the input space to the output space (that is, the row space of the matrix to the column) through five operations on the input space (set of input vectors). Space), these five operations include: 1. Dimension raising/lowering; 2. Enlarging/reducing; 3. Rotation; 4. Translation; 5. "Bending”. Among them, the operations of 1, 2, and 3 are determined by Complete, the operation of 4 is completed by +b, and the operation of 5 is realized by a().
  • W is the weight vector, and each value in the vector represents the weight value of a neuron in the neural network of this layer.
  • the vector W determines the space transformation from the input space to the output space described above, that is, the weight W of each layer controls how the space is transformed.
  • the purpose of training the deep neural network is to finally obtain the weight matrix of all layers of the trained neural network (the weight matrix formed by the vectors W of many layers). Therefore, the training process of the neural network is essentially learning the way to control the spatial transformation, and more specifically, learning the weight matrix.
  • the weight vector of the network (of course, there is usually an initialization process before the first update, that is, the parameters are pre-configured for each layer in the deep neural network), for example, if the predicted value of the network is high, adjust the weight vector to make it Predict lower and keep adjusting until the neural network can predict the actual desired target value.
  • the target models/rules obtained by training the device 220 can be applied in different systems or devices.
  • the execution device 210 is configured with an I/O interface 212 for data interaction with external devices, and a “user” can input data to the I/O interface 212 through the client device 240 .
  • the execution device 210 can call data, codes, etc. in the data storage system 250 , and can also store data, instructions, etc. in the data storage system 250 .
  • the calculation module 211 uses the target model/rule 201 to process the input data. For example, for a click rate estimation scenario, the calculation module 211 uses the target model/rule 201 to predict information that the user may click.
  • the I/O interface 212 returns the processing result to the client device 240, which is provided to the user.
  • the training device 220 can generate corresponding target models/rules 201 based on different data for different targets, so as to provide users with better results.
  • the user can manually specify data in the input execution device 210 , eg, operate in the interface provided by the I/O interface 212 .
  • the client device 240 can automatically input data to the I/O interface 212 and obtain the result. If the client device 240 automatically inputs data and needs to obtain the user's authorization, the user can set the corresponding permission in the client device 240 .
  • the user can view the result output by the execution device 210 on the client device 240, and the specific presentation form can be a specific manner such as display, sound, and action.
  • the client device 240 can also act as a data collection terminal to store the collected sample data in the database 230 .
  • FIG. 2 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship among the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data storage system 250 is an external memory relative to the execution device 210 , and in other cases, the data storage system 250 may also be placed in the execution device 210 .
  • FIG. 3 is a structural diagram of a chip hardware provided by an embodiment of the present application.
  • the neural network processor NPU 30 is mounted on the main CPU (Host CPU) as a co-processor, and tasks are assigned by the Host CPU.
  • the core part of the NPU is the arithmetic circuit 305, which is controlled by the controller 304 to extract the matrix data in the memory and perform multiplication operations.
  • the arithmetic circuit 305 includes multiple processing units (Process Engine, PE). In some implementations, the arithmetic circuit 305 is a two-dimensional systolic array. The arithmetic circuit 305 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 305 is a general-purpose matrix processor.
  • the operation circuit fetches the data corresponding to the matrix B from the weight memory 302 and buffers it on each PE in the operation circuit.
  • the operation circuit fetches the data of matrix A and matrix B from the input memory 301 to perform matrix operation, and the obtained partial result or final result of the matrix is stored in the accumulator 308 .
  • Unified memory 306 is used to store input data and output data.
  • the weight data is directly transferred to the weight memory 302 through a storage unit access controller (Direct Memory Access Controller, DMAC) 305 .
  • DMAC Direct Memory Access Controller
  • Input data is also moved to unified memory 306 via the DMAC.
  • the BIU is the Bus Interface Unit, that is, the bus interface unit 310, which is used for the interaction between the AXI bus and the DMAC and the instruction fetch memory 309 Instruction Fetch Buffer.
  • the bus interface unit 310 (Bus Interface Unit, BIU for short) is used for the instruction fetch memory 309 to obtain instructions from the external memory, and also for the storage unit access controller 305 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
  • BIU Bus Interface Unit
  • the DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 306 , the weight data to the weight memory 302 , or the input data to the input memory 301 .
  • the vector calculation unit 307 includes a plurality of operation processing units, and further processes the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc., if necessary. It is mainly used for non-convolutional layer network calculations in neural networks, such as pooling (Pooling), batch normalization (Batch Normalization), and local response normalization (Local Response Normalization).
  • vector computation unit 307 stores the processed output vectors to unified buffer 306 .
  • the vector calculation unit 307 may apply a nonlinear function to the output of the arithmetic circuit 305, such as a vector of accumulated values, to generate activation values.
  • vector computation unit 307 generates normalized values, merged values, or both.
  • the vector of processed outputs can be used as activation input to the arithmetic circuit 305, eg, for use in subsequent layers in a neural network.
  • the instruction fetch memory (instruction fetch buffer) 309 connected to the controller 304 is used to store the instructions used by the controller 304;
  • the unified memory 306, the input memory 301, the weight memory 302 and the instruction fetch memory 309 are all On-Chip memories. External memory is private to the NPU hardware architecture.
  • FIG. 4 is a schematic diagram of the data training system provided by the present application.
  • the system includes N processors, where N is an integer greater than one.
  • the N processors may be the training device 220 in FIG. 2 described above.
  • a ring communication architecture can be used to implement message communication.
  • the ring communication architecture is a logical architecture for realizing ring communication among the N processors.
  • each of the N processors only receives messages from the previous hop processor of each processor, and only sends messages to the next hop processor of each processor.
  • processor 0 only sends messages to processor 1 and only receives messages from processor 3;
  • processor 1 only sends messages to processor 2;
  • processor 2 only sends messages to processor 3, and only receives messages from processor 1;
  • processor 3 only sends messages to processor 0, and only receives messages from processor 2.
  • the above-mentioned ring communication mode based on the ring communication architecture may be implemented by adopting a ring communication mode in a message passing interface (message passing interface, MPI).
  • MPI message passing interface
  • the message communication in the whole process may adopt the above-mentioned ring communication architecture to realize communication;
  • the message includes, for example, a message used to find the embedding variable of the data during the forward propagation of the embedding layer, and/or included in the back propagation of the embedding layer, used to obtain the gradient for optimizing the embedding parameter. (gradient) message, the rest of the message communication may adopt other communication methods, which is not limited in this application.
  • the above-mentioned embedding parameter is in the form of a vector (vector), and the embedding parameter may be referred to as an embedding vector or an embedding vector.
  • the above gradient can also be in the form of a vector, and the gradient can also be called a gradient vector.
  • the N processors may all be graphics processing units (graphics processing units, GPU); or, the N processors may all be network processors (neural-network processing units, NPU); or, the N A processor can be part GPU and part NPU.
  • the NPU may be the neural network processor described in FIG. 3 above. It should be noted that the N processors are not limited to being GPUs or NPUs, but may also be other high-speed processors.
  • the above data training system can be applied to scenarios where the amount of training embedded parameters reaches tens of billions, or even hundreds of billions.
  • the data training system can be applied in practical application scenarios such as information search, information recommendation, advertisement such as click-through rate (click-through rate, CTR) and so on.
  • the data trained by the data training system may be sparse data, or may be dense data, which is not limited in this application.
  • the above-mentioned data to be trained may be identification (identity document, id) data
  • the id data may be a number or a character string or the like.
  • the id data may be the identification code of the product or the address of the merchant's store, or the like.
  • the data processing method provided by this application is mainly described below by taking the data to be trained as id data as an example, but the data processing method provided by this application can also process other types of data, not limited to id data.
  • FIG. 5 can be exemplified.
  • each piece of data in the N pieces of data may be referred to as batch data (batch), each batch contains q sample data, and each sample data contains multiple pieces of data.
  • the q is an integer greater than 0, and the q is called the batch size, and the number of data contained in each sample data can be different.
  • each processor runs a training process to train the corresponding data, and each training process has its own serial number, which is used by the processor to distinguish different processes.
  • the message communication between the processors described later can also be said to be the message communication between the training processes.
  • the training of data is performed by deep learning neural network. Therefore, the model of each processor training data includes but is not limited to input layer, embedded layer, hidden layer, loss function operator, gradient calculation operator and For sub-models such as parameter update operators, the model of the training data shown in FIG. 5 only exemplarily draws part of the sub-models.
  • the whole training process includes forward-propagation (FP) process and back-propagation (back-propagation, BP) process.
  • FP forward-propagation
  • BP back-propagation
  • the embedding layer in forward propagation and the embedding layer in back propagation shown in Figure 5 are the same embedding layer, and similarly, the hidden layer in forward propagation and the hidden layer in back propagation It is the same hidden layer, which is drawn separately to better distinguish and reflect the process of forward propagation and back propagation.
  • the process of forward propagation includes: inputting data into the embedding layer, which is used to map the data into dense embedding parameters for calculation; in the calculation of the embedding layer, messages need to be communicated between N processors to find The embedding parameters of the respective training data (we will introduce why and how to communicate in detail later, and will not go into details here); the output of the embedding layer is the embedding parameters of the data, and these embedding parameters are input to the hidden layer for calculation and output prediction. value.
  • the predicted value of the output can be combined with the label to establish a loss function (loss), and the gradient can be calculated by automatic derivation.
  • the process of backpropagation includes: the processor derives the gradients of all training parameters of the hidden layer and the embedding layer through a reverse chain derivation process based on the above-mentioned loss function and gradient, and then optimizes the parameters through an optimization algorithm. Specifically, when these gradients are back-propagated to the embedding layer, the processor calculates the gradient corresponding to the embedding parameters of each data based on these gradients; then, the N processors obtain the corresponding embedding parameters required by each processor through message communication. Gradient, the processor optimizes the corresponding embedding parameters based on the obtained gradient (we will introduce why and how to communicate in detail later, and will not go into details here).
  • the function of the embedding layer is mainly to map the data into a dense vector, and the dense vector is the above-mentioned embedding parameter. Since the amount of data to be trained is huge, and the model is trained in parallel, in order to facilitate calculation and save computing resources for preprocessing, the data to be trained can be randomly allocated to N processors (N training processes) for training.
  • N processors N training processes
  • each processor or the training process of each processor maintains an embedding table (embedding table), which is used to store data and embedding parameters.
  • an embedding table embedding table
  • the embedded parameters of the data randomly assigned to a processor are not necessarily in the embedding table of the processor. Therefore, the embedding parameters of the corresponding data need to be obtained in the embedding table of other processors, so it is necessary to communicate with each other through message communication. Query the respective embedded parameters.
  • the present application may divide different embedding tables by modulo (mod) calculation. Specifically, the remainder after the modulo calculation of N is the same for the data in the same embedding table. Optionally, the remainder after the data in the embedding table of processor i in the N processors is calculated modulo N is i, and the input program number of the training data in processor i is i.
  • the present application may divide different embedding tables by means of "division" calculation. Specifically, the data in the same embedding table is divided by N and the result is the same. For example, assuming that N is 3, 4 and 5 belong to the same embedding table, then 4 divided by 3 is equal to 1, and 5 divided by 3 is also equal to 1.
  • the present application may divide different embedding tables by random allocation. Specifically, the data in the embedding table of each processor in the N processors is random. At this time, you can directly use the data itself as an index to find the corresponding embedding parameters in the embedding table.
  • the embodiments of the present application do not limit the manner of how to segment the embedding table.
  • the following describes the specific implementation process mainly by taking the embedding table segmented by modulo calculation as an example. This does not constitute a limitation to the embodiments of the present application.
  • the data and embedding parameters in the embedding table of each processor in the above N processors can be loaded as the data in the embedding table in the current training model by loading the data in the embedding table in the current training. initialization data. Or, for the data queried for the first time when looking up the embedding table, you can directly use random numbers to initialize the embedding parameters of the data, and insert the data and the randomly generated embedding parameters into the embedding table to complete the initialization. This application does not limit the data initialization method of the embedding table.
  • Table 1 exemplarily shows the content and structure included in the embedding table, the id in Table 1 is the data, and in the embedding table, each data mapping has an embedding parameter.
  • m in Table 1 can be any integer greater than 1.
  • a processor calculates the gradient of the embedding parameters corresponding to its training data, it needs to distribute the gradients corresponding to the embedding parameters of the data to the processor corresponding to the embedding table where the data is located. , for the processor to optimize the embedding parameters in its own embedding table. For ease of understanding, see Table 2 for an example.
  • Processor number training data The remainder of the data modulo 3 Processor 0 (training process 0) 10, 21, 14 and 19 1, 0, 2 and 1 Processor 1 (training process 1) 31, 23, 3 and 8 1, 2, 0 and 2 Processor 2 (training process 2) 12, 5, 19 and 33 0, 2, 1 and 0
  • Table 2 it is assumed that the above N is 3, that is, the data is trained by 3 processors.
  • Table 2 exemplifies the data randomly assigned to each processor for training, and gives the remainder of the modulo operation of these data with 3.
  • processor 0 the data to be trained randomly obtained by processor 0 are 10, 21, 14, and 19, and the remainders of the modulo 3 corresponding to these data are 1, 0, 2, and 1, respectively.
  • N is i
  • the remainder after the data in the embedding table in the 0th processor is calculated modulo 3 is 0
  • the remainder after the data in the embedding table in the first processor is calculated modulo 3
  • the remainder after the data in the embedding table in the second processor is calculated modulo 3 is 2.
  • the embedding table in processor 0 only has the embedding parameter of the data map whose remainder is 0 after modulo 3, but there is no embedding parameter for the data map whose remainder is 1 and 2 after modulo 3. parameter. Therefore, in the forward propagation process, processor 0 needs to communicate with processor 1 and processor 2 to obtain the embedded parameters of data 10, 14 and 19.
  • processor 0 calculates the gradient corresponding to data 10, 21, 14 and 19, which is used to correct and update the embedding of data 10, 21, 14 and 19 in the embedding table parameter.
  • the embedded parameters of data 10 and 19 are in processor 1, and the embedded parameters of data 14 are in processor 2. Therefore, processor 0 needs to send the calculated gradients of data 10 and 19 to processor 1, and data 14 The gradient is sent to processor 2.
  • the communication of this gradient can be achieved by message communication.
  • N processors send messages to each other, and each of the N processors sends messages to other multiple processors.
  • both resources and received bandwidth resources require a large amount of consumption, and the processor needs to queue up to send messages and queue up to receive messages, which easily leads to communication bottlenecks and increases communication delays.
  • FIG. 7 In order to facilitate the understanding that the above-mentioned message communication process cannot be superimposed and optimized with the calculation process, please refer to FIG. 7 . It can be seen in FIG. 7 that the process of the communication part and the process of the calculation part cannot be superimposed, and the processor needs to wait for the completion of the communication process before performing the next calculation process. Therefore, if the delay in the communication process is large, the efficiency of the entire training process will be seriously affected, thereby reducing the performance of the training system.
  • the present application provides a data processor method, which can improve the utilization rate of communication bandwidth between processors, reduce the communication delay in the forward propagation and back propagation of the embedding layer, and improve the training efficiency.
  • the data processing method provided by the present application mainly deploys a ring communication architecture among the above N processors, so that the processors communicate with other processors through the ring communication architecture during the forward propagation process of the embedding layer of data training to find corresponding and enable the processor to communicate with other processors through the ring communication architecture to obtain gradients corresponding to the embedding parameters of the respective required data during the back-propagation process of the embedding layer of data training.
  • FIG. 8 FIG. 8
  • each processor can search for the required embedding parameters through the ring communication architecture; and N processing
  • the message communication can be carried out through the ring communication architecture to obtain the required gradient.
  • each of the above N processors can generate a search message, and then the N processors communicate the generated search message in a ring.
  • the communication mode of the architecture is sent to its own next-hop processor.
  • each processor receives the search message, it can identify whether the data in the search message belongs to the data in the embedding table maintained by its own. If there is data belonging to it, then Find the embedded parameter corresponding to the data, and add the found embedded parameter to the received message. Then, the processor sends the message with the added embedded parameter to its next-hop processor again in the communication mode of the ring communication architecture.
  • the received message is directly sent to the next-hop processor in the communication mode of the ring communication architecture. After repeating the search and sending operations for at least N times, each processor can obtain the embedded parameters of all the data searched by itself.
  • each of the above N processors can generate a message including data and the corresponding gradient, and then the N processors will generate a message in a circular
  • the communication mode of the communication architecture is each sent to its own next-hop processor. After each processor receives the message, it can identify whether the data in the message belongs to the data in the embedding table maintained by each processor. If there is data to which it belongs, the gradient corresponding to the data in the message is obtained to optimize and update the corresponding embedding parameters in the embedding table. Then, the received message is sent to the next-hop processor in the communication mode of the ring communication architecture.
  • the processor also sends the received message to the next-hop processor in the communication mode of the ring communication architecture. After repeating the operation of sending and obtaining, after at least N-1 cycles, each processor can obtain the gradient corresponding to the embedding parameters of all the data in its own embedding table, so that the optimization and update of the embedding parameters of all the data can be completed.
  • the processor can propagate the data forward in the embedding layer. Before finding the embedding parameters of these sparse data, first convert the sparse data into dense data, and then use the converted dense data as an index to find the corresponding embedding parameters.
  • the messages sent and received based on the above ring communication architecture include: The data are in the form of dense data. And, in this case, the data in the embedding table maintained by the processor is also in the form of dense data.
  • the process of implementing ring message communication between the above N processors through a ring communication architecture can be encapsulated into a communication interface, that is, the operation of each processor sending a message to the next-hop processor can be encapsulated into a sending interface, each The operation of a processor to receive a message from the previous hop processor can be encapsulated into a receive interface, so that the processor needs to call the encapsulated send interface to send a message based on the above ring communication architecture. The processor needs to be based on the above ring communication architecture. When receiving a message, call the encapsulated receive interface to receive it.
  • the search process of embedding parameters in the forward propagation process of the embedding layer can be designed, and the gradient acquisition process in the back propagation process of the embedding layer can be encapsulated into a callable interface and exposed to artificial intelligence.
  • artificial intelligence artificial intelligence, AI
  • the processor can directly call the encapsulated interface to realize the search of the embedded parameter, and then return the search result.
  • the search result may be an embedded parameter of the found data, or if no corresponding embedded parameter is found, the returned result may be a null value.
  • the processor can directly call the encapsulated interface to search for the gradient, and return the operation result. Since the processor searches the message for the gradient corresponding to the embedding parameter of the data in its own embedding table, whether it is found or not, the returned operation result can be a null value.
  • each operation of the above data processing method provided by the present application is encapsulated into an interface is not limited to the implementation manner shown in the above example.
  • Each interface is provided for use by the AI framework, which is not limited in this application.
  • the data processing method may include but is not limited to the following steps:
  • a first processor sends a first lookup message to a second processor; the first lookup message includes first data and is used to look up embedded parameters of the first data; the second processor is where the first processor is located.
  • the next-hop processor of the first processor in the ring communication architecture is not limited to:
  • the above-mentioned N processors need to obtain the embedding parameters of the respective required data through message communication.
  • the message communication among the N processors may be implemented by adopting a ring communication architecture.
  • the first processor, the next-hop processor of the first processor (the above-mentioned second processor), and the previous-hop processor of the first processor (the third processor in the following step 902) are used first.
  • the communication between ) is taken as an example to introduce the search process of embedded parameters using the ring communication architecture.
  • the above-mentioned first data may include one or more data.
  • the first data may be sparse data or dense data.
  • Table 3 can be exemplified for the content included in the first search message.
  • Table 3 exemplarily shows part of the content included in the first search message.
  • the id in Table 3 is the data, and the value field is used to fill in the embedded parameter corresponding to the data.
  • k1 in Table 3 can be any integer greater than 0.
  • the value range corresponding to the data in the message may be a null value or may be a default original value (for example, the original value may be 0, etc.)
  • the above-mentioned first search message includes first data, so that the processor that receives the message can search for the embedded parameter corresponding to the first data based on the first data, so as to fill in the value range corresponding to the first data of the message middle.
  • the above-mentioned first search message may also include the data in the N
  • the remainder after the modulo since the remainder is the same as the serial number of the processor where the embedding table where the data is located (the sequence number of the training process), it can also be said that the first search message can also include the process where the embedding table where the data is located is located. 's serial number. See Table 4 for an example.
  • the program number is included in the first search message so that after the processor receives the message, it can quickly determine which data belongs to the embedding table maintained by itself through the program number, so as to quickly find the corresponding embedded parameters and fill in the message In the value range of , the search efficiency can be improved.
  • the format of the content included in the first search message may be the format shown in Table 4.
  • the first processor receives a second lookup message from a third processor; the second lookup message includes second data for looking up embedded parameters of the second data; the third processor is the ring communication architecture The previous hop processor of the first processor in the .
  • the above-mentioned second data may include one or more data, and the second data is generally different from the above-mentioned first data.
  • the second data may be sparse data or dense data.
  • part of the data in the second data may be the same as part of the data in the first data.
  • the format of the foregoing second search message is similar to that of the foregoing first search message.
  • For the content format included in the second search message reference may be made to the description corresponding to the foregoing Table 3 or Table 4, which will not be repeated here.
  • the first processor sends the above-mentioned first search message to its next hop processor, the second processor, the third processor from its previous hop processor After the processor receives the above-mentioned second search message.
  • the first processor performs the search operation of the embedded parameter in response to the second search message, which is described in two cases below:
  • the first processor adds the embedded parameters of the part or all of the data to the second search In the message, a third search message is obtained, and the third search message is sent to the second processor.
  • the third search message is used to search for embedded parameters of the data for which no embedded parameters are found in the second data.
  • the first processor parses the message to obtain the content in the message.
  • the second data is compared with the data in the embedding table maintained by the first processor itself. If part or all of the second data exists in the embedding table, the first processor obtains the embedded parameters of the part or all of the data mapping from the embedding table. Then, the first processor adds the embedded parameter of the part or all of the data mapping to the value field corresponding to the part or all of the data in the second search message to obtain a third search message. Then, the first processor sends the third search message to the next-hop processor, that is, to the second processor.
  • adding the embedded parameter to the value field of the message may be adding the embedded parameter to the value field of the message by operations such as accumulation.
  • the remainder after the data in the embedding table in the processor i in the N processors is calculated by modulo N is i, and the format of the content carried by the second search message is as shown in Table 3 above. is displayed, that is, the program number is not carried.
  • the first processor parses the message to obtain the second data in the message.
  • the first processor performs modulo calculation on each data of the second data and N, and obtains the remainder after the modulo of each data.
  • the data corresponding to the one or more remainders is stored in the embedding table maintained by the first processor.
  • the first processor uses the data corresponding to the one or more remainders as an index, and finds the embedding parameters of the data corresponding to the one or more remainders in the embedding table.
  • the found embedded parameters are correspondingly added to the value range corresponding to the data corresponding to the one or more remainders in the second search message to obtain a third search message.
  • the first processor sends the third search message to the next-hop processor, that is, to the second processor.
  • the first processor uses the data corresponding to the one or more Find the embedding parameters of the data corresponding to the one or more remainders in the embedding table. If there is data in the data corresponding to the one or more remainders that is not in the embedding table, then the processor can randomly generate corresponding embedding parameters for the data that is not in the embedding table, and then use the embedding parameters found in the embedding table. The parameter and the randomly generated embedded parameter are added to the corresponding value range in the second search message to obtain the third search message.
  • the first processor sends the third search message to the next-hop processor, that is, to the second processor.
  • the processor adds the data not in the embedding table and the randomly generated embedding parameters into the embedding table in a one-to-one correspondence.
  • the remainder after the data in the embedding table in the processor i of the N processors is calculated by modulo N to be i, and the format of the content carried by the second search message is as shown in Table 4 above. display, that is, carry the program number.
  • the first processor parses the message to obtain the second data in the message and the corresponding entry program number.
  • the process number in the second search message has one or more sequence numbers of the training process running by the first processor, then the data corresponding to the one or more sequence numbers exists in the embedding table maintained by the first processor middle.
  • the first processor uses the data corresponding to the one or more serial numbers as an index, and searches the embedding table to find the embedding parameters of the data corresponding to the one or more serial numbers.
  • the found embedded parameters are correspondingly added to the value range corresponding to the data corresponding to the one or more sequence numbers in the second search message to obtain a third search message. Then, the first processor sends the third search message to the next-hop processor, that is, to the second processor.
  • the first processor uses the data corresponding to the one or more sequence numbers as an index, in itself Find the embedded parameter of the data corresponding to the one or more serial numbers in the maintained embedding table. If there is data in the data corresponding to the one or more serial numbers that is not in the embedding table, then the processor can randomly generate corresponding embedding parameters for the data that is not in the embedding table, and then use the embedding parameters found in the embedding table. The parameter and the randomly generated embedded parameter are added to the corresponding value range in the second search message to obtain the third search message.
  • the first processor sends the third search message to the next-hop processor, that is, to the second processor.
  • the processor adds the data not in the embedding table and the randomly generated embedding parameters into the embedding table in a one-to-one correspondence.
  • the value field is empty by default.
  • the first processor determines that the data 9 and 3 in the table 5 belong to the data in the embedding table maintained by itself, and finds that the embedding parameters of the 9 and 3 in the embedding table are the parameter a and the parameter b respectively, and then the first processing
  • the controller directly adds the parameter a and the parameter b to the value ranges corresponding to 9 and 3, respectively, after adding, see Table 6.
  • the third search message obtained above includes the content shown in Table 6.
  • the first processor sends the second search message to the second processor when the embedded parameter of the second data is not found based on the second search message.
  • the first processor determines that no data in the second data included in the second search message belongs to the first process
  • the data in the embedding table maintained by the processor itself that is, the first processor cannot find the embedded parameters of the second data in the embedding table maintained by itself
  • the first processor will send the second search message to the next
  • the one-hop processor is sent to the above-mentioned second processor.
  • the data in the embedding table in the processor i in the above N processors is calculated by modulo N, the remainder is i, the remainder obtained after the modulo operation on N is the same as the training run by the first processor.
  • the data with the same program number belongs to the data in the embedding table maintained by the first processor itself, and the other data does not belong to the data in the embedding table maintained by the first processor itself.
  • the first processor further receives the fourth search message from the third processor.
  • a lookup message; the fourth lookup message includes the third data and the embedded parameter of the first part of the data mapping in the third data, and the fourth lookup message is used to look up the data mapping of the third data except the first part of the data. Embedded parameters.
  • the embedded parameters of the first part of the data in the third data carried in the fourth search message have been searched in other processors. Therefore, , and the fourth search message carries the embedded parameter of the first part of the data.
  • the first part of data is one or more data in the third data.
  • the third data may be sparse data or dense data.
  • the first processor performs the search operation of the embedded parameter in response to the fourth search message, which is also described in two cases:
  • the first processor adds the embedded parameter of the second part of the data to the fourth search In the message, a fifth search message is obtained, and the fifth search message is sent to the above-mentioned second processor.
  • the second partial data is one or more pieces of data in the third data, and the second partial data and the first partial data are different data.
  • the first processor parses the message to obtain the content in the message.
  • the third data is compared with the data in the embedding table maintained by the first processor itself. If the second part of the data in the third data exists in the embedding table, the first processor obtains the embedding parameter of the second part of the data mapping in the embedding table. Then, the first processor adds the embedded parameter of the second part of the data mapping to the value range corresponding to the second part of the data in the fourth search message to obtain a fifth search message. Then, the first processor sends the fifth search message to the next-hop processor, that is, to the second processor.
  • the fifth search message is used to search for embedded parameters of data for which no embedded parameters are found in the third data.
  • the remainder after the data in the embedding table in the processor i in the N processors is calculated by modulo N to be i, and the format of the content carried by the fourth search message is as shown in Table 3 above. is displayed, that is, the program number is not carried.
  • the first processor parses the message to obtain the third data in the message.
  • the first processor performs a modulo calculation on each data of the third data with N to obtain a remainder after the modulo of each data.
  • the data corresponding to the one or more remainders is stored in the embedding table maintained by the first processor.
  • the data corresponding to the one or more remainders is the above-mentioned second part of the data.
  • the first processor uses the data corresponding to the one or more remainders as an index, and finds the embedding parameters of the data corresponding to the one or more remainders in the embedding table.
  • the found embedded parameters are correspondingly added to the value range corresponding to the data corresponding to the one or more remainders in the fourth search message to obtain a fifth search message.
  • the first processor sends the fifth search message to the next-hop processor, that is, to the second processor.
  • the first processor uses the data corresponding to the one or more Find the embedding parameters of the data corresponding to the one or more remainders in the embedding table. If there is data in the data corresponding to the one or more remainders that is not in the embedding table, then the processor can randomly generate corresponding embedding parameters for the data that is not in the embedding table, and then use the embedding parameters found in the embedding table. The parameter and the randomly generated embedded parameter are added to the corresponding value range in the fourth search message to obtain the fifth search message.
  • the first processor sends the fifth search message to the next-hop processor, that is, to the second processor.
  • the processor adds the data not in the embedding table and the randomly generated embedding parameters into the embedding table in a one-to-one correspondence.
  • the remainder after the data in the embedding table in the processor i in the N processors is calculated by modulo N to be i, and the format of the content carried by the fourth search message is as shown in Table 4 above. display, that is, carry the program number.
  • the first processor parses the message to obtain the third data in the message and the corresponding program number.
  • the process number in the fourth search message has one or more sequence numbers of the training process running by the first processor, then the data corresponding to the one or more sequence numbers exists in the embedding table maintained by the first processor middle.
  • the data corresponding to the one or more serial numbers is the above-mentioned second part of the data.
  • the first processor uses the data corresponding to the one or more serial numbers as an index, and searches the embedding table to find the embedding parameters of the data corresponding to the one or more serial numbers.
  • the found embedded parameters are correspondingly added to the value fields corresponding to the data corresponding to the one or more sequence numbers in the fourth search message to obtain a fifth search message.
  • the first processor sends the fifth search message to the next-hop processor, that is, to the second processor.
  • the first processor uses the data corresponding to the one or more sequence numbers as an index, Find the embedded parameter of the data corresponding to the one or more serial numbers in the maintained embedding table. If there is data in the data corresponding to the one or more serial numbers that is not in the embedding table, then the processor can randomly generate corresponding embedding parameters for the data that is not in the embedding table, and then use the embedding parameters found in the embedding table. The parameter and the randomly generated embedded parameter are added to the corresponding value range in the fourth search message to obtain the fifth search message.
  • the first processor sends the fifth search message to the next-hop processor, that is, to the second processor.
  • the processor adds the data not in the embedding table and the randomly generated embedding parameters into the embedding table in a one-to-one correspondence.
  • the first processor determines that the data 15 in the table 7 belongs to the data in the embedding table maintained by itself, and finds that the embedded parameter of the 15 in the embedding table is the parameter e, and then the first processor directly adds the parameter e to the parameter e. In the value range corresponding to 15, you can refer to Table 8 after adding.
  • the fifth search message obtained above includes the content shown in Table 8.
  • the first processor sends the fourth search message to the second processor.
  • the first processor determines that no data in the third data included in the fourth search message belongs to the first process
  • the data in the embedding table maintained by the processor itself that is, the first processor cannot find the embedded parameters of the third data in the embedding table maintained by itself, then the first processor can send the fourth search message to the next hop.
  • the processor is then sent to the above-mentioned second processor.
  • the fourth search message received from the third processor by the first processor includes not the embedded parameters of the partial data mapping in the third data, but all the data in the third data. Embed parameter for datamap.
  • the first processor may determine that no data in the third data included in the fourth search message belongs to the data in the embedding table maintained by the first processor itself, then the first processor may The four lookup message is sent to the next hop processor, that is, sent to the above-mentioned second processor.
  • the above operations of message sending and embedded parameter search are repeated, and after looping N-1 times, in the Nth loop, the first processor can receive the sixth search from the third processor.
  • the sixth lookup message includes the first data and an embedded parameter of the first data. That is, the above-mentioned first search message is a message generated by the first processor, and the first data carried in the message is the data that the first processor needs to train. After N cycles, the message carrying the first data passes through N processors, and the embedded parameters of the first data are found from one or more processors of the N processors. The found embedding parameters are continuously forwarded along with the message, and finally, the sixth search message is sent to the first processor, so that the first processor obtains all the embedding parameters of the training data. For example, see Table 9.
  • Table 9 exemplarily shows the first data included in the sixth search message and the embedded parameters of the first data. It can be seen that the embedded parameters of the first data have all been found, and the values corresponding to each data are filled in in the domain.
  • the first processor After the first processor obtains all the embedding parameters of the training data through the sixth search message, if the training data is sparse data, the first processor needs to reduce the obtained embedding parameters of the training data. operation, and then forward-propagating the reduced embedding parameters to the hidden layer.
  • the reduction operation may be, for example, an operation such as weighting and summing the embedded parameters of the training data of the same type or relatively large correlation.
  • the specific reduction operation refer to the operation in the existing solution. This does not limit.
  • FIGS. 10A to 10E it is assumed that the above-mentioned N processors are 4 processors, which are processor 0 , processor 1 , processor 2 , and processor 3 respectively.
  • the four processors implement message communication through the above-mentioned ring communication architecture.
  • This example takes the data in the embedding table in the processor i of the N processors as an example, and the remainder after the modulo calculation of N is i as an example.
  • each processor needs to first find the embedding parameters of the data trained by itself.
  • the data that processor 0 needs to find the embedded parameters are the first batch of data: 21, 5, 14 and 25, and the remainders of the first batch of data modulo 4 are 1, 1, 2 and 1 respectively, that is, the data
  • the embedded parameters of 21, 5 and 25 need to be looked up in processor 1, and the embedded parameters of data 14 need to be looked up in processor 2.
  • Fig. 10A it is assumed that the data that processor 0 needs to find the embedded parameters are the first batch of data: 21, 5, 14 and 25, and the remainders of the first batch of data modulo 4 are 1, 1, 2 and 1 respectively, that is, the data The embedded parameters of 21, 5 and 25 need to be looked up in processor 1, and the embedded parameters of data 14 need to be looked up in processor 2.
  • Fig. 10A it is assumed that the data that processor 0 needs to find the embedded parameters are the first batch of data: 21, 5, 14 and 25, and the remainders of the first batch of data modulo 4 are 1, 1, 2 and 1 respectively, that is,
  • the data that the processor 1 needs to find the embedded parameters are the second batch of data: 19, 2, 10 and 32, and the remainders of the second batch of data modulo 4 are 3, 2, 2 and 0 respectively, that is, the data
  • the embedded parameters of 2 and 10 need to be searched in processor 2
  • the embedded parameters of data 19 need to be searched in processor 3
  • the embedded parameters of data 32 need to be searched in processor 0.
  • the data that the processor 2 needs to find the embedded parameters are the third batch of data: 13, 8, 16 and 29, and the remainders of the third batch of data modulo 4 are 1, 0, 0 and 1 respectively, that is, the data
  • the embedded parameters of 8 and 16 need to be looked up in processor 0, and the embedded parameters of data 13 and 29 need to be looked up in processor 1.
  • FIG. 10A it is assumed that the data that the processor 2 needs to find the embedded parameters are the third batch of data: 13, 8, 16 and 29, and the remainders of the third batch of data modulo 4 are 1, 0, 0 and 1 respectively, that is, the data
  • the embedded parameters of 8 and 16 need to be looked up in processor 0, and the embedded parameters of data 13 and 29 need to be looked up in processor 1.
  • the data that the processor 3 needs to find the embedded parameters are the fourth batch of data: 6, 33, 18 and 4, and the remainders of the third batch of data modulo 4 are 2, 1, 2 and 0 respectively, that is, the data
  • the embedded parameters of 6 and 18 need to be searched in processor 2
  • the embedded parameters of data 33 need to be searched in processor 1
  • the embedded parameters of data 4 need to be searched in processor 0.
  • the remainder of the above-mentioned data modulo 4 is the sequence number of the process where the embedded parameter of each data is located.
  • each processor first generates a message, which includes the data to be searched for the embedded parameter, the corresponding program number, and the value range space for filling in the embedded parameter. After each processor generates its own message, according to the communication mode of the ring communication architecture, each of the processors sends the respective generated message to the next hop processor, and receives the message sent from the previous hop processor. After receiving the message, a corresponding table lookup operation is performed. For the specific table lookup operation, refer to the foregoing description, which will not be repeated here. Then, the found embedded parameters are filled into the respective received messages, see FIG. 10B for details.
  • processor 0 sends a message including the first batch of data to processor 1, and receives a message including the fourth batch of data from processor 3, and then finds the message in its own embedding table.
  • the embedded parameter of data 4 is added to the value field corresponding to data 4 in the received message to obtain a new message.
  • Processor 1 sends a message including the second batch of data to processor 2, and receives a message including the first batch of data from processor 0, and then finds the embeddings of data 21, 5, and 25 in its own embedding table parameter, and add it to the value fields corresponding to data 21, 5 and 25 in the received message to obtain a new message.
  • Processor 2 sends the message including the third batch of data to processor 3, and receives the message including the second batch of data from processor 1, and then finds the embedded parameters of data 2 and 10 in its own embedding table, And add it to the value fields corresponding to data 2 and 10 in the received message to obtain a new message.
  • Processor 3 sends a message including the fourth batch of data to processor 0, and receives a message including the third batch of data from processor 2. Since no data in the third batch of data belongs to the data in the embedding table in processor 3, Therefore, the embedded parameter of any data in the third batch of data is not found in the processor 3.
  • the processors that have obtained new messages send the new messages they have obtained to the next hop processor, and the processor that has not obtained new messages (processor 3) sends the received messages to the next hop
  • the processors receive new messages from their respective previous-hop processors after sending messages, and then continue to search for embedded parameters in response to the new messages. Then, the found embedded parameters are filled into the respective received messages, see FIG. 10C for details.
  • processor 0 sends a message including the fourth batch of data to processor 1, and receives a message including the third batch of data from processor 3, and then finds data 8 and 8 in its own embedding table.
  • the embedded parameter of 16 is added to the value fields corresponding to data 8 and 16 in the received message to obtain a new message.
  • Processor 1 sends a message including the first batch of data to processor 2, and receives a message including the fourth batch of data from processor 0, and then finds the embedded parameter of data 33 in its own embedding table, and adds A new message is obtained from the value field corresponding to the data 33 in the received message.
  • the processor 2 sends the message including the second batch of data to the processor 3, and receives the message including the first batch of data from the processor 1, and then finds the embedded parameter of the data 14 in its own embedding table, and adds A new message is obtained in the value field corresponding to the data 14 in the received message.
  • Processor 3 sends a message including the third batch of data to processor 0, and receives a message including the second batch of data from processor 2, and then finds the embedded parameter of data 19 in its own embedding table, and adds A new message is obtained from the value field corresponding to the data 19 in the received message.
  • each processor sends the new message obtained by each processor to the next-hop processor, and each processor receives a new message from the respective previous-hop processor after sending the message, and then continues to respond to the new message Do a lookup of embedded parameters. Then, the found embedded parameters are filled into the respective received messages, see FIG. 10D for details.
  • processor 0 sends a message including the third batch of data to processor 1, and receives a message including the second batch of data from processor 3, and then finds the data 32 in its own embedding table. Embed the parameter and add it to the value field corresponding to the data 32 in the received message to obtain a new message.
  • Processor 1 sends the message including the fourth batch of data to processor 2, and receives the message including the third batch of data from processor 0, and then finds the embedded parameters of data 13 and 29 in its own embedding table, And add it to the value fields corresponding to data 13 and 29 in the received message to obtain a new message.
  • Processor 2 sends a message including the first batch of data to processor 3, and receives a message including the fourth batch of data from processor 1, and then finds the embedded parameters of data 6 and 18 in its own embedding table, And add it to the value fields corresponding to data 6 and 18 in the received message to obtain a new message.
  • Processor 3 sends a message including the second batch of data to processor 0, and receives a message including the first batch of data from processor 2. Since there is no data in the first batch of data that belongs to the data in the embedding table in processor 3, Therefore, the embedded parameter of any data in the first batch of data is not found in the processor 3.
  • each processor receives a new message from its previous hop processor after sending the message.
  • the messages received by each processor include its own training data and the required embedding parameters, thus completing the process of finding the embedding parameters of the entire embedding layer. See Figure 10E for details.
  • processor 0 sends a message including the second batch of data to processor 1, and receives a message including the first batch of data from processor 3, the message includes the first batch of data required by processor 0. Embedding parameters for a batch of data.
  • Processor 1 sends a message including the third batch of data to processor 2, and receives a message including the second batch of data from processor 0, where the message includes embedded parameters of the second batch of data required by processor 1.
  • the processor 2 sends a message including the fourth batch of data to the processor 3, and receives a message including the third batch of data from the processor 1, where the message includes the embedded parameters of the third batch of data required by the processor 2.
  • the processor 3 sends a message including the first batch of data to the processor 0, and receives a message including the fourth batch of data from the processor 2, where the message includes the embedded parameters of the fourth batch of data required by the processor 3.
  • FIG. 10A to FIG. 10E and the related descriptions are only an example, and do not constitute a limitation to the present application, and modifications made based on the above-mentioned ideas are all within the protection scope of the present application.
  • the 4 processors find their respective embedded parameters through 4 cycles. Since the communication between the processors is realized through the ring communication architecture, compared with the many-to-many in the existing technical solution The present application avoids the single-point communication bottleneck, reduces the communication delay, and improves the communication efficiency, thereby improving the training performance of the entire data training system.
  • the “first processor (or data)”, the “first processor (or data)”, “ “Second processor (or data)” and “third processor (or data)” and so on may be the same objects or may be different objects with the same names as those in FIG. 9 and its possible implementations.
  • the data processing method provided by the present application may include but not limited to the following steps during the backpropagation process of the embedding layer:
  • the first processor sends a first notification message to the second processor; the first notification message includes first data and a first gradient, and is used to propagate the first gradient to the first target processor; the first The gradient is the gradient corresponding to the embedding parameter of the first data, and the first data and the first gradient are mapped one-to-one; the second processor is the lower part of the first processor in the ring communication architecture where the first processor is located. One hop processor.
  • the above N processors each obtain the gradient of the embedding parameters of the data trained by each processor, but because the embedding parameters of the data trained by each processor are stored in the embedding tables of other processors , therefore, the gradient needs to be sent to the corresponding processor through message communication for optimizing the corresponding embedding parameters.
  • the N processors implement message communication through a ring communication architecture.
  • the first processor, the next-hop processor of the first processor (the above-mentioned second processor), and the previous-hop processor of the first processor (the third processor in step 1102 below) are used first.
  • the communication between ) is taken as an example to introduce the process of obtaining the gradients required by each other by adopting the ring communication architecture.
  • the above-mentioned first target processor includes one or more processors among the above-mentioned N processors.
  • the specific processor of the first target processor is determined by the first data in the first notification message. Exemplarily, assuming that the first data includes part or all of the data in the embedding table in processor i, then the first target processor includes processor i.
  • the above-mentioned first data may include one or more data.
  • Table 10 For the content included in the above-mentioned first notification message, reference may be made to Table 10 for example.
  • Table 10 exemplarily shows part of the content included in the first notification message.
  • the id in Table 10 is the data, and the value range is the gradient corresponding to the embedded parameter of the data.
  • k2 in Table 10 can be any integer greater than 0.
  • the above-mentioned first notification message may also include the data in the N processor.
  • the remainder after the modulo since the remainder is the same as the serial number of the processor where the embedding table where the data is located (the sequence number of the training process), it can also be said that the first notification message can also include the process where the embedding table where the data is located is located. 's serial number. See Table 11 for an example.
  • the program number is included in the first notification message so that after the processor receives the message, the processor can quickly determine which data belongs to the embedding table maintained by itself through the program number, so as to quickly obtain the corresponding gradient.
  • the format of the content included in the first notification message may be the format shown in Table 11.
  • the first processor receives a second notification message from the third processor; the second notification message includes second data and a second gradient, and is used to propagate the second gradient to the second target processor; the The second gradient is the gradient corresponding to the embedding parameter of the second data, and the second data is mapped to the second gradient one-to-one; the third processor is the previous hop processor of the first processor in the ring communication architecture .
  • the above-mentioned second target processor includes one or more processors among the above-mentioned N processors.
  • the specific processor of the second target processor is determined by the second data in the second notification message. Exemplarily, it is assumed that the second data includes part or all of the data in the embedding table in the processor i, then the second target processor includes the processor i.
  • the above-mentioned second data may include one or more data, and the second data is generally different from the above-mentioned first data.
  • part of the data in the second data may be the same as part of the data in the first data.
  • the format of the second notification message is similar to the format of the first notification message. For the content format included in the second notification message, reference may be made to the description corresponding to Table 10 or Table 11, which will not be repeated here.
  • the first processor sends the above-mentioned first notification message to its next hop processor, the second processor, the third processor from its previous hop processor After the processor receives the above-mentioned second notification message.
  • the first processor performs the gradient acquisition operation in response to the second notification message, which is described in the following two cases:
  • the first processor acquires the first target gradient in the second notification message, and sends the first target gradient to the second processor.
  • Two notification messages used to continue to notify other processors in the second target processor to obtain the required gradient;
  • the first target gradient is the gradient of the embedded parameters in the first embedding table maintained by the first processor, and
  • the first target gradient is the gradient of the embedded parameter in the first embedding table maintained by the first processor.
  • the first processor parses the message to obtain the information in the message.
  • the second data is compared with the data in the embedding table maintained by the first processor itself. If part or all of the second data exists in the embedding table, the first processor extracts the gradient corresponding to the part or all of the data from the range of the parsed second notification message, so as to optimize the first An embedded parameter of the part or all of the data in the first embedded table maintained by a processor. After extracting the gradient, the first processor repackages the second notification message and sends it to the next-hop processor, the second processor.
  • the remainder after the data in the embedding table in the processor i in the above-mentioned N processors is calculated by modulo N to be i, and the format of the content carried by the second notification message is as shown in Table 10 above. shown, that is, the program number is not carried in.
  • the first processor parses the message to obtain the second data in the message.
  • the first processor performs modulo calculation on each data of the second data and N, and obtains the remainder after the modulo of each data.
  • the data corresponding to the one or more remainders is stored in the embedding table maintained by the first processor.
  • the first processor extracts the corresponding gradient of the data corresponding to the one or more remainders from the value range of the parsed second notification message, so as to optimize the one or more in the first embedding table maintained by the first processor. Embedding parameter for data corresponding to multiple remainders. After extracting the gradient, the first processor repackages the second notification message and sends it to the next-hop processor, the second processor.
  • the remainder after the data in the embedding table in the processor i in the above-mentioned N processors is calculated by modulo N to be i, and the format of the content carried by the second notification message is as shown in Table 11 above. shown, that is, carry in the program number.
  • the first processor parses the message to obtain the second data in the message and the corresponding program number. If the process number in the second notification message has one or more sequence numbers of the training process running by the first processor, then the data corresponding to the one or more sequence numbers exists in the embedding table maintained by the first processor middle.
  • the first processor extracts the corresponding gradient of the data corresponding to the one or more sequence numbers from the value range of the parsed second notification message, so as to optimize the one or more in the first embedding table maintained by the first processor. Embedded parameters of data corresponding to multiple serial numbers. After extracting the gradient, the first processor repackages the second notification message and sends it to the next-hop processor, the second processor.
  • the first processor sends the second notification message to the second processor.
  • the first processor determines that no data in the second data included in the second notification message belongs to The data in the embedding table maintained by the first processor itself, then the first processor does not need to extract the gradient from the second notification message, and sends the second notification message to the next-hop processor, that is, to the above-mentioned second notification message. processor.
  • the first processor after the first processor receives the second notification message and completes the response operation to the second notification message, the first processor further receives the third notification message from the third processor. notification message.
  • the third notification message includes third data and a third gradient, and is used to propagate the third gradient to the third target processor; the third gradient is a gradient corresponding to an embedding parameter of the third data, and the third data One-to-one mapping with the third gradient.
  • the above-mentioned third target processor includes one or more processors among the above-mentioned N processors.
  • the specific processor of the third target processor is determined by the third data in the third notification message. Exemplarily, assuming that the third data includes part or all of the data in the embedding table in processor i, then the third target processor includes processor i.
  • the above-mentioned third data may include one or more data, and the third data is generally different from the above-mentioned first data and second data.
  • the partial data in the third data may be the same as the partial data of the first data or the same as the partial data of the second data.
  • the format of the foregoing third notification message is similar to that of the foregoing first notification message.
  • For the content format included in the third notification message reference may be made to the description corresponding to the foregoing Table 10 or Table 11, which will not be repeated here.
  • the first processor acquires the second target gradient in the third notification message, and sends the third notification to the second processor
  • the message is used to continue to notify other processors in the third target processor to obtain the required gradient
  • the second target gradient is the gradient of the embedding parameter in the first embedding table maintained by the first processor.
  • the first processor sends the third notification message to the second processor, so as to continue to notify other in the third target processor
  • the processor obtains the required gradient.
  • the above-mentioned N processors perform message communication through a ring communication architecture, repeat the above-mentioned message sending and gradient acquisition operations, and after at least N-1 cycles, each of the above-mentioned N processors is all
  • the gradients of the embedded parameters in the own embedding table are obtained, so that the embedded parameters in the own embedding table can be optimized correspondingly based on the obtained gradients.
  • FIGS. 12A to 12D it is assumed that the above-mentioned N processors are 4 processors, which are processor 0 , processor 1 , processor 2 , and processor 3 respectively.
  • the four processors implement message communication through a ring communication architecture.
  • This example takes the data in the embedding table in the processor i of the N processors as an example, and the remainder after the modulo calculation of N is i as an example.
  • each processor needs to first obtain the gradient of the embedding parameters of the data trained by itself, so as to optimize the embedding parameters in the embedding table according to the gradients of the embedding parameters. .
  • FIG. 12A for the introduction of the first batch of data, the second batch of data, the third batch of data and the fourth batch of data, reference may be made to the above description of FIG. 10A , which will not be repeated here.
  • each processor first generates a message, which includes data, a corresponding process number, and a gradient corresponding to the data.
  • each processor After each processor generates its own message, according to the communication mode of the ring communication architecture, each sends the generated message to the next hop processor, and receives the message sent by the previous hop processor, see FIG. 12B for details. After receiving the message, the processor may perform a corresponding gradient acquisition operation. For the specific acquisition operation, refer to the description in the foregoing step 1102, which will not be repeated here.
  • processor 0 sends a message including the first batch of data to processor 1, and receives a message including the fourth batch of data from processor 3, and then obtains the corresponding data of data 4 in the received message.
  • Processor 1 sends a message including the second batch of data to processor 2, and receives a message including the first batch of data from processor 0, and then obtains the gradient 1 corresponding to data 21, 5, and 25 in the received message, respectively , Gradient 2 and Gradient 4.
  • Processor 2 sends the message including the third batch of data to processor 3, and receives the message including the second batch of data from processor 1, and then obtains the gradient 6 and gradient corresponding to data 2 and 10 in the received message, respectively 7.
  • Processor 3 sends a message including the fourth batch of data to processor 0, and receives a message including the third batch of data from processor 2. Since no data in the third batch of data belongs to the data in the embedding table in processor 3, Therefore, processor 3 does not acquire any gradient in the received message.
  • each processor performs the gradient acquisition operation in response to the received message, it sends the received message to the next-hop processor, see FIG. 12C for details.
  • processor 0 sends a message including the fourth batch of data to processor 1, and receives a message including the third batch of data from processor 3, and then obtains data 8 and data in the received message.
  • Data 16 corresponds to gradient 10 and gradient 11, respectively.
  • the processor 1 sends the message including the first batch of data to the processor 2, and receives the message including the fourth batch of data from the processor 0, and then obtains the gradient 14 corresponding to the data 33 in the received message.
  • the processor 2 sends the message including the second batch of data to the processor 3, receives the message including the first batch of data from the processor 1, and then obtains the gradient 3 corresponding to the data 14 in the received message.
  • the processor 3 sends the message including the third batch of data to the processor 0, and receives the message including the second batch of data from the processor 2, and then obtains the gradient 5 corresponding to the data 19 in the received message.
  • each processor After each processor has performed the gradient acquisition operation in response to the received message, it sends the received message to the next-hop processor, see FIG. 12D for details.
  • processor 0 sends a message including the third batch of data to processor 1, and receives a message including the second batch of data from processor 3, and then obtains the corresponding data 32 in the received message.
  • Gradient 8 Processor 1 sends a message including the fourth batch of data to processor 2, and receives a message including the third batch of data from processor 0, and then obtains the gradient 9 and gradient corresponding to data 13 and 29 in the received message, respectively 12.
  • Processor 2 sends a message including the first batch of data to processor 3, and receives a message including the fourth batch of data from processor 1, and then obtains the gradient 13 and gradient corresponding to data 6 and 18 in the received message, respectively 15.
  • Processor 3 sends a message including the second batch of data to processor 0, and receives a message including the first batch of data from processor 2. Since there is no data in the first batch of data that belongs to the data in the embedding table in processor 3, Therefore, processor 3 does not acquire any gradient in the received message.
  • FIG. 12A to FIG. 12D and the related descriptions are only an example, and do not constitute a limitation to the present application, and modifications made based on the above-mentioned ideas are all within the protection scope of the present application.
  • the above-mentioned four processors have obtained their respective required gradients. Because the communication between the processors is realized through the ring communication architecture, compared with the many-to-many message communication method in the existing technical solution, the present application avoids the bottleneck of single-point communication, reduces the communication delay, and improves the communication efficiency , which can improve the training performance of the entire data training system.
  • the data processing method shown in FIG. and any of its possible implementations are used together, that is, in the forward propagation process of the embedding layer of the data training, the search of the embedding parameters is realized based on the ring communication architecture introduced above, and then, the embedding layer of the data training is used.
  • the above-mentioned ring communication architecture is used to obtain the gradient.
  • FIG. 13 is a schematic diagram showing the comparison of the communication throughput between the prior art solution shown in FIG. 3 and the solution provided by the present application.
  • the throughput refers to the number of successfully sent data per unit time.
  • the horizontal axis represents the number of processors used by the data training system, and the number of processors increases in the direction of the arrow; the vertical axis represents the throughput, and the throughput increases in the direction of the arrow.
  • the solution of the prior art adopts a many-to-many message communication manner, and as the number of processors increases, the throughput does not change much, or even decreases.
  • the present application adopts a ring communication architecture to realize the communication of messages.
  • the throughput can increase with the increase of the number of processors, and it increases with excellent linearity.
  • the ring communication architecture can make full use of the network bandwidth and is less prone to blocking and jitter in the many-to-many message communication method.
  • the present application can reduce the communication delay to 10-30% of the original by using the ring communication architecture to communicate messages in the forward propagation and back propagation of the embedding layer, which greatly improves the Communication efficiency, thereby improving the performance of the data training system.
  • each device includes corresponding hardware structures and/or software modules for performing each function.
  • the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a function is performed by hardware or computer software driving hardware depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.
  • the device may be divided into functional modules according to the foregoing method examples.
  • each functional module may be divided corresponding to each function, or two or more functions may be integrated into one module.
  • the above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules. It should be noted that, the division of modules in the embodiments of the present application is schematic, and is only a logical function division, and there may be other division manners in actual implementation.
  • FIG. 14 shows a possible schematic diagram of the logical structure of the apparatus, and the apparatus may be the first processor in the method described in FIG. 9 and a possible implementation thereof. , or may be a chip in the first processor, or may be a processing system in the first processor, or the like.
  • the apparatus 1400 includes a sending unit 1401 and a receiving unit 1402 . in:
  • the sending unit 1401 is configured to send a first lookup message to a second processor; the first lookup message includes first data, and the first lookup message is used to look up the embedded parameters of the first data; the second processor is the The next-hop processor of the first processor in the ring communication architecture where the first processor is located; the sending unit 1401 can be implemented by a sending interface or a transmitter, and can perform the operations described in step 901 shown in FIG. 9 . .
  • the receiving unit 1402 is configured to receive a second search message from a third processor; the second search message includes second data, and the second search message is used to search for embedded parameters of the second data; the third processor is The last hop processor of the first processor in the ring communication architecture; the receiving unit 1402 may be implemented by a receiving interface or a receiver, and may perform the operations described in step 902 shown in FIG. 9 .
  • the first processor, the second processor and the third processor are processors among the N processors included in the data training system, where N is an integer greater than or equal to 3;
  • the ring communication architecture implements communication in which each of the N processors only receives messages from the previous hop processor of each processor, and only sends messages to the next hop of each processor.
  • the hop handler sends the message.
  • the device further includes an adding unit
  • the above-mentioned adding unit is configured to add the embedded parameters of the part or all of the data to the second search message when the embedded parameters of part or all of the data in the second data are found based on the second search message, to obtain the third lookup message;
  • the above-mentioned sending unit 1401 is further configured to send the third search message to the second processor
  • the sending unit 1401 is further configured to send the second search message to the second processor when the embedded parameter of the second data is not found based on the second search message.
  • the device further includes a search unit
  • the lookup unit is configured to look up the embedded parameters of the part or all of the data mapping in the first embedded table;
  • the first embedded table is an embedded table maintained by the first processor for storing data and embedded parameters, and the first embedded table is used to store data and embedded parameters. There is a one-to-one mapping relationship between the data in the embedded table and the embedded parameters;
  • the above-mentioned adding unit is specifically used to add the embedded parameter of the part or all of the data mapping to the value range corresponding to the part or all of the data in the second search message to obtain the third search message;
  • the above-mentioned sending unit 1401 is specifically configured to send the third search message to the second processor, where the third search message is used to search for the embedded parameter of the data for which the embedded parameter is not found in the second data.
  • the above-mentioned apparatus further includes a determining unit and a generating unit;
  • the determining unit is configured to determine that part or all of the above data belongs to the first embedded table, and the first embedded table does not yet include the part or all of the data; the first embedded table is maintained by the first processor for storage an embedded table of data and embedded parameters, and there is a one-to-one mapping relationship between data and embedded parameters in the first embedded table;
  • the generating unit is used to generate the respective embedded parameters corresponding to the part or all of the data
  • the above-mentioned adding unit is specifically used for adding the embedded parameters corresponding to the part or all of the data to the value range corresponding to the part or all of the data in the second search message to obtain the third search message;
  • the above-mentioned sending unit 1401 is specifically configured to send the third search message to the second processor, where the third search message is used to search for the embedded parameter of the data for which the embedded parameter is not found in the second data.
  • the above-mentioned sending unit 1401 is specifically configured to:
  • the first embedded table is maintained by the first processor for storing data and An embedded table of embedded parameters, there is a one-to-one mapping relationship between the data in the first embedded table and the embedded parameters.
  • the above receiving unit 1402 is further configured to receive a fourth search message from the third processor; the fourth search message includes the third data and the embedding of the first part of the data mapping in the third data parameters, where the fourth lookup message is used to look up embedded parameters of data mappings in the third data other than the first part of data;
  • the apparatus further includes an adding unit, configured to add the embedded parameter of the second part of the data to the fourth search message when the embedded parameter of the second part of the data in the third data is found based on the fourth search message , get the fifth search message;
  • the above-mentioned sending unit 1401 is further configured to send the fifth search message to the second processor
  • the sending unit 1401 is further configured to send the fourth search message to the second processor when the embedded parameter of the third data is not found based on the fourth search message.
  • the receiving unit 1402 is further configured to: receive a sixth search message from the third processor, where the sixth search message includes the first data and an embedded parameter of the first data.
  • FIG. 15 shows a schematic diagram of a possible logical structure of the apparatus, and the apparatus may be the first processor in the method described in FIG. 11 and a possible implementation thereof. , or may be a chip in the first processor, or may be a processing system in the first processor, or the like.
  • the apparatus 1500 includes a sending unit 1501 and a receiving unit 1502 . in:
  • a sending unit 1501 is configured to send a first notification message to the second processor; the first notification message includes first data and a first gradient, and is used to propagate the first gradient to the first target processor; the first notification message includes first data and a first gradient.
  • the gradient is the gradient corresponding to the embedding parameter of the first data; the second processor is the next-hop processor of the first processor in the ring communication architecture where the first processor is located; the sending unit 1501 can be sent by a sending interface or the transmitter, the operations described in step 1101 shown in FIG. 11 can be performed.
  • a receiving unit 1502 configured to receive a second notification message from a third processor; the second notification message includes second data and a second gradient, and is used to propagate the second gradient to the second target processor; the first The second gradient is the gradient corresponding to the embedding parameter of the second data; the third processor is the last hop processor of the first processor in the ring communication architecture; the receiving unit 1502 can be implemented by a receiving interface or a receiver , the operations described in step 1102 shown in FIG. 11 may be performed.
  • the first processor, the second processor and the third processor are processors among the N processors included in the data training system, where N is an integer greater than or equal to 3;
  • the ring communication architecture implements communication in which each of the N processors only receives messages from the previous hop processor of each processor, and only sends messages to the next hop of each processor.
  • the hop handler sends the message.
  • the device further includes an acquisition unit;
  • the obtaining unit configured to obtain the first target gradient in the second notification message when the second notification message includes the first target gradient
  • the sending unit 1501 is further configured to send the second notification message to the second processor;
  • the first target gradient is the gradient of the embedding parameter in the first embedding table maintained by the first processor, and the first embedding table There is a one-to-one mapping relationship between data and embedded parameters;
  • the sending unit 1501 is further configured to send the second notification message to the second processor under the condition that the first target gradient is not included in the second notification message.
  • the obtaining unit is specifically used for:
  • the first target gradient is obtained in the second notification message based on the part or all of the data.
  • the receiving unit 1502 is further configured to receive a third notification message from the third processor; the third notification message includes third data and a third gradient, and is used for the third gradient Propagated to the third target processor; the third gradient is the gradient corresponding to the embedding parameter of the third data;
  • the apparatus further includes an acquiring unit configured to acquire the second target gradient in the third notification message when the third notification message includes the second target gradient,
  • the sending unit 1501 is further configured to send the third notification message to the second processor;
  • the second target gradient is the gradient of the embedding parameter in the first embedding table maintained by the first processor, and the first embedding table Including the mapping relationship between data and embedded parameters of data;
  • the sending unit 1501 is further configured to send the third notification message to the second processor under the condition that the second target gradient is not included in the third notification message.
  • FIG. 16 is a schematic diagram showing a possible hardware structure of the apparatus provided by the present application, and the apparatus may be the first processor in the method described in the foregoing embodiment.
  • the apparatus 1600 includes: a processor 1601 , a memory 1602 and a communication interface 1603 .
  • the processor 1601 , the communication interface 1603 , and the memory 1602 may be connected to each other or to each other through a bus 1604 .
  • the memory 1602 is used to store computer programs and data of the device 1600, and the memory 1602 may include, but is not limited to, random access memory (RAM), read-only memory (ROM), memory Erase programmable read only memory (erasable programmable read only memory, EPROM) or portable read only memory (compact disc read-only memory, CD-ROM), etc.
  • RAM random access memory
  • ROM read-only memory
  • EPROM erasable programmable read only memory
  • portable read only memory compact disc read-only memory
  • the software or program codes required to perform the functions of all or part of the units in FIG. 14 are stored in the memory 1602 .
  • the processor 1601 can not only call the program codes in the memory 1602 to realize some functions, but also cooperate with other The components (eg, the communication interface 1603 ) collectively perform other functions (eg, functions of receiving or sending messages) described in the embodiment of FIG. 14 .
  • the software or program codes required to perform the functions of all or part of the units in FIG. 15 are stored in the memory 1602 .
  • the processor 1601 can not only call the program codes in the memory 1602 to implement some functions, but also cooperate with other The components (eg, the communication interface 1603 ) collectively perform other functions (eg, functions of receiving or sending messages) described in the embodiment of FIG. 15 .
  • the communication interface 1603 includes a sending interface and a receiving interface.
  • the number of the communication interfaces 1603 may be multiple, and is used to support the apparatus 1600 to communicate, such as receiving or sending data or messages.
  • the processor 1601 may be a central processing unit, a general purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof.
  • a processor may also be a combination that performs computing functions, such as a combination comprising one or more microprocessors, a combination of a digital signal processor and a microprocessor, and the like.
  • the processor 1601 can be used to read the program stored in the above-mentioned memory 1602, and execute any one of the data processing methods described in the above-mentioned FIG. 9 and its possible embodiments; or, the processor 1601 can be used to read the above-mentioned memory.
  • the program stored in 1602 executes any one of the data processing methods described in the above-mentioned FIG. 11 and its possible embodiments; or, the processor 1601 can be used to read the program stored in the above-mentioned memory 1602, and executes the program as shown in the above-mentioned FIG. 9 and any of the data processing methods described in its possible embodiments and/or any of the data processing methods described in the above-mentioned FIG. 11 and its possible embodiments.
  • the processor 1601 may be configured to read the program stored in the above-mentioned memory 1602, and perform the following operations:
  • the first lookup message includes first data, and the first lookup message is used to look up the embedded parameters of the first data;
  • the second processor processes the first the next-hop processor of the first processor in the ring communication architecture where the processor is located;
  • the third processor is the ring communication the previous hop processor of the first processor in the architecture;
  • the first processor, the second processor and the third processor are processors among the N processors included in the data training system, where N is an integer greater than or equal to 3;
  • the ring communication architecture implements communication in which each of the N processors only receives messages from the previous hop processor of each processor, and only sends messages to the next hop of each processor.
  • the hop handler sends the message.
  • the processor 1601 may be configured to read the program stored in the above-mentioned memory 1602, and perform the following operations:
  • the first notification message includes first data and a first gradient, and is used to propagate the first gradient to the first target processor;
  • the first gradient is the the gradient corresponding to the embedding parameter of the first data;
  • the second processor is the next-hop processor of the first processor in the ring communication architecture where the first processor is located;
  • a second notification message from the third processor is received through the receiving interface;
  • the second notification message includes second data and a second gradient, and is used for propagating the second gradient to the second target processor;
  • the second gradient is The gradient corresponding to the embedded parameter of the second data;
  • the third processor is the previous hop processor of the first processor in the ring communication architecture;
  • the first processor, the second processor and the third processor are processors among the N processors included in the data training system, where N is an integer greater than or equal to 3;
  • the ring communication architecture implements communication in which each of the N processors only receives messages from the previous hop processor of each processor, and only sends messages to the next hop of each processor.
  • the hop handler sends the message.
  • Embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement the above-mentioned FIG. 9 and/or FIG. 11 and possible method embodiments thereof The method of any embodiment.
  • An embodiment of the present application further provides a computer program product, when the computer program product is read and executed by a computer, the method described in any of the above-mentioned FIG. 9 and/or FIG. 11 and the possible method embodiments thereof will be executed.
  • the message communication in the process of forward propagation and back propagation of the above N processors in the embedding layer can be realized through the ring communication architecture, and the ring communication method is used to realize the interaction of messages, compared with the existing technical solutions
  • the application can make full use of the bandwidth resources between processors, avoid single-point communication bottlenecks, reduce communication delays, improve communication efficiency, and further improve the training efficiency of the entire data training system. and performance.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Devices For Executing Special Programs (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

本申请提供人工智能领域中的一种数据处理方法,该方法包括:第一处理器向第二处理器发送第一查找消息,收来自第三处理器的第二查找消息;第一查找消息包括第一数据,用于查找第一数据的嵌入参数;第二查找消息包括第二数据,用于查找第二数据的嵌入参数;第二处理器和第三处理器分别为第一处理器所在的环形通信架构中第一处理器的下一跳处理器和上一跳处理器;三个处理器为数据训练系统包括的N个处理器中的处理器,N为大于或等于3的整数;N个处理器之间通过环形通信架构实现通信,环形通信架构中,每个处理器仅从每个处理器的上一跳处理器接收消息,并且仅向每个处理器的下一跳处理器发送消息。本申请能够提高数据训练系统的性能。

Description

数据处理方法、装置及系统
本申请要求于2021年04月29日提交中国专利局、申请号为202110477608.3、申请名称为“数据处理方法、装置及系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,尤其涉及一种大规模数据处理方法、装置及系统。
背景技术
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能领域的研究包括机器人,自然语言处理,计算机视觉,决策与推理,人机交互,推荐与搜索,AI基础理论等。
AI领域中,大规模数据模型训练是广泛应用在互联网搜索、广告、推荐业务等场景中的核心技术,典型的应用场景例如包括点击率预估(click-through rate,CTR)模型等。具体的,大规模数据训练时,首先输入样本数据,这些样本数据多数为数据。数据本身是无法进行数值计算的,因此必须通过嵌入(embedding)的方法将样本数据转成数值,因此大规模数据训练模型的入口算子都是embedding算子。经过embedding层之后,再经过若干层全连接层和激活函数,即可获得损失函数(loss),再由损失函数进行反向传播,即完成一轮(step)训练过程。目前,为了提升大规模数据的训练速度,可以使用图形处理器(graphics processing unit,GPU)或网络处理器(neural-network processing units,NPU)来加速模型的训练。
若使用GPU或NPU来训练,由于embedding参数规模巨大,例如可达10TB,单个设备无法训练此模型(例如典型GPU的显存只有16~32GB),因此数据并行加模型并行的训练方式是业内主流解决该问题的方案。但是,模型并行训练需要在并行的模型对应的进程之间进行数据通信以互相配合,该通信的过程增加了训练的时间,降低训练的效率。综上所述,如何在通过模型并行的方式训练大规模数据训练模型时进一步提高训练效率是本领域技术人员需要解决的技术问题。
发明内容
本申请公开了一种数据处理方法、装置及系统,能够提高数据训练模型的训练效率和性能。
第一方面,本申请提供一种数据处理方法,该方法包括:
第一处理器向第二处理器发送第一查找消息;该第一查找消息包括第一数据,该第一查找消息用于查找该第一数据的嵌入参数;该第二处理器为该第一处理器所在的环形通信架构中该第一处理器的下一跳处理器;
该第一处理器接收来自第三处理器的第二查找消息;该第二查找消息包括第二数据,该第二查找消息用于查找该第二数据的嵌入参数;该第三处理器为该环形通信架构中该第一处理器的上一跳处理器;
该第一处理器、该第二处理器和该第三处理器为数据训练系统包括的N个处理器中的处理器,该N为大于或等于3的整数;该N个处理器之间通过该环形通信架构实现通信,该环形通信架构中,该N个处理器的每个处理器仅从该每个处理器的上一跳处理器接收消息,并且仅向该每个处理器的下一跳处理器发送消息。
在本申请实施方式中,数据训练系统包括N个处理器,为了训练大规模的样本数据,该N个处理器组成的数据训练系统以数据并行加模型并行的方式来实现数据的训练。基于该数据并行加模型并行的训练方式,该N个处理器上分别随机得到一部分的样本数据来训练,被训练的数据输入到训练模型之后,需要先经过嵌入(embedding)层将数据映射为稠密向量(也称为嵌入参数)才可以用于后续的计算;但是由于一个处理器上训练的数据是随机得到的,因此,这些数据的嵌入参数不一定存在于该处理器中,需要从该N个处理器的其它处理器中获取对应的嵌入参数,这就需要与其它的处理器进行消息通信。本申请实施例中,在嵌入层进行消息通信查找数据的嵌入参数的过程中,该N个处理器之间通过环形通信架构来实现消息的环形通信,相比于现有的技术方案中多对多的消息通信方式,本申请可以充分利用处理器之间的带宽资源,避免了单点通信瓶颈,降低了通信时延,提高了通信效率,进而能够提升整个数据训练系统的训练效率和性能。
一种可能的实施方式中,上述方法还包括:在基于该第二查找消息查找到该第二数据中部分或全部数据的嵌入参数的情况下,该第一处理器将该部分或全部数据的嵌入参数添加到该第二查找消息中,得到第三查找消息,并向该第二处理器发送该第三查找消息;或者,在基于该第二查找消息未查找到该第二数据的嵌入参数的情况下,该第一处理器向该第二处理器发送该第二查找消息。
在本申请实施方式中,处理器在收到数据的嵌入参数的查找消息后,不管在本地是否查找到对应数据的嵌入参数,都基于上述环形通信架构向下一跳处理器继续转发查找消息,通过循环的转发和查找最终可以查找到处理器需要的全部数据的嵌入参数。
一种可能的实施方式中,上述在基于该第二查找消息查找到该第二数据中部分或全部数据的嵌入参数的情况下,该第一处理器将该部分或全部数据的嵌入参数添加到该第二查找消息中,得到第三查找消息,并向该第二处理器发送该第三查找消息,包括:
该第一处理器在第一嵌入表中查找该部分或全部数据映射的嵌入参数;该第一嵌入表为该第一处理器维护的用于存储数据和嵌入参数的嵌入表,该第一嵌入表中数据和嵌入参数存在一一映射关系;
该第一处理器将该部分或全部数据映射的嵌入参数添加到该第二查找消息中的该部分或全部数据对应的值域中,得到该第三查找消息;
该第一处理器向该第二处理器发送该第三查找消息,该第三查找消息用于查找该第二数据中未查找到嵌入参数的数据的嵌入参数。
在本申请实施方式中,上述N个处理器每个处理器中维护有一个嵌入表,该嵌入表用于存储数据与对应的嵌入参数,因此,处理器在收到嵌入参数的查找消息之后,可以以查找消息中的数据作为索引在该处理器的嵌入表中查找。若查找消息中的数据存在于该嵌入表中,那么可以查找到对应的嵌入参数。
一种可能的实施方式中,该在基于该第二查找消息查找到该第二数据中部分或全部数据的嵌入参数的情况下,该第一处理器将该部分或全部数据的嵌入参数添加到该第二查找消息中,得到第三查找消息,并向该第二处理器发送该第三查找消息,包括:
该第一处理器确定该部分或全部数据属于第一嵌入表,且该第一嵌入表中还未包括该部 分或全部数据;该第一嵌入表为该第一处理器维护的用于存储数据和嵌入参数的嵌入表,该第一嵌入表中数据和嵌入参数存在一一映射关系;
该第一处理器生成该部分或全部数据各自对应的嵌入参数;
该第一处理器将该部分或全部数据各自对应的嵌入参数添加到该第二查找消息中的该部分或全部数据对应的值域中,得到该第三查找消息;
该第一处理器向该第二处理器发送该第三查找消息,该第三查找消息用于查找该第二数据中未查找到嵌入参数的数据的嵌入参数。
在本申请实施方式中,上述N个处理器每个处理器中维护有一个嵌入表,该嵌入表用于存储数据与对应的嵌入参数,因此,处理器在收到嵌入参数的查找消息之后,若处理器确定出消息中有数据属于该处理器的嵌入表,但不在该嵌入表中,那么处理器可以为该属于嵌入表的数据随机生成对应的嵌入参数。可选的,该属于嵌入表的数据对N取模运算后得到的余数与该处理器运行的训练进程序号相同。
一种可能的实施方式中,该在基于该第二查找消息未查找到该第二数据的嵌入参数的情况下,该第一处理器向该第二处理器发送该第二查找消息,包括:
在该第二数据均不属于第一嵌入表中的数据的情况下,该第一处理器向该第二处理器发送该第二查找消息;该第一嵌入表为该第一处理器维护的用于存储数据和嵌入参数的嵌入表,该第一嵌入表中数据和嵌入参数存在一一映射关系。
在本申请实施方式中,若处理器接收到的数据的嵌入参数的查找消息中没有属于该处理器的嵌入表中的数据,那么处理器基于上述环形通信架构直接将接收到的查找消息发送给下一跳处理器。
一种可能的实施方式中,该方法还包括:
该第一处理器接收来自该第三处理器的第四查找消息;该第四查找消息包括第三数据和该第三数据中第一部分数据映射的嵌入参数,该第四查找消息用于查找该第三数据中除了该第一部分数据之外的数据映射的嵌入参数;
在基于该第四查找消息查找到该第三数据中第二部分数据的嵌入参数的情况下,该第一处理器将该第二部分数据的嵌入参数添加到该第四查找消息中,得到第五查找消息,并向该第二处理器发送该第五查找消息;
或者,在基于该第四查找消息未查找到该第三数据的嵌入参数的情况下,该第一处理器向该第二处理器发送该第四查找消息。
在本申请实施方式中,采用上述环形通信架构实现上述N个处理器各个处理器需要的嵌入参数的查找,可以基于该架构多次实现查找消息的环形通信来查找数据的嵌入参数,示例性地,可以在该N个处理器之间至少循环N次消息通信和嵌入参数的查找,从而确保该各个处理器均可以获取到需要的全部数据的嵌入参数。
一种可能的实施方式中,该方法还包括:该第一处理器接收来自该第三处理器的第六查找消息,该第六查找消息包括该第一数据和该第一数据的嵌入参数。
在本申请实施方式中,基于上述环形通信架构来实现上述N个处理器的消息通信查找各个处理器需要的嵌入参数,经过多次循环后处理器可以从上一跳处理器中接收到包括需要的全部嵌入参数的消息。
第二方面,本申请提供一种数据处理方法,该方法包括:
第一处理器向第二处理器发送第一通知消息;该第一通知消息包括第一数据和第一梯度,用于将该第一梯度传播到第一目标处理器中;该第一梯度为该第一数据的嵌入参数对应的梯 度;该第二处理器为该第一处理器所在的环形通信架构中该第一处理器的下一跳处理器;
该第一处理器接收来自第三处理器的第二通知消息;该第二通知消息包括第二数据和第二梯度,用于将该第二梯度传播到第二目标处理器中;该第二梯度为该第二数据的嵌入参数对应的梯度;该第三处理器为该环形通信架构中该第一处理器的上一跳处理器;
该第一处理器、该第二处理器和该第三处理器为数据训练系统包括的N个处理器中的处理器,该N为大于或等于3的整数;该N个处理器之间通过该环形通信架构实现通信,该环形通信架构中,该N个处理器的每个处理器仅从该每个处理器的上一跳处理器接收消息,并且仅向该每个处理器的下一跳处理器发送消息。
在本申请实施方式中,由于处理器训练数据的前向传播过程中有数据的嵌入参数是从其它处理器中获取的,即该数据的嵌入参数存在其它处理器中,而在训练的反向传播过程中,需要基于计算得到梯度优化该数据的嵌入参数,那么,处理器需要将计算得到的该数据的嵌入参数对应的梯度发送给对应的处理器,让该对应的处理器来优化该数据的嵌入参数。本申请实施例中,在嵌入层反向传播过程中进行消息通信获取各自需要的嵌入参数的梯度的过程中,该N个处理器之间通过环形通信架构来实现消息的环形通信,相比于现有的技术方案中多对多的消息通信方式,本申请可以充分利用处理器之间的带宽资源,避免了单点通信瓶颈,降低了通信时延,提高了通信效率,进而能够提升整个数据训练系统的训练效率和性能。
一种可能的实施方式中,该方法还包括:在该第二通知消息中包括第一目标梯度的情况下,该第一处理器在该第二通知消息中获取该第一目标梯度,并向该第二处理器发送该第二通知消息;该第一目标梯度为该第一处理器维护的第一嵌入表中嵌入参数的梯度,该第一嵌入表中数据和嵌入参数存在一一映射关系;
或者,在该第二通知消息中未包括该第一目标梯度的情况下,该第一处理器向该第二处理器发送该第二通知消息。
在本申请实施方式中,处理器在收到梯度的通知消息后,不管在该通知消息中是否查找到自身需要的梯度,都基于上述环形通信架构向下一跳处理器继续转发该通知消息,通过循环的转发最终可以使得各个处理器均获取到各自需要的梯度。
一种可能的实施方式中,该在该第二通知消息中包括第一目标梯度的情况下,该第一处理器在该第二通知消息中获取该第一目标梯度,包括:
该第一处理器确定该第二数据中的部分或全部数据为该第一嵌入表中的数据;
该第一处理器基于该部分或全部数据在该第二通知消息中获取该第一目标梯度。
在本申请实施方式中,上述N个处理器每个处理器中维护有一个嵌入表,该嵌入表用于存储数据与对应的嵌入参数,因此,处理器在收到梯度的通知消息之后,若该消息中有数据存在于该嵌入表中,那么处理器可以从该消息中获取到对应的梯度,以用于优化该数据。
一种可能的实施方式中,该方法还包括:
该第一处理器接收来自该第三处理器的第三通知消息;该第三通知消息包括第三数据和第三梯度,用于将该第三梯度传播到第三目标处理器中;该第三梯度为该第三数据的嵌入参数对应的梯度;
在该第三通知消息中包括第二目标梯度的情况下,该第一处理器在该第三通知消息中获取该第二目标梯度,并向该第二处理器发送该第三通知消息;该第二目标梯度为该第一处理器维护的第一嵌入表中嵌入参数的梯度,该第一嵌入表中包括数据和数据的嵌入参数的映射关系;
或者,在该第三通知消息中未包括该第二目标梯度的情况下,该第一处理器向该第二处 理器发送该第三通知消息。
在本申请实施方式中,采用上述环形通信架构使得上述N个处理器各个处理器获取需要的梯度,可以基于该架构多次实现通知消息的环形通信,示例性地,可以在该N个处理器之间至少循环N-1次消息通信,从而确保该各个处理器均可以获取到需要的全部梯度。
需要说明的是上述第一方面及其可能的实施方式中的任一种实施方式,可以和第二方面及其可能的实施方式中的任一种实施方式结合实现,第一方面及其可能的实施方式中的任一种实施方式应用在数据训练的embedding层的前向传播过程中,第二方面及其可能的实施方式中的任一种实施方式应用在数据训练的embedding层的反向传播过程中。
第三方面,本申请提供一种数据处理装置,该装置包括:
发送单元,用于向第二处理器发送第一查找消息;该第一查找消息包括第一数据,该第一查找消息用于查找该第一数据的嵌入参数;该第二处理器为该第一处理器所在的环形通信架构中该第一处理器的下一跳处理器;
接收单元,用于接收来自第三处理器的第二查找消息;该第二查找消息包括第二数据,该第二查找消息用于查找该第二数据的嵌入参数;该第三处理器为该环形通信架构中该第一处理器的上一跳处理器;
该第一处理器、该第二处理器和该第三处理器为数据训练系统包括的N个处理器中的处理器,该N为大于或等于3的整数;该N个处理器之间通过该环形通信架构实现通信,该环形通信架构中,该N个处理器的每个处理器仅从该每个处理器的上一跳处理器接收消息,并且仅向该每个处理器的下一跳处理器发送消息。
一种可能的实施方式中,该装置还包括添加单元;
上述添加单元,用于在基于该第二查找消息查找到该第二数据中部分或全部数据的嵌入参数的情况下,将该部分或全部数据的嵌入参数添加到该第二查找消息中,得到第三查找消息;
上述发送单元,还用于向该第二处理器发送该第三查找消息;
或者,该发送单元,还用于在基于该第二查找消息未查找到该第二数据的嵌入参数的情况下,向该第二处理器发送该第二查找消息。
一种可能的实施方式中,该装置还包括查找单元;
该查找单元,用于在第一嵌入表中查找该部分或全部数据映射的嵌入参数;该第一嵌入表为该第一处理器维护的用于存储数据和嵌入参数的嵌入表,该第一嵌入表中数据和嵌入参数存在一一映射关系;
上述添加单元,具体用于将该部分或全部数据映射的嵌入参数添加到该第二查找消息中的该部分或全部数据对应的值域中,得到该第三查找消息;
上述发送单元,具体用于向该第二处理器发送该第三查找消息,该第三查找消息用于查找该第二数据中未查找到嵌入参数的数据的嵌入参数。
一种可能的实施方式中,上述装置还包括确定单元和生成单元;
该确定单元,用于确定上述部分或全部数据属于第一嵌入表,且该第一嵌入表中还未包括该部分或全部数据;该第一嵌入表为该第一处理器维护的用于存储数据和嵌入参数的嵌入表,该第一嵌入表中数据和嵌入参数存在一一映射关系;
该生成单元,用于生成该部分或全部数据各自对应的嵌入参数;
上述添加单元,具体用于该部分或全部数据各自对应的嵌入参数添加到该第二查找消息中的该部分或全部数据对应的值域中,得到该第三查找消息;
上述发送单元,具体用于向该第二处理器发送该第三查找消息,该第三查找消息用于查找该第二数据中未查找到嵌入参数的数据的嵌入参数。
一种可能的实施方式中,上述发送单元具体用于:
在该第二数据均不属于第一嵌入表中的数据的情况下,向该第二处理器发送该第二查找消息;该第一嵌入表为该第一处理器维护的用于存储数据和嵌入参数的嵌入表,该第一嵌入表中数据和嵌入参数存在一一映射关系。
一种可能的实施方式中,上述接收单元,还用于接收来自该第三处理器的第四查找消息;该第四查找消息包括第三数据和该第三数据中第一部分数据映射的嵌入参数,该第四查找消息用于查找该第三数据中除了该第一部分数据之外的数据映射的嵌入参数;
该装置还包括添加单元,用于在基于该第四查找消息查找到该第三数据中第二部分数据的嵌入参数的情况下,将该第二部分数据的嵌入参数添加到该第四查找消息中,得到第五查找消息;
上述发送单元,还用于向该第二处理器发送该第五查找消息;
或者,该发送单元,还用于在基于该第四查找消息未查找到该第三数据的嵌入参数的情况下,向该第二处理器发送该第四查找消息。
一种可能的实施方式中,上述接收单元还用于:接收来自该第三处理器的第六查找消息,该第六查找消息包括该第一数据和该第一数据的嵌入参数。
第四方面,本申请提供一种数据处理装置,该装置包括:
发送单元,用于向第二处理器发送第一通知消息;该第一通知消息包括第一数据和第一梯度,用于将该第一梯度传播到第一目标处理器中;该第一梯度为该第一数据的嵌入参数对应的梯度;该第二处理器为该第一处理器所在的环形通信架构中该第一处理器的下一跳处理器;
接收单元,用于接收来自第三处理器的第二通知消息;该第二通知消息包括第二数据和第二梯度,用于将该第二梯度传播到第二目标处理器中;该第二梯度为该第二数据的嵌入参数对应的梯度;该第三处理器为该环形通信架构中该第一处理器的上一跳处理器;
该第一处理器、该第二处理器和该第三处理器为数据训练系统包括的N个处理器中的处理器,该N为大于或等于3的整数;该N个处理器之间通过该环形通信架构实现通信,该环形通信架构中,该N个处理器的每个处理器仅从该每个处理器的上一跳处理器接收消息,并且仅向该每个处理器的下一跳处理器发送消息。
一种可能的实施方式中,该装置还包括获取单元;
该获取单元,用于在该第二通知消息中包括第一目标梯度的情况下,在该第二通知消息中获取该第一目标梯度;
该发送单元,还用于向该第二处理器发送该第二通知消息;该第一目标梯度为该第一处理器维护的第一嵌入表中嵌入参数的梯度,该第一嵌入表中数据和嵌入参数存在一一映射关系;
或者,该发送单元,还用于在该第二通知消息中未包括该第一目标梯度的情况下,向该第二处理器发送该第二通知消息。
一种可能的实施方式中,该获取单元具体用于:
确定该第二数据中的部分或全部数据为该第一嵌入表中的数据;
基于该部分或全部数据在该第二通知消息中获取该第一目标梯度。
一种可能的实施方式中,该接收单元,还用于接收来自该第三处理器的第三通知消息; 该第三通知消息包括第三数据和第三梯度,用于将该第三梯度传播到第三目标处理器中;该第三梯度为该第三数据的嵌入参数对应的梯度;
该装置还包括获取单元,用于在该第三通知消息中包括第二目标梯度的情况下,在该第三通知消息中获取该第二目标梯度,
该发送单元,还用于向该第二处理器发送该第三通知消息;该第二目标梯度为该第一处理器维护的第一嵌入表中嵌入参数的梯度,该第一嵌入表中包括数据和数据的嵌入参数的映射关系;
或者,该发送单元,还用于在该第三通知消息中未包括该第二目标梯度的情况下,向该第二处理器发送该第三通知消息。
第五方面,本申请提供一种装置,该装置可以包括处理器和存储器,用于实现上述第一方面描述的数据处理方法。该存储器与处理器耦合,处理器执行存储器中存储的计算机程序时,可以实现上述第一方面或第一方面任一种可能的实现方式所述的方法。该装置还可以包括通信接口,通信接口用于该装置与其它装置进行通信,示例性的,通信接口可以是收发器、电路、总线、模块或其它类型的通信接口。该通信接口包括接收接口和发送接口,该接收接口用于接收消息,该发送接口用于发送消息。
在一种可能的实现中,该装置可以包括:
存储器,用于存储计算机程序;
处理器,用于通过发送接口向第二处理器发送第一查找消息;该第一查找消息包括第一数据,该第一查找消息用于查找该第一数据的嵌入参数;该第二处理器为该第一处理器所在的环形通信架构中该第一处理器的下一跳处理器;
通过接收接口接收来自第三处理器的第二查找消息;该第二查找消息包括第二数据,该第二查找消息用于查找该第二数据的嵌入参数;该第三处理器为该环形通信架构中该第一处理器的上一跳处理器;
该第一处理器、该第二处理器和该第三处理器为数据训练系统包括的N个处理器中的处理器,该N为大于或等于3的整数;该N个处理器之间通过该环形通信架构实现通信,该环形通信架构中,该N个处理器的每个处理器仅从该每个处理器的上一跳处理器接收消息,并且仅向该每个处理器的下一跳处理器发送消息。
需要说明的是,本申请中存储器中的计算机程序可以预先存储也可以使用该装置时从互联网下载后存储,本申请对于存储器中计算机程序的来源不进行具体限定。本申请实施例中的耦合是装置、单元或模块之间的间接耦合或连接,其可以是电性,机械或其它的形式,用于装置、单元或模块之间的信息交互。
第六方面,本申请提供一种装置,该装置可以包括处理器和存储器,用于实现上述第二方面描述的数据处理方法。该存储器与处理器耦合,处理器执行存储器中存储的计算机程序时,可以实现上述第二方面或第二方面任一种可能的实现方式所述的方法。该装置还可以包括通信接口,通信接口用于该装置与其它装置进行通信,示例性的,通信接口可以是收发器、电路、总线、模块或其它类型的通信接口。该通信接口包括接收接口和发送接口,该接收接口用于接收消息,该发送接口用于发送消息。
在一种可能的实现中,该装置可以包括:
存储器,用于存储计算机程序;
处理器,用于通过发送接口向第二处理器发送第一通知消息;该第一通知消息包括第一数据和第一梯度,用于将该第一梯度传播到第一目标处理器中;该第一梯度为该第一数据的 嵌入参数对应的梯度;该第二处理器为该第一处理器所在的环形通信架构中该第一处理器的下一跳处理器;
通过接收接口接收来自第三处理器的第二通知消息;该第二通知消息包括第二数据和第二梯度,用于将该第二梯度传播到第二目标处理器中;该第二梯度为该第二数据的嵌入参数对应的梯度;该第三处理器为该环形通信架构中该第一处理器的上一跳处理器;
该第一处理器、该第二处理器和该第三处理器为数据训练系统包括的N个处理器中的处理器,该N为大于或等于3的整数;该N个处理器之间通过该环形通信架构实现通信,该环形通信架构中,该N个处理器的每个处理器仅从该每个处理器的上一跳处理器接收消息,并且仅向该每个处理器的下一跳处理器发送消息。
需要说明的是,本申请中存储器中的计算机程序可以预先存储也可以使用该装置时从互联网下载后存储,本申请对于存储器中计算机程序的来源不进行具体限定。本申请实施例中的耦合是装置、单元或模块之间的间接耦合或连接,其可以是电性,机械或其它的形式,用于装置、单元或模块之间的信息交互。
第七方面,本申请提供一种数据训练系统,该系统包括N个处理器,该N为大于或等于3的整数;该N个处理器之间通过环形通信架构实现通信,该环形通信架构中,该N个处理器的每个处理器仅从该每个处理器的上一跳处理器接收消息,并且仅向该每个处理器的下一跳处理器发送消息;该N个处理器中的每个处理器可以是上述第三方面及其可能的实施方式中任一项所述的装置;或者,该N个处理器中的每个处理器可以是上述第四方面及其可能的实施方式中任一项所述的装置;或者,该N个处理器中的每个处理器可以是上述第五方面及其可能的实施方式中任一项所述的装置;或者,该N个处理器中的每个处理器可以是上述第六方面及其可能的实施方式中任一项所述的装置。
第八方面,本申请提供一种计算机可读存储介质,该计算机可读存储介质存储有计算机程序,该计算机程序被处理器执行以实现上述第一方面及其可能的实施方式中任一项所述的方法;或者,该计算机程序被处理器执行以实现上述第二方面及其可能的实施方式中任一项所述的方法。
第九方面,本申请提供一种计算机程序产品,该计算机程序产品被处理器执行时,上述第一方面及其可能的实施方式中任一项所述的方法将被执行;或者,上述第二方面及其可能的实施方式中任一项所述的方法将被执行。
上述第三方面至第九方面提供的方案,用于实现或配合实现上述第一方面和第二方面中对应提供的方法,因此可以与第一方面和第二方面中对应的方法达到相同或相应的有益效果,此处不再进行赘述。
附图说明
下面将对本申请实施例中所需要使用的附图作介绍。
图1为本申请实施例提供的一种人工智能主体框架示意图;
图2为本申请实施例提供的一种应用环境示意图;
图3为本申请实施例提供的一种神经网络处理器的结构示意图;
图4所示为本申请实施例提供的一种数据训练系统的示意图;
图5所示为一种数据的训练模型的示意图;
图6所示为现有的技术方案中消息通信的示意图;
图7所示为数据训练过程的通信部分和计算部分的示意图;
图8所示为本申请数据训练模型中的进行环形通信的示意图;
图9所示为本申请提供的数据处理方法的流程示意图;
图10A至图10E所示为本申请提供的一种环形通信的流程示意图;
图11所示为本申请提供的另一种数据处理方法的流程示意图;
图12A至图12D所示为本申请提供的一种环形通信的流程示意图;
图13所示为本方案通信吞吐量和现有的技术方案通信吞吐量的对比示意图;
图14和图15所示为本申请提供的装置的逻辑结构示意图;
图16所示为本申请提供的装置的实体结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
图1示出一种人工智能主体框架示意图,该主体框架描述了人工智能系统总体工作流程,适用于通用的人工智能领域需求。
下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主题框架进行阐述。
“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。
“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。
(1)基础设施:
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。通过传感器与外部沟通;计算能力由智能芯片(CPU、NPU、GPU、ASIC、FPGA等硬件加速芯片)提供;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据,这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。
(2)数据
基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。
(3)数据处理
数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。
(4)通用能力
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理,语音识别,图像的识别等等。
(5)智能产品及行业应用
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能制造、智能交通、智能家居、智能医疗、智能安防、自动驾驶,平安城市,智能终端等。
参见附图2,本申请实施例提供了一种系统架构200。数据采集设备260用于采集训练的样本数据并存入数据库230,训练设备220基于数据库230中维护的样本数据生成目标模型/规则201。下面将更详细地描述训练设备220如何基于样本数据得到目标模型/规则201,目标模型/规则201能够实现点击率预估、信息推荐或者搜索等功能。
深度神经网络中的每一层的工作可以用数学表达式
Figure PCTCN2022085353-appb-000001
来描述:从物理层面深度神经网络中的每一层的工作可以理解为通过五种对输入空间(输入向量的集合)的操作,完成输入空间到输出空间的变换(即矩阵的行空间到列空间),这五种操作包括:1、升维/降维;2、放大/缩小;3、旋转;4、平移;5、“弯曲”。其中1、2、3的操作由
Figure PCTCN2022085353-appb-000002
完成,4的操作由+b完成,5的操作则由a()来实现。这里之所以用“空间”二字来表述是因为被分类的对象并不是单个事物,而是一类事物,空间是指这类事物所有个体的集合。其中,W是权重向量,该向量中的每一个值表示该层神经网络中的一个神经元的权重值。该向量W决定着上文所述的输入空间到输出空间的空间变换,即每一层的权重W控制着如何变换空间。训练深度神经网络的目的,也就是最终得到训练好的神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。因此,神经网络的训练过程本质上就是学习控制空间变换的方式,更具体的就是学习权重矩阵。
因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到神经网络能够预测出真正想要的目标值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。
训练设备220得到的目标模型/规则可以应用不同的系统或设备中。在附图2中,执行设备210配置有I/O接口212,与外部设备进行数据交互,“用户”可以通过客户设备240向I/O接口212输入数据。
执行设备210可以调用数据存储系统250中的数据、代码等,也可以将数据、指令等存入数据存储系统250中。
计算模块211使用目标模型/规则201对输入的数据进行处理,例如,对于点击率预估场景,该计算模块211使用目标模型/规则201预测出用户可能点击的信息等。
最后,I/O接口212将处理结果返回给客户设备240,提供给用户。
更深层地,训练设备220可以针对不同的目标,基于不同的数据生成相应的目标模型/规则201,以给用户提供更佳的结果。
在附图2中所示情况下,用户可以手动指定输入执行设备210中的数据,例如,在I/O接口212提供的界面中操作。另一种情况下,客户设备240可以自动地向I/O接口212输入数据并获得结果,如果客户设备240自动输入数据需要获得用户的授权,用户可以在客户设备240中设置相应权限。用户可以在客户设备240查看执行设备210输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备240也可以作为数据采集端将采集到样本数据存入数据库230。
值得注意的,附图2仅是本申请实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在附图2中,数据存储系统250相对执行设备210是外部存储器,在其它情况下,也可以将数据存储系统250置于执行设备210中。
图3,是本申请实施例提供的一种芯片硬件结构图。
神经网络处理器NPU 30作为协处理器挂载到主CPU(Host CPU)上,由Host CPU分配任务。NPU的核心部分为运算电路305,通过控制器304控制运算电路305提取存储器中的矩阵数据并进行乘法运算。
在一些实现中,运算电路305内部包括多个处理单元(Process Engine,PE)。在一些实现中,运算电路305是二维脉动阵列。运算电路305还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路305是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器302中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器301中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器308中。
统一存储器306用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器(Direct Memory Access Controller,DMAC)305被搬运到权重存储器302中。输入数据也通过DMAC被搬运到统一存储器306中。
BIU为Bus Interface Unit即,总线接口单元310,用于AXI总线与DMAC和取指存储器309Instruction Fetch Buffer的交互。
总线接口单元310(Bus Interface Unit,简称BIU),用于取指存储器309从外部存储器获取指令,还用于存储单元访问控制器305从外部存储器获取输入矩阵A或者权重矩阵B的原数据。
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器306或将权重数据搬运到权重存储器302中或将输入数据搬运到输入存储器301中。
向量计算单元307包括多个运算处理单元,在需要的情况下,对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积层网络计算,如池化(Pooling),批归一化(Batch Normalization),局部响应归一化(Local Response Normalization)等。
在一些实现中,向量计算单元307将经处理的输出的向量存储到统一缓存器306。例如,向量计算单元307可以将非线性函数应用到运算电路305的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元307生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路305的激活输入,例如用于在神经网络中的后续层中的使用。
控制器304连接的取指存储器(instruction fetch buffer)309,用于存储控制器304使用的指令;
统一存储器306,输入存储器301,权重存储器302以及取指存储器309均为On-Chip存储器。外部存储器私有于该NPU硬件架构。
参见图4,图4所示为本申请提供的数据训练系统的示意图。该系统包括N个处理器,该N为大于1的整数。示例性地,该N个处理器可以是上述图2中的训练设备220。
该N个处理器在数据训练过程中,可以采用环形通信架构实现消息通信。该环形通信架构为该N个处理器之间实现环形通信的逻辑架构。该环形通信架构中,该N个处理器的每个处理器仅从该每个处理器的上一跳处理器接收消息,并且仅向该每个处理器的下一跳处理器发送消息。为了便于理解该环形通信架构中的通信方式,假设N为4,那么,处理器0只向处理器1发送消息,并且只从处理器3接收消息;处理器1只向处理器2发送消息,并且只从处理器0接收消息;处理器2只向处理器3发送消息,并且只从处理器1接收消息;处理器3只向处理器0发送消息,并且只从处理器2接收消息。
在本申请中,在该N个处理器采用环形通信架构实现消息的通信的过程中,该N个处理器中的处理器i(i=N-1除外)为处理器i+1的上一跳处理器,处理器N-1为处理器0的上一跳处理器。该N个处理器中的处理器i(i=0除外)为处理器i-1的下一跳处理器,处理器0为处理器N-1的下一跳处理器,该i为0至N-1之间的整数。
示例性地,上述基于环形通信架构的环形通信方式可以是采用消息传递接口(message passing interface,MPI)中的环形(ring)通信方式来实现。
需要说明的是,该N个处理器在数据训练过程中,可以是整个过程的消息通信均采用上述环形通信架构实现通信;或者,可以是仅部分消息通信采用该环形通信架构实现通信,该部分消息例如包括在嵌入(embedding)层的前向传播过程中,用于查找数据的嵌入参数(embedding variable)的消息,和/或包括embedding层反向传播过程中,用于获取优化嵌入参数的梯度(gradient)的消息,其余部分的消息通信可以采用其它的通信方式,本申请对此不做限制。
一种可能的实施方式中,上述嵌入参数是向量(vector)的形式,嵌入参数可以称为嵌入向量或者embedding向量。上述梯度也可以是向量的形式,梯度也可以称为梯度向量。
示例性地,该N个处理器可以均为图形处理器(graphics processing unit,GPU);或者,该N个处理器可以均为网络处理器(neural-network processing units,NPU);或者,该N个处理器可以部分是GPU部分是NPU。该NPU可以上述图3所述的神经网络处理器。需要说明的是,该N个处理器不限于是GPU或者NPU,也可以是其它高速处理器。
在本申请中,上述数据训练系统可以应用于训练嵌入参数量达到百亿,甚至千亿级别的场景。示例性的,该数据训练系统可以应用在信息搜索、信息推荐、广告例如点击率预估(click-through rate,CTR)等实际的应用场景中。需要说明的是,该数据训练系统训练的数据可以是稀疏数据,或者可以是稠密数据,本申请对此不做限制。
示例性的,上述需要训练的数据可以是标识(identity document,id)数据,该id数据可以是数字或者字符串等。例如,在商品推荐的应用场景中,该id数据可以是商品的标识码或者商家店铺的地址等等。下面主要以需要训练的数据为id数据为例介绍本申请提供的数据处理方法,但是本申请提供的数据处理方法也可以实现对其它类型数据的处理,不限于是id数据。
由于训练过程中需要训练的数据其数据量巨大,因此,采用数据并行加模型并行的方式 来训练。
为了便于理解数据并行加模型并行的训练方式,可以示例性地参见图5。
在图5中可以看到,由于需要训练的数据量巨大,因此,将数据切分成N份,上述N个处理器每个处理器训练一份数据。该N份数据中的每一份数据可以称为批数据(batch),每份batch包含q个样本数据,每个样本数据包含多个数据。该q为大于0的整数,该q称为批大小(batch size),每个样本数据里包含的数据个数可以不同。
在训练过程中,每个处理器运行一个训练进程来训练对应的数据,每个训练进程都有自己的序号,以用于处理器区分不同的进程。后续描述的处理器之间的消息通信也可以说是训练进程之间的消息通信。
数据的训练采用深度学习(deep learning)神经网络来进行训练,因此,每个处理器训练数据的模型均包括但不限于输入层、嵌入层、隐藏层、损失函数算子、梯度计算算子以及参数更新算子等子模型,图5中所示的训练数据的模型只是示例性地画出了部分子模型。
整个训练过程包括前向传播(forward-propagation,FP)过程和反向传播(back-propagation,BP)过程。需要说明的是,图5中所示的前向传播中的嵌入层和反向传播中的嵌入层是同一个嵌入层,同样的,前向传播中的隐藏层和反向传播中的隐藏层是同一个隐藏层,分开画出是为了更好地区分及体现出前向传播和反向传播的过程。
其中,前向传播的过程包括:将数据输入到嵌入层,以用于将数据映射为稠密的嵌入参数来计算;在嵌入层的计算中,N个处理器之间需要进行消息的通信来查找各自训练的数据的嵌入参数(后面会详细介绍为何进行通信以及如何进行通信,此处暂不赘述);嵌入层的输出即为数据的嵌入参数,这些嵌入参数输入到隐藏层进行计算,输出预测值。该输出的预测值可以与标签(label)建立损失函数(loss),并通过自动求导的方式计算梯度。
反向传播的过程包括:处理器基于上述损失函数和梯度通过反向链式求导过程推导出隐藏层和嵌入层所有训练参数的梯度,然后通过优化算法优化参数。具体的,当这些梯度反向传播到嵌入层,处理器基于这些梯度计算得到各个数据的嵌入参数对应的梯度;然后,N个处理器之间通过消息通信获取各个处理器需要的嵌入参数对应的梯度,处理器基于获取到的梯度优化对应的嵌入参数(后面会详细介绍为何进行通信以及如何进行通信,此处暂不赘述)。
下面介绍在嵌入层的计算中,N个处理器之间为何需要进行消息的通信来查找各自训练的数据的嵌入参数;以及介绍N个处理器之间为何通过消息通信获取梯度。
在具体实施例中,嵌入层的作用主要是将数据映射为稠密向量,该稠密向量即为上述嵌入参数。由于需要训练的数据数量巨大,且采用了模型并行方式训练,为了便于计算,节省预处理的计算资源,可以将需要训练的数据随机分配到N个处理器(N个训练进程)中进行训练。而该N个处理器中每个处理器(或者可以说是每个处理器的训练进程)维护有一个嵌入表(embedding表),该embedding表用于存储数据和嵌入参数,该embedding表中的数据和嵌入参数存在一一映射的关系。上述随机分配到一个处理器中的数据的嵌入参数不一定在本处理器的embedding表内,因此,需要在其它处理器的embedding表内获取对应的数据的嵌入参数,因而需要通过消息通信来互相查询各自的嵌入参数。
一种可能的实施方式中,本申请可以通过模(mod)计算的方式切分不同的embedding表,具体的,相同embedding表内的数据对N取模计算后的余数均相同。可选的,该N个处理器中处理器i的embedding表内的数据对N取模计算后的余数为i,处理器i中训练数据的进程序号为i。
另一种可能的实施方式中,本申请可以通过“除”的计算的方式切分不同的embedding 表,具体的,相同embedding表内的数据除以N并取下整后的结果均相同。例如,假设N为3,4和5属于同一个embedding表内的数据,那么4除以3取下整后等于1,5除以3取下整后也等于1。
另一种可能的实施方式中,本申请可以通过随机分配的方式切分不同的embedding表,具体的,该N个处理器中每个处理器的embedding表内的数据是随机的,在查找的时候可以直接以数据本身作为索引在embedding表中查找对应的嵌入参数。
需要说明的是,本申请实施例中对对于如何切分embedding表的方式不做限制,下面介绍具体的实现过程中主要是以通过模计算的方式切分得到的embedding表为例进行介绍,但这不构成对本申请实施例的限制。
另外,在具体实现过程中,上述N个处理器中每个处理器的embedding表内的数据和嵌入参数,可以通过加载已训练好的模型中的embedding表中的数据作为当前训练中embedding表中的初始化数据。或者,在对embedding表查找的时候对于首次查询的数据,可以直接使用随机数初始化该数据的嵌入参数,并将该数据和随机生成的嵌入参数插入embedding表中完成初始化。本申请对embedding表的数据初始化的方式不做限制。
为了便于理解embedding表,可以参见表1。
表1
id 嵌入参数
数据1 参数1
数据2 参数2
…… ……
数据m 参数m
表1示例性地示出了embedding表包括的内容和结构,表1中的id即为数据,在embedding表中,每个数据映射有一个嵌入参数。表1中的m可以是大于1的任意整数。
同理,在反向传播过程中,一个处理器计算得到其训练的数据对应的嵌入参数的梯度后,需要将这些数据的嵌入参数对应的梯度,分发到这些数据所在的embedding表对应的处理器中,以用于该处理器对其自身的embedding表中的嵌入参数进行优化。为了便于理解,举例说明,参见表2。
表2
处理器(训练进程)编号 训练的数据 数据对3取模的余数
处理器0(训练进程0) 10、21、14和19 1、0、2和1
处理器1(训练进程1) 31、23、3和8 1、2、0和2
处理器2(训练进程2) 12、5、19和33 0、2、1和0
表2中假设上述N为3,即由3个处理器来训练数据。表2中示例性给出了随机分配给每个处理器训练的数据,并给出了这些数据与3进行模运算后的余数。以处理器0为例,处理器0随机获得的需要进行训练的数据为10、21、14和19,该几个数据对应的对3取模的余数分别为1、0、2、1。
假设N个处理器中处理器i中的embedding表内的数据对N取模计算后的余数为i,即第0个处理器中的embedding表内的数据对3取模计算后的余数为0,第1个处理器中的embedding表内的数据对3取模计算后的余数为1,第2个处理器中的embedding表内的数据对3取模计算后的余数为2。那么,以处理器0为例,处理器0中的embedding表内只有对3 取模后余数为0的数据映射的嵌入参数,而没有对3取模后余数为1和2的数据映射的嵌入参数。因此,在正向传播过程中,处理器0需要与处理器1和处理器2进行消息通信以获取数据10、14和19的嵌入参数。
同理,在反向传播过程中,处理器0计算出了数据10、21、14和19对应的梯度,该梯度是用于修正和更新embedding表中的数据10、21、14和19的嵌入参数。而数据10和19的嵌入参数在处理器1中,数据14的嵌入参数在处理器2中,因此,处理器0需要将计算得到的数据10和19的梯度发送给处理器1,将数据14的梯度发送给处理器2。该梯度的通信可以通过消息通信来实现。
基于上述的介绍可知,数据的embedding过程中,不管是前向传播还是反向传播均需要数据并行加模型并行训练数据的N个处理器之间进行消息的通信,来查找数据的嵌入参数和梯度。但是,现有的技术方案中,N个处理器之间进行通信时采用多对多的消息通信方式,导致产生通信瓶颈,增加通信时延,降低通信效率,且该消息通信的过程无法和计算的过程叠加优化,从而影响整个训练系统的训练效率,降低训练性能。为了便于理解上述多对多的消息通信方式,可以参见图6。
在图6中可以看到,N个处理器之间相互发送消息,并且是N个处理器中的每一个处理器向另外的多个处理器发送消息,这种情况下,不管是发送的带宽资源还是接收的带宽资源都需要较大的消耗,并且处理器需要排队发送消息和排队接收消息,从而容易导致通信瓶颈,增加通信时延。
为了便于理解上述消息通信的过程无法和计算的过程叠加优化,可以参见图7。在图7中可以看到,通信部分的过程和计算部分的过程无法叠加,处理器需要等待通信过程完成之后才可以进行下一步的计算过程。因此,若通信过程时延较大,则严重影响整个训练过程的效率,进而降低训练系统的性能。
为了解决上述问题,本申请提供了一种数据处理器方法,可以提高处理器之间通信带宽的利用率,降低embedding层前向传播和反向传播中的通信时延,以提高训练效率。
本申请提供的数据处理方法主要是在上述N个处理器之间部署环形通信架构,使得处理器在数据训练的embedding层前向传播过程中,通过该环形通信架构与其它的处理器通信查找对应的数据嵌入参数;并且使得处理器在数据训练的embedding层反向传播过程中,通过该环形通信架构与其它的处理器通信来获取各自需要的数据的嵌入参数对应的梯度。可以参见图8,图8示例性地示出了N个处理器在训练数据的过程中,数据到达embedding层后各个处理器之间可以通过环形通信架构查找各自需要的嵌入参数;且N个处理器在反向传播过程中,梯度反向传播到embedding层并计算得到数据嵌入参数对应的梯度后,可以通过环形通信架构进行消息通信获取各自需要的梯度。
在具体实施例中,在embedding层前向传播过程中嵌入参数的查找过程,上述N个处理器每个处理器可以生成一个查找消息,然后,该N个处理器将生成的查找消息以环形通信架构的通信方式各自发送给自己的下一跳处理器,各个处理器接收到查找消息后,可以识别查找消息中的数据是否属于其各自维护的embedding表中的数据,若有属于的数据,那么查找该属于的数据对应的嵌入参数,并将查找到的嵌入参数添加到接收到的消息中。然后,处理器再将添加了嵌入参数的消息再次以环形通信架构的通信方式发送给自己的下一跳处理器。若接收到的消息中的数据没有属于自身维护的embedding表内的数据,那么,直接将接收到的消息以环形通信架构的通信方式发送给下一跳处理器。重复该查找和发送的操作,循环至少N次之后,各个处理器都可以获得自身查找的所有数据的嵌入参数。
同理,在embedding层反向传播过程中梯度的获取过程,上述N个处理器每个处理器可以生成一个包括数据和对应的梯度的消息,然后,该N个处理器将生成的消息以环形通信架构的通信方式各自发送给自己的下一跳处理器。各个处理器接收到消息后,可以识别消息中的数据是否属于其各自维护的embedding表中的数据。若有属于的数据,则获取消息中该属于的数据对应的梯度,以用于优化更新embedding表中对应的嵌入参数。然后,将该接收到的消息以环形通信架构的通信方式发送给下一跳处理器。若接收到的消息中的数据没有属于自身维护的embedding表中的数据,那么,处理器也将该接收到的消息以环形通信架构的通信方式发送给下一跳处理器。重复该发送和获取的操作,循环至少N-1次之后,各个处理器都可以获得自身embedding表中所有数据的嵌入参数对应的梯度,从而可以完成该所有数据的嵌入参数的优化更新。
一种可能的实施方式中,由于数据训练系统训练的数据可以是稀疏数据或者稠密数据,那么,如果处理器接收到的需要训练的数据为稀疏数据,处理器可以在embedding层前向传播过程中查找这些稀疏数据的嵌入参数之前,先将这些稀疏数据转换为稠密数据,然后,以转换得到的稠密数据为索引查找对应的嵌入参数。这种情况下,不管处理器需要训练的数据是稀疏数据还是稠密数据,在embedding层前向传播过程中查找这些稀疏数据的嵌入参数的过程中,基于上述环形通信架构发送和接收的消息中包括的数据均为稠密数据的形式。并且,这种情况下,处理器维护的embedding表中的数据也是稠密数据的形式。
可选的,上述N个处理器之间通过环形通信架构实现环形消息通信的过程可以封装成通信接口,即每个处理器向下一跳处理器发送消息的操作可以封装成一个发送接口,每个处理器从上一跳处理器接收消息的操作可以封装成一个接收接口,这样处理器需要基于上述环形通信架构发送消息时调用封装好的发送接口发送即可,处理器需要基于上述环形通信架构接收消息时调用封装好的接收接口接收即可。
可选的,在本申请中,可以设计将embedding层前向传播过程中嵌入参数的查找过程,与embedding层反向传播过程中梯度的获取过程封装成一个可调用的接口,并暴露给人工智能(artificial intelligence,AI)框架使用。即处理器在接收到嵌入参数的查找消息后,可以直接调用封装好的接口实现该嵌入参数的查找,然后返回查找结果。该查找结果可以是查找到的数据的嵌入参数,或者若没有查找到对应的嵌入参数,该返回结果可以是空值。
同理,处理器在接收到梯度的消息后,可以直接调用封装好的接口实现梯度的查找,并返回操作结果。由于处理器是在消息中查找属于自身embedding表中的数据的嵌入参数对应的梯度,不管是否查找到,其返回的操作结果都可以是空值。
需要说明的是,本申请提供的上述数据处理方法的各个操作封装成接口的方式不限于上述示例性示出的实现方式,在具体实施例中,可以将上述多个操作任意拆分封装成多个接口提供给AI框架使用,本申请对此不做限制。
上述综合描述了本申请提供的数据处理方法的整体过程,下面结合图表描述具体的步骤的实现。
首先介绍在embedding层前向传播过程中本申请提供的数据处理方法,参见图9,该数据处理方法可以包括但不限于如下步骤:
901、第一处理器向第二处理器发送第一查找消息;该第一查找消息包括第一数据,用于查找该第一数据的嵌入参数;该第二处理器为该第一处理器所在的环形通信架构中该第一处理器的下一跳处理器。
在具体实施例中,该第一处理器可以是上述图4所示系统中N个处理器中的任意一个。 假设该第一处理器为上述N个处理器中的处理器i(i=N-1除外),那么,该第二处理器为该N个处理器中的处理器i+1;若该第一处理器为处理器N-1,那么,该第二处理器为处理器0。
具体的,基于前面的描述可知,在embedding层前向传播过程中,上述N个处理器之间需要通过消息通信来获取各自需要的数据的嵌入参数。在本申请中,该N个处理器之间的消息通信可以采用环形通信架构来实现。本实施例先以第一处理器、该第一处理器的下一跳处理器(上述第二处理器)和该第一处理器的上一跳处理器(下面步骤902中的第三处理器)之间的通信为例介绍采用环形通信架构实现嵌入参数的查找过程。
上述第一数据可以包括一个或多个数据。该第一数据可以是稀疏数据或者稠密数据。上述第一查找消息中包括的内容可以示例性地参见表3。
表3
id 数据1 数据2 …… 数据k1
值域(value) - - - -
表3示例性示出了上述第一查找消息包括的部分内容,表3中的id即为数据,值域用于填写数据对应的嵌入参数。表3中的k1可以是任意大于0的整数。假设该第一查找消息为第一处理器生成的原始消息,该消息中数据对应的值域可以是空值或者可以是默认的原始值(例如该原始值可以是0等),本申请对此不作限制。上述第一查找消息中包括第一数据,是为了接收到该消息的处理器能够基于该第一数据查找到该第一数据对应的嵌入参数,以填入该消息的第一数据对应的值域中。
在一种可能的实施方式中,假设该N个处理器中处理器i的embedding表内的数据对N取模计算后的余数为i,那么上述第一查找消息中还可以包括数据对N取模后的余数,由于该余数和数据所在的embedding表所在的处理器的序号(训练进程的序号)相同,因此,也可以说该第一查找消息中还可以包括数据所在的embedding表所在的进程的序号。示例性地参见表4。
表4
id 数据1 数据2 …… 数据k1
进程(Rank)序号 - - …… -
值域 - - …… -
该第一查找消息中包括进程序号是为了处理器接收到该消息后,可以通过进程序号快速确定哪些数据是属于自身维护的embedding表中的数据,从而快速查找到对应的嵌入参数填入到消息的值域中,可以提高查找效率。
或者,另一种可能的实施方式中,不管上述N个处理器中处理器i的embedding表内的数据对N取模计算后的余数是否为i,只要第一处理器可以确定数据的嵌入参数所在的训练进程的序号,那么,第一查找消息包括的内容的格式就可以是表4所示的格式。
902、该第一处理器接收来自第三处理器的第二查找消息;该第二查找消息包括第二数据,用于查找该第二数据的嵌入参数;该第三处理器为该环形通信架构中该第一处理器的上一跳处理器。
在具体实施例中,假设该第一处理器为上述N个处理器中的处理器i(i=0除外),那么,该第三处理器为该N个处理器中的处理器i-1;若该第一处理器为处理器0,那么,该第三处理器为处理器N-1。
上述第二数据可以包括一个或多个数据,该第二数据一般与上述第一数据不相同。该第二数据可以是稀疏数据或者稠密数据。一种可能的实施方式中,该第二数据中的部分数据可 以与该第一数据部分数据相同。上述第二查找消息的格式和上述第一查找消息的格式类似,关于第二查找消息包括的内容格式可以示例性参见上述表3或表4对应的描述,此处不再赘述。
具体的,基于上述N个处理器采用环形通信架构进行消息通信,第一处理器向其下一跳处理器第二处理器发送了上述第一查找消息后,从其上一跳处理器第三处理器接收到上述第二查找消息后。第一处理器响应于该第二查找消息执行嵌入参数的查找操作,下面分两种情况进行描述:
第一种情况,在基于该第二查找消息查找到该第二数据中部分或全部数据的嵌入参数的情况下,该第一处理器将该部分或全部数据的嵌入参数添加到该第二查找消息中,得到第三查找消息,并向该第二处理器发送该第三查找消息。该第三查找消息用于查找该第二数据中未查找到嵌入参数的数据的嵌入参数。
在具体实施例中,假设第二查找消息携带的内容的格式如上述表3所示,即未携带进程序号,那么,第一处理器接收到上述第二查找消息后,解析消息获取消息中的第二数据,将该第二数据与第一处理器自身维护的embedding表中的数据比较。若该embedding表中存在该第二数据中的部分或全部数据,那么,第一处理器在该embedding表中获取该部分或全部数据映射的嵌入参数。然后,第一处理器将该部分或全部数据映射的嵌入参数添加到上述第二查找消息中该部分或全部数据对应的值域中,得到第三查找消息。然后,该第一处理器将该第三查找消息发送给下一跳处理器即发送给上述第二处理器。
可选的,将嵌入参数添加到消息的值域中可以是将嵌入参数以累加等操作添加到消息的值域中。
一种可能的实施方式中,假设N个处理器中处理器i中的embedding表内的数据对N取模计算后的余数为i,并且第二查找消息携带的内容的格式如上述表3所示,即未携带进程序号。那么,第一处理器接收到上述第二查找消息后,解析消息获取消息中的第二数据。然后,第一处理器将该第二数据的每个数据与N进行模计算,得到该每个数据取模后的余数。
若计算得到的余数中有一个或多个余数与该第一处理器运行的训练进程的序号相同,那么,该一个或多个余数对应的数据存在该第一处理器维护的embedding表中。第一处理器以该一个或多个余数对应的数据为索引,在该embedding表中查找到该一个或多个余数对应的数据的嵌入参数。并将查找到的嵌入参数对应添加到上述第二查找消息中该一个或多个余数对应的数据所对应的值域中,得到第三查找消息。然后,该第一处理器将该第三查找消息发送给下一跳处理器即发送给上述第二处理器。
或者,若上述计算得到的余数中有一个或多个余数与该第一处理器运行的训练进程的序号相同,第一处理器以该一个或多个余数对应的数据为索引,在自身维护的embedding表中查找该一个或多个余数对应的数据的嵌入参数。如果该一个或多个余数对应的数据中有数据不在该embedding表中,那么,处理器可以为该不在embedding表中的数据随机生成对应的嵌入参数,然后,将在embedding表中查找到的嵌入参数和该随机生成的嵌入参数添加到第二查找消息中对应的值域中,得到第三查找消息。然后,该第一处理器将该第三查找消息发送给下一跳处理器即发送给上述第二处理器。另外,处理器将该不在embedding表中的数据与该随机生成的嵌入参数一一对应地添加到embedding表中。
一种可能的实施方式中,假设N个处理器中处理器i中的embedding表内的数据对N取模计算后的余数为i,并且第二查找消息携带的内容的格式如上述表4所示,即携带进程序号。那么,第一处理器接收到上述第二查找消息后,解析消息获取消息中的第二数据和对应的进 程序号。
若该第二查找消息中的进程序号有一个或多个序号为该第一处理器运行的训练进程的序号,那么,该一个或多个序号对应的数据存在该第一处理器维护的embedding表中。第一处理器以该一个或多个序号对应的数据为索引,在该embedding表中查找到该一个或多个序号对应的数据的嵌入参数。并将查找到的嵌入参数对应添加到上述第二查找消息中该一个或多个序号对应的数据所对应的值域中,得到第三查找消息。然后,该第一处理器将该第三查找消息发送给下一跳处理器即发送给上述第二处理器。
或者,若上述第二查找消息中的进程序号有一个或多个序号为该第一处理器运行的训练进程的序号,第一处理器以该一个或多个序号对应的数据为索引,在自身维护的embedding表中查找该一个或多个序号对应的数据的嵌入参数。如果该一个或多个序号对应的数据中有数据不在该embedding表中,那么,处理器可以为该不在embedding表中的数据随机生成对应的嵌入参数,然后,将在embedding表中查找到的嵌入参数和该随机生成的嵌入参数添加到第二查找消息中对应的值域中,得到第三查找消息。然后,该第一处理器将该第三查找消息发送给下一跳处理器即发送给上述第二处理器。另外,处理器将该不在embedding表中的数据与该随机生成的嵌入参数一一对应地添加到embedding表中。
为了便于理解处理器如何将嵌入参数添加到第二查找消息的值域中,下面举例说明。参见表5。
表5
id 9 8 13 3
值域 - - - -
假设表5中所示的数据为上述第二查找消息中携带的数据,值域默认为空。第一处理器确定该表5中数据9和3属于自身维护的embedding表中的数据,并在该embedding表中查找到该9和3的嵌入参数分别为参数a和参数b,然后第一处理器直接将该参数a和参数b分别添加到9和3对应的值域中,添加之后可以参见表6。
表6
id 9 8 13 3
值域 参数a - - 参数b
那么,上述获得的第三查找消息即包括该表6所示的内容。
第二种情况,在基于该第二查找消息未查找到该第二数据的嵌入参数的情况下,该第一处理器向该第二处理器发送该第二查找消息。
在具体实施例中,不管该第二查找消息中包括的内容格式是上述介绍的哪一种,若第一处理器确定该第二查找消息中包括的第二数据中没有数据属于该第一处理器自身维护的embedding表中的数据,即第一处理器无法在自身维护的embedding表中查找到该第二数据的嵌入参数,那么,第一处理器就会将该第二查找消息发送给下一跳处理器即发送给上述第二处理器。
需要说明的是,如果上述N个处理器中处理器i中的embedding表内的数据对N取模计算后的余数为i,对N取模运算后得到的余数与第一处理器运行的训练进程序号相同的数据属于该第一处理器自身维护的embedding表中的数据,除此之外的数据不属于该第一处理器自身维护的embedding表中的数据。
一种可能的实施方式中,第一处理器在接收到上述第二查找消息,并完成对该第二查找 消息的响应操作之后,该第一处理器还接收来自上述第三处理器的第四查找消息;该第四查找消息包括第三数据和该第三数据中第一部分数据映射的嵌入参数,该第四查找消息用于查找该第三数据中除了该第一部分数据之外的数据映射的嵌入参数。
在具体实施例中,该第四查找消息在发送到该第一处理器之前,该第四查找消息中携带的第三数据中第一部分数据的嵌入参数已经在其它的处理器中查找到,因此,第四查找消息中携带有该第一部分数据的嵌入参数。该第一部分数据为该第三数据中的一个或多个数据。该第三数据可以是稀疏数据或者稠密数据。
类似于第一处理器接收到上述第二查找消息的操作,第一处理器响应于该第四查找消息执行嵌入参数的查找操作,同样地分两种情况进行描述:
第一种情况,在基于该第四查找消息查找到该第三数据中第二部分数据的嵌入参数的情况下,该第一处理器将该第二部分数据的嵌入参数添加到该第四查找消息中,得到第五查找消息,并向上述第二处理器发送该第五查找消息。该第二部分数据为该第三数据中的一个或多个数据,该第二部分数据与上述第一部分数据是不同的数据。
在具体实施例中,假设第四查找消息携带的内容的格式如上述表3所示,即未携带进程序号,那么,第一处理器接收到上述第四查找消息后,解析消息获取消息中的第三数据,将该第三数据与第一处理器自身维护的embedding表中的数据比较。若该embedding表中存在该第三数据中的第二部分数据,那么,第一处理器在该embedding表中获取该第二部分数据映射的嵌入参数。然后,第一处理器将该第二部分数据映射的嵌入参数添加到上述第四查找消息中该第二部分数据对应的值域中,得到第五查找消息。然后,该第一处理器将该第五查找消息发送给下一跳处理器即发送给上述第二处理器。该第五查找消息用于查找该第三数据中未查找到嵌入参数的数据的嵌入参数。
一种可能的实施方式中,假设N个处理器中处理器i中的embedding表内的数据对N取模计算后的余数为i,并且第四查找消息携带的内容的格式如上述表3所示,即未携带进程序号。那么,第一处理器接收到上述第四查找消息后,解析消息获取消息中的第三数据。然后,第一处理器将该第三数据的每个数据与N进行模计算,得到该每个数据取模后的余数。
若计算得到的余数中有一个或多个余数与该第一处理器运行的训练进程的序号相同,那么,该一个或多个余数对应的数据存在该第一处理器维护的embedding表中。该一个或多个余数对应的数据即为上述第二部分数据。第一处理器以该一个或多个余数对应的数据为索引,在该embedding表中查找到该一个或多个余数对应的数据的嵌入参数。并将查找到的嵌入参数对应添加到上述第四查找消息中该一个或多个余数对应的数据所对应的值域中,得到第五查找消息。然后,该第一处理器将该第五查找消息发送给下一跳处理器即发送给上述第二处理器。
或者,若上述计算得到的余数中有一个或多个余数与该第一处理器运行的训练进程的序号相同,第一处理器以该一个或多个余数对应的数据为索引,在自身维护的embedding表中查找该一个或多个余数对应的数据的嵌入参数。如果该一个或多个余数对应的数据中有数据不在该embedding表中,那么,处理器可以为该不在embedding表中的数据随机生成对应的嵌入参数,然后,将在embedding表中查找到的嵌入参数和该随机生成的嵌入参数添加到第四查找消息中对应的值域中,得到第五查找消息。然后,该第一处理器将该第五查找消息发送给下一跳处理器即发送给上述第二处理器。另外,处理器将该不在embedding表中的数据与该随机生成的嵌入参数一一对应地添加到embedding表中。
一种可能的实施方式中,假设N个处理器中处理器i中的embedding表内的数据对N取 模计算后的余数为i,并且第四查找消息携带的内容的格式如上述表4所示,即携带进程序号。那么,第一处理器接收到上述第四查找消息后,解析消息获取消息中的第三数据和对应的进程序号。
若该第四查找消息中的进程序号有一个或多个序号为该第一处理器运行的训练进程的序号,那么,该一个或多个序号对应的数据存在该第一处理器维护的embedding表中。该一个或多个序号对应的数据即为上述第二部分数据。第一处理器以该一个或多个序号对应的数据为索引,在该embedding表中查找到该一个或多个序号对应的数据的嵌入参数。并将查找到的嵌入参数对应添加到上述第四查找消息中该一个或多个序号对应的数据所对应的值域中,得到第五查找消息。然后,该第一处理器将该第五查找消息发送给下一跳处理器即发送给上述第二处理器。
或者,若上述第四查找消息中的进程序号有一个或多个序号为该第一处理器运行的训练进程的序号,第一处理器以该一个或多个序号对应的数据为索引,在自身维护的embedding表中查找该一个或多个序号对应的数据的嵌入参数。如果该一个或多个序号对应的数据中有数据不在该embedding表中,那么,处理器可以为该不在embedding表中的数据随机生成对应的嵌入参数,然后,将在embedding表中查找到的嵌入参数和该随机生成的嵌入参数添加到第四查找消息中对应的值域中,得到第五查找消息。然后,该第一处理器将该第五查找消息发送给下一跳处理器即发送给上述第二处理器。另外,处理器将该不在embedding表中的数据与该随机生成的嵌入参数一一对应地添加到embedding表中。
为了便于理解处理器如何将嵌入参数添加到第四查找消息的值域中,下面举例说明。参见表7。
表7
id 11 10 5 15
值域 参数c - 参数d -
假设表7中所示的数据为上述第四查找消息中携带的数据,可以看到数据11和15的嵌入参数已经在其它处理器中查找到,其它没找到嵌入参数的数据对应的值域默认为空。第一处理器确定该表7中数据15属于自身维护的embedding表中的数据,并在该embedding表中查找到该15的嵌入参数为参数e,然后第一处理器直接将该参数e添加到15对应的值域中,添加之后可以参见表8。
表8
id 11 10 5 15
值域 参数c - 参数d 参数e
那么,上述获得的第五查找消息即包括该表8所示的内容。
第二种情况,在基于该第四查找消息未查找到该第三数据的嵌入参数的情况下,该第一处理器向该第二处理器发送该第四查找消息。
在具体实施例中,不管该第四查找消息中包括的内容格式是上述介绍的哪一种,若第一处理器确定该第四查找消息中包括的第三数据中没有数据属于该第一处理器自身维护的embedding表中的数据,即第一处理器无法在自身维护的embedding表中查找到该第三数据的嵌入参数,那么,第一处理可以将该第四查找消息发送给下一跳处理器即发送给上述第二处理器。
一种可能的实施方式中,上述第一处理器接收到的来自上述第三处理器的第四查找消息 中,包括的不是第三数据中部分数据映射的嵌入参数,而是第三数据中全部数据映射的嵌入参数。这种情况下,第一处理器可以确定出该第四查找消息中包括的第三数据中没有数据属于该第一处理器自身维护的embedding表中的数据,那么,第一处理可以将该第四查找消息发送给下一跳处理器即发送给上述第二处理器。
一种可能的实施方式中,重复上述消息发送和嵌入参数查找的操作,循环N-1次之后,在第N次循环中,第一处理器可以接收到来自该第三处理器的第六查找消息,该第六查找消息包括该第一数据和该第一数据的嵌入参数。即上述第一查找消息是第一处理器生成的消息,该消息携带的第一数据为该第一处理器需要训练的数据。经过N次的循环之后,携带第一数据的消息经过N个处理器,并从该N个处理器的一个或多个处理器中查找到了该第一数据的嵌入参数。查找到的该嵌入参数随着消息的不断转发,最终,通过该第六查找消息发送到该第一处理器,使得第一处理器获得其训练的数据的全部嵌入参数。示例性地,参见表9。
表9
id 16 20 19 27
值域 参数f 参数g 参数h 参数r
表9中示例性示出了上述第六查找消息包括的第一数据和该第一数据的嵌入参数,可以看到该第一数据的嵌入参数均已经查找到,并填写在各个数据对应的值域中。
第一处理器通过上述第六查找消息获取到其训练数据的全部嵌入参数后,若该训练数据为稀疏数据,那么第一处理器需要对获取的这些训练数据的嵌入参数进行归约(reduce)操作,然后,将归约之后的嵌入参数继续前向传播到隐藏层。示例性地,该归约操作例如可以是对同类的或者关联性较大的训练数据的嵌入参数进行加权求和等操作,具体的归约操作可以参照现有的方案中的操作,本申请对此不做限制。
为了便于理解上述介绍的在embedding层前向传播过程中本申请提供的数据处理方法,下面举例说明,参见图10A至图10E。在图10A至图10E中,假设上述N个处理器为4个处理器,分别为处理器0、处理器1、处理器2和处理器3。该4个处理器通过上述环形通信架构实现消息的通信。本例子以N个处理器中处理器i中的embedding表内的数据对N取模计算后的余数为i为例介绍。
首先,参见图10A,在embedding层的前向传播过程中,各个处理器需要首先查找到自身训练的数据的嵌入参数。图10A中,假设处理器0需要查找嵌入参数的数据为第一批数据:21、5、14和25,第一批数据对4取模的余数分别为1、1、2和1,即数据21、5和25的嵌入参数需要在处理器1中查找,数据14的嵌入参数需要在处理器2中查找。图10A中,假设处理器1需要查找嵌入参数的数据为第二批数据:19、2、10和32,第二批数据对4取模的余数分别为3、2、2和0,即数据2和10的嵌入参数需要在处理器2中查找,数据19的嵌入参数需要在处理器3中查找,数据32的嵌入参数需要在处理器0中查找。图10A中,假设处理器2需要查找嵌入参数的数据为第三批数据:13、8、16和29,第三批数据对4取模的余数分别为1、0、0和1,即数据8和16的嵌入参数需要在处理器0中查找,数据13和29的嵌入参数需要在处理器1中查找。图10A中,假设处理器3需要查找嵌入参数的数据为第四批数据:6、33、18和4,第三批数据对4取模的余数分别为2、1、2和0,即数据6和18的嵌入参数需要在处理器2中查找,数据33的嵌入参数需要在处理器1中查找,数据4的嵌入参数需要在处理器0中查找。上述各个数据对4取模的余数即为各个数据的嵌入 参数所在的进程的序号。
在图10A中,各个处理器首先生成一个消息,该消息包括各自需要查找嵌入参数的数据、对应的进程序号以及用于填写嵌入参数的值域空间。各个处理器生成各自的消息后,按照环形通信架构的通信方式各自向下一跳处理器发送各自生成的消息,并接收上一跳处理器发送来的消息。接收到消息后,进行对应的查表操作,具体的查表操作参见前述的描述,此处不再赘述。然后,将查找到的嵌入参数填入各自接收到的消息中,具体参见图10B。
在图10B中可以看到,处理器0将包括第一批数据的消息发送给处理器1,并接收来自处理器3的包括第四批数据的消息,然后,在自身的embedding表中查找到数据4的嵌入参数,并添加到接收的消息中数据4对应的值域中,获得新的消息。处理器1将包括第二批数据的消息发送给处理器2,并接收来自处理器0的包括第一批数据的消息,然后,在自身的embedding表中查找到数据21、5和25的嵌入参数,并添加到接收的消息中数据21、5和25对应的值域中,获得新的消息。处理器2将包括第三批数据的消息发送给处理器3,并接收来自处理器1的包括第二批数据的消息,然后,在自身的embedding表中查找到数据2和10的嵌入参数,并添加到接收的消息中数据2和10对应的值域中,获得新的消息。处理器3将包括第四批数据的消息发送给处理器0,并接收来自处理器2的包括第三批数据的消息,由于第三批数据中没有数据属于处理器3中embedding表的数据,因此处理器3中没有查找到第三批数据中任意数据的嵌入参数。
在图10B中,获得新的消息的处理器将各自获得的新的消息发送给下一跳处理器,未获得新的消息的处理器(处理器3)将接收到的消息发送给下一跳处理器,各个处理器发送完消息之后从各自的上一跳处理器接收新的消息,然后继续响应于新的消息进行嵌入参数的查找。然后,将查找到的嵌入参数填入各自接收到的消息中,具体参见图10C。
在图10C中,处理器0将包括第四批数据的消息发送给处理器1,并接收来自处理器3的包括第三批数据的消息,然后,在自身的embedding表中查找到数据8和16的嵌入参数,并添加到接收的消息中数据8和16对应的值域中,获得新的消息。处理器1将包括第一批数据的消息发送给处理器2,并接收来自处理器0的包括第四批数据的消息,然后,在自身的embedding表中查找到数据33的嵌入参数,并添加到接收的消息中数据33对应的值域中,获得新的消息。处理器2将包括第二批数据的消息发送给处理器3,并接收来自处理器1的包括第一批数据的消息,然后,在自身的embedding表中查找到数据14的嵌入参数,并添加到接收的消息中数据14对应的值域中,获得新的消息。处理器3将包括第三批数据的消息发送给处理器0,并接收来自处理器2的包括第二批数据的消息,然后,在自身的embedding表中查找到数据19的嵌入参数,并添加到接收的消息中数据19对应的值域中,获得新的消息。
在图10C中,各个处理器将各自获得的新的消息发送给下一跳处理器,各个处理器发送完消息之后从各自的上一跳处理器接收新的消息,然后继续响应于新的消息进行嵌入参数的查找。然后,将查找到的嵌入参数填入各自接收到的消息中,具体参见图10D。
在图10D中,处理器0将包括第三批数据的消息发送给处理器1,并接收来自处理器3的包括第二批数据的消息,然后,在自身的embedding表中查找到数据32的嵌入参数,并添加到接收的消息中数据32对应的值域中,获得新的消息。处理器1将包括第四批数据的消息发送给处理器2,并接收来自处理器0的包括第三批数据的消息,然后,在自身的embedding表中查找到数据13和29的嵌入参数,并添加到接收的消息中数据13和29对应的值域中,获得新的消息。处理器2将包括第一批数据的消息发送给处理器3,并接收来自处理器1的 包括第四批数据的消息,然后,在自身的embedding表中查找到数据6和18的嵌入参数,并添加到接收的消息中数据6和18对应的值域中,获得新的消息。处理器3将包括第二批数据的消息发送给处理器0,并接收来自处理器2的包括第一批数据的消息,由于第一批数据中没有数据属于处理器3中embedding表的数据,因此处理器3中没有查找到第一批数据中任意数据的嵌入参数。
在图10D中,获得新的消息的处理器将各自获得的新的消息发送给下一跳处理器,未获得新的消息的处理器(处理器3)将接收到的消息发送给下一跳处理器,各个处理器发送完消息之后从各自的上一跳处理器接收新的消息。这次循环,各个处理器接收到的消息包括自身训练的数据和需要的嵌入参数,从而完成了整个embedding层嵌入参数的查找过程。具体参见图10E。
在图10E中可以看到,处理器0将包括第二批数据的消息发送给处理器1,并接收来自处理器3的包括第一批数据的消息,该消息中包括处理器0需要的第一批数据的嵌入参数。处理器1将包括第三批数据的消息发送给处理器2,并接收来自处理器0的包括第二批数据的消息,该消息中包括处理器1需要的第二批数据的嵌入参数。处理器2将包括第四批数据的消息发送给处理器3,并接收来自处理器1的包括第三批数据的消息,该消息中包括处理器2需要的第三批数据的嵌入参数。处理器3将包括第一批数据的消息发送给处理器0,并接收来自处理器2的包括第四批数据的消息,该消息中包括处理器3需要的第四批数据的嵌入参数。
需要说明的是,上述图10A至图10E所示及其相关的描述仅为一个示例,不构成对本申请的限制,基于上述思想所做的变形均在本申请的保护范围之内。
综上,本例子中,该4个处理器通过4次循环查找到了各自需要的嵌入参数,由于处理器之间是通过环形通信架构来实现通信,相比于现有的技术方案中多对多的消息通信方式,本申请避免了单点通信瓶颈,降低了通信时延,提高了通信效率,从而能够提升整个数据训练系统的训练性能。
下面介绍在embedding层反向传播过程中本申请提供的数据处理方法。需要说明的是,下面介绍的在embedding层反向传播过程中本申请提供的数据处理方法的实施例中,所提到的用于区分不同对象的“第一处理器(或数据)”、“第二处理器(或数据)”和“第三处理器(或数据)”等等与上述图9及其可能的实施方式中对应的相同称呼可以是相同的对象,或者可以是不同的对象。
参见图11,在embedding层反向传播过程中本申请提供的数据处理方法可以包括但不限于如下步骤:
1101、第一处理器向第二处理器发送第一通知消息;该第一通知消息包括第一数据和第一梯度,用于将该第一梯度传播到第一目标处理器中;该第一梯度为该第一数据的嵌入参数对应的梯度,该第一数据与该第一梯度一一映射;该第二处理器为该第一处理器所在的环形通信架构中该第一处理器的下一跳处理器。
在具体实施例中,该第一处理器可以是上述图4所示系统中N个处理器中的任意一个。假设该第一处理器为上述N个处理器中的处理器i(i=N-1除外),那么,该第二处理器为该N个处理器中的处理器i+1;若该第一处理器为处理器N-1,那么,该第二处理器为处理器0。
具体的,在embedding层反向传播过程中,上述N个处理器各自获得了各自训练的数据的嵌入参数的梯度,但是由于各个处理器训练的数据的嵌入参数存储在其它处理器的 embedding表中,因此,需要通过消息通信来将梯度发送到对应的处理器中,以用于优化对应的嵌入参数。同样的,在本申请中,该N个处理器通过环形通信架构来实现消息的通信。本实施例先以第一处理器、该第一处理器的下一跳处理器(上述第二处理器)和该第一处理器的上一跳处理器(下面步骤1102中的第三处理器)之间的通信为例介绍采用环形通信架构实现各自需要的梯度的获取过程。
上述第一目标处理器包括上述N个处理器中的一个或多个处理器。该第一目标处理器具体为哪个处理器由第一通知消息中的第一数据来决定。示例性的,假设该第一数据包括处理器i中embedding表内的部分或全部数据,那么,该第一目标处理器包括该处理器i。
上述第一数据可以包括一个或多个数据。上述第一通知消息中包括的内容可以示例性地参见表10。
表10
id 数据1 数据2 …… 数据k2
值域 梯度1 梯度2 …… 梯度k2
表10示例性示出了上述第一通知消息包括的部分内容,表10中的id即为数据,值域中为数据的嵌入参数对应的梯度。表10中的k2可以是任意大于0的整数。
在一种可能的实施方式中,假设上述N个处理器中处理器i的embedding表内的数据对N取模计算后的余数为i,那么上述第一通知消息中还可以包括数据对N取模后的余数,由于该余数和数据所在的embedding表所在的处理器的序号(训练进程的序号)相同,因此,也可以说该第一通知消息中还可以包括数据所在的embedding表所在的进程的序号。示例性地参见表11。
表11
id 数据1 数据2 …… 数据k2
进程(Rank)序号 序号1 序号2 …… 序号3
值域 梯度1 梯度2 …… 梯度k2
该第一通知消息中包括进程序号是为了处理器接收到该消息后,可以通过进程序号快速确定哪些数据是属于自身维护的embedding表中的数据,从而快速获取对应的梯度。
或者,另一种可能的实施方式中,不管上述N个处理器中处理器i的embedding表内的数据对N取模计算后的余数是否为i,只要第一处理器可以确定数据的嵌入参数所在的训练进程的序号,那么,第一通知消息包括的内容的格式就可以是表11所示的格式。
1102、上述第一处理器接收来自第三处理器的第二通知消息;该第二通知消息包括第二数据和第二梯度,用于将该第二梯度传播到第二目标处理器中;该第二梯度为该第二数据的嵌入参数对应的梯度,该第二数据与该第二梯度一一映射;该第三处理器为该环形通信架构中该第一处理器的上一跳处理器。
在具体实施例中,假设该第一处理器为上述N个处理器中的处理器i(i=0除外),那么,该第三处理器为该N个处理器中的处理器i-1;若该第一处理器为处理器0,那么,该第三处理器为处理器N-1。
上述第二目标处理器包括上述N个处理器中的一个或多个处理器。该第二目标处理器具体为哪个处理器由第二通知消息中的第二数据来决定。示例性的,假设该第二数据包括处理器i中embedding表内的部分或全部数据,那么,该第二目标处理器包括该处理器i。
上述第二数据可以包括一个或多个数据,该第二数据一般与上述第一数据不相同。一种 可能的实施方式中,该第二数据中的部分数据可以与该第一数据部分数据相同。上述第二通知消息的格式和上述第一通知消息的格式类似,关于第二通知消息包括的内容格式可以示例性参见上述表10或表11对应的描述,此处不再赘述。
具体的,基于上述N个处理器采用环形通信架构进行消息通信,第一处理器向其下一跳处理器第二处理器发送了上述第一通知消息后,从其上一跳处理器第三处理器接收到上述第二通知消息后。第一处理器响应于该第二通知消息执行梯度的获取操作,下面分两种情况进行描述:
第一种情况,在该第二通知消息中包括第一目标梯度的情况下,该第一处理器在该第二通知消息中获取该第一目标梯度,并向该第二处理器发送该第二通知消息,以用于继续通知第二目标处理器中其它的处理器获取需要的梯度;该第一目标梯度为该第一处理器维护的第一嵌入表中嵌入参数的梯度,该第一嵌入表中数据和嵌入参数存在一一映射关系。
在具体实施例中,假设第二通知消息携带的内容的格式如上述表10所示,即未携带进程序号,那么,第一处理器接收到上述第二通知消息后,解析消息获取消息中的第二数据,将该第二数据与第一处理器自身维护的embedding表中的数据比较。若该embedding表中存在该第二数据中的部分或全部数据,那么,第一处理器从解析后的第二通知消息的值域中提取该部分或全部数据对应的梯度,以用于优化第一处理器维护的第一嵌入表中该部分或全部数据的嵌入参数。提取完梯度之后,第一处理器将第二通知消息重新封装好,并发送给下一跳处理器第二处理器。
一种可能的实施方式中,假设上述N个处理器中处理器i中的embedding表内的数据对N取模计算后的余数为i,并且第二通知消息携带的内容的格式如上述表10所示,即未携带进程序号。那么,第一处理器接收到上述第二通知消息后,解析消息获取消息中的第二数据。然后,第一处理器将该第二数据的每个数据与N进行模计算,得到该每个数据取模后的余数。若计算得到的余数中有一个或多个余数与该第一处理器运行的训练进程的序号相同,那么,该一个或多个余数对应的数据存在该第一处理器维护的embedding表中。第一处理器从解析后的第二通知消息的值域中提取该一个或多个余数对应的数据所述对应的梯度,以用于优化第一处理器维护的第一嵌入表中该一个或多个余数对应的数据的嵌入参数。提取完梯度之后,第一处理器将第二通知消息重新封装好,并发送给下一跳处理器第二处理器。
一种可能的实施方式中,假设上述N个处理器中处理器i中的embedding表内的数据对N取模计算后的余数为i,并且第二通知消息携带的内容的格式如上述表11所示,即携带进程序号。那么,第一处理器接收到上述第二通知消息后,解析消息获取消息中的第二数据和对应的进程序号。若该第二通知消息中的进程序号有一个或多个序号为该第一处理器运行的训练进程的序号,那么,该一个或多个序号对应的数据存在该第一处理器维护的embedding表中。第一处理器从解析后的第二通知消息的值域中提取该一个或多个序号对应的数据所述对应的梯度,以用于优化第一处理器维护的第一嵌入表中该一个或多个序号对应的数据的嵌入参数。提取完梯度之后,第一处理器将第二通知消息重新封装好,并发送给下一跳处理器第二处理器。
第二种情况,在该第二通知消息中未包括该第一目标梯度的情况下,该第一处理器向该第二处理器发送该第二通知消息。
在具体实施例中,不管该第二通知消息中包括的内容格式是上述表10还是表11所示的格式,若第一处理器确定该第二通知消息中包括的第二数据中没有数据属于该第一处理器自身维护的embedding表中的数据,那么,第一处理器无需从第二通知消息中提取梯度,并将 该第二通知消息发送给下一跳处理器即发送给上述第二处理器。
一种可能的实施方式中,第一处理器在接收到上述第二通知消息,并完成对该第二通知消息的响应操作之后,该第一处理器还接收来自上述第三处理器的第三通知消息。
该第三通知消息包括第三数据和第三梯度,用于将该第三梯度传播到第三目标处理器中;该第三梯度为该第三数据的嵌入参数对应的梯度,该第三数据与该第三梯度一一映射。
上述第三目标处理器包括上述N个处理器中的一个或多个处理器。该第三目标处理器具体为哪个处理器由第三通知消息中的第三数据来决定。示例性的,假设该第三数据包括处理器i中embedding表内的部分或全部数据,那么,该第三目标处理器包括该处理器i。
上述第三数据可以包括一个或多个数据,该第三数据一般与上述第一数据和第二数据不相同。一种可能的实施方式中,该第三数据中的部分数据可以与该第一数据部分数据相同或者与该第二数据部分数据相同。上述第三通知消息的格式和上述第一通知消息的格式类似,关于第三通知消息包括的内容格式可以示例性参见上述表10或表11对应的描述,此处不再赘述。
具体的,在该第三通知消息中包括第二目标梯度的情况下,该第一处理器在该第三通知消息中获取该第二目标梯度,并向该第二处理器发送该第三通知消息,以用于继续通知第三目标处理器中其它的处理器获取需要的梯度;该第二目标梯度为该第一处理器维护的第一嵌入表中嵌入参数的梯度。或者,在该第三通知消息中未包括该第二目标梯度的情况下,该第一处理器向该第二处理器发送该第三通知消息,以用于继续通知第三目标处理器中其它的处理器获取需要的梯度。具体的实施步骤可以参考上述步骤1102中的第一种情况和第二种情况中的描述,此处不再赘述。
一种可能的实施方式中,上述N个处理器通过环形通信架构进行消息通信,重复上述消息发送和梯度的获取操作,循环至少N-1次之后,上述N个处理器中各个处理器均全部获取到了自身embedding表中嵌入参数的梯度,从而可以各自基于获取到梯度对应优化自身embedding表中的嵌入参数。
为了便于理解上述介绍的在embedding层反向传播过程中本申请提供的数据处理方法,下面举例说明,参见图12A至图12D。在图12A至图12D中,假设上述N个处理器为4个处理器,分别为处理器0、处理器1、处理器2和处理器3。该4个处理器通过环形通信架构实现消息的通信。本例子以N个处理器中处理器i中的embedding表内的数据对N取模计算后的余数为i为例介绍。
首先,参见图12A,在embedding层的反向传播过程中,各个处理器需要首先获取到自身训练的数据的嵌入参数的梯度,以用于根据各个嵌入参数的梯度对应优化embedding表内的嵌入参数。图12A中,关于第一批数据、第二批数据、第三批数据和第四批数据的介绍可以参见上述对图10A的描述,此处不再赘述。在图12A中,各个处理器首先生成一个消息,该消息包括数据、对应的进程序号以及数据对应的梯度。各个处理器生成各自的消息后,按照环形通信架构的通信方式各自向下一跳处理器发送各自生成的消息,并接收上一跳处理器发送来的消息,具体参见图12B。接收到消息后,处理器可以进行对应的梯度获取操作,具体的获取操作参见前述步骤1102中的描述,此处不再赘述。
在图12B中可以看到,处理器0将包括第一批数据的消息发送给处理器1,并接收来自 处理器3的包括第四批数据的消息,然后,获取接收的消息中数据4对应的梯度16。处理器1将包括第二批数据的消息发送给处理器2,并接收来自处理器0的包括第一批数据的消息,然后,获取接收的消息中数据21、5和25分别对应的梯度1、梯度2和梯度4。处理器2将包括第三批数据的消息发送给处理器3,并接收来自处理器1的包括第二批数据的消息,然后,获取接收的消息中数据2和10分别对应的梯度6和梯度7。处理器3将包括第四批数据的消息发送给处理器0,并接收来自处理器2的包括第三批数据的消息,由于第三批数据中没有数据属于处理器3中embedding表的数据,因此处理器3中没有在接收到的消息中获取到任意的梯度。
在图12B中,各个处理器响应于接收到的消息执行完梯度的获取操作之后,将接收到的消息向下一跳处理器发送,具体参见图12C。
在图12C中可以看到,处理器0将包括第四批数据的消息发送给处理器1,并接收来自处理器3的包括第三批数据的消息,然后,获取接收的消息中数据8和数据16分别对应的梯度10和梯度11。处理器1将包括第一批数据的消息发送给处理器2,并接收来自处理器0的包括第四批数据的消息,然后,获取接收的消息中数据33对应的梯度14。处理器2将包括第二批数据的消息发送给处理器3,并接收来自处理器1的包括第一批数据的消息,然后,获取接收的消息中数据14对应的梯度3。处理器3将包括第三批数据的消息发送给处理器0,并接收来自处理器2的包括第二批数据的消息,然后,获取接收的消息中数据19对应的梯度5。
在图12C中,各个处理器响应于接收到的消息执行完梯度的获取操作之后,将接收到的消息向下一跳处理器发送,具体参见图12D。
在图12D中可以看到,处理器0将包括第三批数据的消息发送给处理器1,并接收来自处理器3的包括第二批数据的消息,然后,获取接收的消息中数据32对应的梯度8。处理器1将包括第四批数据的消息发送给处理器2,并接收来自处理器0的包括第三批数据的消息,然后,获取接收的消息中数据13和29分别对应的梯度9和梯度12。处理器2将包括第一批数据的消息发送给处理器3,并接收来自处理器1的包括第四批数据的消息,然后,获取接收的消息中数据6和18分别对应的梯度13和梯度15。处理器3将包括第二批数据的消息发送给处理器0,并接收来自处理器2的包括第一批数据的消息,由于第一批数据中没有数据属于处理器3中embedding表的数据,因此处理器3中没有在接收到的消息中获取到任意的梯度。
需要说明的是,上述图12A至图12D所示及其相关的描述仅为一个示例,不构成对本申请的限制,基于上述思想所做的变形均在本申请的保护范围之内。
综上,本例子中,通过上述图12B至图12D中三次的循环,上述4个处理器均获取到了各自需要的梯度。由于处理器之间是通过环形通信架构来实现通信,相比于现有的技术方案中多对多的消息通信方式,本申请避免了单点通信瓶颈,降低了通信时延,提高了通信效率,从而能够提升整个数据训练系统的训练性能。
一种可能的实施方式中,在具体的数据训练过程中,上述图9所示的数据处理方法及其可能的实施方式中的任一种实现方式,可以与上述图11所示的数据处理方法及其可能的实施方式中的任一种实现方式一起使用,即在数据训练的embedding层前向传播过程中基于上述介绍的环形通信架构实现嵌入参数的查找,然后,在该数据训练的embedding层反向传播过程中采用上述介绍的环形通信架构实现梯度的获取。
参见图13,图13画出了上述图3所示的现有技术方案与本申请提供的方案的通信吞吐量的比较示意图。该吞吐量指的是单位时间内成功地发送数据的数量。图13中,横轴表示数据训练系统采用的处理器的个数,随着箭头的方向处理器个数越来越多;纵轴表示吞吐量,随着箭头的方向吞吐量越来越大。可以看到,现有技术的方案采用多对多的消息通信方式随着处理器个数增多,其吞吐量变化不大,甚至会下降。而本申请采用环形通信架构来实现消息的通信,随着处理器的增多,其吞吐量可以随着处理器的个数增加而增加,并且是以极好的线性度在增加。这是因为环形通信架构可以充分利用网络带宽,不易发生多对多消息通信方式的阻塞和抖动。一种可能的实施方式中,本申请通过采用该环形通信架构在embedding层的前向传播和反向传播中进行消息通信,能够将通信时延降低至原来的10-30%,极大地提升了通信效率,进而提升数据训练系统的性能。
上述主要对本申请实施例提供的数据处理方法进行了介绍。可以理解的是,各个设备为了实现上述对应的功能,其包含了执行各个功能相应的硬件结构和/或软件模块。结合本文中所公开的实施例描述的各示例的单元及步骤,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用使用不同方法来实现所描述的功能,但这种实现不应认为超出本申请的范围。
本申请实施例可以根据上述方法示例对设备进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。需要说明的是,本申请实施例对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
在采用对应各个功能划分各个功能模块的情况下,图14示出了装置的一种可能的逻辑结构示意图,该装置可以是上述图9所述方法及其可能的实施方式中的第一处理器,或者可以是该第一处理器中的芯片,或者可以是该第一处理器中的处理系统等。该装置1400包括发送单元1401和接收单元1402。其中:
发送单元1401,用于向第二处理器发送第一查找消息;该第一查找消息包括第一数据,该第一查找消息用于查找该第一数据的嵌入参数;该第二处理器为该第一处理器所在的环形通信架构中该第一处理器的下一跳处理器;该发送单元1401可以由发送接口或发射器来实现,可以执行图9所示的步骤901中所述的操作。
接收单元1402,用于接收来自第三处理器的第二查找消息;该第二查找消息包括第二数据,该第二查找消息用于查找该第二数据的嵌入参数;该第三处理器为该环形通信架构中该第一处理器的上一跳处理器;该接收单元1402可以由接收接口或接收器来实现,可以执行图9所示的步骤902中所述的操作。
该第一处理器、该第二处理器和该第三处理器为数据训练系统包括的N个处理器中的处理器,该N为大于或等于3的整数;该N个处理器之间通过该环形通信架构实现通信,该环形通信架构中,该N个处理器的每个处理器仅从该每个处理器的上一跳处理器接收消息,并且仅向该每个处理器的下一跳处理器发送消息。
一种可能的实施方式中,该装置还包括添加单元;
上述添加单元,用于在基于该第二查找消息查找到该第二数据中部分或全部数据的嵌入参数的情况下,将该部分或全部数据的嵌入参数添加到该第二查找消息中,得到第三查找消 息;
上述发送单元1401,还用于向该第二处理器发送该第三查找消息;
或者,该发送单元1401,还用于在基于该第二查找消息未查找到该第二数据的嵌入参数的情况下,向该第二处理器发送该第二查找消息。
一种可能的实施方式中,该装置还包括查找单元;
该查找单元,用于在第一嵌入表中查找该部分或全部数据映射的嵌入参数;该第一嵌入表为该第一处理器维护的用于存储数据和嵌入参数的嵌入表,该第一嵌入表中数据和嵌入参数存在一一映射关系;
上述添加单元,具体用于将该部分或全部数据映射的嵌入参数添加到该第二查找消息中的该部分或全部数据对应的值域中,得到该第三查找消息;
上述发送单元1401,具体用于向该第二处理器发送该第三查找消息,该第三查找消息用于查找该第二数据中未查找到嵌入参数的数据的嵌入参数。
一种可能的实施方式中,上述装置还包括确定单元和生成单元;
该确定单元,用于确定上述部分或全部数据属于第一嵌入表,且该第一嵌入表中还未包括该部分或全部数据;该第一嵌入表为该第一处理器维护的用于存储数据和嵌入参数的嵌入表,该第一嵌入表中数据和嵌入参数存在一一映射关系;
该生成单元,用于生成该部分或全部数据各自对应的嵌入参数;
上述添加单元,具体用于该部分或全部数据各自对应的嵌入参数添加到该第二查找消息中的该部分或全部数据对应的值域中,得到该第三查找消息;
上述发送单元1401,具体用于向该第二处理器发送该第三查找消息,该第三查找消息用于查找该第二数据中未查找到嵌入参数的数据的嵌入参数。
一种可能的实施方式中,上述发送单元1401具体用于:
在该第二数据均不属于第一嵌入表中的数据的情况下,向该第二处理器发送该第二查找消息;该第一嵌入表为该第一处理器维护的用于存储数据和嵌入参数的嵌入表,该第一嵌入表中数据和嵌入参数存在一一映射关系。
一种可能的实施方式中,上述接收单元1402,还用于接收来自该第三处理器的第四查找消息;该第四查找消息包括第三数据和该第三数据中第一部分数据映射的嵌入参数,该第四查找消息用于查找该第三数据中除了该第一部分数据之外的数据映射的嵌入参数;
该装置还包括添加单元,用于在基于该第四查找消息查找到该第三数据中第二部分数据的嵌入参数的情况下,将该第二部分数据的嵌入参数添加到该第四查找消息中,得到第五查找消息;
上述发送单元1401,还用于向该第二处理器发送该第五查找消息;
或者,该发送单元1401,还用于在基于该第四查找消息未查找到该第三数据的嵌入参数的情况下,向该第二处理器发送该第四查找消息。
一种可能的实施方式中,上述接收单元1402还用于:接收来自该第三处理器的第六查找消息,该第六查找消息包括该第一数据和该第一数据的嵌入参数。
图14所示装置1400中各个单元的具体操作以及有益效果可以参见上述图9及其可能的方法实施例中对应的描述,此处不再赘述。
在采用对应各个功能划分各个功能模块的情况下,图15示出了装置的一种可能的逻辑结构示意图,该装置可以是上述图11所述方法及其可能的实施方式中的第一处理器,或者可以 是该第一处理器中的芯片,或者可以是该第一处理器中的处理系统等。该装置1500包括发送单元1501和接收单元1502。其中:
发送单元1501,用于向第二处理器发送第一通知消息;该第一通知消息包括第一数据和第一梯度,用于将该第一梯度传播到第一目标处理器中;该第一梯度为该第一数据的嵌入参数对应的梯度;该第二处理器为该第一处理器所在的环形通信架构中该第一处理器的下一跳处理器;该发送单元1501可以由发送接口或发射器来实现,可以执行图11所示的步骤1101中所述的操作。
接收单元1502,用于接收来自第三处理器的第二通知消息;该第二通知消息包括第二数据和第二梯度,用于将该第二梯度传播到第二目标处理器中;该第二梯度为该第二数据的嵌入参数对应的梯度;该第三处理器为该环形通信架构中该第一处理器的上一跳处理器;该接收单元1502可以由接收接口或接收器来实现,可以执行图11所示的步骤1102中所述的操作。
该第一处理器、该第二处理器和该第三处理器为数据训练系统包括的N个处理器中的处理器,该N为大于或等于3的整数;该N个处理器之间通过该环形通信架构实现通信,该环形通信架构中,该N个处理器的每个处理器仅从该每个处理器的上一跳处理器接收消息,并且仅向该每个处理器的下一跳处理器发送消息。
一种可能的实施方式中,该装置还包括获取单元;
该获取单元,用于在该第二通知消息中包括第一目标梯度的情况下,在该第二通知消息中获取该第一目标梯度;
该发送单元1501,还用于向该第二处理器发送该第二通知消息;该第一目标梯度为该第一处理器维护的第一嵌入表中嵌入参数的梯度,该第一嵌入表中数据和嵌入参数存在一一映射关系;
或者,该发送单元1501,还用于在该第二通知消息中未包括该第一目标梯度的情况下,向该第二处理器发送该第二通知消息。
一种可能的实施方式中,该获取单元具体用于:
确定该第二数据中的部分或全部数据为该第一嵌入表中的数据;
基于该部分或全部数据在该第二通知消息中获取该第一目标梯度。
一种可能的实施方式中,该接收单元1502,还用于接收来自该第三处理器的第三通知消息;该第三通知消息包括第三数据和第三梯度,用于将该第三梯度传播到第三目标处理器中;该第三梯度为该第三数据的嵌入参数对应的梯度;
该装置还包括获取单元,用于在该第三通知消息中包括第二目标梯度的情况下,在该第三通知消息中获取该第二目标梯度,
该发送单元1501,还用于向该第二处理器发送该第三通知消息;该第二目标梯度为该第一处理器维护的第一嵌入表中嵌入参数的梯度,该第一嵌入表中包括数据和数据的嵌入参数的映射关系;
或者,该发送单元1501,还用于在该第三通知消息中未包括该第二目标梯度的情况下,向该第二处理器发送该第三通知消息。
图15所示装置1500中各个单元的具体操作以及有益效果可以参见上述图11及其可能的方法实施例中对应的描述,此处不再赘述。
图16所示为本申请提供的装置的一种可能的硬件结构示意图,该装置可以是上述实施例所述方法中的第一处理器。该装置1600包括:处理器1601、存储器1602和通信接口1603。 处理器1601、通信接口1603以及存储器1602可以相互连接或者通过总线1604相互连接。
示例性的,存储器1602用于存储装置1600的计算机程序和数据,存储器1602可以包括但不限于是随机存储记忆体(random access memory,RAM)、只读存储器(read-only memory,ROM)、可擦除可编程只读存储器(erasable programmable read only memory,EPROM)或便携式只读存储器(compact disc read-only memory,CD-ROM)等。
在实现图14所示实施例的情况下,执行图14中的全部或部分单元的功能所需的软件或程序代码存储在存储器1602中。
在实现图14实施例的情况下,如果是部分单元的功能所需的软件或程序代码存储在存储器1602中,则处理器1601除了调用存储器1602中的程序代码实现部分功能外,还可以配合其他部件(如通信接口1603)共同完成图14实施例所描述的其他功能(如接收或发送消息的功能)。
在实现图15所示实施例的情况下,执行图15中的全部或部分单元的功能所需的软件或程序代码存储在存储器1602中。
在实现图15实施例的情况下,如果是部分单元的功能所需的软件或程序代码存储在存储器1602中,则处理器1601除了调用存储器1602中的程序代码实现部分功能外,还可以配合其他部件(如通信接口1603)共同完成图15实施例所描述的其他功能(如接收或发送消息的功能)。
通信接口1603包括发送接口和接收接口,通信接口1603的个数可以为多个,用于支持装置1600进行通信,例如接收或发送数据或消息等。
示例性的,处理器1601可以是中央处理器单元、通用处理器、数字信号处理器、专用集成电路、现场可编程门阵列或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。处理器也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,数字信号处理器和微处理器的组合等等。处理器1601可以用于读取上述存储器1602中存储的程序,执行如上述图9及其可能的实施例中所述的任一种数据处理方法;或者,处理器1601可以用于读取上述存储器1602中存储的程序,执行如上述图11及其可能的实施例中所述的任一种数据处理方法;或者,处理器1601可以用于读取上述存储器1602中存储的程序,执行如上述图9及其可能的实施例中所述的任一种数据处理方法和/或上述图11及其可能的实施例中所述的任一种数据处理方法。
一种可能的实施方式中,处理器1601可以用于读取上述存储器1602中存储的程序,执行如下操作:
通过发送接口向第二处理器发送第一查找消息;该第一查找消息包括第一数据,该第一查找消息用于查找该第一数据的嵌入参数;该第二处理器为该第一处理器所在的环形通信架构中该第一处理器的下一跳处理器;
通过接收接口接收来自第三处理器的第二查找消息;该第二查找消息包括第二数据,该第二查找消息用于查找该第二数据的嵌入参数;该第三处理器为该环形通信架构中该第一处理器的上一跳处理器;
该第一处理器、该第二处理器和该第三处理器为数据训练系统包括的N个处理器中的处理器,该N为大于或等于3的整数;该N个处理器之间通过该环形通信架构实现通信,该环形通信架构中,该N个处理器的每个处理器仅从该每个处理器的上一跳处理器接收消息,并且仅向该每个处理器的下一跳处理器发送消息。
另一种可能的实施方式中,处理器1601可以用于读取上述存储器1602中存储的程序, 执行如下操作:
通过发送接口向第二处理器发送第一通知消息;该第一通知消息包括第一数据和第一梯度,用于将该第一梯度传播到第一目标处理器中;该第一梯度为该第一数据的嵌入参数对应的梯度;该第二处理器为该第一处理器所在的环形通信架构中该第一处理器的下一跳处理器;
通过接收接口接收来自第三处理器的第二通知消息;该第二通知消息包括第二数据和第二梯度,用于将该第二梯度传播到第二目标处理器中;该第二梯度为该第二数据的嵌入参数对应的梯度;该第三处理器为该环形通信架构中该第一处理器的上一跳处理器;
该第一处理器、该第二处理器和该第三处理器为数据训练系统包括的N个处理器中的处理器,该N为大于或等于3的整数;该N个处理器之间通过该环形通信架构实现通信,该环形通信架构中,该N个处理器的每个处理器仅从该每个处理器的上一跳处理器接收消息,并且仅向该每个处理器的下一跳处理器发送消息。
图16所示装置1600中各个单元的具体操作以及有益效果可以参见上述图9和/或图11及其可能的方法实施例中对应的描述,此处不再赘述。
本申请实施例还提供一种计算机可读存储介质,该计算机可读存储介质存储有计算机程序,该计算机程序被处理器执行以实现上述图9和/或图11及其可能的方法实施例中任一实施例所述的方法。
本申请实施例还提供一种计算机程序产品,当该计算机程序产品被计算机读取并执行时,上述图9和/或图11及其可能的方法实施例中任一实施例所述的方法将被执行。
综上所述,上述N个处理器在embedding层前向传播和反向传播过程中的消息通信可以通过环形通信架构来实现,采用环形通信方式实现消息的交互,相比于现有的技术方案中多对多的消息通信方式,本申请可以充分利用处理器之间的带宽资源,避免了单点通信瓶颈,降低了通信时延,提高了通信效率,进而能够提升整个数据训练系统的训练效率和性能。
最后应说明的是:以上各实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述各实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。

Claims (24)

  1. 一种数据处理方法,其特征在于,所述方法包括:
    第一处理器向第二处理器发送第一查找消息;所述第一查找消息包括第一数据,所述第一查找消息用于查找所述第一数据的嵌入参数;所述第二处理器为所述第一处理器所在的环形通信架构中所述第一处理器的下一跳处理器;
    所述第一处理器接收来自第三处理器的第二查找消息;所述第二查找消息包括第二数据,所述第二查找消息用于查找所述第二数据的嵌入参数;所述第三处理器为所述环形通信架构中所述第一处理器的上一跳处理器;
    所述第一处理器、所述第二处理器和所述第三处理器为数据训练系统包括的N个处理器中的处理器,所述N为大于或等于3的整数;所述N个处理器之间通过所述环形通信架构实现通信,所述环形通信架构中,所述N个处理器的每个处理器仅从所述每个处理器的上一跳处理器接收消息,并且仅向所述每个处理器的下一跳处理器发送消息。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    在基于所述第二查找消息查找到所述第二数据中部分或全部数据的嵌入参数的情况下,所述第一处理器将所述部分或全部数据的嵌入参数添加到所述第二查找消息中,得到第三查找消息,并向所述第二处理器发送所述第三查找消息;
    或者,在基于所述第二查找消息未查找到所述第二数据的嵌入参数的情况下,所述第一处理器向所述第二处理器发送所述第二查找消息。
  3. 根据权利要求2所述的方法,其特征在于,所述在基于所述第二查找消息查找到所述第二数据中部分或全部数据的嵌入参数的情况下,所述第一处理器将所述部分或全部数据的嵌入参数添加到所述第二查找消息中,得到第三查找消息,并向所述第二处理器发送所述第三查找消息,包括:
    所述第一处理器在第一嵌入表中查找所述部分或全部数据映射的嵌入参数;所述第一嵌入表为所述第一处理器维护的用于存储数据和嵌入参数的嵌入表,所述第一嵌入表中数据和嵌入参数存在一一映射关系;
    所述第一处理器将所述部分或全部数据映射的嵌入参数添加到所述第二查找消息中的所述部分或全部数据对应的值域中,得到所述第三查找消息;
    所述第一处理器向所述第二处理器发送所述第三查找消息,所述第三查找消息用于查找所述第二数据中未查找到嵌入参数的数据的嵌入参数。
  4. 根据权利要求2所述的方法,其特征在于,所述在基于所述第二查找消息未查找到所述第二数据的嵌入参数的情况下,所述第一处理器向所述第二处理器发送所述第二查找消息,包括:
    在所述第二数据均不属于第一嵌入表中的数据的情况下,所述第一处理器向所述第二处理器发送所述第二查找消息;所述第一嵌入表为所述第一处理器维护的用于存储数据和嵌入参数的嵌入表,所述第一嵌入表中数据和嵌入参数存在一一映射关系。
  5. 根据权利要求1至4任一项所述的方法,其特征在于,所述方法还包括:
    所述第一处理器接收来自所述第三处理器的第四查找消息;所述第四查找消息包括第三 数据和所述第三数据中第一部分数据映射的嵌入参数,所述第四查找消息用于查找所述第三数据中除了所述第一部分数据之外的数据映射的嵌入参数;
    在基于所述第四查找消息查找到所述第三数据中第二部分数据的嵌入参数的情况下,所述第一处理器将所述第二部分数据的嵌入参数添加到所述第四查找消息中,得到第五查找消息,并向所述第二处理器发送所述第五查找消息;
    或者,在基于所述第四查找消息未查找到所述第三数据的嵌入参数的情况下,所述第一处理器向所述第二处理器发送所述第四查找消息。
  6. 根据权利要求1至5任一项所述的方法,其特征在于,所述方法还包括:
    所述第一处理器接收来自所述第三处理器的第六查找消息,所述第六查找消息包括所述第一数据和所述第一数据的嵌入参数。
  7. 一种数据处理方法,其特征在于,所述方法包括:
    第一处理器向第二处理器发送第一通知消息;所述第一通知消息包括第一数据和第一梯度,用于将所述第一梯度传播到第一目标处理器中;所述第一梯度为所述第一数据的嵌入参数对应的梯度;所述第二处理器为所述第一处理器所在的环形通信架构中所述第一处理器的下一跳处理器;
    所述第一处理器接收来自第三处理器的第二通知消息;所述第二通知消息包括第二数据和第二梯度,用于将所述第二梯度传播到第二目标处理器中;所述第二梯度为所述第二数据的嵌入参数对应的梯度;所述第三处理器为所述环形通信架构中所述第一处理器的上一跳处理器;
    所述第一处理器、所述第二处理器和所述第三处理器为数据训练系统包括的N个处理器中的处理器,所述N为大于或等于3的整数;所述N个处理器之间通过所述环形通信架构实现通信,所述环形通信架构中,所述N个处理器的每个处理器仅从所述每个处理器的上一跳处理器接收消息,并且仅向所述每个处理器的下一跳处理器发送消息。
  8. 根据权利要求7所述的方法,其特征在于,所述方法还包括:
    在所述第二通知消息中包括第一目标梯度的情况下,所述第一处理器在所述第二通知消息中获取所述第一目标梯度,并向所述第二处理器发送所述第二通知消息;所述第一目标梯度为所述第一处理器维护的第一嵌入表中嵌入参数的梯度,所述第一嵌入表中数据和嵌入参数存在一一映射关系;
    或者,在所述第二通知消息中未包括所述第一目标梯度的情况下,所述第一处理器向所述第二处理器发送所述第二通知消息。
  9. 根据权利要求8所述的方法,其特征在于,所述在所述第二通知消息中包括第一目标梯度的情况下,所述第一处理器在所述第二通知消息中获取所述第一目标梯度,包括:
    所述第一处理器确定所述第二数据中的部分或全部数据为所述第一嵌入表中的数据;
    所述第一处理器基于所述部分或全部数据在所述第二通知消息中获取所述第一目标梯度。
  10. 根据权利要求7至9任一项所述的方法,其特征在于,所述方法还包括:
    所述第一处理器接收来自所述第三处理器的第三通知消息;所述第三通知消息包括第三 数据和第三梯度,用于将所述第三梯度传播到第三目标处理器中;所述第三梯度为所述第三数据的嵌入参数对应的梯度;
    在所述第三通知消息中包括第二目标梯度的情况下,所述第一处理器在所述第三通知消息中获取所述第二目标梯度,并向所述第二处理器发送所述第三通知消息;所述第二目标梯度为所述第一处理器维护的第一嵌入表中嵌入参数的梯度,所述第一嵌入表中包括数据和数据的嵌入参数的映射关系;
    或者,在所述第三通知消息中未包括所述第二目标梯度的情况下,所述第一处理器向所述第二处理器发送所述第三通知消息。
  11. 一种数据处理装置,其特征在于,所述装置包括:
    发送单元,用于向第二处理器发送第一查找消息;所述第一查找消息包括第一数据,所述第一查找消息用于查找所述第一数据的嵌入参数;所述第二处理器为所述第一处理器所在的环形通信架构中所述第一处理器的下一跳处理器;
    接收单元,用于接收来自第三处理器的第二查找消息;所述第二查找消息包括第二数据,所述第二查找消息用于查找所述第二数据的嵌入参数;所述第三处理器为所述环形通信架构中所述第一处理器的上一跳处理器;
    所述第一处理器、所述第二处理器和所述第三处理器为数据训练系统包括的N个处理器中的处理器,所述N为大于或等于3的整数;所述N个处理器之间通过所述环形通信架构实现通信,所述环形通信架构中,所述N个处理器的每个处理器仅从所述每个处理器的上一跳处理器接收消息,并且仅向所述每个处理器的下一跳处理器发送消息。
  12. 根据权利要求11所述的装置,其特征在于,所述装置还包括添加单元;
    所述添加单元,用于在基于所述第二查找消息查找到所述第二数据中部分或全部数据的嵌入参数的情况下,将所述部分或全部数据的嵌入参数添加到所述第二查找消息中,得到第三查找消息;
    所述发送单元,还用于向所述第二处理器发送所述第三查找消息;
    或者,所述发送单元,还用于在基于所述第二查找消息未查找到所述第二数据的嵌入参数的情况下,向所述第二处理器发送所述第二查找消息。
  13. 根据权利要求12所述的装置,其特征在于,所述装置还包括查找单元;
    所述查找单元,用于在第一嵌入表中查找所述部分或全部数据映射的嵌入参数;所述第一嵌入表为所述第一处理器维护的用于存储数据和嵌入参数的嵌入表,所述第一嵌入表中数据和嵌入参数存在一一映射关系;
    所述添加单元,具体用于将所述部分或全部数据映射的嵌入参数添加到所述第二查找消息中的所述部分或全部数据对应的值域中,得到所述第三查找消息;
    所述发送单元,具体用于向所述第二处理器发送所述第三查找消息,所述第三查找消息用于查找所述第二数据中未查找到嵌入参数的数据的嵌入参数。
  14. 根据权利要求12所述的装置,其特征在于,所述发送单元具体用于:
    在所述第二数据均不属于第一嵌入表中的数据的情况下,向所述第二处理器发送所述第二查找消息;所述第一嵌入表为所述第一处理器维护的用于存储数据和嵌入参数的嵌入表, 所述第一嵌入表中数据和嵌入参数存在一一映射关系。
  15. 根据权利要求11至14任一项所述的装置,其特征在于,
    所述接收单元,还用于接收来自所述第三处理器的第四查找消息;所述第四查找消息包括第三数据和所述第三数据中第一部分数据映射的嵌入参数,所述第四查找消息用于查找所述第三数据中除了所述第一部分数据之外的数据映射的嵌入参数;
    所述装置还包括添加单元,用于在基于所述第四查找消息查找到所述第三数据中第二部分数据的嵌入参数的情况下,将所述第二部分数据的嵌入参数添加到所述第四查找消息中,得到第五查找消息;
    所述发送单元,还用于向所述第二处理器发送所述第五查找消息;
    或者,所述发送单元,还用于在基于所述第四查找消息未查找到所述第三数据的嵌入参数的情况下,向所述第二处理器发送所述第四查找消息。
  16. 根据权利要求11至15任一项所述的装置,其特征在于,所述接收单元还用于:
    接收来自所述第三处理器的第六查找消息,所述第六查找消息包括所述第一数据和所述第一数据的嵌入参数。
  17. 一种数据处理装置,其特征在于,所述装置包括:
    发送单元,用于向第二处理器发送第一通知消息;所述第一通知消息包括第一数据和第一梯度,用于将所述第一梯度传播到第一目标处理器中;所述第一梯度为所述第一数据的嵌入参数对应的梯度;所述第二处理器为所述第一处理器所在的环形通信架构中所述第一处理器的下一跳处理器;
    接收单元,用于接收来自第三处理器的第二通知消息;所述第二通知消息包括第二数据和第二梯度,用于将所述第二梯度传播到第二目标处理器中;所述第二梯度为所述第二数据的嵌入参数对应的梯度;所述第三处理器为所述环形通信架构中所述第一处理器的上一跳处理器;
    所述第一处理器、所述第二处理器和所述第三处理器为数据训练系统包括的N个处理器中的处理器,所述N为大于或等于3的整数;所述N个处理器之间通过所述环形通信架构实现通信,所述环形通信架构中,所述N个处理器的每个处理器仅从所述每个处理器的上一跳处理器接收消息,并且仅向所述每个处理器的下一跳处理器发送消息。
  18. 根据权利要求17所述的装置,其特征在于,所述装置还包括获取单元;
    所述获取单元,用于在所述第二通知消息中包括第一目标梯度的情况下,在所述第二通知消息中获取所述第一目标梯度;
    所述发送单元,还用于向所述第二处理器发送所述第二通知消息;所述第一目标梯度为所述第一处理器维护的第一嵌入表中嵌入参数的梯度,所述第一嵌入表中数据和嵌入参数存在一一映射关系;
    或者,所述发送单元,还用于在所述第二通知消息中未包括所述第一目标梯度的情况下,向所述第二处理器发送所述第二通知消息。
  19. 根据权利要求18所述的装置,其特征在于,所述获取单元具体用于:
    确定所述第二数据中的部分或全部数据为所述第一嵌入表中的数据;
    基于所述部分或全部数据在所述第二通知消息中获取所述第一目标梯度。
  20. 根据权利要求17至19任一项所述的装置,其特征在于,
    所述接收单元,还用于接收来自所述第三处理器的第三通知消息;所述第三通知消息包括第三数据和第三梯度,用于将所述第三梯度传播到第三目标处理器中;所述第三梯度为所述第三数据的嵌入参数对应的梯度;
    所述装置还包括获取单元,用于在所述第三通知消息中包括第二目标梯度的情况下,在所述第三通知消息中获取所述第二目标梯度,
    所述发送单元,还用于向所述第二处理器发送所述第三通知消息;所述第二目标梯度为所述第一处理器维护的第一嵌入表中嵌入参数的梯度,所述第一嵌入表中包括数据和数据的嵌入参数的映射关系;
    或者,所述发送单元,还用于在所述第三通知消息中未包括所述第二目标梯度的情况下,向所述第二处理器发送所述第三通知消息。
  21. 一种装置,其特征在于,所述装置包括处理器和存储器,其中,所述存储器用于存储计算机程序,所述处理器用于执行所述存储器中存储的计算机程序,使得所述装置执行如权利要求1至6任一项所述的方法;或者,使得所述装置执行如权利要求7至10任一项所述的方法。
  22. 一种数据训练系统,其特征在于,所述系统包括N个处理器,所述N为大于或等于3的整数;所述N个处理器之间通过环形通信架构实现通信,所述环形通信架构中,所述N个处理器的每个处理器仅从所述每个处理器的上一跳处理器接收消息,并且仅向所述每个处理器的下一跳处理器发送消息;所述N个处理器中的每个处理器可以是权利要求11至16中任一项所述的装置,或者,所述N个处理器中的每个处理器可以是权利要求17至20中任一项所述的装置,或者,所述N个处理器中的每个处理器可以是权利要求21中所述的装置。
  23. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行以实现权利要求1至6任意一项所述的方法;或者,所述计算机程序被处理器执行以实现权利要求7至10任意一项所述的方法。
  24. 一种计算机程序产品,其特征在于,所述计算机程序产品被处理器执行时,权利要求1至6任意一项所述的方法将被执行;或者,权利要求7至10任意一项所述的方法将被执行。
PCT/CN2022/085353 2021-04-29 2022-04-06 数据处理方法、装置及系统 WO2022228060A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP22794531.8A EP4283523A1 (en) 2021-04-29 2022-04-06 Data processing method, apparatus, and system
US18/491,844 US20240054031A1 (en) 2021-04-29 2023-10-23 Data processing method and apparatus, and system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110477608.3A CN115271025A (zh) 2021-04-29 2021-04-29 数据处理方法、装置及系统
CN202110477608.3 2021-04-29

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/491,844 Continuation US20240054031A1 (en) 2021-04-29 2023-10-23 Data processing method and apparatus, and system

Publications (1)

Publication Number Publication Date
WO2022228060A1 true WO2022228060A1 (zh) 2022-11-03

Family

ID=83744678

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/085353 WO2022228060A1 (zh) 2021-04-29 2022-04-06 数据处理方法、装置及系统

Country Status (4)

Country Link
US (1) US20240054031A1 (zh)
EP (1) EP4283523A1 (zh)
CN (1) CN115271025A (zh)
WO (1) WO2022228060A1 (zh)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104767664A (zh) * 2014-01-07 2015-07-08 艾默生网络能源有限公司 一种环形通信网络增减从节点的方法、装置及系统
CN110378472A (zh) * 2019-07-24 2019-10-25 苏州浪潮智能科技有限公司 一种深度神经网络模型的数据并行训练方法、装置及设备
CN112000473A (zh) * 2020-08-12 2020-11-27 中国银联股份有限公司 深度学习模型的分布式训练方法以及装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104767664A (zh) * 2014-01-07 2015-07-08 艾默生网络能源有限公司 一种环形通信网络增减从节点的方法、装置及系统
CN110378472A (zh) * 2019-07-24 2019-10-25 苏州浪潮智能科技有限公司 一种深度神经网络模型的数据并行训练方法、装置及设备
CN112000473A (zh) * 2020-08-12 2020-11-27 中国银联股份有限公司 深度学习模型的分布式训练方法以及装置

Also Published As

Publication number Publication date
EP4283523A1 (en) 2023-11-29
US20240054031A1 (en) 2024-02-15
CN115271025A (zh) 2022-11-01

Similar Documents

Publication Publication Date Title
US20220391771A1 (en) Method, apparatus, and computer device and storage medium for distributed training of machine learning model
WO2022083536A1 (zh) 一种神经网络构建方法以及装置
CN109284823B (zh) 一种运算装置及相关产品
WO2022022274A1 (zh) 一种模型训练方法及装置
CN111401406B (zh) 一种神经网络训练方法、视频帧处理方法以及相关设备
CN112183718B (zh) 一种用于计算设备的深度学习训练方法和装置
WO2021233342A1 (zh) 一种神经网络构建方法以及系统
WO2022068623A1 (zh) 一种模型训练方法及相关设备
US20210295168A1 (en) Gradient compression for distributed training
WO2022111617A1 (zh) 一种模型训练方法及装置
WO2023093724A1 (zh) 神经网络模型的处理方法及装置
WO2022057433A1 (zh) 一种机器学习模型的训练的方法以及相关设备
WO2022012668A1 (zh) 一种训练集处理方法和装置
WO2022170569A1 (zh) 数据处理方法和装置
WO2021169366A1 (zh) 数据增强方法和装置
US20220391781A1 (en) Architecture-agnostic federated learning system
WO2022227777A1 (zh) 一种模型处理方法及装置
WO2022156475A1 (zh) 神经网络模型的训练方法、数据处理方法及装置
WO2022100607A1 (zh) 一种神经网络结构确定方法及其装置
CN112528108A (zh) 一种模型训练系统、模型训练中梯度聚合的方法及装置
WO2023185541A1 (zh) 一种模型训练方法及其相关设备
WO2022228060A1 (zh) 数据处理方法、装置及系统
WO2023143080A1 (zh) 一种数据处理方法以及相关设备
WO2023071658A1 (zh) Ai模型的处理方法、运算方法及装置
WO2023122854A1 (zh) 数据处理的方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22794531

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022794531

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022794531

Country of ref document: EP

Effective date: 20230825

WWE Wipo information: entry into national phase

Ref document number: 11202306406R

Country of ref document: SG

NENP Non-entry into the national phase

Ref country code: DE