CN113254206A - Data processing system and method thereof - Google Patents

Data processing system and method thereof Download PDF

Info

Publication number
CN113254206A
CN113254206A CN202110568222.3A CN202110568222A CN113254206A CN 113254206 A CN113254206 A CN 113254206A CN 202110568222 A CN202110568222 A CN 202110568222A CN 113254206 A CN113254206 A CN 113254206A
Authority
CN
China
Prior art keywords
data
data processing
executor
group
backward
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110568222.3A
Other languages
Chinese (zh)
Other versions
CN113254206B (en
Inventor
李新奇
成诚
柳俊丞
李一鹏
袁进辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Oneflow Technology Co Ltd
Original Assignee
Beijing Oneflow Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Oneflow Technology Co Ltd filed Critical Beijing Oneflow Technology Co Ltd
Priority to CN202110568222.3A priority Critical patent/CN113254206B/en
Publication of CN113254206A publication Critical patent/CN113254206A/en
Application granted granted Critical
Publication of CN113254206B publication Critical patent/CN113254206B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5066Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/448Execution paradigms, e.g. implementations of programming paradigms
    • G06F9/4488Object-oriented
    • G06F9/4492Inheritance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/448Execution paradigms, e.g. implementations of programming paradigms
    • G06F9/4498Finite state machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present disclosure provides a data processing system and a method thereof. The system comprises: a plurality of data execution groups, the number of the data execution groups is equal to the predetermined number of the processing stages of the data to be processed, wherein each data execution group comprises at least one data processing executor, after the data processing executor in a first data execution group in the plurality of data execution groups starts, forward data processing and backward data processing are executed alternately after forward data processing is continuously executed for a predetermined number of times, the data processing executors in the data execution groups behind the first data execution group in the plurality of data execution groups execute forward data processing according to the data receiving sequence for the forward result data generated by the data processing executor in the previous data execution group executing the forward data processing, or update parameters generated by the data processing executor in the next data execution group executing the backward data processing, backward data processing is performed.

Description

Data processing system and method thereof
Technical Field
The present disclosure relates to a data processing technology, and more particularly, to a data processing system and a method thereof.
Background
With the development of machine learning and the gradual and deep research of artificial neural networks, the concept of deep learning is widely concerned and applied. Deep learning is a special machine learning, which adopts a mesh hierarchy structure to express learned objects, combines abstract concepts through simple concepts, and realizes abstract concept expression through simple concept calculation. Deep learning is currently a great advance in the fields of image recognition, speech recognition and natural language processing. The deep learning involves many model parameters, which results in huge calculation amount and large scale of training data, and thus needs to consume more calculation resources.
Currently, both general purpose processors GPU and special purpose chips TPU are many times more powerful than CPU, but the desire for computing power for real-world applications is endless, and practitioners need to process larger-scale data in larger-scale models at a faster rate, which cannot be met by one hardware device alone. The development of hardware is limited by the manufacturing process (chip area, power consumption, clock signal propagation range), and it is impossible to improve the processing capacity of one chip without limit. Therefore, many high throughput devices are connected together by high speed interconnect technology to cooperatively perform large scale tasks.
For this purpose, intra-layer parallelism and inter-layer parallelism are proposed by the person skilled in the art. The parallel in the layer is an iterative training calculation of the neural network model, and a plurality of devices participate together. The advantage of intra-layer parallelism is that the work of an iterative training is shared on each device, but the disadvantage is that the data transmission takes more time. The interlayer parallel divides the whole flow of the neural network model into different stages, the devices participating in the different stages are different from each other, and the stages have a definite sequence. The parallel input data between layers is divided into small batches, training is carried out in batches, and each group of data can be regarded as a small batch. The advantage of inter-layer parallelism is that transmission is only required at task handover, but the disadvantage is that the iteration speed is contradictory to convergence.
Inter-layer parallelism can be further divided into Bulk Synchronous Parallelism (BSP), asynchronous parallelism (ASP), and delayed synchronous parallelism (SSP). The overall synchronous parallelism needs to wait for the computation of all machines to be completed in each iteration, so that the speed is the slowest, but the convergence is the best. Asynchronous parallel machines do not wait for each other at all, and are fast but not necessarily convergent. Delay-synchronous parallel (SSP) allows a degree of computational inconsistency, but with some speed and some convergence.
Therefore, it is desirable to obtain a data processing system that can reduce the time spent on transmission by utilizing interlayer parallelism and can effectively solve the contradiction between the iteration speed and convergence of the interlayer parallelism.
Disclosure of Invention
The object of the present invention is to solve the above technical problems. According to the present disclosure, there is provided a data processing system comprising: a plurality of data execution groups, the number of the data execution groups is equal to the predetermined number of the processing stages of the data to be processed, wherein each data execution group comprises at least one data processing executor, after being started, a data processing executor in a first data execution group in the plurality of data execution groups continuously executes forward data processing for a predetermined number of times and then alternately executes forward data processing and backward data processing, and data processing executors in data execution groups behind the first data execution group in the plurality of data execution groups execute forward result data generated by executing the forward data processing on the data processing executor in the previous data execution group obtained by the data processing executor according to the sequence of data reception, execute the forward data processing, or execute model update parameters generated by executing the backward data processing on the data processing executor in the next data execution group obtained by the data processing executor, backward data processing is performed.
The data processing system according to the present disclosure, wherein each data executor group, when performing one forward data processing or backward data processing, performs backward data processing on the received model update parameter when there are one or more forward result data generated by the data processing executor in the previous data executor group performing forward data processing that need to perform forward data processing, and when there are one or more model update parameters generated by the data processing executor in the subsequent data executor group performing backward data processing that need to perform backward data processing.
A data processing system according to the present disclosure, wherein the predetermined number of times is equal to a number of groups of the data executor groups.
The data processing system according to the present disclosure, further comprising: and the data executor grouping component correspondingly groups a plurality of data executor pairs into the plurality of data executor groups based on the calculated amount of each processing stage in the processing stages of the data to be processed so as to balance the calculated amount of each data executor for executing data processing for each batch of data.
The data processing system according to the present disclosure, further comprising: a processing stage dividing component that divides the data processing process into a plurality of data processing scenes corresponding to the processing stages on average such that the data processing time of each scene is uniform.
The data processing system according to the present disclosure, wherein the model parameter memory in each data executor group stores at most a number of versions of the model update parameter equal to the predetermined number of times, so that the forward data processing being executed performs the forward data processing with the latest version of the model update parameter stored in the model parameter memory.
According to the data processing system of the present disclosure, when the first version model update parameter is not obtained after the first data executor group continuously executes forward data processing for a predetermined number of times after starting, the first data executor group waits for the first version model update parameter of the second data executor group generated by backward data processing executed by the second data executor group before executing backward data processing based on the first version model update parameter of the second data executor group, so as to output the first version model update parameter of the first data executor group.
According to another aspect of the present disclosure, there is provided a data processing method including: averagely dividing the data processing process into a plurality of data processing scenes corresponding to the processing stages so that the data processing time of each scene is consistent; correspondingly grouping a plurality of data executors into a plurality of data execution groups based on the calculated amount of each processing stage in the processing stages of the data to be processed, wherein each data execution group comprises at least one data processing executor; and the data processing executors in the data executors behind the first data executors in the plurality of data executors execute the forward data processing according to the sequence of data receiving and aiming at the forward result data generated by the data processing executors in the previous data executors executing the forward data processing, or execute the backward data processing according to the model updating parameters generated by the data processing executors in the next data executors executing the backward data processing.
According to the data processing method of the present disclosure, when each data executor group performs one forward data processing or backward data processing, when there are one or more forward result data generated by the data processing executors in the previous data executor group performing the forward data processing that need to perform the forward data processing and one or more model update parameters generated by the data processing executors in the subsequent data executor group performing the backward data processing that need to perform the backward data processing, the backward data processing is performed on the received model update parameters.
The data processing method according to the present disclosure, wherein the predetermined number of times is equal to the number of groups of the data executor group.
The data processing method according to the present disclosure further includes: the model parameter memory in each data executor group stores a maximum number of versions of the model update parameters equal to the predetermined number of times, so that the forward data processing being executed performs the forward data processing using the latest version of the model update parameters stored in the model parameter memory.
According to the data processing method disclosed by the disclosure, when the first version model updating parameters are not obtained after the first data execution group continuously executes forward data processing for a preset number of times after starting, the first data execution group waits for the first version model updating parameters of the second data execution group generated by backward data processing executed by the second data execution group before executing backward data processing based on the first version model updating parameters of the second data execution group so as to output the first version model updating parameters of the first data execution group.
By adopting the distributed data processing system and the distributed pipeline task processing method, the model for data processing is optimized and divided into a plurality of stages and is deployed on a plurality of computing devices executing data parallel processing, so that interlayer synchronization is realized, communication transmission consumption is effectively reduced, the consistency of forward and backward calculation versions is ensured while iteration waiting time is eliminated through the distributed pipeline task processing method, and the problem of contradiction between interlayer synchronization iteration speed and convergence is solved.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention.
Drawings
The disclosure may be better understood by describing exemplary embodiments thereof in conjunction with the following drawings, in which:
FIG. 1 is a schematic block diagram of a data processing system according to the present disclosure.
FIG. 2 is a schematic diagram illustrating a first startup phase of a data processing process performed by data processing system 10 according to the present disclosure.
FIG. 3 is a schematic diagram illustrating a second startup phase of a data processing process performed by data processing system 10 according to the present disclosure.
FIG. 4 is a schematic diagram of the operation of data processing system 10 in performing a stabilization phase of a data processing process according to the present disclosure.
FIG. 5 is a schematic diagram illustrating a first embodiment of a data processing system 10 that performs a data processing process stabilization phase according to the present disclosure.
FIG. 6 is a schematic diagram illustrating a second embodiment of a data processing system 10 according to the present disclosure performing a stabilization phase of a data processing process.
Detailed Description
In the following description of the embodiments of the present disclosure, it is noted that in the interest of brevity and conciseness, not all features of an actual implementation may be described in detail in this specification. It should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions are made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another. Moreover, it should be further appreciated that such a development effort might be complex and tedious, but would nevertheless be a routine undertaking of design, fabrication, and manufacture for those of ordinary skill having the benefit of this disclosure, and it will be appreciated that such a development effort might be complex and tedious.
Unless otherwise defined, technical or scientific terms used in the claims and the specification should have the ordinary meaning as understood by those of ordinary skill in the art to which this disclosure belongs. The use of "first," "second," and similar terms in the description and claims of the present disclosure are not intended to indicate any order, quantity, or importance, but rather are used to distinguish one element from another. The terms "a" or "an," and the like, do not denote a limitation of quantity, but rather denote the presence of at least one. The word "comprise" or "comprises", and the like, means that the element or item listed before "comprises" or "comprising" covers the element or item listed after "comprising" or "comprises" and its equivalent, and does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, nor are they restricted to direct or indirect connections.
The present invention will be described in further detail with reference to the following examples and the accompanying drawings so that those skilled in the art can practice the invention with reference to the description.
FIG. 1 is a schematic block diagram of a data processing system according to the present disclosure. As shown in FIG. 1, the data processing system 10 includes a plurality of data executor groups, such as a first data executor group 10-1, a second data executor group 10-2, a third data executor group 10-3, and a fourth data executor group 10-4. Although 4 are embodied here, at least two or more may be deployed according to actual needs. For convenience of explanation, only 4 data executor groups are described herein. Each streaming data processing layer is correspondingly provided with a data execution group, and each data execution group executes one phase of data processing corresponding to one batch. In this way, for multiple batches of data which are input in a streaming manner, different data execution groups can process different batches of data in parallel, so that processing is realized. Since each respective data executor group is caused to generate data transfer only at the time of result data transfer between groups, the amount of data transfer is greatly reduced with respect to intra-layer parallelism. To this end, the data processing system 10 comprises a number of groups of said data executors equal to the predetermined number of phases of the processing of the data to be processed, and each data executer group therefore comprises at least one data processing executor (which will be described in detail below).
In order to enable each data executor group to confirm whether the data processing executed by the data executor group is forward data processing or backward data processing, the present disclosure enables each data executor group to alternately perform data processing in a forward data processing manner and a backward data processing manner under a generally stable condition. Since each forward data process requires obtaining the model update parameter using the latest backward data process, there may be a case where the model update parameter is not obtained in time in some cases, and therefore, the backward data process is in a long wait, for this reason, in order to prevent a long wait between each backward data process and an adjacent forward data process and to make the version of the model parameter used by the forward data process excessively old in a subsequent stable state, the data process executer in the first data executor group 10-1 of the plurality of data executor groups performs the forward data process and the backward data process alternately after continuously performing the forward data process a predetermined number of times after starting, and the data process executer in the data executor group subsequent to the first data executor group 10-1 of the plurality of data executor groups obtains the previous data process thereto in the order of data reception The data processing executors in the executive group execute forward data processing on the forward result data generated by executing the forward data processing or execute backward data processing on the model updating parameters generated by executing the backward data processing on the data processing executors in the subsequent data executive group. In order to prevent the data execution group from being subjected to forward data processing in a stuck state and being conducted to other data execution groups, the other data execution groups can preferentially execute backward data processing, that is, when each data execution group executes one forward data processing or backward data processing, when one or more forward result data generated by the data processing executors in the previous data execution group executing the forward data processing need to execute the forward data processing and one or more model update parameters generated by the data processing executors in the next data execution group executing the backward data processing need to execute the backward data processing, the backward data processing is executed according to the received model update parameters. By prioritizing the backward data processing, the backward data processing can be prevented from being in a long wait state, thereby enabling each data executor group to use the updated model parameters when performing the forward data processing. Since the backward data processing is preferentially executed, the data executor group capable of executing the backward data processing can notify the corresponding data executor group executing the forward data processing as soon as possible to release the storage space of the data to be used by the backward data processing as soon as possible, so that the corresponding data executor group executing the forward data processing can reuse the storage space. On the contrary, if backward data processing is not preferentially performed, data required for the backward data processing is stored in the memory until the backward data processing is performed so that the data is released. In the case of a tight storage space, since the storage space cannot be released, the data execution group using the storage and space will not be able to perform the forward data processing of the next batch of data, which may cause the whole streaming data processing to be stuck at the data execution group, thereby compromising the speed of the whole data processing system.
FIG. 2 is a schematic diagram illustrating a first startup phase of a data processing process performed by data processing system 10 according to the present disclosure. In the case of insufficient memory space, as shown in fig. 2, a blank block is used to indicate forward data processing, and an upward-slanted block is used to indicate backward data processing. The data processing executors in the first data executor group 10-1 successively execute a predetermined number of times, for example, 4 times, after being started, that is, four forward data processing operations are successively executed on four batches of data, namely, data 1, data 2, data 3, and data 4, based on the initial model parameters, and the data obtained by the processing operations are transmitted to the second data executor group 10-2 corresponding to the next data processing stage, and so on, and then sequentially go to the third data executor group 10-3 and the fourth data executor group 10-4. Thus, although the first, second, third and fourth data executor groups 10-1, 10-2, 10-3 and 10-4 all have the same forward or backward data processing number in FIG. 2, the data objects actually processed by them are not the same, e.g., the forward data processing number "2" of the second data executor group 10-2 processes the data object that is the result of the data processing generated by the forward data processing number "2" of the first data executor group 10-1, and so on. As shown in FIG. 2, after the first data execution group 10-1, a backward data processing, i.e. a model data update, is performed. Since the backward data processing for the first batch of data 1 fails to return in time, the first data executor group 10-1 is required to briefly wait for a period of time in the startup phase for the model data update in the backward data processing so that the updated model parameters are adopted in the forward data processing of the subsequent batch. According to the data processing system of the present disclosure, such waiting occurs only at the initial stage of startup and does not occur during subsequent steady operation. In the startup phase, since the first data executor group 10-1 will wait for a short period of time in the startup phase, there is a case that the second data executor group 10-2 does not receive the backward data processing result in the startup phase, and therefore, the forward data processing is continuously executed for a predetermined number of times, for example, 3 times, first, and when the backward data processing result is received, the data processing is performed first. Similarly, since the second data executor group 10-2 will briefly wait for a period of time during the startup phase, there is a case that the third data executor group 10-3 does not receive the backward data processing result during the startup phase, and therefore, the forward data processing is continuously executed for a predetermined number of times, for example, 2 times, first, and then the data processing is executed when the backward data processing result is received. Finally, the fourth data executor group 10-4 is the last group, and therefore data processing is performed directly in a forward direction and a backward direction alternately during the start-up phase.
FIG. 3 is a schematic diagram illustrating a second startup phase of a data processing process performed by data processing system 10 according to the present disclosure. In the case of sufficient memory space, as shown in fig. 3, the forward data processing is represented by a blank block, and the backward data processing is represented by an upward-slanted block. The data processing executors in the first data executor group 10-1 successively execute a predetermined number of times, for example, 4 times, after being started, that is, four forward data processing operations are successively executed on four batches of data, namely, data 1, data 2, data 3, and data 4, based on the initial model parameters, and the data obtained by the processing operations are transmitted to the second data executor group 10-2 corresponding to the next data processing stage, and so on, and then sequentially go to the third data executor group 10-3 and the fourth data executor group 10-4. Thus, although the first, second, third and fourth data executor groups 10-1, 10-2, 10-3 and 10-4 all have the same forward or backward data processing number in FIG. 2, the data objects actually processed by them are not the same, e.g., the forward data processing number "2" of the second data executor group 10-2 processes the data object that is the result of the data processing generated by the forward data processing number "2" of the first data executor group 10-1, and so on. As shown in FIG. 3, after the first data execution group 10-1, a backward data processing, i.e. a model data update, is performed. Since the backward data processing for the first batch of data 1 fails to return in time, the first data executor group 10-1 is required to briefly wait for a period of time in the startup phase for the model data update in the backward data processing so that the updated model parameters are adopted in the forward data processing of the subsequent batch. According to the data processing system of the present disclosure, such waiting occurs only at the initial stage of startup and does not occur during subsequent steady operation. In the startup phase, since the first data executor group 10-1 will wait for a short period of time in the startup phase, there is a case that the second data executor group 10-2 does not receive the backward data processing result in the startup phase, and therefore, the forward data processing is continuously executed for a predetermined number of times, for example, 3 times, first, and when the backward data processing result is received, the data processing is performed first. Similarly, since the second data executor group 10-2 will briefly wait for a period of time during the startup phase, there is a case that the third data executor group 10-3 does not receive the backward data processing result during the startup phase, and therefore, the forward data processing is continuously executed for a predetermined number of times, for example, 2 times, first, and then the data processing is executed when the backward data processing result is received. Finally, the fourth data executor group 10-4 is the last group, and therefore data processing is performed directly in a forward direction and a backward direction alternately during the start-up phase. In the startup phase, since the first data executor group 10-1 will briefly wait for a period of time in the startup phase, there are cases where the second data executor group 10-2 and the third data executor group 10-3 receive the backward data processing results first, and thus the backward data processing is performed twice in succession. After the last data execution group, such as the fourth data execution group in FIG. 3, has performed the forward data processing of the fourth batch of data (e.g., data 4), the data processing system 10 will enter the stable phase. As shown in fig. 2, the first data executor group 10-1 waits for the first version model update parameter of the second data executor group generated by the backward data processing executed by the second data executor group before executing the backward data processing based on the first version model update parameter of the second data executor group when the first version model update parameter is not obtained after continuously executing the forward data processing for a predetermined number of times after starting up, so as to output the first version model update parameter of the first data executor group. That is, although each data executor group has the first version of model update parameters, the specific parameters are different, and so on, as well as other versions.
FIG. 4 is a schematic diagram illustrating a data processing system 10 performing a data processing process stabilization phase according to the present disclosure. As shown in fig. 4, after the data processing system 10 enters the stable phase, each data executor group alternately performs data processing according to a forward data processing and a backward data processing, and each forward data processing executes the forward data processing using the latest updated model data that it previously obtained through the backward data processing. Since the data processing executors in the first data executor group 10-1 are continuously executed a predetermined number of times after being activated as shown in fig. 2 or 3, the time for the forward data processing and the time for the backward data processing shown in fig. 2-4 are not completely equal, and the difference between them is shown by the length of the schematic block. Updating by backward data processing takes longer than the corresponding forward data processing because the backward data processing also needs to update parameters while propagating the gradient, and the amount of operation of backward data processing is just 2 times that of forward data processing, which is common for data processing involving matrix multiplication. For some linear data processing, the amount of computation for backward data processing is substantially the same as the amount of computation for forward data processing. Therefore, the fourth data executor group 10-4 directly performs the first backward data processing without waiting after the first forward data processing is processed. Therefore, the fourth data executor group 10-4 as the fourth stage of data processing then enters a state where the forward data processing and the backward data processing are alternately executed from the beginning. For this reason, although the data processing executors in the first data executor group 10-1 are executed successively a predetermined number of times after being started, where the predetermined number of times may be greater than or less than the number of groups of the data executor group, and the parallel processing can also be achieved, when the predetermined number of times is greater than the number of groups, during the system is in the start-up stage and the steady state, on the one hand, the forward data processing uses older model parameters, which may result in a longer time for the data training process to achieve better convergence, and when the predetermined number of times is less than the number of groups, during the system is in the steady state, a process of waiting for input data occurs in each group of data executor groups after a certain fixed processing process, which results in that each group of data executor groups cannot enter the full-time load working state, and thus results in a waste of computing resources.
Although in the stable phase shown in fig. 4, each data executor group alternates between one forward data process and one backward data process, in practical cases, there may be a case where the data to be processed required for the forward data process and the data to be processed required for the backward data process do not alternate, for example, when a certain data executor group is performing the forward data process, two kinds of required data to be processed arrive at the same time or when a certain data executor group is performing the backward data process, the data to be processed required for the next backward data process arrives before the data to be processed required for the next forward data process. Since the data executor group performing the backward data processing for the same batch of data shares data with the data executor group performing the forward data processing, in order to speed up the release of the memory space of the data executor group performing the forward data processing and prevent the data processing flow from being stuck due to the long-term unreleased memory space used by the previous batch of data for performing the forward data processing in a certain data executor group and the conduction of the stuck to other data executor groups, the data processing system of the present disclosure sets up a state machine in each data executor group, and when the state machine of a certain data executor group obtains the backward data processing result of the data executor group for any batch of data, triggers a state condition so that the certain data executor group performs the backward data processing for the batch of data immediately in the next data execution processing, whether or not there is a pending forward data processing task and whether or not the pending forward processing task arrives prior to the backward processing task. The new forward data processing task that has arrived pending can only be executed if all of the backward processing tasks that have arrived pending have been executed. By prioritizing the backward data processing, the backward data processing can be prevented from being in a long wait state, thereby enabling each data executor group to use the updated model parameters when performing the forward data processing. Since the backward data processing is preferentially executed, the data executor group capable of executing the backward data processing can notify the corresponding data executor group executing the forward data processing as soon as possible to release the storage space of the data to be used by the backward data processing as soon as possible, so that the corresponding data executor group executing the forward data processing can reuse the storage space. In the case of a tight storage space, since the storage space cannot be released, the data execution group using the storage and space will not be able to perform the forward data processing of the next batch of data, which may cause the whole streaming data processing to be stuck at the data execution group, thereby compromising the speed of the whole data processing system.
To this end, the present disclosure selects a data processing executor in the first data executor group 10-1 to execute a predetermined number of times consecutively after startup equal to the number of groups of the data executor group, thus enabling each data executor group to implement alternating forward data processing and backward data processing in a stable phase, and each data executor group to use as new model parameters as possible when executing the forward data processing. FIG. 5 is a schematic diagram illustrating a first embodiment of a data processing system 10 that performs a data processing process stabilization phase according to the present disclosure. In the case of insufficient memory space, as shown in fig. 4, for example, in the case of four data executor groups, the model parameters of the model 11 may be used when the fourth data executor group 10-4 executes the data 12, the model parameters of the model 10 may be used when the third data executor group 10-3 executes the data 12, the model parameters of the model 9 may be used when the second data executor group 10-2 executes the data 12, and the model parameters of the model 8 may be used when the first data executor group 10-1 executes the data 12, which improves convergence during the data training process. On the one hand, the storage space for storing model parameters of the data executor groups corresponding to the later scenes may be gradually reduced, as shown in fig. 4, for example, the first data executor group 10-1 requires four memory spaces to store the latest four versions of model parameters, the second data executor group 10-2 only requires three memory spaces to store the latest three versions of model parameters, the third data executor group 10-3 requires two memory spaces to store the latest two versions of model parameters, and the fourth data executor group 10-4 corresponding to the last scene most requires only one memory space to store one version of model parameters. This significantly reduces the need for memory space.
FIG. 6 is a schematic diagram illustrating a second embodiment of a data processing system 10 according to the present disclosure performing a stabilization phase of a data processing process. In the case of allowable memory space of the data processing system, the same number of memory spaces as the number of groups of the data execution groups can be configured for each data execution group, and each data execution group executes forward data processing by using the same version of model parameters for the same batch of data. In the case of sufficient memory space, as shown in fig. 6, for example, in the case of four data execution groups, the fourth data execution group 10-4 may use the model parameters of the model 8 when executing the data 12, the third data execution group 10-3 may use the model parameters of the model 8 when executing the data 12, the second data execution group 10-2 may use the model parameters of the model 8 when executing the data 12, and the first data execution group 10-1 may use the model parameters of the model 8 when executing the data 12, so that the input data of each batch may be processed by using the same model parameters (weight) in different data execution groups. As described above, the first data executor group 10-1 uses the model parameters of version 8 when executing the data 12 at the beginning, and the data executor groups 10-2 to 10-4 must use the model parameters of version 8, and for this reason, all the data executor groups must reserve 4 space in the memory for the model parameters and the data. Unlike the system shown in FIG. 5, the amount of memory space reserved for the model parameters per data executor group of the system shown in FIG. 6 is the same. On the one hand, the structure enables the data processing system to have higher stability in the result of executing data processing, and on the other hand, the convergence of the data processing result is better because the versions of the model parameters used in one iteration are the same. On the other hand, by providing sufficient memory space to store continuous model parameter versions, it can be determined that each batch of data uses the same model parameter version, so that the data processing in the stable phase can be basically performed alternately according to a forward data processing and a backward data processing, so that the data processing is more stable, and the waiting time of each data execution unit group of the parallel data processing system in the stable phase when executing the data processing is eliminated, so that each data execution unit group is in a full load processing state.
Referring back to fig. 1, in order for the data processing system 10 to be able to achieve convergence of the training process and be in a full load state in the parallel data processing process as much as possible, a processing stage division component 11 is configured for this purpose, and the processing stage division component 11 divides the data processing process into a plurality of data processing scenes corresponding to the processing stages on average based on the amount of tasks to be executed so that the data processing time of each scene is uniform. However, the average division is not necessary, and the average division may not be performed according to the actual needs of data processing and the configuration amount of the computing resources, so that the computing resources configured for each processing stage are balanced in load. To this end, the data processing system 10 is provided with a data executor grouping component 12 that groups a plurality of data executor pairs into the plurality of data executor groups based on the calculation amount of each of the processing stages of the data to be processed, so as to equalize the calculation amount of each data executor to perform data processing for each batch of data.
As described above, in order to enable each forward data processing to use the latest model parameters, the model update parameters of the number of versions equal to the predetermined number of times are stored at the maximum in the model parameter memory in each data executor group, so that the forward data processing being executed performs the forward data processing using the latest version of the model update parameters stored in the model parameter memory. FIG. 4 is a schematic diagram of a data processing system 10 according to the present disclosure executing a data processing process stabilization phase model parameter storage process. As shown in FIG. 5, with respect to the data 12, the first data executor group 10-1 uses the model parameters of version 8 when executing the forward data processing, and thereafter stores the model parameters of version 4, i.e., the model parameters of versions 8-11, before performing the update of the model parameters of version 12 with the alternate execution of the forward processing and the backward processing. In this process, the latest version of the model parameters is used each time the first data execution group 10-1 performs forward data processing, for example, for the data 13, the 9 th version of the model parameters is used, and so on. Also with respect to the data 12, the second data executor group 10-2 uses the version 9 model parameters when performing forward data processing, and therefore, the second data executor group 10-2 has completed updating of the version 9 model parameter update before performing forward data processing, and therefore, employs the latest version of the model parameters. Thereafter, with the alternate execution of the forward processing and the backward processing, the 4-version model parameters are stored before the 12 th version model parameter update is performed. Similarly, for data 12, the third data executor group 10-3 uses the 10 th version model parameters in executing forward data processing. For the data 12, the fourth data executor group 10-4 uses the 11 th version model parameters when performing forward data processing. As forward and backward data processing progresses, the oldest model parameters in the model parameter memory are replaced with the newest model parameters.
Also, for the embodiment shown in FIG. 6, by specifying a predetermined interval of the version of the model parameters employed to perform the forward data processing such that all data executor groups employ the latest version of the model parameters before a predetermined number of intervals for the same batch of data, for example, for data 13, all data executor groups use version 9 model parameters, for data 14, all data executor groups use version 10 model parameters, for data 15, all data executor groups use version 11 model parameters, and so on, a predetermined interval is maintained between the batch number of the data and the version number of the model parameters. Therefore, the latest version of the model parameters is used. After the backward data processing is executed once each time, the model parameter version in the memory space of the corresponding data execution group is updated, namely the model parameter of the oldest version is replaced by the model parameter of the newest version.
Returning to fig. 1, the data processing system according to the present disclosure performs data processing in the following manner: firstly, averagely dividing a data processing process into a plurality of data processing scenes corresponding to processing stages to enable the data processing time of each scene to be consistent; secondly, correspondingly grouping a plurality of data executors into a plurality of data execution groups based on the calculated amount of each processing stage in the processing stages of the data to be processed, wherein each data execution group comprises at least one data processing executor; then, after starting, the data processing executor in the first data executor group in the plurality of data executor groups continuously executes forward data processing for a preset number of times and then alternately executes forward data processing and backward data processing; and finally, the data processing executors in the data executors behind the first data executor group in the plurality of data executor groups execute forward data processing according to the data receiving sequence aiming at the forward result data generated by the data processing executors in the previous data executor group executing the forward data processing or execute backward data processing according to the model updating parameters generated by the data processing executors in the next data executor group executing the backward data processing. It is noted that the predetermined number of selections is preferably equal to the number of groups of the data executor groups. Further, the model parameter memory in each data executor group stores at most the number of versions of the model update parameter equal to the predetermined number of times, so that the forward data processing being executed performs the forward data processing using the latest version of the model update parameter stored in the model parameter memory.
The basic principles of the present disclosure have been described in connection with specific embodiments, but it should be noted that it will be understood by those skilled in the art that all or any of the steps or components of the method and apparatus of the present disclosure may be implemented in any computing device (including processors, storage media, etc.) or network of computing devices, in hardware, firmware, software, or a combination thereof, which can be implemented by those skilled in the art using their basic programming skills after reading the description of the present disclosure.
Thus, the objects of the present disclosure may also be achieved by running a program or a set of programs on any computing device. The computing device may be a general purpose device as is well known. Thus, the object of the present disclosure can also be achieved merely by providing a program product containing program code for implementing the method or apparatus. That is, such a program product also constitutes the present disclosure, and a storage medium storing such a program product also constitutes the present disclosure. It is to be understood that the storage medium may be any known storage medium or any storage medium developed in the future.
It is also noted that in the apparatus and methods of the present disclosure, it is apparent that individual components or steps may be disassembled and/or re-assembled. These decompositions and/or recombinations are to be considered equivalents of the present disclosure. Also, the steps of executing the series of processes described above may naturally be executed chronologically in the order described, but need not necessarily be executed chronologically. Some steps may be performed in parallel or independently of each other.
The above detailed description should not be construed as limiting the scope of the disclosure. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (12)

1. A data processing system comprising: a plurality of data execution groups, the number of the data execution groups is equal to the predetermined number of the processing stages of the data to be processed, wherein each data execution group comprises at least one data processing executor, after being started, a data processing executor in a first data execution group in the plurality of data execution groups continuously executes forward data processing for a predetermined number of times and then alternately executes forward data processing and backward data processing, and data processing executors in data execution groups behind the first data execution group in the plurality of data execution groups execute forward result data generated by executing the forward data processing on the data processing executor in the previous data execution group obtained by the data processing executor according to the sequence of data reception, execute the forward data processing, or execute model update parameters generated by executing the backward data processing on the data processing executor in the next data execution group obtained by the data processing executor, backward data processing is performed.
2. The data processing system of claim 1, wherein each data executor group, when performing one forward data processing or backward data processing, performs backward data processing on the received model update parameters first when there are one or more forward result data generated by the data processing executors in its previous data executor group performing forward data processing that require forward data processing, and when there are one or more model update parameters generated by the data processing executors in its subsequent data executor group performing backward data processing that require backward data processing.
3. A data processing system according to claim 1 or 2, wherein the predetermined number of times is equal to the number of groups of the set of data executors.
4. The data processing system of claim 1, further comprising: and the data executor grouping component correspondingly groups a plurality of data executor pairs into the plurality of data executor groups based on the calculated amount of each processing stage in the processing stages of the data to be processed so as to balance the calculated amount of each data executor for executing data processing for each batch of data.
5. The data processing system of claim 1, further comprising: a processing stage dividing component that divides the data processing process into a plurality of data processing scenes corresponding to the processing stages on average such that the data processing time of each scene is uniform.
6. The data processing system according to claim 1, wherein the model parameter memory in each data executor group stores at most a number of versions of the model update parameter equal to the predetermined number of times, so that the forward data processing being executed performs the forward data processing with the latest version of the model update parameter stored in the model parameter memory.
7. The data processing system according to claim 1, wherein the first data executor group waits for the first version model update parameter of the second data executor group generated by the backward data processing performed by the second data executor group before performing the backward data processing based on the first version model update parameter of the second data executor group when the first version model update parameter is not obtained after continuously performing the forward data processing a predetermined number of times after starting up, so as to output the first version model update parameter of the first data executor group.
8. A method of data processing, comprising:
averagely dividing the data processing process into a plurality of data processing scenes corresponding to the processing stages so that the data processing time of each scene is consistent;
correspondingly grouping a plurality of data executors into a plurality of data execution groups based on the calculated amount of each processing stage in the processing stages of the data to be processed, wherein each data execution group comprises at least one data processing executor;
the data processing executor of a first data executor group of the plurality of data executor groups alternately performs forward data processing and backward data processing after continuously performing forward data processing for a predetermined number of times after being activated, and
and the data processing executors in the data executors behind the first data executor in the plurality of data executors execute forward data processing according to the data receiving sequence aiming at the obtained forward result data generated by the forward data processing executors in the previous data executor, or execute backward data processing aiming at the model updating parameters generated by the backward data processing executed by the data processing executors in the obtained next data executor.
9. The data processing method of claim 8, further comprising:
when the data processing executor in the former data executor group executes the forward data processing, one or more forward result data generated by the data processing executor in the former data executor group execute the forward data processing need to execute the forward data processing, and when one or more model updating parameters generated by the data processing executor in the latter data executor group execute the backward data processing need to execute the backward data processing, the each data executor group executes the backward data processing according to the received model updating parameters.
10. A data processing method according to claim 8 or 9, wherein the predetermined number of times is equal to the number of groups of the data execution groups.
11. The data processing method of claim 8, further comprising: the model parameter memory in each data executor group stores a maximum number of versions of the model update parameters equal to the predetermined number of times, so that the forward data processing being executed performs the forward data processing using the latest version of the model update parameters stored in the model parameter memory.
12. The data processing method according to claim 8, wherein the first data executor group waits for the first version model update parameter of the second data executor group generated by the backward data processing performed by the second data executor group before performing the backward data processing based on the first version model update parameter of the second data executor group when the first version model update parameter is not obtained after continuously performing the forward data processing a predetermined number of times after starting up, so as to output the first data executor group first version model update parameter.
CN202110568222.3A 2021-05-25 2021-05-25 Data processing system and method thereof Active CN113254206B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110568222.3A CN113254206B (en) 2021-05-25 2021-05-25 Data processing system and method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110568222.3A CN113254206B (en) 2021-05-25 2021-05-25 Data processing system and method thereof

Publications (2)

Publication Number Publication Date
CN113254206A true CN113254206A (en) 2021-08-13
CN113254206B CN113254206B (en) 2021-09-28

Family

ID=77184267

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110568222.3A Active CN113254206B (en) 2021-05-25 2021-05-25 Data processing system and method thereof

Country Status (1)

Country Link
CN (1) CN113254206B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111582451A (en) * 2020-05-08 2020-08-25 中国科学技术大学 Image recognition interlayer parallel pipeline type binary convolution neural network array architecture
CN112116084A (en) * 2020-09-15 2020-12-22 中国科学技术大学 Convolution neural network hardware accelerator capable of solidifying full network layer on reconfigurable platform
CN112154462A (en) * 2018-05-23 2020-12-29 微软技术许可有限责任公司 High performance pipeline parallel deep neural network training
US20210133591A1 (en) * 2019-11-04 2021-05-06 Baidu Usa Llc Reducing training times of deep neural networks through efficient hybrid parallelism

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112154462A (en) * 2018-05-23 2020-12-29 微软技术许可有限责任公司 High performance pipeline parallel deep neural network training
US20210133591A1 (en) * 2019-11-04 2021-05-06 Baidu Usa Llc Reducing training times of deep neural networks through efficient hybrid parallelism
CN111582451A (en) * 2020-05-08 2020-08-25 中国科学技术大学 Image recognition interlayer parallel pipeline type binary convolution neural network array architecture
CN112116084A (en) * 2020-09-15 2020-12-22 中国科学技术大学 Convolution neural network hardware accelerator capable of solidifying full network layer on reconfigurable platform

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
任卫欣: "基于FPGA的硬件加速系统", 《电子制作》 *
王 巍: "卷积神经网络(CNN)算法的FPGA并行结构设计", 《微电子学与计算机》 *
章锦文: "基于DSP的神经网络计算机的设计", 《计算机工程与设计》 *
章锦文: "神经 网络运算的并行性及并行计算机", 《微处理机》 *
莫则尧等: "并行算法与并行编程:从个性、共性到软件复用", 《中国科学:信息科学》 *
陈朋: "基于改进动态配置的FPGA卷积神经网络加速器的优化方法", 《高技术通讯》 *

Also Published As

Publication number Publication date
CN113254206B (en) 2021-09-28

Similar Documents

Publication Publication Date Title
Norrie et al. The design process for Google's training chips: TPUv2 and TPUv3
CN107636638B (en) General parallel computing architecture
Hegde et al. Parallel and distributed deep learning
EP3797385B1 (en) Highly performant pipeline parallel deep neural network training
JP6698784B2 (en) Multi-thread binding state in multi-thread processor
US10949746B2 (en) Efficient parallel training of a network model on multiple graphics processing units
US20160321776A1 (en) Model Parallel Processing Method and Apparatus Based on Multiple Graphic Processing Units
US20160321777A1 (en) Data parallel processing method and apparatus based on multiple graphic processing units
Park et al. A GPU-based application framework supporting fast discrete-event simulation
CN109697186A (en) Time determinability compiler
KR102178190B1 (en) Instruction set
CN109697185A (en) Synchronization in more tile processing arrays
CN110018817A (en) The distributed operation method and device of data, storage medium and processor
CN110214317A (en) Synchronization in more tile processing arrangements
US20190303149A1 (en) Sequence alignment method of vector processor
CN104765589A (en) Grid parallel preprocessing method based on MPI
EP4035080A1 (en) Pipelined neural network processing with continuous and asynchronous updates
CN113342525A (en) Distributed data processing system and method thereof
CN108197075B (en) Multi-core implementation method of Inceptation structure
CN113254206B (en) Data processing system and method thereof
JP2023015205A (en) General-purpose parallel computing architecture
KR102463147B1 (en) Massively parallel deep learning method and apparatus
Harlap et al. PipeDream: Pipeline parallelism for DNN training
GB2593756A (en) Control of data transfer between processing nodes
Eckstein et al. Efficient distributed-memory parallel matrix-vector multiplication with wide or tall unstructured sparse matrices

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant