CN111158907A

CN111158907A - Data processing method and device, electronic equipment and storage medium

Info

Publication number: CN111158907A
Application number: CN201911364894.1A
Authority: CN
Inventors: 李英晗; 张行程
Original assignee: Shenzhen Sensetime Technology Co Ltd
Current assignee: Shenzhen Sensetime Technology Co Ltd
Priority date: 2019-12-26
Filing date: 2019-12-26
Publication date: 2020-05-15
Anticipated expiration: 2039-12-26
Also published as: CN111158907B

Abstract

The present disclosure relates to a data processing method and apparatus, an electronic device, and a storage medium, wherein the method includes: acquiring the data volume of at least one group of data in the data to be processed; distributing a computing unit for each group of data in the data to be processed according to the data amount and the cache capacity of at least one computing unit; and reading the data to be processed from the memory, caching the data to be processed into a cache of an allocated computing unit, and processing the data to be processed in the cache through the allocated computing unit.

Description

Data processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a data processing method and apparatus, an electronic device, and a storage medium.

Background

With the rapid development of artificial intelligence technology, a large number of artificial intelligence enterprises such as bamboo shoots in spring after rain are generally built, and the computing task of a Graphics Processing Unit (GPU) gradually develops from small scale to large scale and high speed network interconnection. In order to meet different requirements of different tasks on the GPU, the GPU may be divided into a plurality of computing units, and the plurality of computing units are used to process different tasks.

In general, multiple compute units on one GPU are relatively independent, and multiple compute units may share the global memory of the GPU. However, how to implement efficient processing of tasks by using a computing unit is a technical problem yet to be solved.

Disclosure of Invention

The present disclosure proposes a data processing technical solution.

According to an aspect of the present disclosure, there is provided a data processing method including:

acquiring the data volume of at least one group of data in the data to be processed; distributing a computing unit for each group of data in the data to be processed according to the data amount and the cache capacity of at least one computing unit; and reading the data to be processed from the memory, caching the data to be processed into a cache of an allocated computing unit, and processing the data to be processed in the cache through the allocated computing unit.

In one possible implementation manner, after the to-be-processed data is read from the memory, the method further includes: performing first processing on the data to be processed through the distributed computing units to obtain a processing result; the processing the data to be processed in the cache by the allocated computing unit includes: acquiring the data to be processed from the cache; and performing second processing on the processing result and the to-be-processed data acquired from the cache through the distributed computing unit.

In a possible implementation manner, the allocating a computing unit to each group of data in the to-be-processed data according to the data amount and the cache capacity of at least one computing unit includes: and under the condition that the data volume is less than or equal to the cache capacity of each computing unit in the at least one computing unit, allocating each group of data in the memory to one computing unit.

In a possible implementation manner, the allocating a computing unit to each group of data in the to-be-processed data according to the data amount and the cache capacity of at least one computing unit includes: and under the condition that the data amount is larger than the cache capacity of each computing unit in the at least one computing unit, distributing each group of data in the memory to a plurality of computing units.

In a possible implementation manner, the allocating a computing unit to each group of data in the to-be-processed data according to the data amount and the cache capacity of at least one computing unit includes: and under the condition that the data volume is less than or equal to the buffer capacity of part of the computing units in the at least one computing unit and/or the data volume is greater than the buffer capacity of part of the computing units in the at least one computing unit, allocating each group of data in the grouped data in the memory to one computing unit and allocating each group of data except the part of the group of data in the memory to a plurality of computing units.

In one possible implementation manner, for a set of data allocated to a plurality of computing units, the processing, by the allocated computing unit, the to-be-processed data in the cache includes: acquiring target data belonging to a group of data from a cache of each of a plurality of distributed computing units; performing, by each of the plurality of calculation units, a second process on the processing result and the target data, respectively.

In one possible implementation manner, the performing, by each of the plurality of computing units, the second processing on the processing result and the target data respectively includes: and performing second processing on the processing result and the target data by each of the plurality of calculation units, respectively, when each of the plurality of calculation units completes the first processing.

In a possible implementation manner, before performing, by each of the plurality of computing units, the second processing on the processing result and the target data, respectively, the method further includes: determining the number of computing units of the plurality of computing units that complete the first process; determining that each of the plurality of calculation units completes the first process in a case where the number of calculation units that complete the first process is greater than or equal to the total number of the plurality of calculation units.

In one possible implementation, the data to be processed includes image data; the image data includes data of a plurality of channels; each set of data corresponds to data of one channel in the image data.

According to an aspect of the present disclosure, there is provided a data processing apparatus including:

the acquisition module is used for acquiring the data volume of at least one group of data in the data to be processed;

the distribution module is used for distributing calculation units for each group of data in the data to be processed according to the data volume and the cache capacity of at least one calculation unit;

and the processing module is used for reading the data to be processed from the memory and caching the data to be processed into the cache of the distributed computing unit so as to process the data to be processed in the cache through the distributed computing unit.

In a possible implementation manner, the processing module is further configured to perform first processing on the data to be processed through the allocated computing unit to obtain a processing result;

the processing module is specifically configured to acquire the data to be processed from the cache; and performing second processing on the processing result and the to-be-processed data acquired from the cache through the distributed computing unit.

In a possible implementation manner, the allocating module is specifically configured to allocate each group of data in the memory to one computing unit when the data amount is less than or equal to a cache capacity of each computing unit in the at least one computing unit.

In a possible implementation manner, the allocating module is specifically configured to allocate each group of data in the memory to a plurality of computing units when the data amount is greater than the cache capacity of each computing unit in the at least one computing unit.

In a possible implementation manner, the allocating module is specifically configured to allocate each group of data in the grouped data in the memory to one computing unit and allocate each group of data in the memory except for the partial group of data to a plurality of computing units when the data amount is less than or equal to the cache capacity of part of the computing units in the at least one computing unit and/or the data amount is greater than the cache capacity of part of the computing units in the at least one computing unit.

In a possible implementation manner, the processing module is specifically configured to obtain target data belonging to a group of data from a cache of each of the plurality of distributed computing units; performing, by each of the plurality of calculation units, a second process on the processing result and the target data, respectively.

In a possible implementation manner, the processing module is specifically configured to, when each of the plurality of computing units completes the first processing, perform, by each of the plurality of computing units, second processing on the processing result and the target data respectively.

In a possible implementation manner, the processing module is further configured to determine the number of computing units that complete the first processing in the plurality of computing units; determining that each of the plurality of calculation units completes the first process in a case where the number of calculation units that complete the first process is greater than or equal to the total number of the plurality of calculation units.

According to another aspect of the present disclosure, there is provided an electronic device including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to: the above-described data processing method is performed.

According to another aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described data processing method.

In the embodiment of the present disclosure, the data amount of at least one group of data in the data to be processed may be obtained, then, according to the data amount of the at least one group of data and the cache capacity of the at least one computing unit, a computing unit is allocated to each group of data in the data to be processed, the data to be processed is read from the memory, and the data to be processed is cached in the cache of the allocated computing unit, so that the data to be processed in the cache is processed by the allocated computing unit. In this way, by caching the data to be processed in the cache of the allocated computing unit, in the case of reading the data to be processed again or multiple times, the data to be processed can be read in the cache of the computing unit, so that the processing efficiency of the data to be processed can be improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 shows a flow diagram of a data processing method according to an embodiment of the present disclosure.

Fig. 2 shows a block diagram of an example of a data processing process according to an embodiment of the present disclosure.

Fig. 3 shows a flow diagram of an example of a data processing procedure according to an embodiment of the present disclosure.

FIG. 4 shows a block diagram of an example of a compute unit synchronization process according to an embodiment of the present disclosure.

Fig. 5 shows a flow diagram of an example of a data processing procedure according to an embodiment of the present disclosure.

Fig. 6 shows a block diagram of a data processing apparatus according to an embodiment of the present disclosure.

Fig. 7 shows a block diagram of an example of an electronic device according to an embodiment of the present disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

According to the data processing scheme provided by the embodiment of the disclosure, the data volume of at least one group of data in the data to be processed can be acquired, the computing unit is allocated to each group of data in the data to be processed according to the data volume of the at least one group of data and the cache capacity of the at least one computing unit, then the data to be processed is read from the memory, and the data to be processed is cached to the cache of the allocated computing unit, so that the data to be processed in the cache is processed through the allocated computing unit. Therefore, in the process of processing the data to be processed, the data to be processed can be stored in the cache of the computing unit, and under the condition of reading the data to be processed again or for multiple times, the data to be processed can be directly obtained from the cache of the computing unit, so that the times of repeatedly requesting the data to be processed from the memory through the computing unit are reduced.

Here, the calculation unit may be a smaller physical unit into which the GPU is divided, and a plurality of calculation units may process data to be processed of the GPU in parallel. Generally, the to-be-processed data can be processed by reading the to-be-processed data stored in the memory through the computing unit, and in the process of processing the to-be-processed data, the to-be-processed data may be read multiple times through the computing unit, that is, the to-be-processed data needs to be obtained by reading the memory multiple times through the computing unit, which may affect the processing efficiency of the to-be-processed data. The data processing scheme provided by the embodiment of the disclosure can reduce the number of times that the computing unit accesses the memory, thereby reducing the time consumed by the computing unit to access the memory and improving the processing efficiency of the data to be processed. Memory is understood herein as global memory.

The technical solution provided by the embodiment of the present disclosure may be applied to task processing of a GPU, scheduling of a computing unit, use of a neural network, extension of training, and the like, and the embodiment of the present disclosure does not limit this.

Fig. 1 shows a flow diagram of a data processing method according to an embodiment of the present disclosure. The data processing method may be performed by a terminal device, a server, or other types of electronic devices, where the terminal device may be a User Equipment (UE), a mobile device, a user terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like. In some possible implementations, the data processing method may be implemented by a processor calling computer readable instructions stored in a memory. The data processing method according to the embodiment of the present disclosure is described below by taking an electronic device as an execution subject.

Step S11, acquiring the data volume of at least one group of data in the data to be processed.

In the embodiment of the present disclosure, the data to be processed may be various types of data such as image data, text data, audio data, and the like. The data to be processed may include one or more sets of data, and the data amount of each set of data may be the same or different. The electronic device may divide the data to be processed into at least one group of data according to a preset rule, for example, divide the data to be processed into at least one group of data according to a preset data amount. Alternatively, the electronic device may also receive data to be processed that already comprises at least one set of data.

In the case where the data to be processed includes image data, the image data may include data of a plurality of channels, each set of the data corresponding to data of one channel in the image data. Here, a channel may represent one dimensional parameter of data to be processed. For example, in the training process of the neural network, the data to be processed may be a plurality of sample images, the data to be processed may be represented as N × C × H × W, where N may represent the number of the plurality of sample images, C may represent the number of channels, H may represent the height of each sample image, W may represent the width of each sample image, and the data of one channel may be represented as N × H × W, where N, C, H, W are integers greater than 0. In this way, by using the data of one channel in the image data as a set of data, the data of at least one channel included in the image data can be processed in parallel by the computing unit, and the efficiency of processing the data is improved.

Here, in the case that the data amount of each set of data is the same, the data amount of one set of data in the processing data may be acquired, and the data amount of each set of data in the data to be processed may be determined by the data amount of one set of data. For example, in the case that the data to be processed includes image data, each set of data may correspond to data of one channel in the image data, where the data amount of the data in one channel included in the image data may be acquired, and the data amount of the data in each pass may be determined by the data amount of the data in one channel.

Step S12, allocating a computing unit to each group of data in the to-be-processed data according to the data amount and the buffer capacity of at least one computing unit.

In the embodiment of the present disclosure, the cache capacities of different computing units may be the same or different. The buffer capacity of each computing unit may be a maximum capacity of each computing unit that can accommodate the data amount, that is, the buffer capacity of the computing unit may be a maximum buffer capacity. Alternatively, the buffer capacity of each computing unit may be a standard capacity of each computing unit that can accommodate the data amount, i.e. it may be understood that there is a certain spare capacity between the buffer capacity and the maximum buffer capacity, for example, it may be 80% of the maximum buffer capacity.

Here, one or more computing units that process at least one set of data may be allocated for each set of data according to the data amount of the set of data and the buffer capacity of at least one computing unit. For example, the data amount of at least one group of data may be compared with the buffer capacity of at least one computing unit, and one or more computing units having buffer capacities matching the data amount of the group of data may be allocated to each group of data.

Step S13, reading the to-be-processed data from the memory, and caching the to-be-processed data in a cache of an allocated computing unit, so as to process the to-be-processed data in the cache through the allocated computing unit.

In the embodiment of the present disclosure, after the one or more computing units allocated to each group of data are allocated, each group of allocated data may be read in the memory by using the allocated one or more computing units, each group of data is processed, and each group of data read in the memory may be cached in the cache of the allocated one or more computing units. In the process of processing each group of data by using the allocated one or more computing units, when the to-be-processed data needs to be read again or for multiple times, the allocated one or more computing units can be used for directly reading the allocated each group of data in the cache, so that the times of accessing the global memory can be reduced, and the time for processing the to-be-processed data is saved.

Here, each computing unit may have a corresponding cache, and the caches corresponding to different computing units may be different. In the case of caching the data to be processed in the caches of one or more computing units, each set of data in the data to be processed may be cached in a plurality of computing units, each computing unit caching partial data of one set of data. In the case of processing the data to be processed in the cache with the allocated computing units, a part of the data cached by the computing unit may be processed with each computing unit.

In one possible implementation manner, the distributed computing unit performs first processing on the data to be processed to obtain a processing result, then acquires the data to be processed from the cache, and performs second processing on the processing result and the data to be processed acquired from the cache by the distributed computing unit.

In this implementation, the data to be processed may be processed by one or more allocated computing units, and the processing procedure of each computing unit may be the same. The process may include a first process and a second process. The first processing may be front-part processing, the first processing processes of different computing units may be the same and independent from each other, and after the first processing is performed on the data to be processed by using one or more computing units, the global parameter in the memory may be updated by using the first processing result obtained by each computing unit, so that the total processing result may be obtained after the global parameter is updated. The second process may be a post-processing, and the second processes of the different computing units may be the same and independent of each other. After the total processing result is obtained, the data to be processed can be obtained in the cache through one or more computing units, the data to be processed and the processing result are subjected to second processing through one or more computing units, and the second processing obtained by each computing unit is utilized, so that the number of times of accessing the memory can be reduced, and the data processing efficiency is improved.

For example, each set of data may be subjected to a batch normalization process using at least one computing unit. The batch normalization process may be a common operation in convolutional neural networks. In the training process of the neural network model, the numerical distribution of input data of certain network layers in the neural network model is uneven, the phenomenon of overlarge variance can occur, and the phenomenon can cause the gradient of back propagation to disappear in the training process, so that the training speed of the neural network model is influenced. Batch normalization can be performed by processing input data (data to be processed) of the network layer, and processing the input data into a distribution with a mean value of 0 and a variance of 1. Here, each set of input data may be batch normalized by at least one computing unit, and the batch normalization process involves two times of reading of the input data, so that the input data may be read from the memory and buffered in the buffer of the assigned computing unit, the input data may be first processed by the assigned computing unit to obtain a mean and a variance (processing result) of the input data, and then the input data may be read in the buffer of the computing unit, and the read input data and the mean and the variance may be further second processed by the computing unit to obtain a batch normalization result. Therefore, the times of accessing the memory can be reduced, and the time of batch normalization processing can be reduced.

By the data processing scheme, the data to be processed can be cached in at least one computing unit, the data to be processed is processed by the computing unit, and the data to be processed can be read in the cache of the computing unit under the condition of reading the data to be processed again, so that the access times of the memory are reduced, and the processing efficiency of the data to be processed is improved.

In a possible implementation manner, in a case that the data amount is less than or equal to a cache capacity of each computing unit in the at least one computing unit, each group of data in the memory is allocated to one computing unit.

In this possible implementation manner, the data amount of each group of data may be compared with the cache capacity of each computing unit, and if the data amount of each group of data is less than or equal to the cache capacity of each computing unit, it may be indicated that the cache of one computing unit may cache one group of data, and then each group of data of the data to be processed in the memory may be allocated to one computing unit, and one computing unit is used to process one group of data of the data to be processed. By comparing the data amount of each set of data with the buffer capacity of each computing unit, the computing unit allocated for each set of data can be quickly determined.

For example, the buffer capacity of one calculation unit is 64KB, and in the case where the data amount of each set of data is less than or equal to 64KB, each set of data may be allocated to one calculation unit. Here, the cache of one computing unit may include a shared memory portion and a register portion, wherein the shared memory portion may be a total storage space of the shared memory of the computing unit, and the register portion may be a partial storage space of the register of the computing unit, and assuming that the shared memory of one computing unit is 96KB and the register is 256KB, a storage resource of the register is occupied during processing of the data to be processed by the computing unit, so that the cache of the computing unit may include the shared memory portion of 96KB and the register portion of 16KB, and a cache capacity of the computing unit may be set to be a sum of the total storage space of the shared memory and the partial storage space of the register.

In an example of this implementation, after the to-be-processed data is read from the memory, each set of data may be buffered in the buffer of one allocated computing unit, and in a case where each set of data is read again or multiple times by the allocated computing unit, each set of data is read in the buffer of the allocated computing unit.

In this example, after each set of data stored in the memory is assigned to a computing unit, the set of data may be buffered and processed by the computing unit. In the process of processing the set of data by using the computing unit, each set of data may be read again or multiple times, and in this case, each set of data may be read again or multiple times in the cache of the computing unit by using the computing unit, so that the number of accesses to the memory may be reduced, and the processing efficiency of the data to be processed may be improved.

Fig. 2 shows a block diagram of an example of a data processing process according to an embodiment of the present disclosure. In an application scenario in which batch normalization is performed on data to be processed, the data to be processed may be 2 sample images, which correspond to 2 channels, that is, the number N of the sample images is 2, and the number C of the channels is 2. Each set of data included in the data to be processed may correspond to data of one channel, and the data to be processed is first stored in the memory. In the case where the data amount of each set of data is less than or equal to the buffer capacity of one computing unit, each set of data may be allocated to one computing unit, that is, the data of channel 1 is allocated to computing unit 1, and the data of channel 2 is allocated to computing unit 2. Each set of data may then be buffered in a buffer of a computing unit allocated for the set of data and the mean and variance of each set of data calculated using the computing unit. Subsequently, under the condition that the normalization value of the group of data is calculated according to the mean value and the variance, the group of data in the cache can be directly read by using the calculating unit, so that the number of times of accessing the memory is reduced, and the efficiency of batch normalization processing is improved. Here, the cache of the computing unit may be an on-chip cache.

In one possible implementation manner, in a case that the data amount is larger than a cache capacity of each of the at least one computing unit, each set of data in the memory is allocated to a plurality of computing units.

In this implementation manner, the data amount of each set of data may be compared with the cache capacity of each computing unit in at least one computing unit, and when the data amount of each set of data is greater than the cache capacity of each computing unit, it may be indicated that the cache of one computing unit is insufficient to cache one set of data, and caches of multiple computing units are required to cache each set of data together, so that each set of data in the memory may be allocated to multiple computing units, and each set of data may be processed by using the multiple computing units together. In this way, each set of data may be allocated to multiple compute units in the event that the cache of one compute unit is insufficient to cache a set of data.

Here, each set of data may be divided into a plurality of partial data, for example, each set of data may be divided into a plurality of partial data on average by a preset data amount, and the plurality of partial data of each set of data may be allocated to the plurality of calculation units, respectively, or one set of data may be divided into a plurality of partial data matching the buffer capacity of the calculation unit according to the buffer capacity of the plurality of calculation units, that is, the buffer capacity of one calculation unit is large, the partial data having a large data amount may be allocated to the calculation unit, the buffer capacity of one calculation unit is small, and the partial data having a small data amount may be allocated to the calculation unit.

In a possible implementation manner, in a case that the data amount is less than or equal to the buffer capacity of a part of the computing units in the at least one computing unit, and/or the data amount is greater than the buffer capacity of a part of the computing units in the at least one computing unit, each group of data in the grouped data in the memory is allocated to one computing unit, and each group of data in the memory except the part of the group of data is allocated to a plurality of computing units.

In this implementation, since the data amount of each set of data may be different, or since the buffer capacity of each computing unit is different, there is a case where the data amount of one or more sets of data is less than or equal to the buffer capacity of a part of the computing units in the sets of data to be processed, and at the same time, there is a case where the data amount of one or more sets of data is greater than the buffer capacity of a part of the computing units, in this case, each set of data in the grouped data in the memory may be allocated to one computing unit, that is, each set of data in one or more sets of data whose data amount is less than or equal to the buffer capacity of a part of the computing units in the data to be processed may be allocated to one computing unit. Accordingly, each group of data in another part of group of data in the memory can be allocated to a plurality of computing units, that is, each group of data in one or more groups of data of which the data amount is larger than the buffer capacity of part of computing units in the data to be processed is allocated to a plurality of computing units.

Here, in order to save the computing resources, the smallest number of computing units may be allocated to each group of data, for example, in the case where a cache of one computing unit can hold one group of data, the group of data may be allocated to one computing unit. In the case where it is difficult for the cache of one computing unit to accommodate one set of data, the set of data may be allocated to a plurality of computing units of the smallest number. Therefore, the computing resources can be saved, and the configuration of the computing resources can be optimized.

For example, the data to be processed includes 3 sets of data, the data amount of the first set of data is less than or equal to the buffer capacity of the first part of the at least one computing unit, and the first set of data may be allocated to one of the first part of the computing units. The amount of the second set of data is less than or equal to the buffer capacity of a second portion of the at least one computing unit, and the second set of data may be allocated to one of the second portion of the computing units. The data amount of the third group of data is larger than the buffer capacity of the first partial computing unit and the second partial computing unit in the at least one computing unit, the third group of data may be allocated to a plurality of computing units in the first partial computing unit and the second partial computing unit, and the third group of data may be allocated to a plurality of computing units with the smallest number in order to save computing resources. In this way, a reasonable configuration of computing resources may be achieved.

In one example, the target data belonging to the group of data may be acquired from the cache of each of the plurality of allocated computing units, and the second processing may be performed on the processing result and the target data by each of the plurality of computing units.

In this example, in the case where a set of data in the data to be processed is allocated to the plurality of computing units, the set of data may be cached in common by the plurality of computing units, that is, each of the plurality of computing units caches part of the allocated set of data, and the part of the data cached by each computing unit may be target data. Thus, in the case where the data to be processed in the cache is processed by the assigned computing unit, the target data belonging to each group of data can be acquired from each of the plurality of computing units assigned to the group of data, and the second processing can be performed on the assigned target data of the computing unit by each computing unit. The plurality of computing units can cooperatively complete the second processing on the group of data, so that the access times to the memory can be reduced, and the processing efficiency of the data to be processed is improved.

In one example, the second processing may be performed on the processing result and the target data by each of the plurality of calculation units, respectively, in a case where each of the plurality of calculation units completes the first processing.

In this example, the target data may be subjected to the same processing with each of the plurality of calculation units. The processing procedure may include the above-described first processing and the above-described second processing. Since the second processing may involve processing results obtained by the plurality of computing units after the first processing, in the case where each of the plurality of computing units completes the first processing, that is, after each of the plurality of computing units completes the first processing, the second processing is performed on the cached target data and the processing results by each of the plurality of computing units. This makes it possible to synchronize a plurality of computing units that process a set of data, so that the processing results used in the second processing procedure can be unified.

In one example, before the second processing is performed on the processing result and the target data by each of the plurality of calculation units, the number of calculation units that complete the first processing among the plurality of calculation units may be determined, and in a case where the number of calculation units that complete the first processing is greater than or equal to the total number of the plurality of calculation units, it is determined that each of the plurality of calculation units completes the first processing.

In this example, it may be determined whether each of the plurality of calculation units completes the first process by the number of calculation units that complete the first process among the plurality of calculation units. For example, a task processing progress of each of the plurality of computing units may be detected by one computing unit, and in a case where it is detected that each computing unit completes the first processing, the computing unit may notify the other plurality of computing units to perform the second processing. Or, the data of the computing unit that completes the first processing in the multiple computing units may be indicated by using the cooperative parameter in the memory, and each time one of the multiple computing units completes the first processing, the cooperative parameter may be updated by using the computing unit, for example, the initial parameter value of the cooperative parameter is set to 0, and after a certain computing unit performs the first processing on the allocated target data, the parameter value of the cooperative parameter may be added by 1, so as to achieve the accumulation of the cooperative parameter. In a case where the plurality of calculation units read the cooperation parameter is greater than or equal to the total number of the plurality of calculation units, it may be determined that the plurality of calculation units each complete the first processing.

Fig. 3 shows a flow diagram of an example of a data processing procedure according to an embodiment of the present disclosure. Assuming that there are two compute units, compute unit 1 and compute unit 2, the memory fence operation may be performed first after the first process with compute unit 1. Memory fence operations may be understood as two sets of operations that are executed asynchronously, with the second set of operations beginning after the first set of operations have completed execution. In this case, a memory fence operation may be inserted after the first group of operations, and two processes executed asynchronously may be divided between the first process and the second process, so that the first process starts executing the second process after the first process is executed and the processing result is written in the memory. After each calculation unit 1 completes the first processing, the collaborative parameter counter may be increased by 1, and then it is determined whether the collaborative parameter counter is greater than or equal to 2, and if the counter is greater than or equal to 2, the calculation unit 1 may perform the second processing. Otherwise, it may be repeatedly determined whether counter is greater than or equal to 2. Accordingly, the same processing as that of the calculation unit 1 can be performed on the allocated target data by the calculation unit 2. Here, the operation of adding 1 to the collaborative parameter counter may be an atomic operation. The atomic operation may be understood as that, in a case where a current computing unit performs an atomic operation on a parameter in a memory, other computing units may not perform the operation on the parameter, and after the operation on the parameter by the current computing unit is completed, the other computing units may perform the operation on the parameter.

In one example, different collaborative parameters may be set for computing units assigned different sets of data.

FIG. 4 shows a block diagram of an example of a compute unit synchronization process according to an embodiment of the present disclosure. In this example, the computing units assigned with different sets of data may correspond to different coordination parameters, such that the computing units assigned with different sets of data may be synchronized separately using the different coordination parameters. For example, assume that there are 5 computing units, wherein computing unit 1, computing unit 3, and computing unit 5 correspond to data set 1, and computing unit 1, computing unit 3, and computing unit 5 need to be synchronized. The computing unit 2 and the computing unit 4 correspond to the data set 2, and the computing unit 2 and the computing unit 4 need to be synchronized. In this case, two collaborative parameters counter1 and counter2 may be set, counter1 corresponding to each set of data 1 and counter2 corresponding to each set of data 2. After the first processing is performed on the target data of each set of data 1 by the calculating unit 1, the calculating unit 3, and the calculating unit 5, respectively, the counter1 may be added by 1, respectively, and in the case where the counter1 is equal to or greater than 3, the second processing may be performed by the calculating unit 1, the calculating unit 3, and the calculating unit 5, respectively. After the first processing is performed on the target data of each set of data 2 by the computing unit 2 and the computing unit 4, respectively, the counter2 may be added by 1, respectively, and in the case where the counter2 ≧ 2, the second processing may be performed by the computing unit 2 and the computing unit 4, respectively. Therefore, flexible synchronization of any plurality of computing units can be realized, and the plurality of computing units can work cooperatively. In addition, in the synchronization process, one synchronization mode can be adopted for one part of the computing units, and another synchronization mode can be adopted for the other part of the computing units, and in the execution process of each synchronization mode, only the part of the computing units which adopt the current synchronization mode for synchronization needs to be considered. For example, the calculation units 1, 3, and 5 are synchronized, and the calculation units 2 and 4 are synchronized.

Fig. 5 shows a flow diagram of an example of a data processing procedure according to an embodiment of the present disclosure. In one example, each set of data may be batch normalized using a plurality of computational units. In this example, for each set of data of a larger data amount, the cache of a single computing unit cannot accommodate one set of data, so that one set of data can be allocated to multiple computing units while utilizing the caches of these computing units to cache one channel of data. In the batch normalization process, the allocated target data may be acquired in the memory by using a plurality of calculation units, respectively, then the sum1 and the sum of squares 1 of the target data are calculated by using each calculation unit, respectively, and then the sum and the sum of squares 1 of each set of data stored in the memory are updated by using the calculated sum1 and sum of squares 1, that is, sum0+ sum1 and sum of squares 1. Here, the operations of updating the sum and the sum of squares of each set of data may be both atomic operations, and the initial values of sum and sum may be 0. Then, each time one computing unit updates the sum and the sum of squares of each set of data, a memory fence operation can be performed first, and then 1 is added to the cooperative parameter in the memory. Under the condition that the parameter value of the collaborative parameter is greater than or equal to n, sum and square in the memory can be read by each computing unit, corresponding target data is obtained in the cache of each computing unit, and each data in the target data is normalized by reading sum and square, so that batch normalization processing of the group of data is realized.

In the batch normalization processing process of the data to be processed, each computing unit is used for caching the allocated target data, and when part of data is read again, the data can be directly obtained from the cache of the corresponding computing unit, so that the access times of the memory are reduced, and the efficiency of the batch normalization processing is improved.

It is understood that the above-mentioned method embodiments of the present disclosure can be combined with each other to form a combined embodiment without departing from the logic of the principle, which is limited by the space, and the detailed description of the present disclosure is omitted.

In addition, the present disclosure also provides an apparatus, an electronic device, a computer-readable storage medium, and a program, which can be used to implement any data processing method provided by the present disclosure, and the corresponding technical solutions and descriptions and corresponding descriptions in the method sections are not repeated.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

Fig. 6 shows a block diagram of a data processing apparatus according to an embodiment of the present disclosure, which, as shown in fig. 6, includes:

the acquiring module 61 is configured to acquire a data amount of at least one group of data in the data to be processed;

the allocating module 62 is configured to allocate a computing unit to each group of data in the to-be-processed data according to the data amount and the cache capacity of at least one computing unit;

the processing module 63 is configured to read the data to be processed from the memory, and cache the data to be processed in a cache of an allocated computing unit, so as to process the data to be processed in the cache through the allocated computing unit.

In a possible implementation manner, the processing module 63 is further configured to perform a first processing on the data to be processed through the allocated computing unit to obtain a processing result;

the processing module 63 is specifically configured to obtain the data to be processed from the cache; and performing second processing on the processing result and the to-be-processed data acquired from the cache through the distributed computing unit.

In a possible implementation manner, the allocating module 62 is specifically configured to allocate each group of data in the memory to one computing unit when the data amount is less than or equal to the cache capacity of each computing unit in the at least one computing unit.

In a possible implementation manner, the allocating module 62 is specifically configured to allocate each group of data in the memory to a plurality of computing units when the data amount is greater than the cache capacity of each computing unit in the at least one computing unit.

In a possible implementation manner, the allocating module 62 is specifically configured to allocate each group of data in the grouped data in the memory to one computing unit and allocate each group of data in the memory except for the partial group of data to a plurality of computing units when the data amount is less than or equal to the cache capacity of part of the computing units in the at least one computing unit and/or the data amount is greater than the cache capacity of part of the computing units in the at least one computing unit.

In a possible implementation manner, the processing module 62 is specifically configured to obtain target data belonging to a group of data from a cache of each of the distributed multiple computing units; performing, by each of the plurality of calculation units, a second process on the processing result and the target data, respectively.

In a possible implementation manner, the processing module 62 is specifically configured to, when each of the plurality of computing units completes the first processing, perform second processing on the processing result and the target data through each of the plurality of computing units, respectively.

In a possible implementation manner, the processing module 62 is further configured to determine the number of computing units that complete the first processing in the plurality of computing units; determining that each of the plurality of calculation units completes the first process in a case where the number of calculation units that complete the first process is greater than or equal to the total number of the plurality of calculation units.

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

An embodiment of the present disclosure further provides an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured as the above method.

The electronic device may be provided as a terminal, server, or other form of device.

Fig. 7 is a block diagram illustrating an electronic device 1900 according to an example embodiment. For example, the electronic device 1900 may be provided as a server. Referring to fig. 7, electronic device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described method.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system stored in memory 1932, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium, such as the memory 1932, is also provided that includes computer program instructions executable by the processing component 1922 of the electronic device 1900 to perform the above-described methods.

The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A data processing method, comprising:

acquiring the data volume of at least one group of data in the data to be processed;

distributing a computing unit for each group of data in the data to be processed according to the data amount and the cache capacity of at least one computing unit;

and reading the data to be processed from the memory, caching the data to be processed into a cache of an allocated computing unit, and processing the data to be processed in the cache through the allocated computing unit.

2. The method of claim 1, wherein after the reading the data to be processed from the memory, the method further comprises:

performing first processing on the data to be processed through the distributed computing units to obtain a processing result;

the processing the data to be processed in the cache by the allocated computing unit includes:

acquiring the data to be processed from the cache;

and performing second processing on the processing result and the to-be-processed data acquired from the cache through the distributed computing unit.

3. The method according to claim 1 or 2, wherein the allocating a computing unit to each group of data in the data to be processed according to the data amount and a buffer capacity of at least one computing unit comprises:

and under the condition that the data volume is less than or equal to the cache capacity of each computing unit in the at least one computing unit, allocating each group of data in the memory to one computing unit.

4. The method according to claim 1 or 2, wherein the allocating a computing unit to each group of data in the data to be processed according to the data amount and a buffer capacity of at least one computing unit comprises:

and under the condition that the data amount is larger than the cache capacity of each computing unit in the at least one computing unit, distributing each group of data in the memory to a plurality of computing units.

5. The method according to claim 1 or 2, wherein the allocating a computing unit to each group of data in the data to be processed according to the data amount and a buffer capacity of at least one computing unit comprises:

and under the condition that the data volume is less than or equal to the buffer capacity of part of the computing units in the at least one computing unit and/or the data volume is greater than the buffer capacity of part of the computing units in the at least one computing unit, allocating each group of data in the grouped data in the memory to one computing unit and allocating each group of data except the part of the group of data in the memory to a plurality of computing units.

6. The method according to claim 4 or 5, wherein for a set of data allocated to a plurality of computing units, the processing the data to be processed in the buffer by the allocated computing units comprises:

acquiring target data belonging to a group of data from a cache of each of a plurality of distributed computing units;

performing, by each of the plurality of calculation units, a second process on the processing result and the target data, respectively.

7. The method according to claim 6, wherein the performing, by each of the plurality of computing units, a second process on the processing result and the target data, respectively, comprises:

and performing second processing on the processing result and the target data by each of the plurality of calculation units, respectively, when each of the plurality of calculation units completes the first processing.

8. The method according to claim 7, wherein before performing the second processing on the processing result and the target data by each of the plurality of computing units, respectively, further comprising:

determining the number of computing units of the plurality of computing units that complete the first process;

determining that each of the plurality of calculation units completes the first process in a case where the number of calculation units that complete the first process is greater than or equal to the total number of the plurality of calculation units.

9. The method according to any one of claims 1 to 8, wherein the data to be processed comprises image data; the image data includes data of a plurality of channels; each set of data corresponds to data of one channel in the image data.

10. A data processing apparatus, comprising:

11. The apparatus according to claim 10, wherein the processing module is further configured to perform a first processing on the data to be processed by the allocated computing unit to obtain a processing result;

12. The apparatus according to claim 10 or 11, wherein the allocating module is specifically configured to allocate each group of data in the memory to one computing unit when the data amount is less than or equal to a buffer capacity of each computing unit in the at least one computing unit.

13. The apparatus according to claim 10 or 11, wherein the allocating module is specifically configured to allocate each set of data in the memory to the plurality of computing units when the data amount is greater than a buffer capacity of each computing unit in the at least one computing unit.

14. The apparatus according to claim 10 or 11, wherein the allocating module is specifically configured to, in a case that the data amount is smaller than or equal to a buffer capacity of a part of the computing units in the at least one computing unit, and/or the data amount is larger than the buffer capacity of a part of the computing units in the at least one computing unit, allocate each group of data in the grouped data in the memory to one computing unit, and allocate each group of data in the memory except the part of the group of data to a plurality of computing units.

15. The apparatus according to claim 13 or 14, wherein the processing module is specifically configured to obtain target data belonging to a group of data from a cache of each of the plurality of allocated computing units; performing, by each of the plurality of calculation units, a second process on the processing result and the target data, respectively.

16. The apparatus according to claim 15, wherein the processing module is specifically configured to perform a second processing on the processing result and the target data through each of the plurality of computing units, respectively, when each of the plurality of computing units completes the first processing.

17. The apparatus of claim 16, wherein the processing module is further configured to determine a number of computing units of the plurality of computing units that complete the first process; determining that each of the plurality of calculation units completes the first process in a case where the number of calculation units that complete the first process is greater than or equal to the total number of the plurality of calculation units.

18. The apparatus according to any one of claims 10 to 17, wherein the data to be processed comprises image data; the image data includes data of a plurality of channels; each set of data corresponds to data of one channel in the image data.

19. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to invoke the memory-stored instructions to perform the method of any of claims 1 to 9.

20. A computer readable storage medium having computer program instructions stored thereon, which when executed by a processor implement the method of any one of claims 1 to 9.