WO2024067207A1 - 调度方法、调度装置、电子设备及存储介质 - Google Patents

调度方法、调度装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2024067207A1
WO2024067207A1 PCT/CN2023/119412 CN2023119412W WO2024067207A1 WO 2024067207 A1 WO2024067207 A1 WO 2024067207A1 CN 2023119412 W CN2023119412 W CN 2023119412W WO 2024067207 A1 WO2024067207 A1 WO 2024067207A1
Authority
WO
WIPO (PCT)
Prior art keywords
calculation
data
computing
convolution
units
Prior art date
Application number
PCT/CN2023/119412
Other languages
English (en)
French (fr)
Inventor
袁航剑
杨东明
李涛
施云峰
王剑
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2024067207A1 publication Critical patent/WO2024067207A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • Embodiments of the present disclosure relate to a scheduling method, a scheduling device, an electronic device, and a storage medium.
  • AI chips are chips dedicated to neural network operations and are specially designed to accelerate the execution of neural networks. With the development of artificial intelligence (AI), the number of parameters in algorithm models has increased dramatically, and the demand for computing power has increased.
  • At least one embodiment of the present disclosure provides a scheduling method for a multi-layer convolutional neural network, the scheduling method comprising: a plurality of computing units respectively perform first convolution calculations on corresponding multiple data groups to obtain corresponding multiple first calculation result groups, wherein the multiple first calculation result groups are used to constitute a first convolution layer obtained by the first convolution calculation, and the multiple computing units include a first computing unit and a second computing unit; according to a configuration rule of the second convolution layer obtained after the multiple computing units perform a second convolution calculation on the first convolution layer, the data replication and transmission mode corresponding to the multiple first calculation result groups on the multiple computing units is determined; the first computing unit in the multiple computing units that needs to perform data valid data row filling is based on In the corresponding data copy transmission mode, a first intermediate data row required for the first computing unit to fill in the second convolution calculation process is obtained from the first calculation result group on the second computing unit.
  • At least one embodiment of the present disclosure further provides a scheduling device, which includes: a computing control module, configured to enable multiple computing units to perform first convolution calculations on corresponding multiple data groups respectively to obtain corresponding multiple first calculation result groups, wherein the multiple first calculation result groups are used to constitute a first convolution layer obtained by the first convolution calculation, and the multiple computing units include a first computing unit and a second computing unit; an allocation scheduling module, configured to determine a data copying and transmission mode corresponding to the multiple first calculation result groups on the multiple computing units according to a configuration rule of the second convolution layer obtained after the multiple computing units perform the second convolution calculation on the first convolution layer, and a data transmission module, configured to enable a first computing unit among the multiple computing units that needs to fill valid data rows to obtain, based on the corresponding data copying and transmission mode, the first intermediate data row required for the first computing unit to fill in the second convolution calculation process from the first calculation result group on the second computing unit.
  • a computing control module configured to enable multiple computing units to perform first convolution calculations on
  • At least one embodiment of the present disclosure further provides an electronic device, including the scheduling device provided by any embodiment of the present disclosure.
  • At least one embodiment of the present disclosure further provides an electronic device, comprising: a processor; a memory, comprising at least one computer program module; wherein the at least one computer program module is stored in the memory and configured to be executed by the processor, and the at least one computer program module is used to implement the scheduling method described in any embodiment of the present disclosure.
  • At least one embodiment of the present disclosure further provides a storage medium storing non-transitory computer-readable instructions, which, when executed by a computer, implements the scheduling method described in any embodiment of the present disclosure.
  • FIG1A is a schematic diagram of model parallelism
  • FIG1B is a schematic diagram of data parallelism
  • FIG2 is a schematic diagram of a convolution calculation process based on data parallelism represented on a calculation graph
  • FIG3 is a flow chart of a scheduling method provided by at least one embodiment of the present disclosure.
  • FIG4 is a schematic flow chart of step S20 in FIG3 ;
  • FIG5 is a schematic flow chart of step S30 in FIG3 ;
  • 6A-6D are schematic diagrams of a data replication and transmission process provided by at least one embodiment of the present disclosure.
  • FIG7 is a schematic block diagram of a scheduling device provided by at least one embodiment of the present disclosure.
  • FIG8 is a schematic block diagram of an electronic device provided by at least one embodiment of the present disclosure.
  • FIG9 is a schematic block diagram of another electronic device provided by at least one embodiment of the present disclosure.
  • FIG10 is a schematic block diagram of another electronic device provided by at least one embodiment of the present disclosure.
  • FIG. 11 is a schematic diagram of a storage medium provided by at least one embodiment of the present disclosure.
  • DSA domain-specific accelerator
  • PE processing units
  • a DSA accelerator refers to a chip that uses multiple computing units (or processing units (PE) or processing cores) to interconnect to achieve computing power expansion and computing acceleration.
  • the multiple processing units are connected to each other through communication lines, such as buses, on-chip networks, etc.
  • model parallelism that is, different devices are responsible for the calculation of different parts of the calculation graph. For example, in some domain-specific accelerators, it is necessary to first statically allocate each operator in the calculation graph to different PEs inside the chip, and then input the input data to the PE where the first operator is located. When the PE where the first operator is located completes the calculation, the calculated data is transferred to the next PE until all operators are calculated. In each PE of model parallelism, only operators on local memory are executed, and each PE must wait for the previous operator to complete the calculation before continuing the calculation. This is a way to serialize the execution of operations in a pipeline manner.
  • 3 different operators are loaded on PE0, PE1, and PE2, respectively.
  • these 3 operators are convolution operators corresponding to different layers of the neural network.
  • PE0 has convolution operator A
  • PE1 has convolution operator B
  • PE2 has convolution operator C.
  • the server first loads the input data to PE0 where convolution operator A is located, waits for PE0 to complete the first convolution calculation of the input data, passes the result of the first convolution calculation to PE1, and performs the second convolution calculation corresponding to convolution operator B on PE1, and so on, until the convolution operator C on PE2 is executed, and the final result obtained on PE2 is returned to the host.
  • model parallelism requires static compilation, and some operators in the computational graph need to be assigned to multiple fixed PEs, so it cannot process computational graphs with dynamic shapes.
  • Dynamic shape means that the shape of a tensor depends on specific operations and cannot be obtained through advance calculations. In other words, a dynamic computational graph is built based on the calculations of each step. Therefore, a model parallelism method that only contains the computational graph part cannot process dynamic computational graphs.
  • each device has a complete calculation graph.
  • each operator in the calculation graph needs to be loaded into the GPU device in sequence, and each operator is placed on multiple PEs in parallel. Then the input data is split into multiple parts and loaded on multiple PEs respectively. Multiple PEs process the input data in parallel until the calculation of all operators is completed, and the calculation results on multiple PEs are summarized and returned to the host.
  • PE0, PE1, and PE2 are loaded with three different operators, for example, these three operators are convolution operators corresponding to different layers of the neural network.
  • these three operators are convolution operators corresponding to different layers of the neural network.
  • PE0, PE1, and PE2 all have convolution operators A, B, and C.
  • the server first splits the input data into multiple parts, for example, evenly splits it into three parts of input data, and then loads the three parts of input data to PE0, PE1, and PE2 respectively.
  • PE0, PE1, and PE2 perform multiple convolution calculations of the corresponding convolution operators A, B, and C on the input data in parallel. After the calculation is completed, the system gathers the final results on PE0, PE1, and PE2 and returns them to the host.
  • the PE In the current data parallel method, for input data with a large amount of data, the PE cannot store all the input data required for calculation at one time. Therefore, it is usually necessary to split the input data into multiple groups (Group), and then put the data of multiple groups on different PEs respectively, and calculate them in parallel by multiple PEs. After the calculation is completed, the calculation results on multiple PEs are aggregated to obtain the final output result corresponding to the input data.
  • Group groups
  • the calculation results on multiple PEs are aggregated to obtain the final output result corresponding to the input data.
  • an exemplary computation graph contains the following three convolution operators (conv):
  • %0, %5, %10 and %15 represent tensor data.
  • the input image of the first convolution layer of a neural network can be represented as [B, H, W, C], that is, the dimension of the input data is B ⁇ H ⁇ W ⁇ C.
  • %0 indicates that the size of the input image of the first convolution layer of a neural network is [1, 224, 224, 64], that is, the input image has 64 channels, and the image size of each channel (height H ⁇ width W) is 224 ⁇ 224, for a total of 1 batch.
  • This calculation uses floating point calculation, and the accuracy is FP32.
  • FIG2 shows a schematic diagram of a convolution calculation process using data parallelism represented on a calculation graph, which contains the three convolution operators described above.
  • the input data needs to be split into multiple data groups according to the memory and number of PEs.
  • the size of the input data is 224 ⁇ 224, and the input data is split into three data groups to be loaded into the three PEs for convolution calculation.
  • the size and range of the split data and the data groups need to be selected according to the output data.
  • the size of the output data %15 is 224 ⁇ 224.
  • the data is roughly evenly split into three parts in the height H dimension of the output data.
  • the three output data groups are divided into three rounds, round1, round2, and round3, corresponding to the three PEs.
  • the (height) sizes on round3 are 75 rows, 75 rows, and 74 rows.
  • the data shown in the calculation diagram shown in Figure 2 are all "valid data”.
  • the "valid data” here refers to the data obtained from each convolution calculation that is of practical significance and valid data.
  • the data at the "edge" of each data group after being split the area composed of the data and the filled "0" used for convolution calculation with the convolution kernel is a non-valid area, and the result obtained after the convolution calculation in this area is non-valid data and needs to be discarded.
  • the output data obtained by the convolution calculation of the complete input data is valid data, so, corresponding to the split data group, only the data obtained after the convolution calculation of the data in the valid area is valid data.
  • the data shown on the calculation diagram are all valid data.
  • the range of the read data of the input data %0 of round1 in the height direction is [0,77] (row number starts from 0), a total of 78 rows.
  • At least one embodiment of the present disclosure provides a scheduling method for a multi-layer convolutional neural network.
  • the scheduling method includes: a plurality of computing units respectively perform first convolution calculations on corresponding multiple data groups to obtain corresponding multiple first calculation result groups, wherein the multiple first calculation result groups are used to constitute a first convolution layer obtained by the first convolution calculation, and the multiple computing units include a first computing unit and a second computing unit; according to the configuration rules of the second convolution layer obtained by the multiple computing units after the second convolution calculation is performed on the first convolution layer, the data copying and transmission mode corresponding to the multiple first calculation result groups on the multiple computing units is determined; the first computing unit in the multiple computing units that needs to perform valid data row filling obtains the first intermediate data row required for the first computing unit to fill in the second convolution calculation process from the first calculation result group on the second computing unit based on the corresponding data copying and transmission mode.
  • the scheduling method can achieve balanced utilization of the computing power of the computing unit by copying and transmitting valid data on the computing unit
  • At least one embodiment of the present disclosure further provides a scheduling device, an electronic device, and a storage medium. Similarly, by copying and transmitting valid data on a computing unit to another computing unit, the scheduling device, the electronic device, and the storage medium can achieve balanced utilization of the computing power of the computing unit, reduce repeated calculations of data, and improve the utilization rate of computing power.
  • Step S10 Multiple computing units perform first convolution calculations on corresponding multiple data groups respectively Obtaining a corresponding plurality of first calculation result groups, wherein the plurality of first calculation result groups are used to constitute a first convolution layer obtained by the first convolution calculation, and the plurality of calculation units include a first calculation unit and a second calculation unit;
  • Step S20 determining data copy transmission modes corresponding to a plurality of first calculation result groups on the plurality of calculation units according to a configuration rule of the second convolutional layer obtained after the plurality of calculation units perform the second convolutional calculation on the first convolutional layer in the plurality of calculation units;
  • Step S30 The first computing unit among the multiple computing units that needs to fill valid data rows obtains the first intermediate data row required for the first computing unit to fill in the second convolution calculation process from the first computing result group on the second computing unit based on the corresponding data copy transmission mode.
  • the scheduling method can be used in computing devices such as AI chips and general-purpose graphics processing units (GPGPUs) that perform convolution operations on multi-layer convolutional neural networks.
  • the AI chip can be an AI chip using a DSA accelerator or an AI chip with multiple PE units.
  • it can be used in data-parallel distributed training, and the embodiments of the present disclosure are not limited to this.
  • the "first convolution calculation” may represent a convolution calculation performed on a complete input image, or may represent a convolution calculation performed on a current input data group on a computing unit.
  • the "first convolution calculation” may be any convolution calculation in a multi-layer convolution calculation, rather than a convolution calculation performed only on the first layer of convolution.
  • the "first convolution calculation” may be a convolution calculation performed on the first layer of convolution, or may be a convolution calculation performed on the second layer, the third layer, and so on.
  • the "second convolution calculation” refers to the next convolution calculation corresponding to the "first convolution calculation”, that is, a convolution calculation performed using the "first convolution layer” obtained after the first convolution calculation as input.
  • the "first convolution layer” is a convolution layer composed of the calculation results obtained by the "first convolution calculation”
  • the “first convolution layer” here refers to the actual convolution layer obtained by performing the first convolution calculation on the complete input image that has not been split.
  • the first convolution layer can be the input data of the second convolution calculation.
  • the "second convolution layer” is a convolution layer composed of the calculation results obtained by the “second convolution calculation”
  • the “second convolution layer” here refers to the actual convolution layer obtained by performing the second convolution calculation on the first convolution layer composed of data obtained without splitting.
  • FIG. 6A-6D The scheduling method of the embodiment of the present disclosure will be described in detail below in conjunction with the calculation diagrams shown in FIG. 6A-6D.
  • three PEs (FIG. 6A-6C) are used as an example for description, but the embodiments of the present disclosure are not limited thereto, and the computing device may include, for example, 2 PEs, 4 PEs (FIG. 6D) or a greater number of PEs.
  • the long strip areas shown in FIGS. 6A-6D are abstract representations of the range of at least part of the valid data on a certain dimension (e.g., height H or width W) of the input image for each convolution calculation.
  • the diagonal area, grid area, and diamond area are abstract representations of the data groups on the computing units (PE0, PE1, PE2, etc.), and their respective positions on the long strip areas are only schematic representations of the corresponding positions of their respective data groups on the input image, and do not constitute a limitation on the embodiments of the present disclosure.
  • the "multiple data groups" corresponding to the multiple computing units may be multiple original input data groups obtained by splitting the original input matrix (these original input data groups are input into the computing device), or may be multiple input data groups for any convolution calculation in the computing device.
  • the multiple data groups may be multiple data groups 10, 20, 30 for performing convolution calculation on %0 on PE0, PE1, and PE2 in FIG6A, or may be multiple data groups for performing convolution calculation on %5.
  • the data group for performing convolution calculation on %5 by each computing unit includes a first calculation result group and an intermediate data row obtained from another computing unit. That is, in at least one embodiment of the present disclosure, the first calculation result group on the first computing unit and the first intermediate data row obtained from the second computing unit constitute the data group required for the first computing unit to perform the second convolution calculation.
  • multiple computing units perform first convolution calculations on corresponding multiple data groups to obtain corresponding multiple first calculation result groups.
  • multiple first calculation result groups can be used to constitute a first convolution layer obtained by the first convolution calculation.
  • multiple computing units PE0, PE1, and PE2 perform first convolution calculations on corresponding multiple data groups 10, 20, and 30 to obtain multiple first calculation result groups 11, 21, and 31, and the multiple first calculation result groups 11, 21, and 31 can be used to constitute a first convolution layer obtained by the first convolution calculation.
  • multiple first calculation result groups in multiple calculation units can directly constitute the first convolution layer obtained by the first convolution calculation, or can constitute a part of the first convolution layer obtained by the first convolution calculation.
  • multiple calculation units can carry all the input data at one time, and then the first calculation result groups in multiple calculation units can be aggregated to form the first convolution layer; when the input image is large, multiple calculation units cannot To carry all the input data at one time, the input data needs to be divided into multiple loads into computing units, and the input data loaded each time is split into multiple groups and placed in multiple computing units respectively. Then, the multiple first calculation result groups in the multiple computing units only constitute a part of the first convolution layer corresponding to the input data loaded this time.
  • the data rows in the plurality of first calculation result groups are continuous without overlap in the first convolution layer.
  • the data rows in the plurality of first calculation result groups 11, 21, and 31 are continuous without overlap in the first convolution layer, that is, the data in the plurality of first calculation result groups 11, 21, and 31 obtained by performing the first convolution calculation on %0 on the plurality of calculation units PE0, PE1, and PE2 have no overlapping parts.
  • a "data row” may represent a row of data or a column of data in the input image.
  • a "data row” represents a row of data in the height direction of the input image.
  • a "data row” may also represent a column of data in the width direction of the input image, and the embodiments of the present disclosure are not limited to this.
  • step S20 the data copy and transfer modes corresponding to the multiple first calculation result groups on the multiple calculation units are determined according to the configuration rules of the second convolutional layer obtained after the multiple calculation units perform the second convolutional calculation on the first convolutional layer.
  • the "performing a second convolutional calculation on the first convolutional layer” here does not refer to the actual convolutional calculation in the calculation unit, but refers to the rule based on the convolutional calculation, and the size of the second convolutional layer to be obtained after the second convolutional calculation can be known according to the size of the first convolutional layer. Therefore, the "second convolutional layer to be obtained” does not refer to the specific calculation results of the data in the second convolutional layer.
  • the splitting mode of the second convolutional layer can be determined according to the size of the second convolutional layer, that is, the configuration rules of the second convolutional layer in multiple calculation units are determined.
  • Fig. 4 is a schematic flow chart of step S20 in Fig. 3.
  • step S20 may further include steps S21 to S23.
  • Step S21 determining a configuration rule of the second convolutional layer in the multiple computing units based on the memory size of the memories on the multiple computing units;
  • Step S22 determining the distribution of the second convolutional layer on the plurality of computing units based on the configuration rule
  • Step S23 Determine the first calculation result group corresponding to the plurality of calculation units based on the distribution Data replication transfer mode.
  • step S21 the size of the storage (e.g., memory) on the multiple computing units is obtained, and the configuration rules of the second convolutional layer in the multiple computing units are determined based on the memory size of each computing unit.
  • the second convolutional layer is split into multiple pieces of data, and the size of each piece of data shall not exceed the available capacity of the memory allocated to the corresponding computing unit, which is the premise for determining the configuration rules.
  • step S22 based on the determined configuration rule, it is determined how to split the second convolutional layer, that is, the distribution of the multiple data groups split by the second convolutional layer on the multiple computing units is determined. For example, corresponding to the distribution of the multiple first calculation result groups on the multiple computing units and the determined configuration rule, the distribution of the multiple data groups split by the second convolutional layer on the corresponding multiple computing units can be determined.
  • the data copying and transmission mode corresponding to the first computing result group on the multiple computing units is determined.
  • the data copying and transmission mode may include a unidirectional data transmission mode and a bidirectional data transmission mode.
  • the unidirectional data transmission mode means that the computing unit can only obtain copied data from another unit
  • the bidirectional data transmission mode means that the computing unit can not only obtain copied data from another computing unit, but also copy and transmit the data in itself to the other computing unit.
  • multiple computing units can be set to have the same data copying and transmission mode, or multiple computing units can be set to have different data copying and transmission modes.
  • the data copying and transmission mode is not limited to the above two modes, and can also be other feasible implementation methods, and the embodiments of the present disclosure are not limited to this.
  • the configuration rule may be: make the distribution of the data of the second convolution layer on multiple computing units the same as the data range of the first calculation result group in the multiple computing units.
  • multiple computing units PE0, PE1, and PE2 perform a first convolution calculation on the data group of %0 to obtain a first convolution layer composed of multiple first calculation result groups 11, 21, and 31.
  • the size of the second convolution layer obtained by performing a second convolution calculation on the first convolution layer can be calculated. Therefore, it is proposed to split the data of the second convolution layer and distribute it to multiple computing units PE0, PE1, and PE2 according to the distribution method on %10 as shown in FIG6A.
  • the data copy and transmission mode when each computing unit performs a convolution calculation on %5 can be determined.
  • the data replication and transmission mode of multiple computing units PE0, PE1, and PE2 can be determined according to the configuration rule. Both transmission modes are bidirectional data transmission modes.
  • the configuration rule may be: let the distribution of the data of the second convolutional layer on some of the multiple computing units be the same as the data range of the first calculation result group in the part of the computing units.
  • the configuration rule may be: let the distribution of the data of the second convolutional layer on some of the multiple computing units be the same as the data range of the first calculation result group in the part of the computing units.
  • FIG6B after knowing the size of the second convolutional layer, it is planned to distribute the data of the second convolutional layer to multiple computing units PE0, PE1, and PE2 according to the distribution method on %10 as shown in FIG6B, that is, the data of the second convolutional layer is split into 3 parts, where the data range allocated to PE1 is equal to the data range of the first calculation result group on PE1. Therefore, the data copy transmission mode when each computing unit performs convolution calculation on %5 can be determined according to the configuration rule, for example, it is determined that the data copy transmission modes of multiple computing units PE0, PE1, and PE2 are all unidirectional
  • the configuration rules can be set not only according to the memory size of the computing unit, but also flexibly set according to factors such as the computing power distribution of multiple computing units and data transmission overhead.
  • the embodiments of the present disclosure do not limit the specific content of the configuration rules.
  • the first computing unit that needs to perform valid data row filling among multiple computing units is first determined.
  • multiple computing units may all need to perform valid data row filling, or only some computing units among multiple computing units may need to perform valid data row filling.
  • PE0, PE1, and PE2 are all first computing units that need to perform valid data row filling
  • PE1 and PE2 are first computing units that need to perform valid data row filling
  • FIG6C only PE1 is the first computing unit that needs to perform valid data row filling.
  • step S30 after determining the first computing unit that needs to perform valid data row filling and the data copy transmission mode corresponding to the first computing unit, the first computing unit obtains at least one first intermediate row data from the first calculation result group of another computing unit.
  • the first computing unit can obtain two first intermediate row data from the first calculation result groups of two different computing units, and both of these first intermediate row data are required for filling in the second convolution calculation process performed by the first computing unit.
  • PE0 as the first computing unit, obtains one or more rows of first intermediate row data from the first calculation result group 21 of PE1;
  • PE1 as the first computing unit, obtains one or more rows of first intermediate row data from the first calculation result group 11 of PE0, and obtains one or more rows of first intermediate row data from the first calculation result group 31 of PE2;
  • Row first middle row data when the first computing unit, obtains one or more rows of first intermediate row data from the first calculation result group 21 of PE1;
  • Fig. 5 is a schematic flow chart of step S30 in Fig. 3.
  • step S30 may further include steps S31 and S32.
  • Step S31 determining the size of the first intermediate data row based on the relationship between the second calculation result group obtained after the first calculation unit performs the second convolution calculation and the first calculation result group obtained after the first calculation unit performs the first convolution calculation;
  • Step S32 based on the data replication transmission mode and the size of the first intermediate data row, obtain the first intermediate data row from the first calculation result group on the second calculation unit.
  • step S31 the size of the first intermediate data row obtained by the first computing unit from another computing unit is determined by the relationship between the second computing result group and the first computing result group, and the relationship between the second computing result group and the first computing result group is determined by factors such as the distribution of the second convolutional layer on the computing unit, the size of the convolution kernel, and the padding operation.
  • the first intermediate data row can be obtained from the first calculation result group on the second calculation unit through inter-core transmission between the calculation units. For example, by obtaining the address of the first intermediate data row in the first calculation result group of the second calculation unit, the first intermediate data row is read and copied to the first calculation unit, and the copied first intermediate data row is written to the corresponding address of the first calculation unit.
  • PE1 has [74,150] rows of valid data of %0, and the storage addresses corresponding to the 77 rows of data are address_0 to address_75.
  • the first calculation result group on PE1 is [75,149] rows in the first convolution layer, with a total of 75 rows of data, which can be respectively allocated to addresses address_1 to address_74, and the first intermediate data row obtained from the first calculation result group 11 of PE0 is written to address address_0, and the first intermediate data row obtained from the first calculation result group 31 of PE2 is written to address address_75.
  • the embodiments of the present disclosure do not limit the specific implementation method of the data replication and transmission process between computing units.
  • the scheduling method also includes step S40 (not shown in the figure): splitting the original input matrix to obtain multiple original input data groups, and transmitting the multiple original input data groups to multiple computing units respectively for performing a first convolution calculation.
  • the splitting mode of the original input matrix can be determined according to the size of the first convolution layer obtained after performing the first convolution calculation on the original input matrix, the number of computing units, the memory size, etc.
  • the splitting mode is to evenly split the original input matrix into multiple original input data groups.
  • the original input matrix may be split into a plurality of original input data groups according to the memory size of the computing unit, so as to be respectively placed in a plurality of computing units.
  • the original input data group may be used as the data group for performing the first convolution calculation.
  • the scheduling method further includes step S50 (not shown in the figure): in response to the second convolution layer being the output layer in the multi-layer convolutional neural network, aggregating multiple second calculation result groups in multiple calculation units to obtain at least part of the calculation output of the multi-layer convolutional neural network.
  • the second convolution layer is the last convolution layer after the last convolution operator is calculated
  • the second convolution layer is the output result. That is, the second convolution layer composed of the second calculation result groups in multiple calculation units is the calculation output or at least part of the calculation output of the multi-layer convolutional neural network.
  • the plurality of computing units include at least two computing units, such as a first computing unit and a second computing unit, and the multi-layer convolutional neural network includes at least two convolutional layers.
  • the example shown in FIGS. 6A-6C includes three computing units PE0, PE1, and PE2, and includes the following two convolution operators (conv):
  • the chip uses three computing units PE0, PE1, and PE2 to perform data-parallel convolution calculations on 224 ⁇ 224 input data.
  • the original input matrix is roughly evenly split into three original input data groups 10, 20, and 30 according to the height H dimension of the input data, and loaded onto the three computing units PE0, PE1, and PE2, respectively, as shown on %0.
  • the original input data group 10 includes [0,75], a total of 76 rows of data;
  • the original input data group 20 includes [74,150], a total of 77 rows of data;
  • the original input data group 30 includes [149,223], a total of 75 rows of data.
  • multiple computing units PE0, PE1, and PE2 perform the first convolution calculation on the corresponding three original input data groups 10, 20, and 30, respectively, to obtain the first calculation result groups 11 (row [0,74]), 21 (row [75,149]), and 31 (row [150,223]) as shown in %5.
  • the first calculation result group constitutes the first convolution layer corresponding to the first convolution calculation performed on the original input matrix.
  • step S20 is executed, and according to the configuration rule that the distribution of the data of the second convolutional layer on multiple computing units is the same as the data range of the first calculation result group in multiple computing units, it is determined that the three computing units PE0, PE1, and PE2 all have the same data copy transmission mode, that is, a bidirectional data transmission mode.
  • step S30 is executed to determine that the three computing units PE0, PE1, and PE2 are all computing units that need to perform data copy transmission, and it is determined that the first intermediate data row required by each computing unit is one row, and then each computing unit obtains the first intermediate data row required to fill it when performing the second convolution calculation from another computing unit.
  • PE0 serves as the first computing unit and PE1 serves as the second computing unit.
  • the first computing unit PE0 obtains the first intermediate data row [75] required for PE0 to fill in the second convolution calculation process from the first computing result group 21 of the second computing unit PE1;
  • the second computing unit PE1 obtains the second intermediate data row [74] required for PE1 to fill in the second convolution calculation process from the first computing result group 11 of the first computing unit PE0, and the first intermediate data row and the second intermediate data row obtained from the first computing unit PE0 and the second computing unit PE1 are of the same size, both being 1 row of data.
  • PE1 serves as the first computing unit and PE2 serves as the second computing unit.
  • the first computing unit PE1 obtains the first intermediate data row [150] required for PE1 to fill in the second convolution calculation process from the first computing result group 31 on the second computing unit PE2; as shown by arrow 4, the second computing unit PE2 obtains the second intermediate data row [149] required for PE2 to fill in the second convolution calculation process from the first computing result group 21 of the first computing unit PE1, and the first intermediate data row and the second intermediate data row obtained by the first computing unit PE1 and the second computing unit PE2 from each other are the same in size, both being 1 row of data.
  • the first calculation result group [0, 74] on PE0 and the first intermediate data row [75] obtained from PE1 constitute the data group [0, 75] required for PE0 to perform the second convolution calculation.
  • the first calculation result group [75, 149] on PE1 and the second intermediate data row [74] and the first intermediate data row [150] obtained from PE0 and PE2 respectively constitute the data group [74, 150] required for PE1 to perform the second convolution calculation.
  • the first calculation result [150, 223] on PE2 and the second intermediate data row [149] obtained from PE2 constitute the data group [149, 223] required for PE2 to perform the second convolution calculation.
  • step S50 is executed to aggregate the second calculation result groups [0,74], [75,149], [150,223] on PE0, PE1, and PE2, which are the calculation results of the output layer.
  • the scheduling method provided by at least one embodiment of the present disclosure can reduce the amount of calculation of repeated data in the convolutional neural network and improve the calculation efficiency of the multi-layer convolutional neural network by copying and transmitting valid data between computing units and using the valid data as padding rows.
  • step S20 it can be determined that the three computing units PE0, PE1, and PE2 all have the same unidirectional data transfer mode.
  • step S30 it is determined that computing units PE1 and PE2 are computing units that need to perform data replication and transmission, and it is determined that the first intermediate data line required by each computing unit is two lines. That is, PE1 obtains two lines of valid data from PE0 for filling, and PE2 obtains two lines of valid data from PE1 for filling.
  • FIG. 6C only computing unit PE1 can be a computing unit that needs to perform data replication and transmission. That is, PE1 obtains two lines of valid data from PE0 and PE2 for filling, respectively.
  • the scheduling method of the examples shown in FIG. 6B and FIG. 6C can not only reduce the repeated calculation of data and improve the calculation efficiency, but also reduce the number of data transmissions between computing units and reduce the overhead required for data transmission.
  • the computing device may further include more computing units
  • the convolution network may further include more convolution operators, which are not limited in the embodiments of the present disclosure.
  • the example shown in FIG6D includes four computing units PE0, PE1, PE2, and PE3, including the following three convolution operators (conv):
  • a plurality of computing units PE0, PE1, PE2, and PE3 may have There are different data replication and transmission modes.
  • the size of the first intermediate data row obtained by each computing unit from another computing unit may be different, and the embodiments of the present disclosure are not limited to this.
  • the first intermediate data row obtained by PE2 from PE1 is 1 row
  • the first intermediate data row obtained from PE3 is 2 rows.
  • the steps of the scheduling method in the example shown in Figure 6D can be found in the description of Figure 6A, and will not be repeated here. Using the scheduling method of the example shown in Figure 6D can not only reduce the repeated calculation of data and improve the utilization of computing power, but also realize the balanced distribution of memory, computing power and data transmission overhead between different computing units based on the comprehensive performance of the system.
  • Fig. 7 is a schematic block diagram of a scheduling device provided by some embodiments of the present disclosure.
  • the scheduling device 100 includes a computing control module 110, an allocation scheduling module 120 and a data transmission module 130. These components may be interconnected via a bus and/or other forms of connection mechanisms (not shown).
  • the computing control module 110 is configured to enable the plurality of computing units to respectively perform first convolution computing on the corresponding plurality of data groups to obtain corresponding plurality of first computing result groups, wherein the plurality of first computing result groups are used to constitute a first convolution layer obtained by the first convolution computing, and the plurality of computing units include a first computing unit and a second computing unit;
  • the allocation scheduling module 120 is configured to determine the data copy transmission mode corresponding to the plurality of first calculation result groups on the plurality of calculation units according to the configuration rule of the second convolution layer in the plurality of calculation units obtained after the plurality of calculation units perform the second convolution calculation on the first convolution layer;
  • the data transmission module 130 is configured to enable the first computing unit among multiple computing units that needs to fill valid data rows to obtain the first intermediate data row required for the first computing unit to fill in the second convolution calculation process from the first computing result group on the second computing unit based on the corresponding data copy transmission mode.
  • each module of the scheduling device 100 corresponds to each step of the aforementioned scheduling method.
  • the scheduling device 100 For the specific functions of the scheduling device 100, reference can be made to the relevant description of the scheduling method above, which will not be repeated here.
  • the components and structures of the scheduling device 100 shown in FIG7 are exemplary, not restrictive, and the scheduling device 100 may also include other components and structures as needed.
  • FIG8 is a schematic block diagram of an electronic device provided by some embodiments of the present disclosure.
  • the electronic device 200 includes a scheduling device 210.
  • the scheduling device 210 may be a scheduling device provided by any embodiment of the present disclosure, such as the aforementioned scheduling device 100.
  • the electronic device 200 may It is any device with computing function, such as a server, a terminal device, a personal computer, etc., and the embodiments of the present disclosure are not limited to this.
  • FIG9 is a schematic block diagram of another electronic device provided in some embodiments of the present disclosure.
  • the electronic device 300 includes a processor 310 and a memory 320, which can be used to implement a client or a server.
  • the memory 320 is used to store computer executable instructions (e.g., at least one (one or more) computer program modules) non-transiently.
  • the processor 310 is used to run the computer executable instructions, and when the computer executable instructions are run by the processor 310, one or more steps in the convolution operation method described above can be executed, thereby implementing the convolution operation method described above.
  • the memory 320 and the processor 310 can be interconnected via a bus system and/or other forms of connection mechanisms (not shown).
  • the processor 310 may be a central processing unit (CPU), a graphics processing unit (GPU), or other forms of processing units having data processing capabilities and/or program execution capabilities.
  • the central processing unit (CPU) may be an X86 or ARM architecture, etc.
  • the processor 310 may be a general-purpose processor or a dedicated processor, and may control other components in the electronic device 300 to perform desired functions.
  • the memory 320 may include any combination of at least one (e.g., one or more) computer program products, and the computer program products may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory.
  • Volatile memory may include, for example, random access memory (RAM) and/or cache memory, etc.
  • Non-volatile memory may include, for example, read-only memory (ROM), hard disk, erasable programmable read-only memory (EPROM), portable compact disk read-only memory (CD-ROM), USB memory, flash memory, etc.
  • At least one (e.g., one or more) computer program modules may be stored on the computer-readable storage medium, and the processor 310 may run at least one (e.g., one or more) computer program modules to implement various functions of the electronic device 300.
  • the processor 310 may run at least one (e.g., one or more) computer program modules to implement various functions of the electronic device 300.
  • Various applications and various data, as well as various data used and/or generated by the application, etc. may also be stored in the computer-readable storage medium.
  • FIG10 is a schematic block diagram of another electronic device provided in some embodiments of the present disclosure.
  • the electronic device 400 is suitable for implementing the scheduling method provided in the embodiments of the present disclosure.
  • 400 may be a terminal device, etc., and may be used to implement a client or a server.
  • the electronic device 400 may include, but is not limited to, mobile terminals such as mobile phones, laptop computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), vehicle-mounted terminals (such as vehicle-mounted navigation terminals), wearable electronic devices, etc., and fixed terminals such as digital TVs, desktop computers, smart home devices, etc.
  • PDAs personal digital assistants
  • PADs tablet computers
  • PMPs portable multimedia players
  • vehicle-mounted terminals such as vehicle-mounted navigation terminals
  • wearable electronic devices etc.
  • fixed terminals such as digital TVs, desktop computers, smart home devices, etc.
  • the electronic device 400 shown in FIG. 10 is only an
  • the electronic device 400 may include a processing device (e.g., a central processing unit, a graphics processing unit, etc.) 410, which may perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 420 or a program loaded from a storage device 480 into a random access memory (RAM) 430.
  • ROM read-only memory
  • RAM random access memory
  • Various programs and data required for the operation of the electronic device 400 are also stored in the RAM 430.
  • the processing device 410, the ROM 420, and the RAM 430 are connected to each other via a bus 440.
  • An input/output (I/O) interface 450 is also connected to the bus 440.
  • the following devices may be connected to the I/O interface 450: an input device 460 including, for example, a touch screen, a touchpad, a keyboard, a mouse, a camera, a microphone, an accelerometer, a gyroscope, etc.; an output device 470 including, for example, a liquid crystal display (LCD), a speaker, a vibrator, etc.; a storage device 480 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 490.
  • the communication device 490 may allow the electronic device 400 to communicate with other electronic devices wirelessly or by wire to exchange data.
  • FIG. 10 shows an electronic device 400 having various devices, it should be understood that it is not required to implement or have all of the devices shown, and the electronic device 400 may alternatively implement or have more or fewer devices.
  • the above-mentioned scheduling method can be implemented as a computer software program.
  • an embodiment of the present disclosure includes a computer program product, which includes a computer program carried on a non-transitory computer-readable medium, and the computer program includes a program code for executing the above-mentioned scheduling method.
  • the computer program can be downloaded and installed from the network through the communication device 490, or installed from the storage device 480, or installed from the ROM420.
  • the functions defined in the scheduling method provided in the embodiment of the present disclosure can be implemented.
  • At least one embodiment of the present disclosure further provides a storage medium.
  • the utilization rate of the matrix operation unit can be improved, the computing power of the matrix operation unit can be effectively utilized, and the convolution operation time can be shortened. It can shorten the operation time, improve the operation efficiency, and save the data transmission time.
  • FIG11 is a schematic diagram of a storage medium provided by some embodiments of the present disclosure.
  • the storage medium 500 may be a non-transitory computer-readable storage medium storing non-transitory computer-readable instructions 510.
  • the scheduling method described in the embodiment of the present disclosure may be implemented.
  • the non-transitory computer-readable instructions 510 are executed by a processor, one or more steps in the scheduling method described above may be executed.
  • the storage medium 500 may be applied to the above-mentioned electronic device.
  • the storage medium 500 may include the memory 320 in the electronic device 300 .
  • the storage medium may include a memory card of a smart phone, a storage component of a tablet computer, a hard disk of a personal computer, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM), a portable compact disk read-only memory (CD-ROM), flash memory, or any combination of the above storage media, or other applicable storage media.
  • RAM random access memory
  • ROM read-only memory
  • EPROM erasable programmable read-only memory
  • CD-ROM portable compact disk read-only memory
  • flash memory or any combination of the above storage media, or other applicable storage media.
  • the description of the storage medium 500 can refer to the description of the memory in the embodiment of the electronic device, and the repeated parts are not repeated.
  • the specific functions and technical effects of the storage medium 500 can refer to the description of the scheduling method above, and are not repeated here.
  • the scheduling method provided by the embodiments of the present disclosure can reduce the amount of calculation of repeated data in the convolutional neural network and improve the calculation efficiency of the multi-layer convolutional neural network by copying and transmitting valid data between computing units and using the valid data as padding rows.
  • a computer-readable medium may be a tangible medium that may contain or store a program for use by an instruction execution system, device or equipment or used in conjunction with an instruction execution system, device or equipment.
  • a computer-readable medium may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above.
  • a computer-readable storage medium may be, for example, but not limited to: an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any combination of the above.
  • Computer-readable storage media may include, but are not limited to: an electrical connection with one or more wires, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable one of the above.
  • a computer-readable storage medium may be any tangible medium containing or storing a program that can be used by or in combination with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in a baseband or as part of a carrier wave, which carries a computer-readable program code. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which may send, propagate, or transmit a program for use by or in combination with an instruction execution system, apparatus, or device.
  • the program code contained on the computer-readable medium may be transmitted using any suitable medium, including but not limited to: wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.
  • the client and server may communicate using any currently known or future developed network protocol such as HTTP (HyperText Transfer Protocol), and may be interconnected with any form or medium of digital data communication (e.g., a communication network).
  • HTTP HyperText Transfer Protocol
  • Examples of communication networks include a local area network ("LAN”), a wide area network ("WAN”), an internet (e.g., the Internet), and a peer-to-peer network (e.g., an ad hoc peer-to-peer network), as well as any currently known or future developed network.
  • the computer-readable medium may be included in the electronic device, or may exist independently without being installed in the electronic device.
  • Computer program code for performing the operations of the present disclosure may be written in one or more programming languages or combinations thereof, including, but not limited to, object-oriented programming languages, such as Java, Smalltalk, C++, and conventional procedural programming languages, such as "C" language or similar programming languages.
  • the program code may be executed entirely on a user's computer, partially on a user's computer, as an independent software package, partially on a user's computer, partially on a remote computer, or entirely on a remote computer or server.
  • the remote computer may be connected to the user's computer via any type of network (including a local area network (LAN) or a wide area network (WAN)), or may be connected to an external computer (e.g., via the Internet using an Internet service provider).
  • LAN local area network
  • WAN wide area network
  • Each square frame in a flowchart or a block diagram can represent a module, a program segment, or a part of a code, and the module, the program segment, or a part of the code contains one or more executable instructions for realizing the specified logical function.
  • the functions marked in the square frame can also occur in a sequence different from that marked in the accompanying drawings. For example, two boxes represented in succession can actually be executed substantially in parallel, and they can sometimes be executed in the opposite order, depending on the functions involved.
  • each square frame in a block diagram and/or a flow chart, and the combination of the square frames in a block diagram and/or a flow chart can be realized by a dedicated hardware-based system that performs a specified function or operation, or can be realized by a combination of dedicated hardware and computer instructions.
  • the units involved in the embodiments described in the present disclosure may be implemented by software or hardware, wherein the name of a unit does not, in some cases, constitute a limitation on the unit itself.
  • exemplary types of hardware logic components include: field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips (SOCs), complex programmable logic devices (CPLDs), and the like.
  • FPGAs field programmable gate arrays
  • ASICs application specific integrated circuits
  • ASSPs application specific standard products
  • SOCs systems on chips
  • CPLDs complex programmable logic devices
  • Example 1 provides a scheduling method for a multi-layer convolutional neural network, including: multiple computing units respectively perform first convolution calculations on corresponding multiple data groups to obtain corresponding multiple first calculation result groups, wherein the multiple first calculation result groups are used to constitute a first convolution layer obtained by the first convolution calculation, and the multiple computing units include a first computing unit and a second computing unit; according to a configuration rule of the second convolution layer obtained after the multiple computing units perform a second convolution calculation on the first convolution layer, the data copy and transmission mode corresponding to the multiple first calculation result groups on the multiple computing units is determined; the first computing unit among the multiple computing units that needs to fill valid data rows obtains the first intermediate data row required for the first computing unit to fill in the second convolution calculation process from the first calculation result group on the second computing unit based on the corresponding data copy and transmission mode.
  • Example 2 A scheduling method according to Example 1, wherein the data rows in the multiple first calculation result groups are continuous without overlap in the first convolutional layer.
  • Example 3 The scheduling method according to Example 1, wherein The first calculation result group and the first intermediate data row obtained from the second calculation unit constitute a data group required for the first calculation unit to perform the second convolution calculation.
  • Example 4 A scheduling method according to Example 1, wherein the corresponding multiple data groups used by the multiple computing units in performing the first convolution calculation are multiple original input data groups.
  • Example 5 The scheduling method according to Example 4 further includes:
  • the original input matrix is split to obtain the multiple original input data groups, and the multiple original input data groups are respectively transmitted to the multiple computing units to perform the first convolution calculation.
  • Example 6 The scheduling method according to Example 1, wherein the step of determining the data copy transmission mode corresponding to the plurality of first calculation result groups on the plurality of computing units according to a configuration rule of the second convolutional layer obtained after the plurality of computing units perform the second convolutional calculation on the first convolutional layer, comprises:
  • the data replication transmission mode corresponding to the first computing result group on the plurality of computing units is determined based on the distribution condition.
  • Example 7 The scheduling method according to Example 1, wherein the step of obtaining, from the first calculation result group on the second calculation unit, a first intermediate data row required for the first calculation unit to fill in the second convolution calculation process comprises:
  • the first intermediate data row is obtained from the first calculation result group on the second calculation unit.
  • Example 8 A scheduling method according to Example 7, wherein the size of the second calculation result group obtained by the first calculation unit is the same as the size of the first calculation result group on the first calculation unit.
  • Example 9 The scheduling method according to Example 7, further comprising:
  • a plurality of second computation result groups in the plurality of computation units are aggregated to obtain at least a portion of the computation output of the multi-layer convolutional neural network.
  • Example 10 The scheduling method according to any one of the above examples 1 to 9, further comprising:
  • the second computing unit obtains, based on the corresponding data copy and transfer mode, a second intermediate data row required for the second computing unit to fill in the second convolution calculation process from the first calculation result group on the first computing unit.
  • Example 11 A scheduling method according to Example 10, wherein the first intermediate data row and the second intermediate data row obtained by the first computing unit and the second computing unit from each other are of the same size.
  • Example 12 provides a scheduling device for a multi-layer convolutional neural network, including:
  • a computing control module configured to enable a plurality of computing units to respectively perform a first convolution computing on a corresponding plurality of data groups to obtain a corresponding plurality of first computing result groups, wherein the plurality of first computing result groups are used to constitute a first convolution layer obtained by the first convolution computing, and the plurality of computing units include a first computing unit and a second computing unit;
  • an allocation scheduling module configured to determine the data copy transmission modes corresponding to the plurality of first calculation result groups on the plurality of calculation units according to a configuration rule of the second convolutional layer obtained after the plurality of calculation units perform the second convolutional calculation on the first convolutional layer in the plurality of calculation units;
  • the data transmission module is configured to enable the first computing unit among multiple computing units that needs to fill valid data rows to obtain the first intermediate data row required for the first computing unit to fill in the second convolution calculation process from the first calculation result group on the second computing unit based on the corresponding data copy transmission mode.
  • Example 13 A scheduling device according to Example 12, wherein the data rows in the multiple first calculation result groups are continuous without overlap in the first convolutional layer.
  • Example 14 A scheduling device according to Example 12, wherein the first calculation result group on the first computing unit and the first intermediate data row obtained from the second computing unit constitute a data group required for the first computing unit to perform the second convolution calculation.
  • Example 15 A scheduling device according to Example 12, wherein the corresponding multiple data groups used by the multiple computing units in performing the first convolution calculation are multiple original input data groups.
  • Example 16 The scheduling device according to Example 15, further comprising:
  • the data splitting module is configured to split the original input matrix to obtain the multiple original input data groups, and transmit the multiple original input data groups to the multiple computing units respectively to perform the first convolution calculation.
  • Example 17 The scheduling device according to Example 12, wherein the determining the data copy transmission mode corresponding to the multiple first calculation result groups on the multiple computing units according to the configuration rule of the second convolutional layer obtained after the multiple computing units perform the second convolutional calculation on the first convolutional layer, comprises:
  • the data replication transmission mode corresponding to the first computing result group on the plurality of computing units is determined based on the distribution condition.
  • Example 18 The scheduling device according to Example 12, wherein the step of obtaining, from the first calculation result group on the second calculation unit, a first intermediate data row required for the first calculation unit to perform padding in the second convolution calculation process comprises:
  • the first intermediate data row is obtained from the first calculation result group on the second calculation unit.
  • Example 19 A scheduling device according to Example 18, wherein the size of the second calculation result group obtained by the first calculation unit is the same as the size of the first calculation result group on the first calculation unit.
  • Example 20 The scheduling device according to Example 18, further comprising:
  • a data output module is configured to respond to the second convolution layer being the multi-layer convolution neural network
  • the output layer in the network aggregates multiple second calculation result groups in the multiple calculation units to obtain at least part of the calculation output of the multi-layer convolutional neural network.
  • Example 21 The scheduling device according to any one of Examples 12 to 20, wherein the data transmission module is further configured as follows:
  • the second computing unit is enabled to obtain, from the first computing result group on the first computing unit based on the corresponding data copy transmission mode, a second intermediate data row required for the second computing unit to fill in the second convolution computing process.
  • Example 22 A scheduling device according to Example 10, wherein the first intermediate data row and the second intermediate data row obtained by the first computing unit and the second computing unit from each other are of the same size.
  • Example 23 provides an electronic device, comprising the scheduling device described in any one of Examples 12-22 above.
  • Example 24 provides an electronic device, comprising: a processor; a memory, comprising at least one computer program module; wherein the at least one computer program module is stored in the memory and configured to be executed by the processor, and the at least one computer program module is used to implement the scheduling method described in any one of the above Examples 1-11.
  • Example 25 provides a storage medium storing non-transitory computer-readable instructions, which, when executed by a computer, implements the scheduling method described in any one of Examples 1-11 above.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

一种调度方法、调度装置、电子设备及存储介质。该调度方法包括:多个计算单元分别对对应的多个数据组进行第一卷积计算得到对应的多个第一计算结果组,其中,多个第一计算结果组用于构成第一卷积计算得到的第一卷积层;根据由多个计算单元对第一卷积层进行第二卷积计算后将得到的第二卷积层在多个计算单元中的配置规则,确定多个计算单元上的多个第一计算结果组对应的数据复制传送模式;多个计算单元中需要进行有效数据行填充的第一计算单元,基于对应的数据复制传送模式,从第二计算单元上的第一计算结果组中获取第一计算单元在第二卷积计算过程中进行填充所需的第一中间数据行。该调度方法能够有效减少数据的重复计算,提高芯片算力的利用率。

Description

调度方法、调度装置、电子设备及存储介质
本申请要求于2022年9月27日递交的中国专利申请第202211188739.0号的优先权,在此全文引用上述中国专利申请公开的内容以作为本申请的一部分。
技术领域
本公开的实施例涉及一种调度方法、调度装置、电子设备及存储介质。
背景技术
人工智能(Artificial Intelligence,AI)芯片是专用于进行神经网络运算的芯片,是为了加速神经网络执行而专门设计的芯片。随着人工智能(AI)的发展,算法模型的参数量剧增,对算力的需求越来越大。
由于传统的硬件架构(例如,中央处理器(Central Processing Unit,CPU))在架构设计阶段考虑了不同业务需求之间的平衡,因此其在AI应用上所能提供的算力有限。出于对高算力的考虑,目前的AI芯片除了采用通用图形处理器(Graphics Processing Unit,GPU)之外,还广泛采用同构多核架构的领域专用加速器(Domain Specific ASIC,DSA)。
发明内容
本公开至少一实施例提供一种用于多层卷积神经网络的调度方法,该调度方法包括:多个计算单元分别对对应的多个数据组进行第一卷积计算得到对应的多个第一计算结果组,其中,所述多个第一计算结果组用于构成所述第一卷积计算得到的第一卷积层,所述多个计算单元包括第一计算单元和第二计算单元;根据由所述多个计算单元对所述第一卷积层进行第二卷积计算后将得到的第二卷积层在所述多个计算单元中的配置规则,确定所述多个计算单元上的所述多个第一计算结果组对应的数据复制传送模式;多个计算单元中需要进行数据有效数据行填充的第一计算单元,基 于对应的所述数据复制传送模式,从所述第二计算单元上的第一计算结果组中获取所述第一计算单元在所述第二卷积计算过程中进行填充所需的第一中间数据行。
本公开至少一实施例还提供一种调度装置,该调度装置包括:计算控制模块,配置为使多个计算单元分别对对应的多个数据组进行第一卷积计算得到对应的多个第一计算结果组,其中,所述多个第一计算结果组用于构成所述第一卷积计算得到的第一卷积层,所述多个计算单元包括第一计算单元和第二计算单元;分配调度模块,配置为根据由所述多个计算单元对所述第一卷积层进行第二卷积计算后将得到的第二卷积层在所述多个计算单元中的配置规则,确定所述多个计算单元上的所述多个第一计算结果组对应的数据复制传送模式;数据传输模块,配置为使多个计算单元中需要进行数据有效数据行填充的第一计算单元,基于对应的所述数据复制传送模式,从所述第二计算单元上的第一计算结果组中获取所述第一计算单元在所述第二卷积计算过程中进行填充所需的第一中间数据行。
本公开至少一实施例还提供一种电子设备,包括本公开任一实施例所提供的调度装置。
本公开至少一实施例还提供一种电子设备,该电子设备包括:处理器;存储器,包括至少一个计算机程序模块;其中,所述至少一个计算机程序模块被存储在所述存储器中,并被配置为由所述处理器执行,所述至少一个计算机程序模块用于实现本公开任一实施例所述的调度方法。
本公开至少一实施例还提供一种存储介质,存储有非暂时性计算机可读指令,当所述非暂时性计算机可读指令由计算机执行时实现本公开任一实施例所述的调度方法。
附图说明
为了更清楚地说明本公开实施例的技术方案,下面将对实施例的附图作简单地介绍,显而易见地,下面描述中的附图仅仅涉及本公开的一些实施例,而非对本公开的限制。
图1A为模型并行的示意图;
图1B为数据并行的示意图;
图2为一种在计算图上表示的基于数据并行的卷积计算过程的示意图;
图3为本公开至少一实施例提供的一种调度方法的流程示意图;
图4为图3中步骤S20的示意性流程图;
图5为图3中步骤S30的示意性流程图;
图6A-6D为本公开至少一实施例提供的数据复制传递过程的示意图;
图7为本公开至少一实施例提供的一种调度装置的示意性框图;
图8为本公开至少一实施例提供的一种电子设备的示意性框图;
图9为本公开至少一实施例提供的另一种电子设备的示意性框图;
图10为本公开至少一实施例提供的另一种电子设备的示意性框图;以及
图11为本公开至少一实施例提供的一种存储介质的示意图。
具体实施方式
为了使得本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例的附图,对本公开实施例的技术方案进行清楚、完整地描述。显然,所描述的实施例是本公开的一部分实施例,而不是全部的实施例。基于所描述的本公开的实施例,本领域普通技术人员在无需创造性劳动的前提下所获得的所有其他实施例,都属于本公开保护的范围。
除非另外定义,本公开使用的技术术语或者科学术语应当为本公开所属领域内具有一般技能的人士所理解的通常意义。本公开中使用的“第一”、“第二”以及类似的词语并不表示任何顺序、数量或者重要性,而只是用来区分不同的组成部分。“包括”或者“包含”等类似的词语意指出现该词前面的元件或者物件涵盖出现在该词后面列举的元件或者物件及其等同,而不排除其他元件或者物件。“连接”或者“相连”等类似的词语并非限定于物理的或者机械的连接,而是可以包括电性的连接,不管是直接的还是间接的。“上”、“下”、“左”、“右”等仅用于表示相对位置关系,当被描述对象的绝对位置改变后,则该相对位置关系也可能相应地改变。
下面通过几个具体的实施例对本公开进行说明。为了保持本公开实施例的以下说明清楚且简明,可省略已知功能和已知部(元)件的详细说明。当本公开实施例的任一部(元)件在一个以上的附图中出现时,该部 (元)件在每个附图中由相同或类似的参考标号表示。
领域专用加速器(Domain Specific ASIC,DSA)的核心思想是使用专用的硬件做专用的事情,例如,DSA是满足一个领域(Domain)内的应用,而非一个固定的应用。因此,DSA能够获得灵活性与专用性的折衷。DSA加速器是指一颗芯片内使用多个计算单元(或称为处理单元(Process Element,PE)或处理核)互联来实现算力扩充和计算加速,该多个处理单元例如通过通信线路彼此连接,该通信线路例如可以是总线、片上网络等。对于这种多PE互联的AI加速器,如何有效调度硬件资源来高效地完成深度神经网络模型的推理,是AI编译器面临的主要挑战。
目前,对于多核的编译调度方案可以有如下两种实现方式。
一种方式是模型并行(Model Parallel),即不同的设备(Device)负责计算图不同部分的计算。例如在某些领域专用加速器中,需要先将计算图中的各个算子静态地分配到芯片内部的不同的PE上,然后再将输入数据输入到第一个算子所在的PE上,当第一个算子所在的PE完成计算后,将计算后的数据传输到下一个PE中,直到完成所有算子的计算。模型并行的每个PE中仅执行本地内存上的算子,每个PE必须等待前面的算子计算结束后才能接着计算,这是一种以流水线(Pipeline)的方式串行化执行运算的方式。
例如,如图1A所示,PE0、PE1、PE2上分别加载了3种不同的算子,例如这3种算子分别为对应神经网络不同层的卷积算子。例如PE0具有卷积算子A,PE1具有卷积算子B,PE2具有卷积算子C。在计算时,服务器先将输入数据加载到卷积算子A所在的PE0,等待PE0完成对输入数据的第一卷积计算后,将第一卷积计算的结果传递给PE1,在PE1上执行对应于卷积算子B的第二卷积计算,以此类推,直到PE2上的卷积算子C执行完后,将PE2上得到最终的结果返回主机。
但是,模型并行需要静态编译,需要将计算图中的部分算子分配到固定的多个PE中,因而无法处理含有动态形状(Shape)的计算图。动态形状是指张量(Tensor)的形状(Shape)依赖于具体的运算,无法通过提前的计算得到,也即,动态的计算图是根据每一步的计算搭建的。因此,只包含计算图部分的模型并行方式是无法处理动态计算图的。
另一种方式是数据并行(Data Parallel),即每个设备(Device)中都有完整的计算图。例如在某些GPU中,需要先将计算图中的每个算子依次加载到GPU设备中,每个算子并行地放在多个PE上,然后再将输入数据拆分成多份,分别加载到多个PE上,多个PE并行处理输入数据,直到完成所有算子的计算后,将多个PE上的计算结果汇总返回给主机。
例如,如图1B所示,PE0、PE1、PE2上均加载了3种不同的算子,例如这3种算子分别为对应神经网络不同层的卷积算子。例如PE0、PE1、PE2上均具有卷积算子A、B、C。在计算时,服务器先将输入数据拆分为多份,例如均匀拆分为3份输入数据,然后将3份输入数据分别加载到PE0、PE1、PE2上,PE0、PE1、PE2并行地对输入数据依次执行对应卷积算子A、B、C的多次卷积计算,直到计算结束后,系统再将PE0、PE1、PE2上的最终结果汇聚起来并返回主机。
数据并行的实现目前没有固定可行的模式,需要针对不同的硬件特点进行调度优化。在实际的模型部署时,有的模型中算子数据量大,有的模型中算子数据量小,因此需要根据模型对AI芯片进行不同的部署方案才能得到最优的推理性能。
在目前的数据并行方式中,对于数据量很大的输入数据,PE内部无法一次性放下所有计算所需的输入数据,因此通常需要将输入数据拆分为多个组(Group),然后将多个组的数据分别放在不同的PE上,由多个PE并行计算,等到计算结束后,再将多个PE上的计算结果汇聚起来以得到对应于输入数据的最终输出结果。
然而,在多层神经网络的推理计算过程中,考虑到卷积计算的填充(Padding)操作,直接将大图片或者大数据量的数据拆分为多个组容易引入数据的重复计算问题。
例如,一个示例性的计算图包含有如下3个卷积算子(conv):
%5:[(1 224 224 64),F,FP32]=conv(%0:[(1 224 224 64),F,FP32],1%:[(64 3 3 64),W,FP32],kh=3,kw=3,pad_h_top=1,pad_h_bottom=1,pad_w_left=1,pad_w_right=1,stride_h=1,stride_w=1);
%10:[(1 224 224 64),F,FP32]=conv(%5:[(1 224 224 64),F,FP32],6%:[(64 3 3 64),W,FP32],kh=3,kw=3,pad_h_top=1, pad_h_bottom=1,pad_w_left=1,pad_w_right=1,stride_h=1,stride_w=1);
%15:[(1 224 224 64),F,FP32]=conv(%10:[(1 224 224 64),F,FP32],11%:[(64 3 3 64),W,FP32],kh=3,kw=3,pad_h_top=1,pad_h_bottom=1,pad_w_left=1,pad_w_right=1,stride_h=1,stride_w=1)。
%0,%5,%10和%15表示张量(Tensor)数据。例如,神经网络的首层卷积的输入图像可以表示为[B,H,W,C],即输入数据的维度是B×H×W×C。例如,%0表示神经网络的首层卷积的输入图像的大小为[1,224,224,64],也即,输入图像具有64个通道(Channel),每个通道的图像尺寸(高度H×宽度W)为224×224,共1个批次(Batch)。例如,每个卷积算子conv采用的卷积核的大小可以表示为kh×kw,也即,kh=3,kw=3表示卷积核的尺寸为3×3。例如,每个卷积算子conv的填充操作用pad表示,例如,pad_h_top=1表示在图片的高度h方向,在图片的上方top填充1行,例如填充0;pad_h_bottom=1表示在图片的高度h方向,在图片的下方bottom填充1行,例如填充0;pad_w_left=1表示在图片的宽度w方向,在图片的左边left填充1列,例如填充0;pad_w_right=1表示在图片的宽度w方向,在图片的右边right填充1列,例如填充0。例如,每个卷积算子conv中卷积核的滑动步长用stride表示,例如,stride_h=1表示卷积核在图片的高度h方向移动时,每次移动的大小为1个像素点;例如,stride_w=1表示卷积核在图片的宽度w方向移动时,每次移动的大小为1个像素点。该计算采用浮点计算,且精度为FP32。
例如,图2示出了一种在计算图上表示的采用数据并行的卷积计算过程的示意图,该计算图包含如上所述的3个卷积算子。例如,当一个PE的内存无法一次性放下计算所需的所有数据时,需要根据PE的内存和数量将输入数据拆分为多个数据组。例如,在图2中具有3个PE,输入数据的大小为224×224,将输入数据拆分为3个数据组以分别加载到3个PE中进行卷积计算。
对于输入数据的拆分和拆分成的数据组的大小、范围等,需要根据输出数据来选择。例如,在图2所示的计算图上,输出数据%15的大小为224×224,选择在输出数据的高度H维度将数据大致均匀地拆分为3份,则3份输出数据组在分别对应于3个PE的三个轮次round1、round2和 round3上的(高度)大小分别为75行、75行和74行。在对输入数据进行划分时,每份输出数据需要深度优先遍历整个数据组,以向前递推得到每份输入数据的范围。
需要说明的是,图2所示的计算图中显示的数据均为“有效数据”。出于简化的目的,这里的“有效数据”表示每一次卷积计算所得到的数据中具有实际意义的、有效的数据。在实际计算时,由于在每一次卷积计算后为了维持图像大小不变,需要进行填充操作,因此,对于被拆分后的每个数据组的“边缘处”的数据,其与填充的“0”所组成的用于与卷积核进行卷积计算的区域为非有效区域,由该区域进行卷积计算后得到的结果为非有效数据,需要舍去。也即,完整的输入数据进行卷积计算所得到的输出数据为有效数据,所以,对应到拆分后的数据组中,只有有效区域的数据进行卷积计算后得到的数据为有效数据。为了图示的清楚性和描述的简洁性,在本申请的附图中,计算图上所示的数据均为有效数据。
例如,图2的round1上显示了输出数据%15具有75行有效数据,即[0,74]。由于填充操作(padding)为:pad_h_top=1,pad_h_bottom=1,则由%15向前递推得到的中间数据%10需要具有76行有效数据,即[0,75]行,由中间数据%10向前递推得到的中间数据%5需要具有77行有效数据,即[0,76]行,依次类推,最终得到的输入数据%0需要具有78行有效数据。由此得到round1的输入数据%0在高度方向上的读取数据的范围为[0,77](行号从0开始),共78行。
同样地,图2的round2上显示了输出数据%15具有75行有效数据,即[75,149]。由于填充操作(padding)为:pad_h_top=1,pad_h_bottom=1,则由%15向前递推得到的中间数据%10需要具有77行有效数据,即[74,150]行,由中间数据%10向前递推得到的中间数据%5需要具有79行有效数据,即[73,151]行,依次类推,最终得到的输入数据%0需要具有81行有效数据。由此得到round2的输入数据%0在高度方向上的读取数据的范围为[72,152],共81行。
同样地,图2的round3上显示了输出数据%15具有74行有效数据,即[150,223]。由于填充操作(padding)为:pad_h_top=1,pad_h_bottom=1,则由%15向前递推得到的中间数据%10需要具有75行有效数据,即 [149,223]行,由中间数据%10向前递推得到的中间数据%5需要具有76行有效数据,即[148,223]行,依次类推,最终得到的输入数据%0需要具有77行有效数据。由此得到round3的输入数据%0在高度方向上的读取数据的范围为[147,223],共77行。
从图2的计算图上可以清楚地看到,round1和round2的重复计算区域为[72,77]行,即重复了6行,round2和round3的重复计算区域为[147,152]行,也重复了6行。由此可知,卷积网络的深度越深,向前递推次数越多,则前一层上由于填充导致的数据组边缘处所需的有效数据越多,也即,需要重复计算的区域越大,这将导致大量的算力冗余开销。
本公开至少一实施例提供一种用于多层卷积神经网络的调度方法。该调度方法包括:多个计算单元分别对对应的多个数据组进行第一卷积计算得到对应的多个第一计算结果组,其中,多个第一计算结果组用于构成第一卷积计算得到的第一卷积层,多个计算单元包括第一计算单元和第二计算单元;根据由多个计算单元对第一卷积层进行第二卷积计算后将得到的第二卷积层在多个计算单元中的配置规则,确定多个计算单元上的多个第一计算结果组对应的数据复制传送模式;多个计算单元中需要进行有效数据行填充的第一计算单元,基于对应的数据复制传送模式,从第二计算单元上的第一计算结果组中获取第一计算单元在第二卷积计算过程中进行填充所需的第一中间数据行。该调度方法通过复制并传送计算单元上的有效数据到另一计算单元中,能够实现计算单元算力的均衡利用,减少了数据的重复计算,提高了算力的利用率。
本公开至少一实施例还提供了一种调度装置、电子设备及存储介质。该调度装置、电子设备及存储介质同样地,通过复制并传送计算单元上的有效数据到另一计算单元中,能够实现计算单元算力的均衡利用,减少了数据的重复计算,提高了算力的利用率。
下面,将参考附图详细地说明本公开的实施例。应当注意的是,不同附图中相同的附图标记将用于指代已描述的相同的元件。
图3为本公开至少一实施例提供的一种调度方法的流程示意图。如图3所示,该调度方法包括步骤S10~S30。
步骤S10:多个计算单元分别对对应的多个数据组进行第一卷积计算 得到对应的多个第一计算结果组,其中,多个第一计算结果组用于构成第一卷积计算得到的第一卷积层,多个计算单元包括第一计算单元和第二计算单元;
步骤S20:根据由多个计算单元对第一卷积层进行第二卷积计算后将得到的第二卷积层在多个计算单元中的配置规则,确定多个计算单元上的多个第一计算结果组对应的数据复制传送模式;
步骤S30:多个计算单元中需要进行有效数据行填充的第一计算单元,基于对应的数据复制传送模式,从第二计算单元上的第一计算结果组中获取第一计算单元在第二卷积计算过程中进行填充所需的第一中间数据行。
例如,该调度方法可以用于对多层卷积神经网络进行卷积运算的AI芯片、通用图形处理单元(GPGPU)等的计算装置中,例如,该AI芯片可以是采用DSA加速器的AI芯片或者具有多个PE单元的AI芯片,例如,可以用于数据并行化式的分布式训练中,本公开的实施例对此不作限制。
需要说明的是,在本公开至少一实施例中,“第一卷积计算”可以表示对完整的输入图像进行的一次卷积计算,也可以表示对于计算单元上当前的输入数据组进行的一次卷积计算。“第一卷积计算”可以是多层卷积计算中的任一次卷积计算,而非仅对于首层卷积进行的卷积计算。例如,“第一卷积计算”可以是对于首层卷积进行的一次卷积计算,也可以是对第2层、第3层……等进行的一次卷积计算。
在本公开至少一实施例中,“第二卷积计算”表示对应于“第一卷积计算”的下一次卷积计算,也即,将第一卷积计算后得到的“第一卷积层”作为输入进行的一次卷积计算。“第一卷积层”为“第一卷积计算”得到的计算结果所组成的卷积层,这里的“第一卷积层”表示对未进行拆分的完整的输入图像进行第一卷积计算所得到的实际卷积层。例如,第一卷积层可以是第二卷积计算的输入数据。同样地,“第二卷积层”为“第二卷积计算”得到的计算结果所组成的卷积层,这里的“第二卷积层”表示对未进行拆分所得的数据组成的第一卷积层进行第二卷积计算所得到的实际卷积层。
下面将结合图6A-6D所示的计算图详细介绍本公开实施例的调度方 法。在下面示例性说明中以三个PE(图6A-6C)为例进行说明,但是本公开的实施例不限于此,计算装置可以包括例如2个PE、4个PE(图6D)或者更多数量的PE。
需要说明的是,图6A-6D所示的长条形区域为每一次卷积计算的输入图像的某一维度(例如高H或宽W)上的至少部分有效数据的范围的抽象表示。对角线区、网格线区和菱形线区分别为计算单元(PE0、PE1、PE2等)上的数据组的抽象表示,其各自在长条形区域上的位置仅为其各自的数据组在输入图像上对应的位置的示意性表示,而不构成对本公开实施例的限制。
例如,在步骤S10中,多个计算单元上对应的“多个数据组”可以是对原始输入矩阵进行拆分得到的多个原始输入数据组(这些原始输入数据组被输入到计算装置中),也可以是计算装置中任一次卷积计算所针对的多个输入数据组。例如,多个数据组可以为图6A中PE0、PE1、PE2上的针对%0进行卷积计算的多个数据组10、20、30,也可以是针对%5进行卷积计算的多个数据组,例如,每个计算单元针对%5进行卷积计算的数据组包括第一计算结果组和从另一计算单元中获取的中间数据行。也即,在本公开至少一实施例中,第一计算单元上的第一计算结果组和从第二计算单元上获取的第一中间数据行构成用于第一计算单元进行第二卷积计算所需的数据组。
例如,在步骤S10中,多个计算单元分别对对应的多个数据组进行第一卷积计算得到对应的多个第一计算结果组。例如,多个第一计算结果组可以用于构成第一卷积计算得到的第一卷积层。例如,如图6A所示,多个计算单元PE0、PE1、PE2分别对对应的多个数据组10、20、30进行第一卷积计算得到多个第一计算结果组11、21、31,该多个第一计算结果组11、21、31可以用于构成第一卷积计算所得到的第一卷积层。
例如,在本公开的实施例中,多个计算单元中的多个第一计算结果组可以直接构成第一卷积计算所得到的第一卷积层,也可以构成第一卷积计算所得到的第一卷积层的一部分。例如,当输入图像较小时,多个计算单元可以一次性承载所有的输入数据,则汇聚多个计算单元中的多个第一计算结果组即可组成第一卷积层;当输入图像较大时,多个计算单元无法 一次性承载所有的输入数据,需要将输入数据分为多次载入计算单元,将每次载入的输入数据拆分为多组,分别放入多个计算单元中,则多个计算单元中的多个第一计算结果组仅构成对应该次载入的输入数据的第一卷积层的一部分。
例如,在步骤10中,多个第一计算结果组中的数据行在第一卷积层中无交叠地连续。例如,如图6A所示,多个第一计算结果组11、21、31中的数据行在第一卷积层中无交叠地连续,也即,多个计算单元PE0、PE1、PE2上由对%0进行第一卷积计算所得到的多个第一计算结果组11、21、31中的数据无重叠部分。
例如,在本公开的实施例中,“数据行”可以表示输入图像中的一行数据或者一列数据。例如,在图6A所示的在高度H维度上拆分的数据组中,“数据行”表示输入图像在高度方向上的一行数据,当然“数据行”也可以表示输入图像在宽度方向上的一列数据,本公开的实施例对此不作限制。
例如,在步骤S20中,根据由多个计算单元对第一卷积层进行第二卷积计算后将得到的第二卷积层在多个计算单元中的配置规则,确定多个计算单元上的多个第一计算结果组对应的数据复制传送模式。需要说明的是,这里“对于第一卷积层进行第二卷积计算”并非指实际的、在计算单元中的卷积计算,而是指基于卷积计算的规则,可以根据第一卷积层的大小获知进行第二卷积计算后将得到的第二卷积层的大小,因此,“将得到的第二卷积层”并非指需要得到第二卷积层中数据的具体计算结果。例如,可以根据第二卷积层的大小确定对该第二卷积层的拆分模式,也即确定第二卷积层在多个计算单元中的配置规则。
图4为图3中步骤S20的示意性流程图。例如,在一些示例中,如图4所示,步骤S20可以进一步包括步骤S21~S23。
步骤S21:基于多个计算单元上的存储器的内存大小,确定第二卷积层在多个计算单元中的配置规则;
步骤S22:基于配置规则,确定第二卷积层在多个计算单元上的分布情况;
步骤S23:基于分布情况确定多个计算单元上的第一计算结果组对应 的数据复制传送模式。
例如,在步骤S21中,获取多个计算单元上的存储器(例如内存)大小,并基于每个计算单元的内存大小确定第二卷积层在多个计算单元中的配置规则。例如,将第二卷积层拆分为多份数据,每份数据的大小不得超过分配到其对应的计算单元上的内存的可用容量大小,此为确定配置规则的前提。
例如,在步骤S22中,基于所确定的配置规则,确定如何拆分第二卷积层,也即,确定第二卷积层拆分成的多份数据组在多个计算单元上的分布情况。例如,对应于多个第一计算结果组在多个计算单元上的分布情况和所确定的配置规则,可以确定第二卷积层拆分成的多份数据组在对应的多个计算单元上的分布。
例如,在步骤S23中,基于根据配置规则确定的第二卷积层在多个计算单元上的分布情况,确定对多个计算单元上的第一计算结果组对应的数据复制传送模式。例如,数据复制传送模式可以包括单向数据传送模式和双向数据传送模式。例如,单向数据传送模式表示计算单元只能从另一单元中获取复制的数据,双向数据传送模式表示计算单元不仅可以从另一计算单元中获取复制的数据,也可以将自身中的数据复制传送给该另一计算单元。例如,可以设置多个计算单元具有相同的数据复制传送模式,也可以设置多个计算单元具有不同的数据复制传送模式,数据复制传送模式不限于上述两种模式,也可以为其他可行的实现方式,本公开的实施例对此不作限制。
例如,在一个示例中,该配置规则可以是:令第二卷积层的数据在多个计算单元上的分布均与多个计算单元中的第一计算结果组的数据范围相同。例如,如图6A所示,多个计算单元PE0、PE1、PE2对%0的数据组进行第一卷积计算得到由多个第一计算结果组11、21、31构成的第一卷积层,可以由此计算得知对第一卷积层进行第二卷积计算所得到的第二卷积层的大小。因此,拟将第二卷积层的数据进行拆分,并按照如图6A的%10上的分布方式分配到多个计算单元PE0、PE1、PE2上,根据该配置规则可以确定每个计算单元对%5进行卷积计算时的数据复制传送模式。例如,可以根据该配置规则确定多个计算单元PE0、PE1、PE2的数据复 制传送模式均为双向数据传送模式。
例如,在另一个示例中,该配置规则可以是:令第二卷积层的数据在多个计算单元中的部分计算单元上的分布与该部分计算单元中的第一计算结果组的数据范围相同。例如,如图6B所示,在得知第二卷积层的大小后,拟将第二卷积层的数据按照如图6B的%10上的分布方式分配到多个计算单元PE0、PE1、PE2上,也即,将第二卷积层的数据拆分为3份,其中分配到PE1上的数据范围与PE1上的第一计算结果组的数据范围大小相等。由此可以根据该配置规则确定每个计算单元对于%5进行卷积计算时的数据复制传送模式,例如,确定多个计算单元PE0、PE1、PE2的数据复制传送模式均为单向数据传送模式。
在本公开的实施例中,配置规则不仅可以根据计算单元的内存大小进行设置,还可以根据多个计算单元的算力分配、数据传送开销等因素进行灵活设置,本公开的实施例并不限制配置规则的具体内容。
例如,在步骤S30中,先确定多个计算单元中需要进行有效数据行填充的第一计算单元。例如,可以是多个计算单元均需要进行有效数据填充,也可以是在多个计算单元中,只有部分计算单元需要进行有效数据行填充。例如,在图6A中,PE0、PE1、PE2均为需要进行有效数据行填充的第一计算单元;在图6B中,PE1、PE2为需要进行有效数据行填充的第一计算单元;在图6C中,只有PE1为需要进行有效数据行填充的第一计算单元。
例如,在步骤S30中,在确定了需要进行有效数据行填充的第一计算单元,以及第一计算单元所对应的数据复制传送模式后,第一计算单元从另一计算单元的第一计算结果组中获取至少一份第一中间行数据。例如,第一计算单元可以从两个不同的计算单元的第一计算结果组中获取两份第一中间行数据,这两份第一中间行数据均为第一计算单元进行第二卷积计算过程中进行填充所需要的。例如,在图6A中,PE0作为第一计算单元从PE1的第一计算结果组21中获取1行或多行第一中间行数据;PE1作为第一计算单元从PE0的第一计算结果组11中获取1行或多行第一中间行数据,并从PE2的第一计算结果组31中获取1行或多行第一中间行数据;PE2作为第一计算单元从PE1的第一计算结果组21中获取1行或多 行第一中间行数据。
图5为图3中步骤S30的示意性流程图。例如,在一些示例中,如图5所示,步骤S30可以进一步包括步骤S31和S32。
步骤S31:基于第一计算单元进行第二卷积计算后将得到的第二计算结果组与第一计算单元进行第一卷积计算所得到的第一计算结果组之间的关系,确定第一中间数据行的大小;
步骤S32:基于数据复制传送模式和第一中间数据行的大小,从第二计算单元上的第一计算结果组中获取第一中间数据行。
例如,在步骤S31中,第一计算单元从另一计算单元上获取的第一中间数据行的大小由第二计算结果组与第一计算结果组之间的关系决定,第二计算结果组与第一计算结果组之间的关系由第二卷积层在计算单元上的分布情况、卷积核的大小和填充操作等因素决定。
例如,在步骤S32中,可以通过计算单元之间的核间传输实现从第二计算单元上的第一计算结果组中获取第一中间数据行。例如,通过获取第二计算单元的第一计算结果组中的第一中间数据行的地址,读取并复制该第一中间数据行到第一计算单元中,将复制得到的第一中间数据行写入第一计算单元的对应的地址上。例如,在图6A中,PE1上具有%0的[74,150]行有效数据,该77行数据对应的存储地址为address_0~address_75,则在进行了第一卷积计算后,PE1上的第一计算结果组为第一卷积层中的[75,149]行,共75行数据,可以将其分别分配到地址address_1~address_74中,而将从PE0的第一计算结果组11中获取得到的第一中间数据行写入到地址address_0中,将从PE2的第一计算结果组31中获取得到的第一中间数据行写入到地址address_75中。本公开的实施例对于计算单元之间数据的复制传送过程的具体实现方式不作限制。
例如,该调度方法还包括步骤S40(图中未示出):对原始输入矩阵进行拆分得到多个原始输入数据组,并将多个原始输入数据组分别传输到多个计算单元上进行第一卷积计算。
例如,可以根据由对原始输入矩阵进行第一卷积计算后得到的第一卷积层的大小和计算单元的数量、内存大小等确定原始输入矩阵的拆分模式。例如,该拆分模式为将原始输入矩阵平均拆分为多个原始输入数据组, 或者按照计算单元的内存大小将原始输入矩阵对应拆分为多个原始输入数据组,以分别放入多个计算单元中。例如,原始输入数据组可以作为进行第一卷积计算所针对的数据组。
例如,该调度方法还包括步骤S50(图中未示出):响应于第二卷积层是多层卷积神经网络中的输出层,汇聚多个计算单元中的多个第二计算结果组以得到多层卷积神经网络的至少部分计算输出。例如,当最后一个卷积算子计算结束后,第二卷积层为最后一层的卷积层时,第二卷积层即为输出结果。也即,由多个计算单元中的第二计算结果组构成的第二卷积层为多层卷积神经网络的计算输出或者至少部分计算输出。
在本公开的实施例中,多个计算单元包括至少两个计算单元,例如第一计算单元和第二计算单元,多层卷积神经网络包括至少两层卷积层。例如,图6A-6C所示的示例包括3个计算单元PE0、PE1和PE2,包含如下2个卷积算子(conv):
%5:[(1 224 224 64),F,FP32]=conv(%0:[(1 224 224 64),F,FP32],1%:[(64 3 3 64),W,FP32],kh=3,kw=3,pad_h_top=1,pad_h_bottom=1,pad_w_left=1,pad_w_right=1,stride_h=1,stride_w=1);
%10:[(1 224 224 64),F,FP32]=conv(%5:[(1 224 224 64),F,FP32],6%:[(64 3 3 64),W,FP32],kh=3,kw=3,pad_h_top=1,pad_h_bottom=1,pad_w_left=1,pad_w_right=1,stride_h=1,stride_w=1);
例如,如图6A所示,该芯片中采用3个计算单元PE0、PE1、PE2对224×224的输入数据进行数据并行化的卷积计算。例如,执行步骤S40,以输入数据的高度H维度将原始输入矩阵大致平均地拆分为3个原始输入数据组10、20、30,并分别加载到PE0、PE1、PE2这3个计算单元上,如%0上所示。例如,原始输入数据组10包括[0,75],共76行数据;原始输入数据组20包括[74,150],共77行数据;原始输入数据组30包括[149,223],共75行数据。例如,执行步骤S10,多个计算单元PE0、PE1、PE2分别对对应的3个原始输入数据组10、20、30进行第一卷积计算,得到如%5上所示的第一计算结果组11([0,74]行)、21([75,149]行)、31([150,223]行),该第一计算结果组构成对应于对原始输入矩阵进行第一卷积计算所得到的第一卷积层。
例如,执行步骤S20,根据令第二卷积层的数据在多个计算单元上的分布均与多个计算单元中的第一计算结果组的数据范围相同的配置规则,确定3个计算单元PE0、PE1、PE2均具有相同的数据复制传送模式,即双向数据传送模式。然后执行步骤S30,确定3个计算单元PE0、PE1、PE2均为需要进行数据复制传送的计算单元,并且确定每个计算单元所需的第一中间数据行为一行,然后,每个计算单元从另一计算单元中获取其进行第二卷积计算时填充所需第一中间数据行。
例如,PE0作为第一计算单元,PE1作为第二计算单元,如箭头1所示,第一计算单元PE0从第二计算单元PE1的第一计算结果组21中获得PE0在进行第二卷积计算过程中进行填充所需的第一中间数据行[75];如箭头2所示,第二计算单元PE1从第一计算单元PE0的第一计算结果组11中获得PE1在进行第二卷积计算过程中进行填充所需的第二中间数据行[74],并且,第一计算单元PE0和第二计算单元PE1从彼此获取的第一中间数据行和第二中间数据行的大小相同,均为1行数据。
例如,PE1作为第一计算单元,PE2作为第二计算单元,如箭头3所示,第一计算单元PE1从第二计算单元PE2上的第一计算结果组31中获得PE1在进行第二卷积计算过程中进行填充所需的第一中间数据行[150];如箭头4所示,第二计算单元PE2从第一计算单元PE1的第一计算结果组21中获得PE2在进行第二卷积计算过程中进行填充所需的第二中间数据行[149],并且,第一计算单元PE1和第二计算单元PE2从彼此获取的第一中间数据行和第二中间数据行的大小相同,均为1行数据。
例如,由PE0上的第一计算结果组[0,74]和从PE1中获取得到的第一中间数据行[75]构成了PE0进行第二卷积计算所需的数据组[0,75]。
例如,由PE1上的第一计算结果组[75,149]和分别从PE0、PE2中获取得到的第二中间数据行[74]和第一中间数据行[150]构成了PE1进行第二卷积计算所需的数据组[74,150]。
例如,由PE2上的第一计算结果[150,223]和从PE2中获取得到的第二中间数据行[149]构成了PE2进行第二卷积计算所需的数据组[149,223]。
3个计算单元PE0、PE1、PE2对经过数据复制传送后的数据组[0,75]、[74,150]、[149,223]进行第二卷积计算,得到如%10上所示的第二计算结 果组[0,74]、[75,149]、[150,223]。最后,执行步骤S50,将PE0、PE1、PE2上的第二计算结果组[0,74]、[75,149]、[150,223]汇聚起来,即为输出层的计算结果。
因此,本公开至少一实施例提供的调度方法通过计算单元之间的有效数据的复制传送,将有效数据作为填充行,能够减少卷积神经网络中重复数据的计算量,提高多层卷积神经网络的计算效率。
在如图6B所示的另一个示例中,在步骤S20,可以确定3个计算单元PE0、PE1、PE2均具有相同的单向数据传送模式。在步骤S30中,确定计算单元PE1和PE2为需要进行数据复制传送的计算单元,并且确定每个计算单元所需的第一中间数据行为两行。也即,PE1从PE0中获取两行有效数据以用于填充,PE2从PE1中获取两行有效数据以用于填充。例如,与图6B相比,在图6C中,还可以只有计算单元PE1为需要进行数据复制传送的计算单元。也即,PE1分别从PE0和PE2中获取两行有效数据以用于填充。采用图6B和图6C所示的示例的调度方法,不仅可以减少数据的重复计算,提高计算效率,还可以减少计算单元之间的数据传送次数,减少数据传输所需的开销。
例如,计算装置还可以包括更多的计算单元,卷积网络还可以包括更多的卷积算子,本公开的实施例对此不作限制。例如,图6D所示的示例包括4个计算单元PE0、PE1、PE2和PE3,包含如下3个卷积算子(conv):
%5:[(1 224 224 64),F,FP32]=conv(%0:[(1 224 224 64),F,FP32],1%:[(64 3 3 64),W,FP32],kh=3,kw=3,pad_h_top=1,pad_h_bottom=1,pad_w_left=1,pad_w_right=1,stride_h=1,stride_w=1);
%10:[(1 224 224 64),F,FP32]=conv(%5:[(1 224 224 64),F,FP32],6%:[(64 3 3 64),W,FP32],kh=3,kw=3,pad_h_top=1,pad_h_bottom=1,pad_w_left=1,pad_w_right=1,stride_h=1,stride_w=1);
%15:[(1 224 224 64),F,FP32]=conv(%10:[(1 224 224 64),F,FP32],11%:[(64 3 3 64),W,FP32],kh=3,kw=3,pad_h_top=1,pad_h_bottom=1,pad_w_left=1,pad_w_right=1,stride_h=1,stride_w=1)。
例如,在图6D所示,多个计算单元PE0、PE1、PE2和PE3可以具 有不同的数据复制传送模式。例如,每个计算单元从另一计算单元中获得的第一中间数据行的大小可以不同,本公开的实施例对此不作限制。例如,PE2从PE1中获得的第一中间数据行为1行,而从PE3中获得的第一中间数据行为2行。图6D所示的示例中的调度方法的步骤可以参见对图6A的描述,此处不再赘述。采用图6D所示的示例的调度方法,不仅可以减少数据的重复计算,提高算力的利用率,还可以根据系统的综合性能的考虑,实现对不同的计算单元之间的内存、算力和数据传输开销的均衡分配。
图7为本公开一些实施例提供的一种调度装置的示意框图。如图7所示,调度装置100包括计算控制模块110、分配调度模块120和数据传输模块130。这些组件可以通过总线和/或其它形式的连接机构(未示出)互连。
计算控制模块110配置为使多个计算单元分别对对应的多个数据组进行第一卷积计算得到对应的多个第一计算结果组,其中,多个第一计算结果组用于构成第一卷积计算得到的第一卷积层,多个计算单元包括第一计算单元和第二计算单元;
分配调度模块120配置为根据由多个计算单元对第一卷积层进行第二卷积计算后将得到的第二卷积层在多个计算单元中的配置规则,确定多个计算单元上的多个第一计算结果组对应的数据复制传送模式;
数据传输模块130配置为使多个计算单元中需要进行有效数据行填充的第一计算单元,基于对应的数据复制传送模式,从第二计算单元上的第一计算结果组中获取第一计算单元在第二卷积计算过程中进行填充所需的第一中间数据行。
需要说明的是,在本公开的实施例中,调度装置100的各个模块与前述的调度方法的各个步骤对应,关于调度装置100的具体功能可以参考上文中调度方法的相关描述,此处不再赘述。图7所示的调度装置100的组件和结构知识示例性的,而非限制性的,根据需要,该调度装置100还可以包括其他组件和结构。
图8为本公开一些实施例提供的一种电子设备的示意框图。如图8所示,电子设备200包括调度装置210,调度装置210可以为本公开任一实施例提供的调度装置,例如为前述的调度装置100。该电子设备200可以 为任意的具有计算功能的设备,例如为服务器、终端设备、个人计算机等,本公开的实施例对此不作限制。
图9为本公开一些实施例提供的另一种电子设备的示意框图。如图9所示,该电子设备300包括处理器310和存储器320,可以用于实现客户端或服务器。存储器320用于非瞬时性地存储有计算机可执行指令(例如至少一个(一个或多个)计算机程序模块)。处理器310用于运行该计算机可执行指令,该计算机可执行指令被处理器310运行时可以执行上文所述的卷积运算方法中的一个或多个步骤,进而实现上文所述的卷积运算方法。存储器320和处理器310可以通过总线系统和/或其它形式的连接机构(未示出)互连。
例如,处理器310可以是中央处理单元(CPU)、图形处理单元(GPU)或者具有数据处理能力和/或程序执行能力的其它形式的处理单元。例如,中央处理单元(CPU)可以为X86或ARM架构等。处理器310可以为通用处理器或专用处理器,可以控制电子设备300中的其它组件以执行期望的功能。
例如,存储器320可以包括至少一个(例如一个或多个)计算机程序产品的任意组合,计算机程序产品可以包括各种形式的计算机可读存储介质,例如易失性存储器和/或非易失性存储器。易失性存储器例如可以包括随机存取存储器(RAM)和/或高速缓冲存储器(cache)等。非易失性存储器例如可以包括只读存储器(ROM)、硬盘、可擦除可编程只读存储器(EPROM)、便携式紧致盘只读存储器(CD-ROM)、USB存储器、闪存等。在计算机可读存储介质上可以存储至少一个(例如一个或多个)计算机程序模块,处理器310可以运行至少一个(例如一个或多个)计算机程序模块,以实现电子设备300的各种功能。在计算机可读存储介质中还可以存储各种应用程序和各种数据以及应用程序使用和/或产生的各种数据等。
需要说明的是,本公开的实施例中,电子设备300的具体功能和技术效果可以参考上文中关于调度方法的描述,此处不再赘述。
图10为本公开一些实施例提供的另一种电子设备的示意框图。该电子设备400例如适于用来实施本公开实施例提供的调度方法。电子设备 400可以是终端设备等,可以用于实现客户端或服务器。电子设备400可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)、可穿戴电子设备等等的移动终端以及诸如数字TV、台式计算机、智能家居设备等等的固定终端。需要注意的是,图10示出的电子设备400仅仅是一个示例,其不会对本公开实施例的功能和使用范围带来任何限制。
如图10所示,电子设备400可以包括处理装置(例如中央处理器、图形处理器等)410,其可以根据存储在只读存储器(ROM)420中的程序或者从存储装置480加载到随机访问存储器(RAM)430中的程序而执行各种适当的动作和处理。在RAM 430中,还存储有电子设备400操作所需的各种程序和数据。处理装置410、ROM 420以及RAM 430通过总线440彼此相连。输入/输出(I/O)接口450也连接至总线440。
通常,以下装置可以连接至I/O接口450:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置460;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置470;包括例如磁带、硬盘等的存储装置480;以及通信装置490。通信装置490可以允许电子设备400与其他电子设备进行无线或有线通信以交换数据。虽然图10示出了具有各种装置的电子设备400,但应理解的是,并不要求实施或具备所有示出的装置,电子设备400可以替代地实施或具备更多或更少的装置。
例如,根据本公开的实施例,上述调度方法可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包括用于执行上述调度方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置490从网络上被下载和安装,或者从存储装置480安装,或者从ROM420安装。在该计算机程序被处理装置410执行时,可以实现本公开实施例提供的调度方法中限定的功能。
本公开至少一个实施例还提供一种存储介质。利用该存储介质,可以提高矩阵运算单元的利用率,有效利用矩阵运算单元的算力,缩短卷积 运算的时间,提高运算效率,并且可以节省数据传输时间。
图11为本公开一些实施例提供的一种存储介质的示意图。例如,如图11所示,存储介质500可以为非暂时性计算机可读存储介质,存储有非暂时性计算机可读指令510。当非暂时性计算机可读指令510由处理器执行时可以实现本公开实施例所述的调度方法,例如,当非暂时性计算机可读指令510由处理器执行时,可以执行根据上文所述的调度方法中的一个或多个步骤。
例如,该存储介质500可以应用于上述电子设备中,例如,该存储介质500可以包括电子设备300中的存储器320。
例如,存储介质可以包括智能电话的存储卡、平板电脑的存储部件、个人计算机的硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM)、便携式紧致盘只读存储器(CD-ROM)、闪存、或者上述存储介质的任意组合,也可以为其他适用的存储介质。
例如,关于存储介质500的说明可以参考电子设备的实施例中对于存储器的描述,重复之处不再赘述。存储介质500的具体功能和技术效果可以参考上文中关于调度方法的描述,此处不再赘述。
在上文中,结合图1至图11描述了本公开实施例提供的调度方法、调度装置、电子设备及存储介质。本公开实施例提供的调度方法可以通过计算单元之间的有效数据的复制传送,将有效数据作为填充行,能够减少卷积神经网络中重复数据的计算量,提高多层卷积神经网络的计算效率。
需要说明的是,在本公开的上下文中,计算机可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是,但不限于:电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适 的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。
在一些实施方式中,客户端、服务器可以利用诸如HTTP(HyperText Transfer Protocol,超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(“LAN”),广域网(“WAN”),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言,诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言,诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络(,包括局域网(LAN)或广域网(WAN))连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流 程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,单元的名称在某种情况下并不构成对该单元本身的限定。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。
根据本公开的一个或多个实施例,示例1提供了一种用于多层卷积神经网络的调度方法,包括:多个计算单元分别对对应的多个数据组进行第一卷积计算得到对应的多个第一计算结果组,其中,所述多个第一计算结果组用于构成所述第一卷积计算得到的第一卷积层,所述多个计算单元包括第一计算单元和第二计算单元;根据由所述多个计算单元对所述第一卷积层进行第二卷积计算后将得到的第二卷积层在所述多个计算单元中的配置规则,确定所述多个计算单元上的所述多个第一计算结果组对应的数据复制传送模式;多个计算单元中需要进行有效数据行填充的第一计算单元,基于对应的所述数据复制传送模式,从所述第二计算单元上的第一计算结果组中获取所述第一计算单元在所述第二卷积计算过程中进行填充所需的第一中间数据行。
示例2、根据示例1所述的调度方法,其中,所述多个第一计算结果组中的数据行在所述第一卷积层中无交叠地连续。
示例3、根据示例1所述的调度方法,其中,所述第一计算单元上的 所述第一计算结果组和从所述第二计算单元上获取的所述第一中间数据行构成用于所述第一计算单元进行所述第二卷积计算所需的数据组。
示例4、根据示例1所述的调度方法,其中,所述多个计算单元进行所述第一卷积计算中分别使用的对应的所述多个数据组为多个原始输入数据组。
示例5、根据示例4所述的调度方法,还包括:
对原始输入矩阵进行拆分得到所述多个原始输入数据组,并将所述多个原始输入数据组分别传输到所述多个计算单元上进行所述第一卷积计算。
示例6、根据示例1所述的调度方法,其中,所述根据由所述多个计算单元对所述第一卷积层进行第二卷积计算后将得到的第二卷积层在所述多个计算单元中的配置规则,确定所述多个计算单元上的所述多个第一计算结果组对应的数据复制传送模式,包括:
基于所述多个计算单元上的存储器的内存大小,确定所述第二卷积层在所述多个计算单元中的所述配置规则;
基于所述配置规则,确定所述第二卷积层在所述多个计算单元上的分布情况;
基于所述分布情况确定所述多个计算单元上的所述第一计算结果组对应的所述数据复制传送模式。
示例7、根据示例1所述的调度方法,其中,所述从所述第二计算单元上的第一计算结果组中获取所述第一计算单元在所述第二卷积计算过程中进行填充所需的第一中间数据行,包括:
基于所述第一计算单元进行所述第二卷积计算后将得到的第二计算结果组与所述第一计算单元进行所述第一卷积计算所得到的第一计算结果组之间的关系,确定所述第一中间数据行的大小;
基于所述数据复制传送模式和所述第一中间数据行的大小,从所述第二计算单元上的第一计算结果组中获取所述第一中间数据行。
示例8、根据示例7所述的调度方法,其中,所述第一计算单元将得到的所述第二计算结果组的大小与所述第一计算单元上的所述第一计算结果组的大小相同。
示例9、根据示例7所述的调度方法,还包括:
响应于所述第二卷积层是所述多层卷积神经网络中的输出层,汇聚所述多个计算单元中的多个第二计算结果组以得到所述多层卷积神经网络的至少部分计算输出。
示例10、根据上述示例1-9任一项所述的调度方法,还包括:
所述第二计算单元基于对应的所述数据复制传送模式,从所述第一计算单元上的第一计算结果组中获取所述第二计算单元在所述第二卷积计算过程中进行填充所需的第二中间数据行。
示例11、根据示例10所述的调度方法,其中,所述第一计算单元和所述第二计算单元从彼此获取的所述第一中间数据行和所述第二中间数据行的大小相同。
根据本公开的一个或多个实施例,示例12提供了一种用于多层卷积神经网络的调度装置,包括:
计算控制模块,配置为使多个计算单元分别对对应的多个数据组进行第一卷积计算得到对应的多个第一计算结果组,其中,所述多个第一计算结果组用于构成所述第一卷积计算得到的第一卷积层,所述多个计算单元包括第一计算单元和第二计算单元;
分配调度模块,配置为根据由所述多个计算单元对所述第一卷积层进行第二卷积计算后将得到的第二卷积层在所述多个计算单元中的配置规则,确定所述多个计算单元上的所述多个第一计算结果组对应的数据复制传送模式;
数据传输模块,配置为使多个计算单元中需要进行有效数据行填充的第一计算单元,基于对应的所述数据复制传送模式,从所述第二计算单元上的第一计算结果组中获取所述第一计算单元在所述第二卷积计算过程中进行填充所需的第一中间数据行。
示例13、根据示例12所述的调度装置,其中,所述多个第一计算结果组中的数据行在所述第一卷积层中无交叠地连续。
示例14、根据示例12所述的调度装置,其中,所述第一计算单元上的所述第一计算结果组和从所述第二计算单元上获取的所述第一中间数据行构成用于所述第一计算单元进行所述第二卷积计算所需的数据组。
示例15、根据示例12所述的调度装置,其中,所述多个计算单元进行所述第一卷积计算中分别使用的对应的所述多个数据组为多个原始输入数据组。
示例16、根据示例15所述的调度装置,还包括:
数据拆分模块,配置为对原始输入矩阵进行拆分得到所述多个原始输入数据组,并将所述多个原始输入数据组分别传输到所述多个计算单元上进行所述第一卷积计算。
示例17、根据示例12所述的调度装置,其中,所述根据由所述多个计算单元对所述第一卷积层进行第二卷积计算后将得到的第二卷积层在所述多个计算单元中的配置规则,确定所述多个计算单元上的所述多个第一计算结果组对应的数据复制传送模式,包括:
基于所述多个计算单元上的存储器的内存大小,确定所述第二卷积层在所述多个计算单元中的所述配置规则;
基于所述配置规则,确定所述第二卷积层在所述多个计算单元上的分布情况;
基于所述分布情况确定所述多个计算单元上的所述第一计算结果组对应的所述数据复制传送模式。
示例18、根据示例12所述的调度装置,其中,所述从所述第二计算单元上的第一计算结果组中获取所述第一计算单元在所述第二卷积计算过程中进行填充所需的第一中间数据行,包括:
基于所述第一计算单元进行所述第二卷积计算后将得到的第二计算结果组与所述第一计算单元进行所述第一卷积计算所得到的第一计算结果组之间的关系,确定所述第一中间数据行的大小;
基于所述数据复制传送模式和所述第一中间数据行的大小,从所述第二计算单元上的第一计算结果组中获取所述第一中间数据行。
示例19、根据示例18所述的调度装置,其中,所述第一计算单元将得到的所述第二计算结果组的大小与所述第一计算单元上的所述第一计算结果组的大小相同。
示例20、根据示例18所述的调度装置,还包括:
数据输出模块,配置为响应于所述第二卷积层是所述多层卷积神经 网络中的输出层,汇聚所述多个计算单元中的多个第二计算结果组以得到所述多层卷积神经网络的至少部分计算输出。
示例21、根据示例12-20任一项所述的调度装置,其中,所述数据传输模块还配置为:
使所述第二计算单元基于对应的所述数据复制传送模式,从所述第一计算单元上的第一计算结果组中获取所述第二计算单元在所述第二卷积计算过程中进行填充所需的第二中间数据行。
示例22、根据示例10所述的调度装置,其中,所述第一计算单元和所述第二计算单元从彼此获取的所述第一中间数据行和所述第二中间数据行的大小相同。
根据本公开的一个或多个实施例,示例23提供了一种电子设备,包括上述示例12-22任一项所述的调度装置。
根据本公开的一个或多个实施例,示例24提供了一种电子设备,包括:处理器;存储器,包括至少一个计算机程序模块;其中,所述至少一个计算机程序模块被存储在所述存储器中,并被配置为由所述处理器执行,所述至少一个计算机程序模块用于实现上述示例1-11任一项所述的调度方法。
根据本公开的一个或多个实施例,示例25提供了一种存储介质,存储有非暂时性计算机可读指令,当所述非暂时性计算机可读指令由计算机执行时实现上述示例1-11任一项所述的调度方法。
虽然上文中已经用一般性说明及具体实施方式,对本公开作了详尽的描述,但在本公开实施例基础上,可以对之作一些修改或改进,这对本领域技术人员而言是显而易见的。因此,在不偏离本公开精神的基础上所做的这些修改或改进,均属于本公开要求保护的范围。
对于本公开,还有以下几点需要说明:
(1)本公开实施例附图只涉及到与本公开实施例涉及到的结构,其他结构可参考通常设计。
(2)为了清晰起见,在用于描述本公开的实施例的附图中,层或区域的厚度被放大或缩小,即这些附图并非按照实际的比例绘制。
(3)在不冲突的情况下,本公开的实施例及实施例中的特征可以相 互组合以得到新的实施例。
以上所述,仅为本公开的具体实施方式,但本公开的保护范围并不局限于此,本公开的保护范围应以所述权利要求的保护范围为准。

Claims (15)

  1. 一种用于多层卷积神经网络的调度方法,包括:
    多个计算单元分别对对应的多个数据组进行第一卷积计算得到对应的多个第一计算结果组,其中,所述多个第一计算结果组用于构成所述第一卷积计算得到的第一卷积层,所述多个计算单元包括第一计算单元和第二计算单元;
    根据由所述多个计算单元对所述第一卷积层进行第二卷积计算后将得到的第二卷积层在所述多个计算单元中的配置规则,确定所述多个计算单元上的所述多个第一计算结果组对应的数据复制传送模式;
    所述多个计算单元中需要进行有效数据行填充的第一计算单元,基于对应的所述数据复制传送模式,从所述第二计算单元上的第一计算结果组中获取所述第一计算单元在所述第二卷积计算过程中进行填充所需的第一中间数据行。
  2. 根据权利要求1所述的调度方法,其中,所述多个第一计算结果组中的数据行在所述第一卷积层中无交叠地连续。
  3. 根据权利要求1或2所述的调度方法,其中,所述第一计算单元上的所述第一计算结果组和从所述第二计算单元上获取的所述第一中间数据行构成用于所述第一计算单元进行所述第二卷积计算所需的数据组。
  4. 根据权利要求1-3任一项所述的调度方法,其中,所述多个计算单元进行所述第一卷积计算中分别使用的对应的所述多个数据组为多个原始输入数据组。
  5. 根据权利要求4所述的调度方法,还包括:
    对原始输入矩阵进行拆分得到所述多个原始输入数据组,并将所述多个原始输入数据组分别传输到所述多个计算单元上进行所述第一卷积计算。
  6. 根据权利要求1-5任一项所述的调度方法,其中,所述根据由所述多个计算单元对所述第一卷积层进行第二卷积计算后将得到的第二卷积层在所述多个计算单元中的配置规则,确定所述多个计算单元上的所述多个第一计算结果组对应的数据复制传送模式,包括:
    基于所述多个计算单元上的存储器的内存大小,确定所述第二卷积层在所述多个计算单元中的所述配置规则;
    基于所述配置规则,确定所述第二卷积层在所述多个计算单元上的分布情况;
    基于所述分布情况确定所述多个计算单元上的所述第一计算结果组对应的所述数据复制传送模式。
  7. 根据权利要求1-6任一项所述的调度方法,其中,所述从所述第二计算单元上的第一计算结果组中获取所述第一计算单元在所述第二卷积计算过程中进行填充所需的第一中间数据行,包括:
    基于所述第一计算单元进行所述第二卷积计算后将得到的第二计算结果组与所述第一计算单元进行所述第一卷积计算所得到的第一计算结果组之间的关系,确定所述第一中间数据行的大小;
    基于所述数据复制传送模式和所述第一中间数据行的大小,从所述第二计算单元上的第一计算结果组中获取所述第一中间数据行。
  8. 根据权利要求7所述的调度方法,其中,所述第一计算单元将得到的所述第二计算结果组的大小与所述第一计算单元上的所述第一计算结果组的大小相同。
  9. 根据权利要求7所述的调度方法,还包括:
    响应于所述第二卷积层是所述多层卷积神经网络中的输出层,汇聚所述多个计算单元中的多个第二计算结果组以得到所述多层卷积神经网络的至少部分计算输出。
  10. 根据权利要求1-9任一项所述的调度方法,还包括:
    所述第二计算单元基于对应的所述数据复制传送模式,从所述第一计算单元上的第一计算结果组中获取所述第二计算单元在所述第二卷积计算过程中进行填充所需的第二中间数据行。
  11. 根据权利要求10所述的调度方法,其中,所述第一计算单元和所述第二计算单元从彼此获取的所述第一中间数据行和所述第二中间数据行的大小相同。
  12. 一种用于多层卷积神经网络的调度装置,包括:
    计算控制模块,配置为使多个计算单元分别对对应的多个数据组进 行第一卷积计算得到对应的多个第一计算结果组,其中,所述多个第一计算结果组用于构成所述第一卷积计算得到的第一卷积层,所述多个计算单元包括第一计算单元和第二计算单元;
    分配调度模块,配置为根据由所述多个计算单元对所述第一卷积层进行第二卷积计算后将得到的第二卷积层在所述多个计算单元中的配置规则,确定所述多个计算单元上的所述多个第一计算结果组对应的数据复制传送模式;
    数据传输模块,配置为使所述多个计算单元中需要进行有效数据行填充的第一计算单元,基于对应的所述数据复制传送模式,从所述第二计算单元上的第一计算结果组中获取所述第一计算单元在所述第二卷积计算过程中进行填充所需的第一中间数据行。
  13. 一种电子设备,包括权利要求12所述的调度装置。
  14. 一种电子设备,包括:
    处理器;
    存储器,包括至少一个计算机程序模块;
    其中,所述至少一个计算机程序模块被存储在所述存储器中,并被配置为由所述处理器执行,所述至少一个计算机程序模块用于实现权利要求1-11任一项所述的调度方法。
  15. 一种存储介质,存储有非暂时性计算机可读指令,其中,当所述非暂时性计算机可读指令由计算机执行时实现权利要求1-11任一项所述的调度方法。
PCT/CN2023/119412 2022-09-27 2023-09-18 调度方法、调度装置、电子设备及存储介质 WO2024067207A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211188739.0 2022-09-27
CN202211188739.0A CN117827386A (zh) 2022-09-27 2022-09-27 调度方法、调度装置、电子设备及存储介质

Publications (1)

Publication Number Publication Date
WO2024067207A1 true WO2024067207A1 (zh) 2024-04-04

Family

ID=90476147

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/119412 WO2024067207A1 (zh) 2022-09-27 2023-09-18 调度方法、调度装置、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN117827386A (zh)
WO (1) WO2024067207A1 (zh)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109740732A (zh) * 2018-12-27 2019-05-10 深圳云天励飞技术有限公司 神经网络处理器、卷积神经网络数据复用方法及相关设备
CN111260020A (zh) * 2018-11-30 2020-06-09 深圳市海思半导体有限公司 卷积神经网络计算的方法和装置
CN111597029A (zh) * 2020-05-20 2020-08-28 上海商汤智能科技有限公司 数据处理方法及装置、电子设备和存储介质
CN112200300A (zh) * 2020-09-15 2021-01-08 厦门星宸科技有限公司 卷积神经网络运算方法及装置
KR20210029595A (ko) * 2019-09-06 2021-03-16 주식회사 하이퍼커넥트 키워드 스폿팅 장치, 방법 및 컴퓨터 판독 가능한 기록매체
CN112633470A (zh) * 2020-12-11 2021-04-09 苏州浪潮智能科技有限公司 优化神经网络卷积残差结构的方法、系统、设备及介质
CN114548354A (zh) * 2020-11-26 2022-05-27 珠海格力电器股份有限公司 多层神经网络的计算方法、装置、设备及计算机可读介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111260020A (zh) * 2018-11-30 2020-06-09 深圳市海思半导体有限公司 卷积神经网络计算的方法和装置
CN109740732A (zh) * 2018-12-27 2019-05-10 深圳云天励飞技术有限公司 神经网络处理器、卷积神经网络数据复用方法及相关设备
KR20210029595A (ko) * 2019-09-06 2021-03-16 주식회사 하이퍼커넥트 키워드 스폿팅 장치, 방법 및 컴퓨터 판독 가능한 기록매체
CN111597029A (zh) * 2020-05-20 2020-08-28 上海商汤智能科技有限公司 数据处理方法及装置、电子设备和存储介质
CN112200300A (zh) * 2020-09-15 2021-01-08 厦门星宸科技有限公司 卷积神经网络运算方法及装置
CN114548354A (zh) * 2020-11-26 2022-05-27 珠海格力电器股份有限公司 多层神经网络的计算方法、装置、设备及计算机可读介质
CN112633470A (zh) * 2020-12-11 2021-04-09 苏州浪潮智能科技有限公司 优化神经网络卷积残差结构的方法、系统、设备及介质

Also Published As

Publication number Publication date
CN117827386A (zh) 2024-04-05

Similar Documents

Publication Publication Date Title
JP7029554B2 (ja) 深層学習モデルをトレーニングするための方法及び装置、電子機器、コンピュータ可読記憶媒体並びにコンピュータプログラム
CN110262901B (zh) 一种数据处理方法及数据处理系统
US20050071404A1 (en) System and method for solving a large system of dense linear equations
US11868243B2 (en) Topological scheduling
WO2022108527A1 (zh) 模型处理方法、系统、装置、介质及电子设备
CN111580974B (zh) Gpu实例分配方法、装置、电子设备和计算机可读介质
US20220012578A1 (en) Methods, apparatus, and articles of manufacture to increase utilization of neural network (nn) accelerator circuitry for shallow layers of an nn by reformatting one or more tensors
WO2024067207A1 (zh) 调度方法、调度装置、电子设备及存储介质
CN111680791B (zh) 适用于异构环境中的通信方法、装置、系统
CN117271136A (zh) 数据处理方法、装置、设备和存储介质
CN112416303A (zh) 软件开发工具包热修复方法、装置及电子设备
US20220350863A1 (en) Technology to minimize the negative impact of cache conflicts caused by incompatible leading dimensions in matrix multiplication and convolution kernels without dimension padding
Prades et al. Multi-tenant virtual GPUs for optimising performance of a financial risk application
WO2023087227A1 (zh) 数据处理装置及方法
CN116108787A (zh) 多fpga间的布线代价优化方法、系统、存储介质及电子设备
CN111694672B (zh) 资源分配方法、任务提交方法、装置、电子设备和介质
Igual et al. Scheduling algorithms‐by‐blocks on small clusters
CN114424174A (zh) 用于神经网络加速器的参数高速缓存
WO2021077364A1 (zh) 数据处理方法、装置、电子设备及计算机可读存储介质
WO2023231999A1 (zh) 卷积运算方法、卷积运算装置、电子设备及存储介质
Zeng et al. Parallel multi-GPU implementation of fast decoupled power flow solver with hybrid architecture
US20220303333A1 (en) Utilizing reinforcement learning for serverless function tuning
WO2024061162A1 (zh) 数据处理方法、数据处理装置和存储介质
US20210255866A1 (en) Acceleration unit, system-on-chip, server, data center, and related method
CN118035618A (zh) 数据处理器、数据处理方法、电子设备、存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23870441

Country of ref document: EP

Kind code of ref document: A1