WO2019206161A1 - 池化运算装置 - Google Patents

池化运算装置 Download PDF

Info

Publication number
WO2019206161A1
WO2019206161A1 PCT/CN2019/084004 CN2019084004W WO2019206161A1 WO 2019206161 A1 WO2019206161 A1 WO 2019206161A1 CN 2019084004 W CN2019084004 W CN 2019084004W WO 2019206161 A1 WO2019206161 A1 WO 2019206161A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
pooling
register
result
storage module
Prior art date
Application number
PCT/CN2019/084004
Other languages
English (en)
French (fr)
Inventor
梁晓峣
景乃锋
崔晓松
陈云
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2019206161A1 publication Critical patent/WO2019206161A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present application relates to the field of neural networks, and in particular to a pool computing device.
  • Convolutional neural networks are commonly used for image recognition.
  • Convolutional neural networks generally include a convolutional layer, a pooled layer, and a fully connected layer.
  • the pooling layer is generally placed after the convolution layer.
  • the operation of the pooling layer is called pooling.
  • the process of pooling is to slide a fixed-size window across the entire image plane, and perform operations on the data covered in the window at each moment, such as maximizing or averaging as an output. This window can be called a pooling window.
  • the size of the pooling window is k1*k2, where k1 and k2 are respectively integers of not less than 2. Pooling includes maximum pooling and average pooling.
  • the present application provides a pool computing device, which can avoid repeated reading and writing to a certain extent, thereby effectively improving data reading and writing efficiency.
  • a pool computing device comprising: a plurality of register groups for storing a plurality of data; and a plurality of computing units for performing a pooling operation on the plurality of data, wherein different The data of the operation of the computing unit is located in a different register group of the plurality of register groups; the first computing unit of the plurality of computing units is configured to: perform a first pool on the first data and the second data of the plurality of data a first operation result; storing the first operation result; acquiring third data from the first register group of the plurality of registers; performing a second pool operation on the first operation result and the third data .
  • the data acquired by different computing units are located in different register groups in multiple register groups, which can effectively avoid data read conflicts, so that parallel pooling can be better realized to improve the efficiency of pooling operations.
  • the first operation result is an intermediate calculation result in a pooling operation corresponding to a pooling window
  • the intermediate calculation result is also involved in a subsequent operation (for example, a second pooling operation).
  • the first calculation unit stores the intermediate calculation result, so that in the subsequent operation process, the operation can be directly performed using the intermediate calculation result, without reading data from the external register file, and performing pooling by the image processor relative to the prior art.
  • the calculation, the pooling operation device provided by the present application can effectively improve the data reading and writing efficiency, thereby improving the pooling efficiency as a whole.
  • each computing unit reads one pooled operand at a time, and performs operations on two data at a time, so that the pooling computing device provided by the present application is not affected by the size change of the pooling window.
  • the pooling operation device provided by the embodiment of the present application can be applied to the pooling operation of a pool of any size.
  • the pooled computing device can implement parallel pooling operations through multiple computing units and multiple register groups, which can improve the pooling efficiency; in addition, each computing unit can store the intermediate computing of the pooling operation. As a result, the data reading and writing efficiency can be improved, and the pooling efficiency can be improved as a whole to achieve the maximum accelerated pooling operation.
  • the data acquired at different clock cycles can be in different register sets or in the same register bank.
  • the data acquired by different computing units can be in different register sets or in the same register set.
  • Each of the plurality of register sets has one read port. That is, one register group can be read out of one data per clock cycle.
  • the number of the plurality of computing units is less than or equal to the number of the plurality of register groups.
  • any one of the plurality of computing units is capable of reading data in any one of the plurality of register sets.
  • connection relationship between the plurality of register sets and the plurality of computing units is: each of the plurality of computing units is respectively connected to all of the plurality of register sets.
  • This connection relationship can be referred to as a full connection.
  • connection relationship between the plurality of register sets and the plurality of computing units may also be: each of the plurality of computing units is respectively connected to a partial register set of the plurality of register sets.
  • the storage module and the operation module are included.
  • the operation module is configured to perform a first pooling operation on the first data and the second data acquired from the plurality of register groups, obtain a first operation result, and store the first operation result in the storage module, where the operation module further And performing a second pooling operation on the first operation result stored in the storage module and the third data acquired from the plurality of register groups.
  • the first pooling operation is a comparison operation, that is, comparing the first data with the second number, and correspondingly, the operation module may include an adder or a comparator.
  • the first pooling operation is an accumulation operation, that is, the first data and the second data are accumulated, and accordingly, the operation module includes an adder.
  • the arithmetic module further includes a multiplier for averaging the total accumulated results of all operands in the pooled window.
  • the pooled computing device uses hardware to implement related operations in the pooling operation instead of using instruction control.
  • the first pooling operation includes a maximum pooling operation
  • the first computing unit includes: a first data interface, configured to receive the plurality of The first data obtained by the register set; the second data interface is configured to receive the second data obtained from the plurality of register sets; the first storage module is configured to store the first data; and the second storage module is configured to: And storing the second data; the operation module is configured to compare the first data with the second data, obtain the first operation result, and store the first operation result in a latch, the comparison result is the first The data is larger than the second data; the latch is configured to store the first operation result, and send a feedback signal to the first data interface and the second data interface according to the first operation result, where the feedback signal is used Instructing the first data interface to be closed and instructing the second data interface to be turned on, wherein the opened second data interface is configured to receive the third data obtained from the first register set; In the first calculation result and the third data of the second cell operation.
  • the pooling computing device provided by the present application can be used to achieve maximum pooling.
  • the first pooling operation includes an average pooling operation;
  • the first calculating unit specifically includes: a first data interface, configured to use the multiple The register group receives the first data; the second data interface is configured to receive the second data from the plurality of register sets; the first storage module is configured to store the first data; and the second storage module is configured to store the first data a second data; an adder, configured to accumulate the first data and the second data to obtain the first operation result; the second storage module is further configured to store the first operation result; the first data interface, The method is further configured to acquire the third data from the first register set; the adder is further configured to perform the second pooling operation on the first operation result and the third data.
  • the first calculating unit further includes: a multiplier for multiplying the accumulated result of the k1*k2 data by 1/(k1*k2) when the adder obtains an accumulated result of k1*k2 data to obtain the The average of k1*k2 data, where k1*k2 is the size of the pooling window corresponding to the pooling operation, and k1 and k2 are integers not less than 2, respectively.
  • the pooling computing device provided by the present application can be used to implement an average pooling.
  • the pooling operation device further includes: a control unit, configured to send a control signal to the plurality of computing units, the control signal is used to indicate the pooling
  • the operation is a maximum pooling or an average pooling
  • the first calculating unit is further configured to receive the control signal; in a process of performing a first pooling operation on the first data and the second data of the plurality of data, The first calculating unit is specifically configured to: when the control signal indicates that the pooling operation is maximum pooling, perform a maximum pooling operation on the first data and the second data; when the control signal indicates that the pooling operation is During the average pooling, the first data and the second data are averaged.
  • the pooling computing device provided by the present application can be used for both processing average pooling and processing maximum pooling, thereby improving hardware utilization and reducing hardware costs.
  • the pooled computing device can implement parallel pooling operations through multiple computing units and multiple register groups, which can improve the pooling efficiency; in addition, each computing unit can store the intermediate computing of the pooling operation. As a result, the data reading and writing efficiency can be improved, and the pooling efficiency can be improved as a whole to achieve the maximum accelerated pooling operation.
  • a second aspect provides a computer device, including a memory and a pooling operation device provided by the first aspect, wherein the memory is used to store data of a pooling operation of the pooling operation device.
  • Figure 1 is a schematic diagram of a pooling operation.
  • FIG. 2 is a schematic block diagram of a pool computing operation device according to an embodiment of the present application.
  • FIG. 4 and FIG. 6 are schematic structural diagrams of a computing unit in an embodiment of the present application.
  • FIG. 7 is a schematic diagram of storing data by a plurality of register groups in the embodiment of the present application.
  • FIG. 8 to FIG. 11 are schematic diagrams showing the calculation unit reading data from a register group in the embodiment of the present application.
  • FIG. 12 and FIG. 13 are another schematic diagrams of storing data by a plurality of register groups in the embodiment of the present application.
  • Pooling refers to the operation of the pooling layer in the neural network.
  • the process of the pooling operation is to slide a fixed-size window across the entire image plane, and calculate the data covered in the window at each moment to obtain a maximum value or an average value as an output.
  • this window can be called a pooling window.
  • the size of the pooling window may be k1*k2, where k1 and k2 are integers of not less than 2, and the values of k1 and k2 may be the same or different, and are not limited in the embodiment of the present invention.
  • Figure 1 is a schematic diagram of a pooling operation.
  • the size of the input image ie, the image to be pooled
  • the pooling operation is a 2*2 pooling window sliding on the 4*4 image at intervals of 2 steps, and the 4 data covered by each pooling window obtains an output result, and all output results constitute an output image.
  • the size of the output image is 2*2.
  • d1-d4 represents image data (ie, pixel value) in the input image
  • o1 represents image data (ie, pixel value) in the output image.
  • the operation of the operator op can be either maximal (max) or averaged (avg).
  • maximum pooling When the operation mode of the operator op is the maximum value (max), the corresponding pooling operation is called maximum pooling.
  • average pooling When the operation mode of the operator op is average (avg), the corresponding pooling operation is called average pooling.
  • the data in the input image covered by a pooled window can be referred to as a pooled operand. For example, if the size of the pooling window is k1*k2, then one pooling operation includes k1*k2 pooling operands.
  • pooling operations referred to herein may be average pooling operations or maximum pooling operations.
  • FIG. 2 is a schematic block diagram of a pool computing device 200 according to an embodiment of the present application. As shown in FIG. 2, the apparatus 200 includes a plurality of register sets 210 and a plurality of computing units 220.
  • a plurality of register sets 210 are used to store a plurality of data.
  • the plurality of data stored by the plurality of register sets are data to be subjected to a pooling operation.
  • the plurality of data is data of the first row and the second row of the input image. It can be understood that a plurality of registers are included in each register set.
  • each register bank 210 has one read port.
  • one register bank can read one data at a time.
  • the register group involved in this embodiment may be referred to as a bank, and the plurality of register groups are a plurality of banks.
  • a plurality of computing units 220 are configured to perform a pooling operation on the plurality of data, wherein data of different computing unit operations is located in different register sets of the plurality of register sets.
  • the plurality of computing units 220 are configured to perform pooling operations on data of different pooled windows in parallel.
  • a plurality of computing units 220 include two computing units (for example, computing unit 1 and computing unit 2). It is assumed that the data of the first row and the second row in the input image shown in FIG. 1 are stored in the plurality of register groups 210, and the size of the pooling window is 2*2. Then, the two rows of data may be divided into two pooling windows, wherein the data of the first pooling window includes: 9, 5, 10, and 32, and the data of the second pooling window includes: 5, 3, 2, and 2 .
  • the calculation unit 1 and the calculation unit 2 can perform pooling operations on the data of the two pooled windows, respectively.
  • the computing unit 1 performs a pooling operation on the data of the first pooling window, and the computing unit 2 performs a pooling operation on the data of the second pooling window.
  • computing unit 1 reads pooled operand 9 from the register bank
  • computing unit 2 reads pooled operand 5 from the register bank
  • compute unit 1 slave register bank Reading the pooled operand 5 the computing unit 2 reads the pooled operand 3 from the register set
  • the computing unit 1 reads the pooled operand 10 from the register set
  • the computing unit 2 The register bank reads the pooled operand 2; at the next clock cycle T+3, the computing unit 1 reads the pooled operand 32 from the register bank, and the computing unit 2 reads the pooled operand 2 from the register bank.
  • the calculation unit 1 and the calculation unit 2 can simultaneously obtain the pooling results of the two pooling windows.
  • the pooled computing device provided by the embodiment of the present invention can implement parallel pooling, which can effectively improve the pooling operation efficiency.
  • the data acquired by different computing units are located in different register groups in the same clock cycle. In the register group.
  • computing unit 1 reads pooled operand 9 from the register bank
  • computing unit 2 reads pooled operand 5 from the register bank, where pooled operand 9 And 5 are respectively located in different register sets
  • the computing unit 1 reads the pooled operand 5 from the register set
  • the computing unit 2 reads the pooled operand 3 from the register set, wherein the pooling Operands 5 and 3 are located in different register banks
  • computing unit 1 reads the pooled operand 10 from the register bank
  • computing unit 2 reads the pooled operand 2 from the register bank.
  • pooled operands 10 and 2 are respectively located in different register banks; at the next clock cycle T+3, the computing unit 1 reads the pooled operand 32 from the register bank, and the computing unit 2 reads from the register bank Pooled operand 2, where pooled operands 32 and 2 are located in different register banks, respectively.
  • the data that is acquired by the same computing unit in different clock cycles may be located in different register groups, or may be located in the same register group, which is not limited in this embodiment of the present application.
  • each of the plurality of computing units 220 The function and structure of each of the plurality of computing units 220 are the same. For ease of understanding and description, one of the plurality of computing units 220 will be hereinafter referred to as the first computing unit 220. The structure and function of the computing unit in the pool computing device provided by the embodiment of the present application are described as an example. The description below for the first computing unit 220 applies to each of the plurality of computing units 220.
  • the first calculating unit 220 is configured to perform a first pooling operation on the first data and the second data of the plurality of data to obtain a first operation result; store the first operation result; and select from the plurality of registers Acquiring third data in a register group; performing a second pooling operation on the first operation result and the third data.
  • the first data and the second data respectively represent a first pooled operand and a second pooled operand in a pooling window that the first computing unit is responsible for, and the third data represents the pooling window The third pooled operand within.
  • the first operation result is an intermediate calculation result in a pooling operation corresponding to a pooling window
  • the intermediate calculation result is also involved in a subsequent operation (for example, a second pooling operation).
  • the first calculation unit stores the intermediate calculation result, so that in the subsequent operation process, the operation can be directly performed using the intermediate calculation result, without reading data from the external register file, and performing pooling by the image processor relative to the prior art.
  • the calculation, the pooling operation device provided by the present application can effectively improve the data reading and writing efficiency, thereby improving the pooling efficiency as a whole.
  • each computing unit reads one pooled operand at a time, and performs operations on two data at a time, so that the pooled computing device provided by the embodiment of the present application is not limited to the pool.
  • the effect of the size change of the window in other words, the pooling operation device provided by the embodiment of the present application can be applied to the pooling operation of the pool of any size.
  • the pooling operation device provided by the embodiment of the present application can implement parallel pooling operations through multiple computing units and multiple register groups, thereby improving pooling efficiency; in addition, each computing unit can store pooling operations.
  • the intermediate calculation results can improve the efficiency of data reading and writing, and thus the overall pooling efficiency can be improved to achieve the maximum accelerated pooling operation.
  • data in a plurality of register sets 210 is loaded from dynamic memory 240.
  • the dynamic memory is, for example, a dynamic random access memory (DRAM).
  • DRAM dynamic random access memory
  • the dynamic memory 240 may be located inside the pool computing device 200 or external to the pool computing device 200.
  • the number of multiple computing units 220 is less than or equal to the number of multiple register banks 210.
  • the pooling computing device 200 includes n computing units 220 and n register banks 210.
  • connection relationship between the plurality of register sets 210 and the plurality of computing units 220 is that each of the plurality of computing units is respectively connected to all of the plurality of register sets. It should be understood that this connection relationship enables each of the plurality of computing units to acquire data stored in any one of the plurality of register sets. This connection relationship can be referred to as a full connection.
  • the plurality of computing units appearing below are fully connected to a plurality of register sets, meaning that each of the plurality of computing units is connected to all of the register sets in the plurality of register sets.
  • connection relationship between the plurality of register sets 210 and the plurality of computing units 220 is: each of the plurality of computing units is respectively connected to a partial register set of the plurality of register sets.
  • the register groups to which the different computing units are connected may be the same, may be completely different, or may not be completely the same.
  • the first computing unit 220 includes a storage module and an arithmetic module.
  • the operation module is configured to perform a first pooling operation on the first data and the second data acquired from the plurality of register groups, obtain a first operation result, and store the first operation result in the storage module, where the operation module further And performing a second pooling operation on the first operation result stored in the storage module and the third data acquired from the plurality of register groups.
  • the first pooling operation is a comparison operation, that is, comparing the first data with the second number, and correspondingly, the operation module may include an adder or a comparator.
  • the first pooling operation is an accumulation operation, that is, the first data and the second data are accumulated, and accordingly, the operation module includes an adder. It should be understood that the arithmetic module further includes a multiplier for averaging the total accumulated results of all operands in the pooled window.
  • the first calculating unit 220 includes:
  • a first data interface for receiving the first data acquired from the plurality of register sets; a second data interface for receiving the second data obtained from the plurality of register sets; a first storage module for storing The first data; the second storage module is configured to store the second data; the operation module is configured to compare the first data with the second data, obtain the first operation result, and store the first operation result in In the latch, the comparison result is that the first data is greater than the second data; the latch is configured to store the first operation result, and according to the first operation result, the first data interface and the first The second data interface sends a feedback signal, the feedback signal is used to indicate that the first data interface is closed and the second data interface is turned on, wherein the opened second data interface is configured to receive the obtained from the first register set
  • the third data; the operation module is further configured to perform the second pooling operation on the first calculation result and the third data.
  • the first computing unit 220 includes a data interface 311, a data interface 312, a storage module 321, a storage module 322, an operation module 330, and a latch 340.
  • the data interface 311 is used to acquire data from a plurality of register sets.
  • Data interface 312 is used to acquire data from a plurality of register sets.
  • the storage module 321 is configured to store data acquired by the data interface 311.
  • the storage module 322 is configured to store data acquired by the data interface 312.
  • the operation module 330 is configured to acquire the first operand from the storage module 321 , acquire the second operand from the storage module 322 , compare the first operand with the second operand, and store the comparison result in the latch 340 .
  • the latch 340 is configured to: when the comparison result is that the first operand is greater than or equal to the second operand, send a first feedback signal to the data interface 311 and the data interface 312, when the comparison result is the first operation When the number is smaller than the second operand, the second feedback signal is sent to the data interface 311 and the data interface 312, where the first feedback signal is used to close the data interface 311, open the data interface 312, and the second feedback signal It is used to open the data interface 311 and close the data interface 312.
  • data interface 311 (or data interface 312) is closed, data is not fetched from the register set, and when data interface 311 (or data interface 312) is turned on, data is fetched from the register set.
  • the data interface 311 receives the first data from a register group, and the storage module 321 stores the first data; at the clock cycle T+1, the data interface 312 A register group receives the second data, the storage module 322 stores the second data, the operation module 330 obtains the first operand (ie, the first data) from the storage module 321, and obtains the second operand from the storage module 322 (ie, the second Data), comparing the first operand with the second operand, and storing the comparison result in the latch 340, the latch 340 is configured to, when the comparison result is that the first operand is greater than or equal to the second When the operand is used, the first feedback signal is sent to the data interface 311 and the data interface 312.
  • the second feedback is sent to the data interface 311 and the data interface 312.
  • the first operand greater than or equal to the second operand and the latch 340 sending the first feedback signal to the data interfaces 311 and 312 as an example; in the clock cycle T+2, Data interface 31 1 is closed, the data interface 312 is turned on, and the third data is received from a register group, the storage module 322 stores the third data, and the operation module 330 obtains the first operand from the storage module 321 (ie, compared in the clock cycle T+1) The larger value: the first data), the second operand (ie, the third data) is obtained from the storage module 322, and the first operand is compared with the second operational data, and the comparison result is that the first operand is greater than or equal to The second operand stores the comparison result in the latch 340.
  • the latch 340 is configured to send the first feedback signal to the data interface 311 and the data interface 312.
  • the data interface 311 is closed, and the data interface 312 is turned on, and receives fourth data from a register group
  • the storage module 322 stores the fourth data
  • the operation module 330 obtains the first operand from the storage module 321 (ie, the larger value compared in the clock cycle T+2: a data), obtaining a second operand (ie, fourth data) from the storage module 322, and comparing the first operand with the second operational data, the comparison result is that the first operand is greater than or equal to the second operand,
  • the pooled results of operation of the pool a first data.
  • Both the storage module 321 and the storage module 322 can be registers.
  • the data interface 311 and the data interface 312 may both be multiplexers.
  • the number of inputs of the multiplexer is equal to the number of register banks to which the first computing unit is connected.
  • the operation module 330 is an adder for subtracting the second operand obtained from the storage module 322 from the first operand acquired by the storage module 321 and storing the subtracted result as a comparison result in the lock. 340.
  • the operation module 330 is a comparator for comparing the first operand acquired from the storage module 321 with the second operand obtained from the storage module 322, and storing the comparison result in the latch 340.
  • the computing unit of the first implementation is applicable to the scenario in which the pooling operation is the maximum pooling. That is, the pooling computing device provided by the embodiment of the present application can be used to implement maximum pooling.
  • the first calculating unit 220 specifically includes:
  • a first data interface configured to receive the first data from the plurality of register sets; a second data interface, configured to receive the second data from the plurality of register sets; a first storage module, configured to store the first data a second storage module, configured to store the second data, an adder, configured to accumulate the first data and the second data, to obtain the first operation result, where the second storage module is further configured to store the second data a first operation result; the first data interface is further configured to acquire the third data from the first register group; the adder is further configured to perform the second pooling on the first operation result and the third data Operation.
  • the first calculating unit 220 further includes: a multiplier for multiplying the accumulated result of the k1*k2 data by 1/(k1*k2) when the adder obtains an accumulated result of k1*k2 data
  • the first computing unit 220 includes a data interface 411, a data interface 412, a storage module 421, a storage module 422, an adder 430, and a multiplier 440.
  • the data interface 411 is used to acquire data from a register set.
  • Data interface 412 is used to acquire data from a register bank.
  • the storage module 421 is configured to store data acquired by the data interface 411.
  • the storage module 422 is configured to store data acquired by the data interface 412.
  • the adder 430 is configured to acquire the first operand from the storage module 421, acquire the second operand from the storage module 422, accumulate the first operand and the second operand, and store the accumulated result in the storage module 422. After the first calculation unit reads k1*k2 data from the register set, the adder 430 is configured to send the accumulation result to the multiplier 440, which is the size of the pooled window.
  • the data interface 412 is closed, after which only the data interface 411 is used to receive data from the register set.
  • the multiplier 440 is configured to multiply the accumulated result sent by the adder 430 by 1/(k1*k2), thereby obtaining the pooling result of the current pooling operation.
  • the data interface 411 receives the first data from a register group, and the storage module 421 stores the first data; at the clock cycle T+1, the data interface 412 A register set receives the second data, the storage module 422 stores the second data, the adder 430 obtains the first operand (ie, the first data) from the storage module 421, and obtains the second operand from the storage module 422 (ie, the first Two data), accumulating the first operand and the second operand, and storing the accumulated result (denoted as the accumulated value 1) in the storage module 422, at which time the data interface 412 is closed; in the clock cycle T+2, the data interface 412 is closed, the data interface 411 is turned on, and the third data is received from a register group, the storage module 421 stores the third data, and the adder 430 obtains the first operand (ie, the third data) from the storage module 421, from the storage module 422.
  • the number (i.e., the accumulated value 2) is accumulated, and the first operand and the second operation data are accumulated, and the accumulated result (denoted as the accumulated value 3) is sent to the multiplier 440, which multiplies the accumulated value 3 by 1/4.
  • the multiplication result is the pooling result of the pooling operation.
  • the storage module 422 is used to store the accumulated result of the adder 430 as an example.
  • the storage module 421 may also be configured to store the accumulated result of the adder 430 (in this case, The data interface 411 needs to be closed, and the data interface 412 is opened. This embodiment does not limit this.
  • Both the storage module 421 and the storage module 422 can be registers.
  • the data interface 411 and the data interface 412 may both be multiplexers.
  • the number of inputs of the multiplexer is equal to the number of register banks to which the first computing unit is connected.
  • calculation unit of the second implementation is applicable to the scenario that the pooling operation is an average pooling, that is, the pooling operation device provided by the embodiment of the present application can be used to process the average pooling.
  • the first computing unit 220 includes a data interface 511 , a data interface 512 , a storage module 521 , a storage module 522 , an adder 530 , and a multiplier 540 .
  • latch 550 latch 550.
  • the data interface 511 is used to acquire data from a register set.
  • Data interface 512 is used to acquire data from a register bank.
  • the storage module 521 is configured to store data acquired by the data interface 511.
  • the storage module 522 is configured to store data acquired by the data interface 512.
  • the adder 530 is configured to acquire the first operand from the storage module 521, acquire the second operand from the storage module 522, compare the first operand with the second operand, and store the comparison result in the latch 550.
  • the latch 550 is configured to: when the comparison result is that the first operand is greater than or equal to the second operand, send a first feedback signal to the data interface 511 and the data interface 512, when the comparison result is the first operation When the number is smaller than the second operand, the second feedback signal is sent to the data interface 511 and the data interface 512, wherein the first feedback signal is used to close the data interface 511, open the data interface 512, and the second feedback signal It is used to open the data interface 511 and close the data interface 512.
  • data interface 511 (or data interface 512) is turned off, data is not fetched from the register set, and when data interface 511 (or data interface 512) is turned on, data is fetched from the register set.
  • the adder 530 is configured to acquire the first operand from the storage module 521, acquire the second operand from the storage module 522, accumulate the first operand and the second operand, and store the accumulated result in the storage module 522. After the first calculation unit reads k1*k2 data from the register set, the adder 530 is configured to send the accumulated result to the multiplier 550, which is the size of the pooled window.
  • the data interface 512 is closed, after which only the data interface 511 is used to receive data from the register set.
  • the multiplier 540 is configured to multiply the accumulated result sent by the adder 530 by 1/(k1*k2), thereby obtaining the pooling result of the current pooling operation.
  • the first computing unit of the third implementation may support two states, one is for the state of maximum pooling (as shown in FIG. 5), and the other is for the state of averaging pooling (as shown in the figure). 6).
  • the pooling computing device further includes: a control unit 230, configured to send a control signal to the plurality of computing units, the control The signal is used to indicate that the pooling operation is maximum pooling or average pooling.
  • the first calculating unit 220 is further configured to receive the control signal. When the control signal indicates that the pooling operation is maximum pooling, the first calculating unit 220 performs a maximum value pool on the first data and the second data. And calculating, when the control signal indicates that the pooling operation is average pooling, performing an average pooling operation on the first data and the second data.
  • the first calculating unit 220 switches to the state of FIG. 5; when the control signal indicates that the pooling operation is average pooling, the first The calculation unit 220 switches to the state as shown in FIG.
  • control unit 230 may also be located outside the pooling computing device 200 provided by the present application, which is not limited in this application.
  • Both storage module 521 and storage module 522 can be registers.
  • the data interface 511 and the data interface 512 may both be multiplexers.
  • the number of inputs of the multiplexer is equal to the number of register banks to which the first computing unit 220 is connected.
  • the computing unit shown in FIG. 5 can be applied to both the average pooled scenario and the maximum pooled scenario. That is, the pool computing device provided in the embodiment of the present application can be used to process the average. Pooling can also handle the maximum pooling, which can improve hardware utilization and reduce hardware costs.
  • a plurality of register sets store data is described below.
  • a plurality of computing units are fully connected to a plurality of register groups as an example for description.
  • the embodiment described below can also be applied to a scenario in which each of the plurality of computing units is connected to a partial register group of the plurality of register groups by a reasonable transformation, and this part also falls within the scope of the present application.
  • a plurality of register banks 210 are taken as an example for description.
  • the data to be subjected to the pooling operation is stored in a plurality of register groups, so that the data read by different computing units is located in each read data process (ie, each clock cycle). In the register group. That is, it can be guaranteed that different computing units will not read data from the register group.
  • multiple computing units are n computing units
  • multiple register groups are n banks
  • the size of the pooling window is k*k
  • the size of the image to be pooled is m*m, where k is greater than 1.
  • a positive number n is a positive number greater than 1
  • m is a positive number greater than one.
  • m n*k.
  • the data to be pooled is stored in n banks: the first row of the jth row in the k rows The column, the k+1th column, the 2k+1th column, ..., the (n-1)*k+1 column data are respectively stored in different register groups of the n register groups, and the second column of the jth row
  • the k+2th column, the 2k+2th column, the ..., the (n-1)*k+2 column data are respectively stored in different register sets of the n register sets, ..., the kth of the jth row
  • Column, k+k column, 2k+k column, ..., (n-1)*k+k column data are respectively stored in different register groups of the n register groups, and j is 1, 2, .. .,k.
  • FIG. 7 is a schematic diagram showing how a specific data to be pooled is stored in a plurality of banks.
  • k is 2
  • n is 9, that is, the size of the pooling window is 2*2, and the number of computing units and banks is 9, as shown in FIG. 6, 9 computing units and 9 banks.
  • Each bank includes a plurality of registers (there are five registers in each bank schematically shown in Fig. 7), and the registers of the nine banks constitute a register array of nine rows and columns (illustrated schematically in Fig. 7) Row 5 column register array).
  • the size of the image to be processed is 18*18
  • the data to be subjected to the pooling operation is the data of the first row and the second row of the image, as shown in FIG. 7, wherein four data representations of the same pattern are shown. Data within the same pooled window.
  • the data in the first row of the image is loaded from the first row of the r1 column in the register array until the 2-column register (the r1 column and the r2 column shown in FIG. 7) is occupied to complete the loading of the first row of data. .
  • the data in the second line of the image is loaded from the first line of the (r1+2)th column (ie, the r3th column) in the register array until it occupies the 2-column register (the r3 column and the R4 column) Complete the loading of the second line of data.
  • FIG. 8 A flow chart for reading data from nine banks by nine computing units is shown in Figures 8, 9, 10 and 11.
  • the data in the dotted line frame of the nine banks is simultaneously read out, as shown in FIG. 8, these data are the first column and the third column of the first row of the image, respectively. , column 5, ... the data of the 17th column, that is, the first pooled operand in each pooling window.
  • the calculation unit 1 reads data "1" from Bank1
  • the calculation unit 2 reads data "3" from Bank3
  • the calculation unit 3 reads data "5" from Bank5
  • the calculation unit 4 reads data from Bank7.
  • the calculation unit 5 reads the data "9” from the Bank 9
  • the calculation unit 6 reads the data "11” from the Bank 2
  • the calculation unit 7 reads the data "13” from the Bank 4
  • the calculation unit 8 reads from the Bank 6 Taking the data "15”
  • the calculation unit 9 reads the data "17” from the Bank 8.
  • the data in the dotted line frame of the nine banks is simultaneously read out, as shown in FIG. 9, these data are the second column of the first row of the image, respectively.
  • the computing unit 1 reads the data "2" from Bank2
  • the computing unit 2 reads the data "4" from Bank4
  • the computing unit 3 reads the data "6” from Bank6
  • the computing unit 4 reads the data from Bank8.
  • the computing unit 5 reads the data "10” from Bank1
  • the computing unit 6 reads the data "12” from Bank3
  • the computing unit 7 reads the data "14" from Bank5
  • the computing unit 8 reads from Bank7.
  • the calculation unit 9 reads the data "18" from the Bank 9.
  • the data in the dotted line frame of the nine banks is simultaneously read out, as shown in FIG. 10, these data are the first column and the second row of the second row of the image, respectively.
  • the calculation unit 1 reads data "19" from Bank1
  • the calculation unit 2 reads data "3" from Bank3
  • the calculation unit 3 reads data "5" from Bank5
  • the calculation unit 4 reads data from Bank7.
  • the calculation unit 5 reads the data "9” from the Bank 9
  • the calculation unit 6 reads the data "11” from the Bank 2
  • the calculation unit 7 reads the data "13” from the Bank 4
  • the calculation unit 8 reads from the Bank 6 Taking the data "15”
  • the calculation unit 9 reads the data "17” from the Bank 8.
  • the data in the dotted line frame of the nine banks is simultaneously read out, as shown in FIG. 11, these data are the second column of the second row of the image, respectively.
  • the computing unit 1 reads the data "20" from Bank2
  • the computing unit 2 reads the data "22” from Bank4
  • the computing unit 3 reads the data "24” from Bank6,
  • the computing unit 4 reads the data from Bank8.
  • the computing unit 5 reads the data "28” from Bank1
  • the computing unit 6 reads the data "30" from Bank3
  • the computing unit 7 reads the data "32” from Bank5
  • the computing unit 8 reads from Bank7.
  • the calculation unit 9 reads the data "36" from the Bank 9. At this point, 9 computing units can simultaneously output 9 pooling results.
  • FIG. 8 It can be seen from the descriptions of FIG. 8, FIG. 9, FIG. 10 and FIG. 11 that the data read by different computing units are located in different banks every clock cycle. In addition, banks that read data at different clock cycles by the same computing unit may also be different.
  • the structure of the calculation unit in the above embodiment may be the structure shown in FIG. 3, the structure shown in FIG. 4, or the result shown in FIG. 5, which is not limited in this application.
  • the nine computing units shown in FIG. 7 are all switched to the state shown in FIG.
  • the calculation unit 9 is taken as an example for description below, and the description of the calculation unit 9 is also applicable to the calculation unit 1-8, which will not be described again for brevity.
  • the data interface 511 in the computing unit 9 receives data "1" from Bank1, and the storage module 521 stores data "1"; at clock cycle T+1, as shown in FIG.
  • the data interface 512 in the computing unit 9 receives the data "2" from the bank 2, the storage module stores the data "2", the adder 530 acquires the first operand “1” from the storage module 521, and obtains the second from the storage module 522.
  • Operand "2" compare the two operands, store the comparison result (ie, the first operand is smaller than the second operand) into the latch 550; the latch 550 sends the data interface 511 and the data interface 512
  • Two feedback signals, the second feedback signal causes the data interface 511 to be turned on, and the data interface 512 is turned off; in the clock cycle T+2, as shown in FIG. 10, the data interface 511 receives the data "19" from the Bank1, and the storage module 521 stores the data.
  • the adder 530 acquires the first operand "19” from the storage module 521, and acquires the second operand "2" from the storage module 522 (ie, the larger value compared with the clock period T+1), for the two The operands are compared and the result will be compared (ie the first operand is greater than
  • the second operand is stored in the latch 550; the latch 550 sends a first feedback signal to the data interface 511 and the data interface 512, the first feedback signal causing the data interface 511 to be closed, and the data interface 512 to be turned on; during the clock cycle T+ 3. As shown in FIG.
  • the data interface 512 receives data "20" from Bank2, the storage module 522 stores data "20", and the adder 530 acquires the first operand "19” from the storage module 521 (ie, the clock period T+) 2 comparing the larger value), obtaining the second operand "20” from the storage module 522, comparing the two operands, and obtaining a comparison result (ie, the first operand is smaller than the second operand), that is, this time is obtained.
  • the pooling result of the pooling operation is "20".
  • the nine computing units shown in FIG. 7 are all switched to the state as shown in FIG. 6.
  • the flow of reading data from the register group by the computing unit is consistent with the description in the maximum pooling above, except that the state shown in FIG. 6 is different from the data processing method in the state shown in FIG. 5.
  • FIG. 5 For details, see above. In conjunction with the description of FIG. 5, for brevity, details are not described herein again.
  • pooled results obtained by each computing unit can be written back into multiple register banks. For example, the pooled results of the same row in the original image are written to different register banks.
  • the pooling operation device can perform parallel pooling by using a plurality of computing units and a plurality of register groups, and different computing units can read data from different register groups without blocking, so that parallel pooling can be implemented.
  • the efficiency of the pooling operation can be improved.
  • the storage unit includes a storage module for storing intermediate calculation results, which can improve data reading and writing efficiency, thereby improving the efficiency of the pooling operation as a whole.
  • Figures 7, 8-11 are merely examples and are not limiting. On this basis, for different application scenarios, the corresponding processing methods can be obtained through adaptive deduction, and these solutions also fall within the protection scope of the present application.
  • n*k is taken as an example for description. In practical applications, m>n*k, or m ⁇ n*k may appear.
  • the embodiment of the present application provides a solution.
  • n computing units are not fully loaded, part of the data in other rows in the image is stored in n register groups, so that n computing units can read the data for parallel pooling.
  • n 9 and k is 2 is taken as an example.
  • the data of the 1st line to the 4th line of the image is shown in Fig. 13, in which one pattern represents data of the 1st line to the 2nd line, and the other pattern represents data of the 3rd line to the 4th line.
  • the first 9*2 column data in the first row and the second row of the image are pooled. After 2*2 clock cycles, the image is in the first row and the second row (m-2*). 9)
  • the column data is stored in 9 register banks (Bank), and the first x column data in the 3rd row and the 4th row of the image is stored in 9 register banks, and the 3rd row and the 4th row are in front.
  • the storage position of the x column data in the nine register sets is spliced together with the position of the post (m-2*9) column data in the first and second rows of the image in the nine register sets, as shown in FIG.
  • the data after the splicing process described above can be such that, in the next 2*2 clock cycles, the nine computing units can be fully loaded.
  • the processing method is similar.
  • the data of the 1st line and the 2nd line of the image are stored in 9 register sets, and the 3rd line and the 4th line of the image are the first y column.
  • the data is stored in nine register banks that allow 9 compute cells to run at full load for the next k*k clock cycles.
  • the storage method of the data provided in this embodiment in a plurality of register sets can be a splicing method.
  • the pooled computing device provided by the present application can implement parallel pooling for different sizes of images and pooling windows, that is, achieve acceleration of pooling operations.
  • the pooling operation device provided in the embodiment of the present invention can implement parallel pooling operation through multiple computing units and multiple register groups, thereby improving pooling efficiency; in addition, each computing unit can store pools.
  • the intermediate calculation result of the operation can improve the efficiency of data reading and writing, and thus the overall pooling efficiency can be improved to achieve the maximum accelerated pooling operation.
  • the nine computing units and nine register groups are included in the pooling computing device as an example, which is merely an example and not a limitation. In practical applications, the number of multiple computing units and multiple register groups in the pool computing device can be designed according to actual needs.
  • the specific form of the pool computing device provided by the embodiment of the present application may be a chip.
  • the embodiment of the present application further provides a computer device, including a memory and the pooling operation device provided by the above embodiment, wherein the memory is used to store data of the pooling operation device to be performed by the pooling operation device.
  • the disclosed systems, devices, and methods may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the functions may be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a standalone product.
  • the technical solution of the present application which is essential or contributes to the prior art, or a part of the technical solution, may be embodied in the form of a software product, which is stored in a storage medium, including
  • the instructions are used to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present application.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like, which can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Advance Control (AREA)
  • Executing Machine-Instructions (AREA)

Abstract

本申请提供一种池化运算装置,该装置包括:多个寄存器组,用于存储多个数据;多个计算单元,用于对该多个数据执行池化操作,其中,不同的计算单元操作的数据位于该多个寄存器组中的不同寄存器组中;该多个计算单元中的第一计算单元用于:对该多个数据中的第一数据和第二数据进行第一池化运算,获得第一运算结果;存储该第一运算结果;从该多个寄存器中的第一寄存器组中获取第三数据;对该第一运算结果和该第三数据进行第二池化运算。本申请可以实现并行池化,其中的计算单元中可以存储中间计算结果,能提高数据读写效率,可以提高池化效率。

Description

池化运算装置 技术领域
本申请涉及神经网络领域,具体地,涉及一种池化运算装置。
背景技术
卷积神经网络通常应用于图像识别。卷积神经网络一般包括卷积层、池化层和全连接层。池化层作为降维的重要工具,一般置于卷积层后。池化层的运算称为池化运算。池化运算的过程为,将一个固定大小的窗口滑动过整个图像平面,在每个时刻对窗口内覆盖的数据进行运算,例如求最大值或者求平均值作为输出。这个窗口可以称为池化窗,通常池化窗的大小为k1*k2,其中,k1和k2分别为不小于2的整数。池化包括最大池化以及平均池化。
目前,有些现有技术利用通用图像处理器做池化运算,具体地,采用通用指令控制实现了池化操作。但这种方案的弊端在于,每次的运算结果需要写回寄存器堆,下次运算需要时又需要从寄存器堆读出来,导致重复读写,从而降低数据读写效率。
发明内容
本申请提供一种池化运算装置,可以在一定程度上避免重复读写,从而可以有效提高数据读写效率。
第一方面,提供了一种池化运算装置,该装置包括:多个寄存器组,用于存储多个数据;多个计算单元,用于对该多个数据执行池化操作,其中,不同的计算单元操作的数据位于该多个寄存器组中的不同寄存器组中;该多个计算单元中的第一计算单元用于:对该多个数据中的第一数据和第二数据进行第一池化运算,获得第一运算结果;存储该第一运算结果;从该多个寄存器中的第一寄存器组中获取第三数据;对该第一运算结果和该第三数据进行第二池化运算。
在同一个时钟周期,不同计算单元所获取的数据位于多个寄存器组中的不同寄存器组中,可以有效避免数据读取冲突,从而可以更好地实现并行池化,以提高池化运算效率。
应理解,第一运算结果为一个池化窗对应的池化操作中的中间计算结果,该中间计算结果还要参与后续的运算(例如第二池化运算)。第一计算单元通过存储该中间计算结果,从而在后续运算过程中,可以直接使用该中间计算结果执行运算,无需从外部寄存器堆读取数据,相对于现有技术采用通过图像处理器执行池化运算,本申请提供的池化运算装置,可以有效提高数据读写效率,从而可以整体上提高池化效率。
此外,每个计算单元每次读取一个池化操作数,且每次针对两个数据进行运算,这样使得本申请提供的池化运算装置不受限于池化窗的大小变化的影响,换句话说,本申请实施例提供的池化运算装置可以适用于任意大小池化窗的池化操作。
因此,本申请提供的池化运算装置,通过多个计算单元与多个寄存器组可以实现并行池化运算,可以提高池化效率;此外,由于每个计算单元均可以存储池化操作的中间计算结果,因此可以提高数据读写效率,进而整体上可以提高池化效率,以实现最大化加速池化运算。
对于同一个计算单元,其在不同时钟周期获取的数据可以位于不同的寄存器组,也可 以位于相同的寄存器组。
在不同时钟周期之间,不同计算单元所获取的数据可以位于不同的寄存器组,也可以位于相同的寄存器组。
多个寄存器组中的每个寄存器组具有一个读端口。即,在每个时钟周期,一个寄存器组可以被读出一个数据。
可选地,多个计算单元的数量小于或等于多个寄存器组的数量。
结合第一方面,在第一方面的一种可能的实现方式中,该多个计算单元中的任一个计算单元能够读取该多个寄存器组中任一个寄存器组中的数据。
具体地,多个寄存器组和多个计算单元的连接关系为:多个计算单元中的每个计算单元分别与多个寄存器组中的全部寄存器组连接。这种连接关系可称为全连接。
可选地,多个寄存器组和多个计算单元的连接关系也可以为:多个计算单元中的每个计算单元分别与多个寄存器组中的部分寄存器组连接。
结合第一方面,在第一方面的一种可能的实现方式中,包括存储模块与运算模块。运算模块用于,对从多个寄存器组获取的第一数据与第二数据进行第一池化运算,获得第一运算结果,并将该第一运算结果存储于该存储模块,该运算模块还用于,对该存储模块存储的第一运算结果与从多个寄存器组获取的第三数据进行第二池化运算。
在池化操作为最大池化的场景下,第一池化运算为比较运算,即比较第一数据与第二数,相应地,该运算模块可以包括加法器或比较器。在池化操作为平均池化的场景下,第一池化运算为累加运算,即对第一数据与第二数据进行累加,相应地,该运算模块包括加法器。应理解,该运算模块还包括乘法器,用于对池化窗内所有操作数的总累加结果求平均。
本申请提供的池化运算装置采用硬件实现池化操作中的相关运算,而非采用指令控制。
结合第一方面,在第一方面的一种可能的实现方式中,该第一池化运算包括最大值池化运算,该第一计算单元包括:第一数据接口,用于接收从该多个寄存器组获取的该第一数据;第二数据接口,用于接收从该多个寄存器组获取的该第二数据;第一存储模块,用于存储该第一数据;第二存储模块,用于存储该第二数据;运算模块,用于比较该第一数据与该第二数据,获得该第一运算结果,并将该第一运算结果存储于锁存器中,该比较结果为该第一数据大于该第二数据;该锁存器用于,用于存储该第一运算结果,并根据该第一运算结果向该第一数据接口与该第二数据接口发送反馈信号,该反馈信号用于指示该第一数据接口关闭并指示该第二数据接口开启,其中,该开启的第二数据接口用于接收从该第一寄存器组中获取的该第三数据;该运算模块,还用于对该第一计算结果和该第三数据进行该第二池化运算。
本申请提供的池化运算装置可以用于实现最大池化。
结合第一方面,在第一方面的一种可能的实现方式中,该第一池化运算包括平均值池化运算;该第一计算单元具体包括:第一数据接口,用于从该多个寄存器组接收该第一数据;第二数据接口,用于从该多个寄存器组接收该第二数据;第一存储模块,用于存储该第一数据;第二存储模块,用于存储该第二数据;加法器,用于对该第一数据与该第二数据进行累加,获得该第一运算结果;该第二存储模块,还用于存储该第一运算结果;该第一数据接口,还用于从该第一寄存器组获取该第三数据;该加法器,还用于对该第一运算 结果与该第三数据进行该第二池化运算。
该第一计算单元还包括:乘法器,用于当该加法器获得k1*k2个数据的累加结果时,对该k1*k2个数据的累加结果乘以1/(k1*k2)以获得该k1*k2个数据的平均值,其中,k1*k2为该池化操作对应的池化窗的大小,k1和k2分别为不小于2的整数。
本申请提供的池化运算装置可以用于实现平均池化。
结合第一方面,在第一方面的一种可能的实现方式中,该池化运算装置还包括:控制单元,用于向该多个计算单元发送控制信号,该控制信号用于指示该池化操作为最大池化或平均池化;该第一计算单元,还用于接收该控制信号;在对该多个数据中的第一数据和第二数据进行第一池化运算的过程中,该第一计算单元具体用于:当该控制信号指示该池化操作为最大池化时,对该第一数据和该第二数据执行最大值池化运算;当该控制信号指示该池化操作为平均池化时,对该第一数据和该第二数据进行平均值池化运算。
本申请提供的池化运算装置既可以用于处理平均池化,又可以用于处理最大池化,从而可以提高硬件利用率,降低硬件成本。
因此,本申请提供的池化运算装置,通过多个计算单元与多个寄存器组可以实现并行池化运算,可以提高池化效率;此外,由于每个计算单元均可以存储池化操作的中间计算结果,因此可以提高数据读写效率,进而整体上可以提高池化效率,以实现最大化加速池化运算。
第二方面,提供一种计算机设备,包括内存以及第一方面提供的池化运算装置,其中,该内存用于存储该池化运算装置待执行池化操作的数据。
附图说明
图1为池化操作的示意图。
图2为本申请实施例提供的池化运算装置的示意性框图。
图3、图4、图5和图6为本申请实施例中计算单元的结构示意图。
图7为本申请实施例中多个寄存器组存储数据的示意图。
图8至图11为本申请实施例中计算单元从寄存器组中读取数据的示意图。
图12与图13为本申请实施例中多个寄存器组存储数据的另一示意图。
具体实施方式
下面将结合附图,对本申请中的技术方案进行描述。
为了便于理解本申请实施例提供的方案,下文首先结合图1描述池化的概念。
池化指的是神经网络中池化层的运算。池化操作的过程为,将一个固定大小的窗口滑动过整个图像平面,在每个时刻对窗口内覆盖的数据进行运算,求最大值或者求平均值作为输出。其中,这个窗口可以称为池化窗。池化窗的大小可以为k1*k2,其中,k1和k2分别为不小于2的整数,k1与k2的值可以相同也可以不相同,在本发明实施例中不进行限定。
图1为池化操作的示意图。输入图像(即待进行池化处理的图像)的大小为4*4,池化窗的大小为2*2。池化操作为一个2*2的池化窗在4*4的图像上以步长为2的间隔滑动,每个池化窗覆盖的4个数据得到一个输出结果,所有输出结果构成输出图像,如图1所示,输出图像的大小为2*2。
图1中所示的输出图像中的图像数据是通过如下公式得到的:
o1=op{d1,d2,d3,d4},
其中,d1-d4表示输入图像中的图像数据(即像素值),o1表示输出图像中的图像数据(即像素值)。
运算符op的运算方式可以为求最大值(max)或求平均值(avg)。当运算符op的运算方式为求最大值(max)时,对应的池化操作称为最大池化。当运算符op的运算方式为求平均值(avg)时,对应的池化操作称为平均池化。
一个池化窗覆盖的输入图像中的数据可以称为池化操作数。例如,池化窗的大小为k1*k2,则一个池化操作包括k1*k2个池化操作数。
本文中涉及的池化运算可以是平均池化运算或最大池化运算。
为了便于理解与描述,本文某些实施例中会以池化窗的大小为k*k(即k1=k2=k)为例进行描述,k为不小于2的整数,但这并不对本申请造成限定。实际应用中,k1和k2的大小可以相同,也可以不相同,在此不做限定。
图2为本申请实施例提供的池化运算装置200的示意性框图。如图2所示,该装置200包括多个寄存器组210和多个计算单元220。
多个寄存器组210用于,存储多个数据。
具体地,多个寄存器组存储的多个数据为待进行池化操作的数据。例如,在图1所示场景中,该多个数据为输入图像的第1行和第2行的数据。可以理解的是,每个寄存器组中包括多个寄存器。
具体地,每个寄存器组210具有一个读端口。换句话说,一个寄存器组每次可被读出一个数据。作为示例,本实施例涉及的寄存器组可以被称为Bank,多个寄存区组即为多个Bank。多个计算单元220用于,对该多个数据执行池化操作,其中,不同的计算单元操作的数据位于该多个寄存器组中的不同寄存器组中。
具体地,多个计算单元220用于并行对不同池化窗的数据进行池化操作。
以图1所示场景为例,假设多个计算单元220包括2个计算单元(例如:计算单元1和计算单元2)。假设多个寄存器组210中存储了图1所示输入图像中的第1行和第2行的数据,池化窗的大小为2*2。则可以将上述两行数据划分为两个池化窗,其中,第一池化窗的数据包括:9、5、10和32,第二池化窗的数据包括:5、3、2和2。计算单元1和计算单元2可以分别对这两个池化窗的数据进行池化操作。例如,计算单元1对第一池化窗的数据进行池化操作,计算单元2对第二池化窗的数据进行池化操作。具体的,在时钟周期T,计算单元1从寄存器组读取池化操作数9,计算单元2从寄存器组读取池化操作数5;在下一个时钟周期T+1,计算单元1从寄存器组读取池化操作数5,计算单元2从寄存器组读取池化操作数3;在再下一个时钟周期T+2,计算单元1从寄存器组读取池化操作数10,计算单元2从寄存器组读取池化操作数2;在再下一个时钟周期T+3,计算单元1从寄存器组读取池化操作数32,计算单元2从寄存器组读取池化操作数2。应理解,在时钟周期T+3之后,计算单元1和计算单元2可以同时获得两个池化窗的池化结果。
上述可知,本申请实施例提供的池化运算装置可以实现并行池化,这样可以有效提高池化运算效率。
需要说明的是,为了减少计算单元从寄存器组读取数据的时间,提升计算效率,在本发明实施例中,在同一个时钟周期,不同计算单元所获取的数据位于多个寄存器组中的不 同寄存器组中。
例如,在上面图1的例子中,在时钟周期T,计算单元1从寄存器组读取池化操作数9,计算单元2从寄存器组读取池化操作数5,其中,池化操作数9和5分别位于不同的寄存器组中;在下一个时钟周期T+1,计算单元1从寄存器组读取池化操作数5,计算单元2从寄存器组读取池化操作数3,其中,池化操作数5和3分别位于不同的寄存器组中;在再下一个时钟周期T+2,计算单元1从寄存器组读取池化操作数10,计算单元2从寄存器组读取池化操作数2,其中,池化操作数10和2分别位于不同的寄存器组中;在再下一个时钟周期T+3,计算单元1从寄存器组读取池化操作数32,计算单元2从寄存器组读取池化操作数2,其中,池化操作数32和2分别位于不同的寄存器组。
应理解,在同一个时钟周期,不同计算单元所获取的数据位于多个寄存器组中的不同寄存器组中,可以有效避免数据读取冲突,从而可以更好地实现并行池化,以提高池化运算效率。
需要说明的是,对于同一个计算单元在不同时钟周期获取的数据可以位于不同的寄存器组,也可以位于相同的寄存器组,本申请实施例对此不做限定。
多个计算单元220中的每个计算单元220的功能与结构都是相同的,为了便于理解与描述,下文中将以多个计算单元220中的一个计算单元(记为第一计算单元220)为例描述本申请实施例提供的池化运算装置中的计算单元的结构与功能。下文针对第一计算单元220的描述均适用于多个计算单元220中的每个计算单元220。
第一计算单元220用于,对该多个数据中的第一数据和第二数据进行第一池化运算,获得第一运算结果;存储该第一运算结果;从该多个寄存器中的第一寄存器组中获取第三数据;对该第一运算结果和该第三数据进行第二池化运算。
该第一数据与该第二数据分别表示该第一计算单元负责处理的一个池化窗中的第一个池化操作数与第二个池化操作数,该第三数据表示这个池化窗内的第三个池化操作数。
应理解,第一运算结果为一个池化窗对应的池化操作中的中间计算结果,该中间计算结果还要参与后续的运算(例如第二池化运算)。第一计算单元通过存储该中间计算结果,从而在后续运算过程中,可以直接使用该中间计算结果执行运算,无需从外部寄存器堆读取数据,相对于现有技术采用通过图像处理器执行池化运算,本申请提供的池化运算装置,可以有效提高数据读写效率,从而可以整体上提高池化效率。
此外,在本申请实施例中,每个计算单元每次读取一个池化操作数,且每次针对两个数据进行运算,这样使得本申请实施例提供的池化运算装置不受限于池化窗的大小变化的影响,换句话说,本申请实施例提供的池化运算装置可以适用于任意大小池化窗的池化操作。
因此,本申请实施例提供的池化运算装置,通过多个计算单元与多个寄存器组可以实现并行池化运算,可以提高池化效率;此外,由于每个计算单元均可以存储池化操作的中间计算结果,因此可以提高数据读写效率,进而整体上可以提高池化效率,以实现最大化加速池化运算。
具体地,如图2所示,多个寄存器组210中的数据是从动态存储器240中加载的。动态存储器例如为动态随机存取存储器(Dynamic Random Access Memory,DRAM)。动态存储器240可以位于池化运算装置200内部,也可以位于池化运算装置200的外部。
多个计算单元220的数量小于或等于多个寄存器组210的数量。例如,该池化运算装 置200包括n个计算单元220和n个寄存器组210。
可选地,多个寄存器组210和多个计算单元220的连接关系为:多个计算单元中的每个计算单元分别与多个寄存器组中的所有寄存器组均连接。应理解,这种连接关系,使得多个计算单元中的每个计算单元能够获取该多个寄存器组中任意一个寄存器组中存储的数据。这种连接关系可以称为全连接。
下文中出现的多个计算单元与多个寄存器组全连接,指的就是,多个计算单元中的每个计算单元分别与多个寄存器组中的所有寄存器组均连接。
可选地,多个寄存器组210和多个计算单元220的连接关系为:多个计算单元中的每个计算单元分别与多个寄存器组中的部分寄存器组连接。具体地,不同计算单元所连接的寄存器组可以相同,也可以完全不同,也可以不完全相同,本申请实施例对此不做限定。
为了清楚的说明本发明实施例如何做池化运算,下面将对本发明实施例提供的计算单元进行描述。第一计算单元220中包括存储模块与运算模块。运算模块用于,对从多个寄存器组获取的第一数据与第二数据进行第一池化运算,获得第一运算结果,并将该第一运算结果存储于该存储模块,该运算模块还用于,对该存储模块存储的第一运算结果与从多个寄存器组获取的第三数据进行第二池化运算。
在池化操作为最大池化的场景下,第一池化运算为比较运算,即比较第一数据与第二数,相应地,该运算模块可以包括加法器或比较器。
在池化操作为平均池化的场景下,第一池化运算为累加运算,即对第一数据与第二数据进行累加,相应地,该运算模块包括加法器。应理解,该运算模块还包括乘法器,用于对池化窗内所有操作数的总累加结果求平均。
可选地,作为第一种实现方式,该第一计算单元220包括:
第一数据接口,用于接收从该多个寄存器组获取的该第一数据;第二数据接口,用于接收从该多个寄存器组获取的该第二数据;第一存储模块,用于存储该第一数据;第二存储模块,用于存储该第二数据;运算模块,用于比较该第一数据与该第二数据,获得该第一运算结果,并将该第一运算结果存储于锁存器中,该比较结果为该第一数据大于该第二数据;该锁存器用于,用于存储该第一运算结果,并根据该第一运算结果向该第一数据接口与该第二数据接口发送反馈信号,该反馈信号用于指示该第一数据接口关闭并指示该第二数据接口开启,其中,该开启的第二数据接口用于接收从该第一寄存器组中获取的该第三数据;该运算模块,还用于对该第一计算结果和该第三数据进行该第二池化运算。
具体地,如图3所示,该第一计算单元220包括数据接口311、数据接口312、存储模块321、存储模块322、运算模块330和锁存器340。
数据接口311用于,从多个寄存器组获取数据。
数据接口312用于,从多个寄存器组获取数据。
存储模块321用于,存储数据接口311获取的数据。
存储模块322用于,存储数据接口312获取的数据。
运算模块330用于,从存储模块321获取第一操作数,从存储模块322获取第二操作数,并比较第一操作数与第二操作数,并比较结果存入锁存器340。
锁存器340用于,当该比较结果为该第一操作数大于或等于该第二操作数时,向数据接口311与数据接口312发送第一反馈信号,当该比较结果为该第一操作数小于该第二操作数时,用于向数据接口311与数据接口312发送第二反馈信号,其中,该第一反馈信号 用于关闭数据接口311、开启数据二接口312,该第二反馈信号用于开启数据接口311、关闭数据接口312。
应理解,当数据接口311(或数据接口312)关闭时,不从寄存器组获取数据,当数据接口311(或数据接口312)开启时,从寄存器组获取数据。
以池化窗的大小为2*2为例,在时钟周期T,数据接口311从一个寄存器组接收第一数据,存储模块321存储该第一数据;在时钟周期T+1,数据接口312从一个寄存器组接收第二数据,存储模块322存储该第二数据,运算模块330从存储模块321中获取第操作数(即第一数据),从存储模块322中获取第二操作数(即第二数据),对第一操作数与第二操作数进行比较,并将比较结果存入锁存器340,锁存器340用于,当该比较结果为该第一操作数大于或等于该第二操作数时,向数据接口311与数据接口312发送第一反馈信号,当该比较结果为该第一操作数小于该第二操作数时,用于向数据接口311与数据接口312发送第二反馈信号,为了便于描述与理解,下面均以第一操作数大于或等于第二操作数,锁存器340向数据接口311和312发送第一反馈信号为例进行描述;在时钟周期T+2,数据接口311关闭,数据接口312开启,并从一个寄存器组接收第三数据,存储模块322存储第三数据,运算模块330从存储模块321中获取第一操作数(即时钟周期T+1中比较出的较大值:第一数据),从存储模块322中获取第二操作数(即第三数据),并对第一操作数与第二操作数据进行比较,比较结果为第一操作数大于或等于第二操作数,将比较结果存入锁存器340,锁存器340用于,向数据接口311与数据接口312发送第一反馈信号;在时钟周期T+3,数据接口311关闭,数据接口312开启,并从一个寄存器组接收第四数据,存储模块322存储第四数据,运算模块330从存储模块321中获取第一操作数(即时钟周期T+2中比较出的较大值:第一数据),从存储模块322中获取第二操作数(即第四数据),并对第一操作数与第二操作数据进行比较,比较结果为第一操作数大于或等于第二操作数,至此,获得本次池化操作的池化结果:第一数据。
存储模块321和存储模块322均可以为寄存器。
可选地,数据接口311与数据接口312可以均为多路选择器。多路选择器的输入端的数量等于该第一计算单元所连接的寄存器组的数量。
可选地,运算模块330为加法器,该加法器用于将从存储模块321获取的第一操作数减去从存储模块322获取的第二操作数,将相减的结果作为比较结果存入锁存器340。
可选地,运算模块330为比较器,用于比较从存储模块321获取的第一操作数减去从存储模块322获取的第二操作数,并将比较结果存入锁存器340。
应理解,第一种实现方式的计算单元适用于池化操作为最大池化的场景,也就是说,本申请实施例提供的池化运算装置可以用于实现最大池化。
可选地,作为第二种实现方式,该第一计算单元220具体包括:
第一数据接口,用于从该多个寄存器组接收该第一数据;第二数据接口,用于从该多个寄存器组接收该第二数据;第一存储模块,用于存储该第一数据;第二存储模块,用于存储该第二数据;加法器,用于对该第一数据与该第二数据进行累加,获得该第一运算结果;该第二存储模块,还用于存储该第一运算结果;该第一数据接口,还用于从该第一寄存器组获取该第三数据;该加法器,还用于对该第一运算结果与该第三数据进行该第二池化运算。
该第一计算单元220还包括:乘法器,用于当该加法器获得k1*k2个数据的累加结果 时,对该k1*k2个数据的累加结果乘以1/(k1*k2)以获得该k1*k2个数据的平均值,其中,k1*k2为该池化操作对应的池化窗的大小,k1和k2分别为不小于2的整数。
具体地,如图4所示,该第一计算单元220包括数据接口411、数据接口412、存储模块421、存储模块422、加法器430和乘法器440。
数据接口411用于,从寄存器组获取数据。
数据接口412用于,从寄存器组获取数据。
存储模块421用于,存储数据接口411获取的数据。
存储模块422用于,存储数据接口412获取的数据。
加法器430用于,从存储模块421获取第一操作数,从存储模块422获取第二操作数,并对第一操作数与第二操作数进行累加,并将累加结果存入存储模块422,当第一计算单元从寄存器组读取k1*k2个数据之后,加法器430用于将累加结果发送至乘法器440,k1*k2为池化窗的大小。
当累加结果存入存储模块422时,数据接口412关闭,之后,仅数据接口411用于从寄存器组接收数据。
乘法器440用于,将加法器430发送的累加结果乘以1/(k1*k2),至此得到本次池化操作的池化结果。
以池化窗的大小为2*2为例,在时钟周期T,数据接口411从一个寄存器组接收第一数据,存储模块421存储该第一数据;在时钟周期T+1,数据接口412从一个寄存器组接收第二数据,存储模块422存储该第二数据,加法器430从存储模块421中获取第一操作数(即第一数据),从存储模块422中获取第二操作数(即第二数据),对第一操作数与第二操作数进行累加,并将累加结果(记为累加值1)存入存储模块422,这时关闭数据接口412;在时钟周期T+2,数据接口412关闭,数据接口411开启,并从一个寄存器组接收第三数据,存储模块421存储第三数据,加法器430从存储模块421中获取第一操作数(即第三数据),从存储模块422中获取第二操作数(即累加值1),并对第一操作数与第二操作数据进行累加,将累加结果(记为累加值2)存入存储模块422;在时钟周期T+3,数据接口412关闭,数据接口411开启,并从一个寄存器组接收第四数据,存储模块421存储第四数据,加法器430从存储模块421中获取第一操作数(即第四数据),从存储模块422中获取第二操作数(即累加值2),并对第一操作数与第二操作数据进行累加,将累加结果(记为累加值3)发送至乘法器440,乘法器440将累加值3乘以1/4,相乘结果为本次池化操作的池化结果。
作为示例而非限定,上文以存储模块422用于存储加法器430的累加结果为例进行描述,实际操作中,也可以设计成由存储模块421来存储加法器430的累加结果(这时,需要关闭数据接口411,开启数据接口412),本实施例对此不作限定。
存储模块421和存储模块422均可以为寄存器。
可选地,数据接口411与数据接口412可以均为多路选择器。多路选择器的输入端的数量等于该第一计算单元所连接的寄存器组的数量。
应理解,第二种实现方式的计算单元适用于池化操作为平均池化的场景,也就是说,本申请实施例提供的池化运算装置可以用于处理平均池化。
可选地,作为第三种实现方式,如图5和图6所示,该第一计算单元220包括数据接口511、数据接口512、存储模块521、存储模块522、加法器530、乘法器540和锁存器 550。
数据接口511用于,从寄存器组获取数据。
数据接口512用于,从寄存器组获取数据。
存储模块521用于,存储数据接口511获取的数据。
存储模块522用于,存储数据接口512获取的数据。
在池化操作为最大池化的情况下,如图5所示。
加法器530用于,从存储模块521获取第一操作数,从存储模块522获取第二操作数,并比较第一操作数与第二操作数,并比较结果存入锁存器550。
锁存器550用于,当该比较结果为该第一操作数大于或等于该第二操作数时,向数据接口511与数据接口512发送第一反馈信号,当该比较结果为该第一操作数小于该第二操作数时,用于向数据接口511与数据接口512发送第二反馈信号,其中,该第一反馈信号用于关闭数据接口511、开启数据二接口512,该第二反馈信号用于开启数据接口511、关闭数据接口512。
应理解,当数据接口511(或数据接口512)关闭时,不从寄存器组获取数据,当数据接口511(或数据接口512)开启时,从寄存器组获取数据。
在池化操作为平均池化的情况下,如图6所示。
加法器530用于,从存储模块521获取第一操作数,从存储模块522获取第二操作数,并对第一操作数与第二操作数进行累加,并将累加结果存入存储模块522,当第一计算单元从寄存器组读取k1*k2个数据之后,加法器530用于将累加结果发送至乘法器550,k1*k2为池化窗的大小。
当累加结果存入存储模块522时,数据接口512关闭,之后,仅数据接口511用于从寄存器组接收数据。
乘法器540用于,将加法器530发送的累加结果乘以1/(k1*k2),至此得到本次池化操作的池化结果。
应理解,第三种实现方式的第一计算单元可以支持两种状态,一种是用于做最大池化的状态(如图5),一种是用于做平均池化的状态(如图6)。
在第一计算单元220的实现方式为上述第三种实现方式时,如图2所示,该池化运算装置还包括:控制单元230,用于向该多个计算单元发送控制信号,该控制信号用于指示该池化操作为最大池化或平均池化。
该第一计算单元220,还用于接收该控制信号;当该控制信号指示该池化操作为最大池化时,该第一计算单元220对该第一数据和该第二数据执行最大值池化运算;当该控制信号指示该池化操作为平均池化时,对该第一数据和该第二数据进行平均值池化运算。
具体地,当该控制信号指示该池化操作为最大池化时,该第一计算单元220切换为如图5的状态;当该控制信号指示该池化操作为平均池化时,该第一计算单元220切换为如图6的状态。
应理解,该控制单元230也可以位于本申请提供的池化运算装置200的外部,本申请对此不做限定。
存储模块521和存储模块522均可以为寄存器。
可选地,数据接口511与数据接口512可以均为多路选择器。多路选择器的输入端的数量等于该第一计算单元220所连接的寄存器组的数量。
应理解,图5所示的计算单元既可以适用于平均池化的场景,也可以适用于最大池化的场景,也就是说,本申请实施例提供的池化运算装置既可以用于处理平均池化,又可以处理最大池化,从而可以提高硬件利用率,降低硬件成本。
下文将描述多个寄存器组存储数据的方式。为了便于理解与描述,下文实施例中以多个计算单元与多个寄存器组全连接为例进行描述。下面描述的实施例通过合理的变换也可以适用于多个计算单元中每个计算单元与多个寄存器组中部分寄存器组连接的场景,这部分内容也落入本申请保护范围。下文实施例中以多个寄存器组210为多个Bank为例进行描述。
在本申请实施例中,待进行池化操作的数据在多个寄存器组中的存储方式,使得在每个读取数据过程(即每个时钟周期),不同计算单元读取的数据位于不同的寄存器组中。即可以保证不同计算单元从寄存器组读取数据不会发生冲突。
假设多个计算单元为n个计算单元,多个寄存器组为n个Bank,池化窗的大小为k*k,待进行池化处理的图像的大小为m*m,其中,k为大于1的正数,n为大于1的正数,m为大于1的正数。假设m=n*k。例如待进行池化操作的数据为该图像中的k行n*k列数据,则待进行池化操作的数据在n个Bank中的存储方式为:该k行中的第j行的第1列、第k+1列、第2k+1列、…、第(n-1)*k+1列数据分别存储于该n个寄存器组的不同寄存器组中,该第j行的第2列、第k+2列、第2k+2列、…、第(n-1)*k+2列数据分别存储于该n个寄存器组的不同寄存器组中,…,该第j行的第k列、第k+k列、第2k+k列、…、第(n-1)*k+k列数据分别存储于该n个寄存器组的不同寄存器组中,j为1,2,...,k。
图7给出一种具体的待进行池化操作的数据在多个Bank中的存储方式的示意图。在图7中,k为2,n为9,即池化窗的大小为2*2,计算单元与Bank的数量均为9,如图6中所示的9个计算单元和9个Bank。每个Bank中包括多个寄存器(图7中示意性给出每个Bank中具有5个寄存器),则9个Bank的寄存器构成一个9行多列的寄存器阵列(图7中示意性给出9行5列的寄存器阵列)。待处理的图像的大小为18*18,假设待进行池化操作的数据为该图像的第1行与第2行的数据,具体如图7中所示,其中,相同图案的4个数据表示同一个池化窗内的数据。
图像的第1行与第2行中的数据存入9个Bank的方式为:
图像的第1行中的数据从寄存器阵列中的第r1列的第1行开始加载,直到占用2列寄存器(图7中所示的第r1列和第r2列)完成第1行数据的加载。
图像的第2行中的数据从寄存器阵列中的第(r1+2)列(即第r3列)的第一行开始加载,直到占用2列寄存器(图7中所示的第r3列和第r4列)完成第2行数据的加载。
具体地,如图7所示。
9个计算单元从9个Bank中读取数据的流程图如图8、图9、图10和图11所示。
如图8所示,在时钟周期T,9个Bank中虚线框内的数据同时被读出来,如图8所示,这几个数据分别是图像的第1行的第1列、第3列、第5列、…第17列的数据,即各个池化窗内的第一个池化操作数。其中,计算单元1从Bank1中读取数据“1”,计算单元2从Bank3中读取数据“3”,计算单元3从Bank5中读取数据“5”,计算单元4从Bank7中读取数据“7”,计算单元5从Bank9中读取数据“9”,计算单元6从Bank2中读取数据“11”,计算单元7从Bank4中读取数据“13”,计算单元8从Bank6中读取数据“15”,计算单元9从Bank8读取数据“17”。
如图9所示,在时钟周期T+1,9个Bank中虚线框内的数据同时被读出来,如图9所示,这几个数据分别是图像的第1行的第2列、第4列、第6列、…第18列的数据,即各个池化窗内的第二个池化操作数。其中,计算单元1从Bank2中读取数据“2”,计算单元2从Bank4中读取数据“4”,计算单元3从Bank6中读取数据“6”,计算单元4从Bank8中读取数据“8”,计算单元5从Bank1中读取数据“10”,计算单元6从Bank3中读取数据“12”,计算单元7从Bank5中读取数据“14”,计算单元8从Bank7中读取数据“16”,计算单元9从Bank9读取数据“18”。
如图10所示,在时钟周期T+2,9个Bank中虚线框内的数据同时被读出来,如图10所示,这几个数据分别是图像的第2行的第1列、第3列、第5列、…第17列的数据,即各个池化窗内的第三个池化操作数。其中,计算单元1从Bank1中读取数据“19”,计算单元2从Bank3中读取数据“3”,计算单元3从Bank5中读取数据“5”,计算单元4从Bank7中读取数据“7”,计算单元5从Bank9中读取数据“9”,计算单元6从Bank2中读取数据“11”,计算单元7从Bank4中读取数据“13”,计算单元8从Bank6中读取数据“15”,计算单元9从Bank8读取数据“17”。
如图11所示,在时钟周期T+2,9个Bank中虚线框内的数据同时被读出来,如图11所示,这几个数据分别是图像的第2行的第2列、第4列、第6列、…第18列的数据,即各个池化窗内的第四个池化操作数。其中,计算单元1从Bank2中读取数据“20”,计算单元2从Bank4中读取数据“22”,计算单元3从Bank6中读取数据“24”,计算单元4从Bank8中读取数据“26”,计算单元5从Bank1中读取数据“28”,计算单元6从Bank3中读取数据“30”,计算单元7从Bank5中读取数据“32”,计算单元8从Bank7中读取数据“34”,计算单元9从Bank9读取数据“36”。至此,9个计算单元可以同时输出9个池化结果。
通过上文结合图8、图9、图10与图11的描述可知,每个时钟周期,不同计算单元读取的数据位于不同的Bank中。此外,同一个计算单元在不同的时钟周期读取数据的Bank也可能不同。
上述实施例中的计算单元的结构可以是图3所示的结构,也可以是图4所示的结构,还可以是图5所示的结果,本申请对此不做限定。
作为一个示例,假设在上述结合图7描述的实施例中,9个计算单元的结构如图5所示的结构,具体如图7中所示,应理解,为了画图的简洁,图7中只给出计算单元9的结构,其他8个计算单元的结构与计算单元9的结构一致。
在池化操作为最大池化的情况下,图7中所示的9个计算单元均切换至如图5所示的状态。下面以计算单元9为例进行描述,对于计算单元9的描述同样适用于计算单元1-8,为了简洁,不再赘述。例如,在时钟周期T,如图8所示,计算单元9中的数据接口511从Bank1接收数据“1”,存储模块521存储数据“1”;在时钟周期T+1,如图9所示,计算单元9中的数据接口512从Bank2接收数据“2”,存储模块存储数据“2”,加法器530从存储模块521中获取第一操作数“1”,从存储模块522中获取第二操作数“2”,对两个操作数做比较,将比较结果(即第一操作数小于第二操作数)存入锁存器550;锁存器550向数据接口511和数据接口512发送第二反馈信号,该第二反馈信号使数据接口511开启,数据接口512关闭;在时钟周期T+2,如图10所示,数据接口511从Bank1中接收数据“19”,存储模块521存储数据“19”,加法器530从存储模块521获取第一操作数“19”,从存储模块522获取第二操作数“2”(即时钟周期T+1比较出的较大值),对两个操作数做比较, 将比较结果(即第一操作数大于第二操作数)存入锁存器550;锁存器550向数据接口511和数据接口512发送第一反馈信号,该第一反馈信号使数据接口511关闭,数据接口512开启;在时钟周期T+3,如图11所示,数据接口512从Bank2中接收数据“20”,存储模块522存储数据“20”,加法器530从存储模块521获取第一操作数“19”(即时钟周期T+2比较出的较大值),从存储模块522获取第二操作数“20”,对两个操作数做比较,得到比较结果(即第一操作数小于第二操作数),即得到本次池化操作的池化结果“20”。
在池化操作为最大池化的情况下,图7中所示的9个计算单元均切换至如图6所示的状态。计算单元从寄存器组中读取数据的流程与上文最大池化中的描述一致,区别在于,图6所示的状态与图5所示的状态的数据处理方法不同,具体描述详见上文结合图5的描述,为了简洁,这里不再赘述。
应理解,每个计算单元获得的池化结果,可以写回多个寄存器组中。例如,原始图像中同一行的池化结果写到不同的寄存器组中。
上述可知,本申请实施例提供的池化运算装置,通过多个计算单元与多个寄存器组,且不同计算单元可以无阻塞地从不同寄存器组中读取数据,从而可以实现并行池化,这样可以提高池化操作的效率;此外,计算单元中包括用于存储中间计算结果的存储模块,这样可以提高数据读写效率,从而整体上提高池化操作的效率。
应理解,图7、图8-图11仅为示例而非限定。在此基础上,针对不同的应用场景,可以通过适应性推演,得到相应地处理方法,这些方案也落入本申请的保护范围。
在上文结合图7、图8-图11的描述中,均以m=n*k为例进行描述,实际应用中,可能会出现m>n*k,或m<n*k。
例如,当m>n*k时,还以图像的第1行至第k行为例进行描述,图像的第1行至第k行中的前n*k列的池化操作可以使得n个计算单元满载运行,即使得n个计算单元实现并行运算。在k*k个时钟周期后,图像的第1行至第k行中的后(m-k*n)列数据只能支持计算单元1-5的池化运算,计算单元6-9中没有数据,如图12所示(图8中n为9,k为2),导致部分计算单元空闲,这样导致资源浪费。当m<n*k时,也会出现上述导致部分计算单元空闲的问题。
针对上述问题,本申请实施例提供一种解决方法。当n个计算单元不满载时,将图像中的其他行中的部分数据存入n个寄存器组,使得n个计算单元都可以读取数据进行并行池化处理。
以m>n*k为例,如图13所示,还以n为9为例,k为2为例。图13中示出图像的第1行至第4行的数据,其中,一种图案表示第1行至第2行的数据,另一种图案表示第3行至第4行的数据。
先对图像的第1行与第2行中的前9*2列数据进行池化操作,2*2个时钟周期后,将图像的第1行与第2行中的后(m-2*9)列数据存储到9个寄存器组(Bank)中,将图像的第3行与第4行中的前x列数据存储到9个寄存器组中,且第3行与第4行中的前x列数据在9个寄存器组中的存储位置与图像的第1行与第2行中的后(m-2*9)列数据在9个寄存器组中的位置要拼接在一起,如图13所示,经过上述拼接处理之后的数据可以使得,在接下来的2*2的时钟周期中,9个计算单元可以满载运行。
当m小于n*k时,处理方法类似。在开始处理图像的第1行和第2行的数据时,将图像的第1行与第2行的数据存储到9个寄存器组中,将图像的第3行和第4行的前y列 数据存储到9个寄存器组中,这些数据可以使得在接下来的k*k个时钟周期内,9个计算单元满载运行。
本实施例提供的数据在多个寄存器组中的存储方法可以成为拼接方法。
应理解,通过本申请实施例提供的方案,使得本申请提供的池化运算装置针对不同尺寸的图像和池化窗,均可以实现并行池化,即实现池化运算的加速。
综上所述,本申请实施例提供的池化运算装置,通过多个计算单元与多个寄存器组可以实现并行池化运算,可以提高池化效率;此外,由于每个计算单元均可以存储池化操作的中间计算结果,因此可以提高数据读写效率,进而整体上可以提高池化效率,以实现最大化加速池化运算。
还应理解,上述各个实施例,可依据内在逻辑关系进行合理的组合,本申请对此不做限定。
还应理解,上文某些实施例中,均以池化运算装置中包9个计算单元与9个寄存器组为例进行描述,仅为示例而非限定。实际应用中,可以根据实际需要设计池化运算装置中多个计算单元与多个寄存器组的数量。
可选地,本申请实施例提供的池化运算装置的具体形态可以是芯片。
本申请实施例还提供一种计算机设备,包括内存以及上文实施例提供的池化运算装置,其中,该内存用于存储该池化运算装置待执行池化操作的数据。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而 前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
上述仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (7)

  1. 一种池化运算装置,其特征在于,包括:
    多个寄存器组,用于存储多个数据;
    多个计算单元,用于对所述多个数据执行池化操作,其中,不同的计算单元操作的数据位于所述多个寄存器组中的不同寄存器组中;
    所述多个计算单元中的第一计算单元用于:
    对所述多个数据中的第一数据和第二数据进行第一池化运算,获得第一运算结果;
    存储所述第一运算结果;
    从所述多个寄存器中的第一寄存器组中获取第三数据;
    对所述第一运算结果和所述第三数据进行第二池化运算。
  2. 根据权利要求1所述的池化运算装置,其特征在于:
    所述多个计算单元中的每个计算单元能够获取所述多个寄存器组中任意一个寄存器组中存储的数据。
  3. 根据权利要求1或2所述的池化运算装置,其特征在于,还包括:
    控制单元,用于向所述多个计算单元发送控制信号,所述控制信号用于指示所述池化操作为最大池化或平均池化;
    所述第一计算单元,还用于接收所述控制信号;
    在对所述多个数据中的第一数据和第二数据进行第一池化运算的过程中,所述第一计算单元具体用于:
    当所述控制信号指示所述池化操作为最大池化时,对所述第一数据和所述第二数据执行最大值池化运算;
    当所述控制信号指示所述池化操作为平均池化时,对所述第一数据和所述第二数据进行平均值池化运算。
  4. 根据权利要求1至3中任一项所述的池化运算装置,其特征在于,所述第一池化运算包括最大值池化运算,
    所述第一计算单元包括:
    第一数据接口,用于接收从所述多个寄存器组获取的所述第一数据;
    第二数据接口,用于接收从所述多个寄存器组获取的所述第二数据;
    第一存储模块,用于存储所述第一数据;
    第二存储模块,用于存储所述第二数据;
    运算模块,用于比较所述第一数据与所述第二数据,获得所述第一运算结果,并将所述第一运算结果存储于锁存器中,所述比较结果为所述第一数据大于所述第二数据;
    所述锁存器用于,用于存储所述第一运算结果,并根据所述第一运算结果向所述第一数据接口与所述第二数据接口发送反馈信号,所述反馈信号用于指示所述第一数据接口关闭并指示所述第二数据接口开启,其中,所述开启的第二数据接口用于接收从所述第一寄存器组中获取的所述第三数据;
    所述运算模块,还用于对所述第一计算结果和所述第三数据进行所述第二池化运算。
  5. 根据权利要求1至3中任一项所述的池化运算装置,其特征在于,所述第一池化运算包括平均值池化运算;
    所述第一计算单元具体包括:
    第一数据接口,用于从所述多个寄存器组接收所述第一数据;
    第二数据接口,用于从所述多个寄存器组接收所述第二数据;
    第一存储模块,用于存储所述第一数据;
    第二存储模块,用于存储所述第二数据;
    加法器,用于对所述第一数据与所述第二数据进行累加,获得所述第一运算结果;
    所述第二存储模块,还用于存储所述第一运算结果;
    所述第一数据接口,还用于从所述第一寄存器组获取所述第三数据;
    所述加法器,还用于对所述第一运算结果与所述第三数据进行所述第二池化运算。
  6. 根据权利要求5所述的池化运算装置,其特征在于,所述第一计算单元还包括:
    乘法器,用于当所述加法器获得k1*k2个数据的累加结果时,对所述k1*k2个数据的累加结果乘以1/(k1*k2)以获得所述k1*k2个数据的平均值,其中,k1*k2为所述池化操作对应的池化窗的大小,k1和k2分别为不小于2的整数。
  7. 一种计算机设备,包括内存以及如权利要求1-6中任一项所述的池化运算装置,其中,所述内存用于存储所述池化运算装置待执行池化操作的数据。
PCT/CN2019/084004 2018-04-25 2019-04-24 池化运算装置 WO2019206161A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810377097.6A CN110399977A (zh) 2018-04-25 2018-04-25 池化运算装置
CN201810377097.6 2018-04-25

Publications (1)

Publication Number Publication Date
WO2019206161A1 true WO2019206161A1 (zh) 2019-10-31

Family

ID=68293755

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/084004 WO2019206161A1 (zh) 2018-04-25 2019-04-24 池化运算装置

Country Status (2)

Country Link
CN (1) CN110399977A (zh)
WO (1) WO2019206161A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111340085A (zh) * 2020-02-20 2020-06-26 深圳市商汤科技有限公司 数据处理方法及装置、处理器、电子设备、存储介质

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021092941A1 (zh) * 2019-11-15 2021-05-20 深圳市大疆创新科技有限公司 感兴趣区域-池化层的计算方法与装置、以及神经网络系统
CN113255897B (zh) * 2021-06-11 2023-07-07 西安微电子技术研究所 一种卷积神经网络的池化计算单元
CN116681114A (zh) * 2022-02-22 2023-09-01 深圳鲲云信息科技有限公司 池化计算芯片、方法、加速器及系统

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106940815A (zh) * 2017-02-13 2017-07-11 西安交通大学 一种可编程卷积神经网络协处理器ip核
CN107679620A (zh) * 2017-04-19 2018-02-09 北京深鉴科技有限公司 人工神经网络处理装置
CN107704923A (zh) * 2017-10-19 2018-02-16 珠海格力电器股份有限公司 卷积神经网络运算电路

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6822885B2 (en) * 2003-04-14 2004-11-23 International Business Machines Corporation High speed latch and compare function
US6944079B2 (en) * 2003-12-31 2005-09-13 Micron Technology, Inc. Digital switching technique for detecting data
CN101222216A (zh) * 2007-01-13 2008-07-16 曹先国 比较器
US8891319B2 (en) * 2010-11-30 2014-11-18 Micron Technology, Inc. Verify or read pulse for phase change memory and switch
US8536908B2 (en) * 2011-09-29 2013-09-17 Spansion Llc Apparatus and method for smart VCC trip point design for testability
CN103338200B (zh) * 2013-06-28 2016-03-02 国家电网公司 基于FPGA的网络Smurf攻击特征瞬时防御电路实现方法
DE102015112852B4 (de) * 2015-08-05 2021-11-18 Lantiq Beteiligungs-GmbH & Co. KG Wägeverfahren mit nichtlinearer Charakteristik
US10803401B2 (en) * 2016-01-27 2020-10-13 Microsoft Technology Licensing, Llc Artificial intelligence engine having multiple independent processes on a cloud based platform configured to scale
CN106056212B (zh) * 2016-05-25 2018-11-23 清华大学 一种人工神经网络计算核
CN106598688B (zh) * 2016-12-09 2019-10-18 曙光信息产业(北京)有限公司 一种深度学习汇编优化中的寄存器冲突避免方法
CN106779060B (zh) * 2017-02-09 2019-03-08 武汉魅瞳科技有限公司 一种适于硬件设计实现的深度卷积神经网络的计算方法
CN106897131B (zh) * 2017-02-22 2020-05-29 浪潮(北京)电子信息产业有限公司 一种用于天文软件Gridding的并行计算方法及其装置
CN106991473A (zh) * 2017-03-30 2017-07-28 中国人民解放军国防科学技术大学 面向向量处理器的基于simd的平均值值池化并行处理方法
CN107797962B (zh) * 2017-10-17 2021-04-16 清华大学 基于神经网络的计算阵列
CN107749044A (zh) * 2017-10-19 2018-03-02 珠海格力电器股份有限公司 图像信息的池化方法及装置
CN107862650B (zh) * 2017-11-29 2021-07-06 中科亿海微电子科技(苏州)有限公司 加速计算二维图像cnn卷积的方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106940815A (zh) * 2017-02-13 2017-07-11 西安交通大学 一种可编程卷积神经网络协处理器ip核
CN107679620A (zh) * 2017-04-19 2018-02-09 北京深鉴科技有限公司 人工神经网络处理装置
CN107704923A (zh) * 2017-10-19 2018-02-16 珠海格力电器股份有限公司 卷积神经网络运算电路

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111340085A (zh) * 2020-02-20 2020-06-26 深圳市商汤科技有限公司 数据处理方法及装置、处理器、电子设备、存储介质
CN111340085B (zh) * 2020-02-20 2024-03-08 深圳市商汤科技有限公司 数据处理方法及装置、处理器、电子设备、存储介质

Also Published As

Publication number Publication date
CN110399977A (zh) 2019-11-01

Similar Documents

Publication Publication Date Title
WO2019206161A1 (zh) 池化运算装置
US11934481B2 (en) Matrix multiplier
US11720646B2 (en) Operation accelerator
US10140251B2 (en) Processor and method for executing matrix multiplication operation on processor
CN109240746B (zh) 一种用于执行矩阵乘运算的装置和方法
CN108629406B (zh) 用于卷积神经网络的运算装置
WO2019205617A1 (zh) 一种矩阵乘法的计算方法及装置
EP4379607A1 (en) Neural network accelerator, and data processing method for neural network accelerator
CN111767986A (zh) 一种基于神经网络的运算方法及装置
WO2020106502A1 (en) Compression-encoding scheduled inputs for matrix computations
WO2020190524A1 (en) Selectively controlling memory power for scheduled computations
CN109753319B (zh) 一种释放动态链接库的装置及相关产品
KR20210033757A (ko) 메모리 장치 및 그 동작 방법
WO2023065983A1 (zh) 计算装置、神经网络处理设备、芯片及处理数据的方法
CN109902821B (zh) 一种数据处理方法、装置及相关组件
CN117574970A (zh) 用于大规模语言模型的推理加速方法、系统、终端及介质
CN116888591A (zh) 一种矩阵乘法器、矩阵计算方法及相关设备
CN112765540A (zh) 数据处理方法、装置及相关产品
CN112784951A (zh) Winograd卷积运算方法及相关产品
CN112801276B (zh) 数据处理方法、处理器及电子设备
CN114510217A (zh) 处理数据的方法、装置和设备
CN112784206A (zh) winograd卷积运算方法、装置、设备及存储介质
CN112765542A (zh) 运算装置
CN114430820A (zh) 使用灵活精度运算的矩阵乘法单元
CN110852202A (zh) 一种视频分割方法及装置、计算设备、存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19792269

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19792269

Country of ref document: EP

Kind code of ref document: A1