WO2020238106A1 - 一种数据处理方法、电子装置及计算机可读存储介质 - Google Patents

一种数据处理方法、电子装置及计算机可读存储介质 Download PDF

Info

Publication number
WO2020238106A1
WO2020238106A1 PCT/CN2019/121602 CN2019121602W WO2020238106A1 WO 2020238106 A1 WO2020238106 A1 WO 2020238106A1 CN 2019121602 W CN2019121602 W CN 2019121602W WO 2020238106 A1 WO2020238106 A1 WO 2020238106A1
Authority
WO
WIPO (PCT)
Prior art keywords
memory
processing element
processing elements
output data
processing
Prior art date
Application number
PCT/CN2019/121602
Other languages
English (en)
French (fr)
Inventor
李炜
曹庆新
Original Assignee
深圳云天励飞技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳云天励飞技术有限公司 filed Critical 深圳云天励飞技术有限公司
Priority to US17/257,324 priority Critical patent/US11061621B2/en
Publication of WO2020238106A1 publication Critical patent/WO2020238106A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • G06F3/0659Command handling arrangements, e.g. command buffers, queues, command scheduling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0604Improving or facilitating administration, e.g. storage management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • This application relates to the field of information processing technology, and in particular to a data processing method, electronic device, and computer-readable storage medium.
  • Neural network is an algorithmic mathematical model that imitates the behavioral characteristics of animal neural networks and performs distributed and parallel information processing. This network is composed of a large number of nodes (or neurons) between star lakes. The connection relationship uses input data and weights to generate output data to simulate the information processing process of the human brain to process information and generate the results after pattern recognition.
  • the input neurons and output neurons of the multi-layer operation do not refer to neurons in the input layer and output layer of the entire neural network, but to In any two adjacent layers of the network, the neuron in the lower layer of the network forward operation is the input neuron, and the neuron in the upper layer of the network forward operation is the output neuron.
  • the output result of the previous layer calculation in the neural network model will be used as the input of the next layer calculation.
  • the processor stores the output results calculated by the previous layer in the neural network model into the corresponding output data buffer.
  • the processor needs to read the output data buffer and store it first.
  • the output result of the previous layer is good, and then it is calculated as the input of the current layer.
  • the storage efficiency is not high.
  • the embodiments of the present application provide a data processing method, an electronic device, and a computer-readable storage medium to solve the problem of low data retrieval efficiency in the existing neural network model. Through the triggering of parallel requests, multiple data can be retrieved. Simultaneous storage, which can improve the storage efficiency.
  • an embodiment of the present application provides a data processing method, which is applied to an electronic device, the electronic device includes a processor and a memory, and the processor includes M processing elements arranged in order according to the size of the identifier, M Is a positive integer, and the method includes:
  • the electronic device sends N storage requests to the memory in parallel Q times in each poll through the processor; wherein the N storage requests are used to request the memory to store N of the M processing elements.
  • One line of output data generated by each of the processing elements with consecutive identifications; the Q is determined according to the number M of processing elements and the number N of storage requests;
  • the electronic device restores the Pth row output data generated by each of the M processing elements according to the received Q ⁇ N storage requests in the Pth poll through the memory.
  • the N storage requests correspond to the N processing elements with consecutive identifications, and each storage request includes a line of output data generated by the corresponding processing element and the corresponding processing element The first address of the generated line of output data to be stored in the memory;
  • the electronic device stores back the Pth row output data generated by the M processing elements in the Pth poll according to the received Q ⁇ N storage requests through the memory, including:
  • the electronic device stores the first address of each of the M processing elements to be stored in the memory according to the Pth row of output data generated by each of the M processing elements in the Pth poll through the memory.
  • the generated P line output data is
  • the method further includes:
  • the flag bit parameter includes a first flag bit parameter, and the first flag bit parameter is a flag bit parameter corresponding to the i-th processing element; where i is a positive integer less than or equal to M ;
  • the method also includes:
  • the determining according to the flag bit parameter the first address of a line of output data generated by each of the M processing elements to be stored in the memory includes:
  • the addr_start(i-1) represents the first address of a row of output data generated by the last processing element to be stored in the memory;
  • the third parameter is used to determine the highest processing bit in each group of the T processing element groups element.
  • the M processing elements when the M processing elements are grouped, it includes:
  • the M processing elements are grouped according to the number S of processing elements contained in a processing element group to obtain the T processing element groups.
  • the Q is obtained by dividing M by N and rounding up.
  • the processing element when the processing element generates output data, it includes:
  • an embodiment of the present application provides an electronic device, including a processor and a memory, the processor, input device, output device, and memory are connected to each other, wherein the memory is used to store information that supports the terminal to execute the above method.
  • a computer program the computer program comprising program instructions, and the processor is configured to invoke the program instructions to execute the method of the first aspect described above.
  • an embodiment of the present application provides a computer-readable storage medium, the computer storage medium stores a computer program, and the computer program includes program instructions that, when executed by a processor, cause the processing The device executes the method of the first aspect above.
  • an embodiment of the present application provides a computer program.
  • the computer program includes program instructions that, when executed by a processor, cause the processor to execute the method of the first aspect.
  • the processor sends multiple storage requests in parallel to the memory multiple times in one poll. Then, the memory can simultaneously store the output data generated by multiple processing elements according to the multiple storage requests at the same time to solve the existing problems.
  • the problem of low data recovery efficiency in the neural network model can improve the data recovery efficiency. Then, on this basis, when performing neural network calculations, the computational efficiency of the neural network model can be improved.
  • FIG. 1 is a schematic structural diagram of a processing element provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of the structure of 32 processing elements provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of a data storage format provided by an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of a data processing method provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of determining the first address of output data to be stored in a memory according to an embodiment of the present application
  • FIG. 6 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 1 is a processing element PE for implementing neural network operations provided by the embodiment of the present application, including: a first buffer 11 (ie, an input buffer), configured to Store the input data and the weights corresponding to the input data; the arithmetic unit 12 is configured to perform neural network calculations based on the input data and generate output data; here, the neural network calculations can be convolutional neural network calculations, or other neural networks. Network calculation; the second buffer 13 (ie, the output buffer) is configured to store the output data.
  • a first buffer 11 ie, an input buffer
  • the arithmetic unit 12 is configured to perform neural network calculations based on the input data and generate output data; here, the neural network calculations can be convolutional neural network calculations, or other neural networks.
  • Network calculation the second buffer 13 (ie, the output buffer) is configured to store the output data.
  • the first buffer 11 may include, for example, an input data buffer 111 and a weight data buffer 112; wherein, the input data buffer 112 is configured to store input data; and the weight data buffer 111 It is configured to store the weight corresponding to the input data.
  • each PE is a SIMD processor with a digit width of m (or a vector processor with a digit width of m).
  • each PE has its own instruction buffer IQ, instruction decoding and control logic.
  • Each PE can perform an independent convolutional neural network (CNN) calculation.
  • CNN convolutional neural network
  • multiple adjacent PEs can also be combined to jointly perform a CNN calculation.
  • the processor includes multiple processing elements (Processing Element, PE), which are set in sequence according to the size of the identifier, and the order of the identifier size can be expressed as PE0, PE0,... ., PEn.
  • PE Processing Element
  • FIG. 2 suppose the processor has 32 PEs (PE0 ⁇ PE31), and each PE has 7 MAC units. So the processor has 224 MACs.
  • Each PE is a 7-bit wide SIMD processor.
  • Each PE has its own instruction buffer (IQ), instruction decoding and control logic.
  • each PE there are three local buffers: i) IBUF (corresponding to the input data buffer), used to store the input data ci; ii) WBUF (corresponding to the weight data buffer), used to store the weight; And iii) OBUF (corresponding to the second buffer) for storing output data co.
  • IBUF corresponding to the input data buffer
  • WBUF corresponding to the weight data buffer
  • OBUF corresponding to the second buffer
  • the storage format of the data in the memory may be as shown in FIG. 3.
  • each data of a feature (feature map) is 16 bits, and the data of each channel (channel) is continuously stored in the memory in rows.
  • the input feature (input feature map) and output feature (output feature map) calculated by each layer of the network in the convolutional neural network model are stored in a row-by-row format.
  • the processor can only send one storage request at a time, and the memory can only store the output data generated by one processing element at a time, which easily leads to the problem of low data recovery efficiency in the neural network model.
  • the present invention provides a data processing method, an electronic device, and a computer-readable storage medium, which can trigger multiple storage requests in parallel to improve the efficiency of data storage in the neural network model. To achieve the purpose of improving the calculation speed of the neural network model.
  • Step S301 The electronic device sends N storage requests to the memory in parallel Q times in each poll through the processor; wherein the N storage requests are used to request the memory to store the M processes One line of output data generated by each of the N identifying continuous processing elements in the element; the Q is determined according to the number M of processing elements and the number N of storage requests.
  • polling is a periodic repetitive process.
  • one polling can include Q specific polling operations.
  • one poll can include Q times of triggering of N storage requests.
  • the processor sends N storage requests Q times in parallel to the memory in one poll, this means that the processor instructs the memory to store the Pth row output data corresponding to each of the M processing elements in the Pth poll.
  • the number of polling may be determined according to the number of rows of output data generated by the processing element.
  • the number of rows of output data generated by the processing element is J.
  • the memory can store the Jth row of output data generated by the processing element according to the storage request. At the same time, This also means that the memory has stored the output data generated by the processing element according to the storage request.
  • the Q is obtained by dividing M by N and rounding up.
  • the storage request valid means that the memory can store the output data generated by the processing element according to the storage request.
  • the processing element when the processing element generates output data, it includes:
  • the input data includes weight data and input neuron data
  • the method of obtaining input data and calculation instructions may be obtained through a data input and output unit, and the data input and output unit may specifically be one or more data I/O interfaces or I/O pins.
  • the foregoing calculation instructions may include, but are not limited to: neural network calculation instructions (for example, convolutional neural network calculation instructions), forward calculation instructions or reverse calculation instructions, etc.
  • neural network calculation instructions for example, convolutional neural network calculation instructions
  • forward calculation instructions or reverse calculation instructions etc.
  • the specific implementation manners of this application do not limit the foregoing calculation instructions The specific form of expression.
  • the operation in the neural network it can be the operation of one layer in the neural network.
  • the realization process is that in the forward operation, after the execution of the upper layer of neural network is completed, the operation instruction of the next layer
  • the output neuron calculated in the arithmetic unit (that is, the output data) is used as the input neuron of the next layer for operation (or some operation is performed on the output neuron and then used as the input neuron of the next layer) , And at the same time, replace the weights with the weights of the next layer; in the reverse calculation, when the reverse calculation of the previous layer of neural network is completed, the next layer of calculation instructions will input the calculated input in the calculation unit
  • the neuron gradient is calculated as the output neuron gradient of the next layer (or some operation is performed on the input neuron gradient and then used as the output neuron gradient of the next layer), and the weight is replaced with the weight of the next layer. value.
  • the input neurons and output neurons of the multi-layer operation do not refer to the neurons in the input layer and the neurons in the output layer of the entire neural network.
  • the neuron in the lower layer of the network forward operation is the input neuron
  • the neuron in the upper layer of the network forward operation is the output neuron.
  • the input data acquired by it is cached in the first buffer, and the i-th processing element
  • the output data generated by the element according to the input data and the calculation instruction is buffered in the second buffer.
  • the flag parameters corresponding to each of the M processing elements are scanned at the same time according to the set scanning sequence.
  • the set scan order may include from low to high, or from high to low, etc., which is not specifically limited in the embodiment of the present application.
  • its corresponding flag bit parameter can be any one of the first parameter, the second parameter, or the third parameter.
  • the flag bit parameter may further include a fourth parameter, where the fourth parameter is used to eliminate invalid mac in the processing element.
  • the flag bit parameter includes a first flag bit parameter, and the first flag bit parameter is a flag bit parameter corresponding to the i-th processing element; where i is a positive integer less than or equal to M; the method further include:
  • the determining according to the flag bit parameter the first address of a row of output data generated by each of the M processing elements to be stored in the memory includes:
  • the sequence number of the processing element group can start from 0.
  • T 8 processing element groups
  • the 8 processing element groups can be represented as processing element group 0, processing element group 1, ..., processing element group 7.
  • the sequence number of the processing element group may indicate that it starts from 1.
  • these 8 processing element groups can be represented as processing element group 1, processing element group 2, ..., processing element group 8.
  • n 1, 2, ..., T.
  • processing element group 0 includes processing elements PE0-PE3
  • processing element group 1 includes processing elements PE4-PE7.
  • the respective OBUFs of processing element group 0 and processing element group 1 store output data generated by two output channels, for example, co0 and co8 are stored in the OBUF of processing element 0.
  • the flag bit parameters corresponding to each of the 8 processing elements can be as shown in Table 5:
  • the lowest processing element in processing element group 0 is PE0.
  • the first address corresponding to PE0 is addr0; when the memory stores co8, the first address corresponding to PE0 is: addr0+co_size*1 .
  • processing element group 0 includes processing elements PE0-PE3.
  • the flag bit parameters corresponding to each of the 4 processing elements can be as shown in Table 6:
  • the addr_start(i-1) represents the first address of a row of output data generated by the last processing element to be stored in the memory;
  • the third parameter is used to determine the highest processing bit in each group of the T processing element groups element.
  • processing element group 0 includes processing elements PE0-PE3
  • processing element group 1 includes processing elements PE4-PE7.
  • the flag bit parameters corresponding to each of the 8 processing elements can be as shown in Table 7:
  • the M processing elements when the M processing elements are grouped, it includes:
  • the M processing elements are grouped according to the number S of processing elements contained in a processing element group to obtain the T processing element groups.
  • the number of processing elements included in a processing element group can be determined according to the width of the output channel of each layer of the neural network.
  • each convolution kernel has three dimensions: length, width (width), and depth (height).
  • the width of the output channel is equal to the width of the convolution kernel. For example, suppose that the output feature map of a certain layer in the convolutional neural network model has 10 output channels, and 4 processing elements need to be combined to form a processing element group to complete the calculation of one output channel.
  • each processing element group contains 4 processing elements, and each processing element group completes the calculation of different output channels.
  • PE_GROUP0 completes the calculation of output channel 1
  • PE_GROUP1 completes the calculation of output channel 2, and so on.
  • Step S302 The electronic device restores the Pth row of output data generated by the M processing elements in the Pth poll according to the received Q ⁇ N storage requests through the memory.
  • the N storage requests correspond one-to-one with the N consecutively identified processing elements, and each storage request includes one line of output data generated by the corresponding processing element and one line of output data generated by the corresponding processing element.
  • the electronic device stores back the Pth row output data generated by the M processing elements in the Pth poll according to the received Q ⁇ N storage requests through the memory, including:
  • the electronic device stores the first address of each of the M processing elements to be stored in the memory according to the Pth row of output data generated by each of the M processing elements in the Pth poll through the memory.
  • the generated P line output data is
  • each storage request further includes the identification information of the corresponding processing element, where the identification information of the processing element can be used to distinguish different storage requests.
  • 4 storage requests are used to request the memory to store the 4 identifiers PE0-PE3 One row of output data generated by successive processing elements; in the second poll within one poll, 4 storage requests are used to request the memory to store one row of output data generated by each of the 4 consecutive processing elements PE4-PE7
  • four storage requests are used to request the memory to store one row of output data generated by each of the four consecutive processing elements PE28-PE31. It can be known that after one polling, the memory can store one line of output data generated by each of the 32 processing elements.
  • the address of the second buffer corresponding to each of the M processing elements is updated, For example, the updated address is:
  • addr(Q) addr_start+co_line_num*co_line_size, where addr_start represents the first address corresponding to each of the 32 processing elements; co_line_num represents the number of rows of output data; co_line_size represents the size of each row of output data.
  • the memory stores another row of output data generated by each of the M processing elements according to a preset rule.
  • the memory stores each of the M processing elements according to a preset rule.
  • the memory may be composed of multiple pieces of static random access memory SRAM (SRAM, static ram), if two of the addresses of N storage requests (for example, 4 storage requests) are mapped to the same SRAM, the other two are mapped to other SRAMs, then two of the four requests will access the same SRAM, and conflict occurs at this time.
  • SRAM static random access memory
  • the memory controller needs to complete three non-conflicting SRAM storage requests in the first cycle, and complete the remaining one SRAM storage request in the second cycle.
  • a certain layer of network in the neural network model has 10 output channels, and 4 processing elements are needed to combine to form a processing element group to complete the calculation of one output channel.
  • the output data stored in the second buffer of each processing element is shown in Figure 5.
  • the OBUF of processing element group 0 and processing element group 1 stores the output generated by 2 output channels.
  • Output data the output data generated by 1 output channel is stored in the OBUF of other processing element groups.
  • the number of processing elements actually participating in the calculation in each processing element group is 2.5, and the highest processing element PE of each processing element group does not produce a valid result, but this PE will Calculate the raw data for low-level improvement.
  • the memory stores the output data co corresponding to each of the 8 processing element groups in the memory according to the storage request, and participates in the calculation as the input data ci of the next layer.
  • the implementation process of the memory storing the output data co corresponding to each of the 8 processing element groups in the memory according to the storage request may include:
  • the processor polls and sends 4 storage requests to the memory in parallel.
  • the number of polls in a poll is 8.
  • the memory stores the 4 PE0-PE3 according to preset rules. One line of output data generated by each processing element.
  • the corresponding flag bit parameters of the 4 processing elements are scanned from low to high.
  • the flag parameters corresponding to each of the four processing elements are all Two parameters, taking the first processing element as an example, when the flag bit parameter corresponding to the first processing element is obtained as the second parameter, at this time, it is determined that a line of output data generated by the first processing element is to be stored in The first address in the memory is addr0.
  • the memory stores the first line of output data generated by the first processing element according to the first address (addr0) of the output data corresponding to the first processing element (that is, PE0) to be stored in the memory.
  • the memory stores the first row of output data generated by the second processing element according to the first address (addr1) of the output data corresponding to the second processing element (that is, PE1) in the memory; the memory stores the first line of output data generated by the second processing element; The output data corresponding to the element (that is, PE2) is to be stored in the memory.
  • the first address (addr2) stores the first line of output data generated by the third processing element; the memory is based on the output corresponding to the fourth processing element (that is, PE3)
  • the first address (addr3) where the data is to be stored in the memory stores the first line of output data generated by the fourth processing element.
  • the memory can complete the storage of the first row of output data corresponding to each of the 32 processing elements.
  • the address of the second buffer corresponding to each of the 32 processing elements is updated.
  • the second row of output data corresponding to each of the 32 processing elements is stored in the second poll according to the updated address, and the above implementation process is repeated until all the row output data in co0-co7 in OBUF is stored.
  • the number of polls in one poll is 2, and the In the first poll in 1 poll, 4 storage requests are used to request the memory to store the first row of output data generated by the 4 consecutive processing elements PE0-PE3; in the first poll, In the second poll, four storage requests are used to request the memory to store the second row of output data generated by the four consecutive processing elements PE4-PE7. It can be known that after the first polling, the memory can store the first row of output data generated by each of the 8 processing elements. In practical applications, it is about storing the second row of the 8 processing elements. For the specific implementation of the output data, please refer to the foregoing description, which is not repeated here.
  • the processor sends multiple storage requests to the memory in parallel in one cycle, and the memory simultaneously stores the output data generated by multiple processing elements according to the multiple storage requests, so as to solve the problem of data recovery in the existing neural network model.
  • the problem of low efficiency can improve the efficiency of data storage, which can improve the computational efficiency of the neural network model.
  • FIG. 6 it is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • the electronic device may include: a processor 601, a memory 602, a communication bus 603, and a communication interface 604.
  • the processor 601 communicates through the The bus connects the memory 602 and the communication interface 603.
  • the electronic device 60 may further include an artificial intelligence processor 605.
  • the artificial intelligence processor 605 can be mounted on a host CPU (Host CPU) as a coprocessor, and the host CPU allocates tasks to it.
  • the artificial intelligence processor 605 may implement one or more operations involved in the foregoing data processing method. For example, taking a neural network processing unit (NPU) NPU as an example, the core part of the NPU is an arithmetic circuit, and the controller controls the arithmetic circuit to extract matrix data in the memory 602 and perform multiplication and addition operations.
  • NPU neural network processing unit
  • the electronic device sends N storage requests to the memory 602 in parallel Q times in each poll through the processor 601; wherein the N storage requests are used to request the memory 602 stores one line of output data generated by each of the N processing elements with consecutive identifications among the M processing elements; the Q is determined according to the number M of processing elements and the number N of storage requests;
  • the electronic device restores the P-th row output data generated by each of the M processing elements according to the received Q ⁇ N storage requests in the P-th poll through the memory 602.
  • the N storage requests have a one-to-one correspondence with the N consecutively identified processing elements, and each storage request includes a line of output data generated by a corresponding processing element and a line of output data generated by the corresponding processing element to be stored.
  • the electronic device stores the P th row of output data generated by the M processing elements in the P th poll according to the received Q ⁇ N storage requests through the memory 602, including:
  • the electronic device stores the M processing according to the first address of the Pth row of output data to be stored in the memory 602 through the memory 602 in the Pth poll.
  • the method further includes:
  • the flag bit parameter includes a first flag bit parameter, and the first flag bit parameter is a flag bit parameter corresponding to the i-th processing element; wherein i is a positive integer less than or equal to M; the method further includes:
  • the processor 601 groups the M processing elements to obtain T processing element groups;
  • the processor 601 determines, according to the flag bit parameter, the first address of a line of output data generated by each of the M processing elements to be stored in the memory, including:
  • the addr_start(i-1) represents the first address of a row of output data generated by the last processing element to be stored in the memory;
  • the third parameter is used to determine the highest processing bit in each group of the T processing element groups element.
  • processor 601 groups the M processing elements, it includes:
  • the M processing elements are grouped according to the number S of processing elements contained in a processing element group to obtain the T processing element groups.
  • the embodiment of the present application also provides a computer storage medium for storing computer software instructions used by the above-mentioned electronic device shown in FIG. 4, which contains programs for executing the above-mentioned method embodiments. By executing the stored program, the efficiency of data recovery in the neural network model can be improved.
  • the embodiments of the present application provide a data processing method, an electronic device, and a computer-readable storage medium, which can be triggered by parallel requests to achieve simultaneous storage of multiple data, thereby improving storage efficiency.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Human Computer Interaction (AREA)
  • Neurology (AREA)
  • Image Analysis (AREA)
  • Advance Control (AREA)
  • Multi Processors (AREA)
  • Memory System (AREA)

Abstract

一种数据处理方法、电子装置及计算机可读存储介质,其中方法包括:所述电子装置通过所述处理器在每个轮询内向所述存储器并行发送N个存储请求Q次;其中,所述N个存储请求用于请求所述存储器存储所述M个处理元件中N个标识连续的处理元件各自生成的一行输出数据;所述Q是根据处理元件的数量M以及存储请求的数量N确定的;所述电子装置通过所述存储器在第P个轮询内根据接收到的Q×N个存储请求回存所述M个处理元件各自生成的第P行输出数据。通过该方法,可以解决现有的神经网络模型中数据回存效率不高的问题,通过并行请求的触发,可以实现多个数据的同时存储,从而可以提高回存效率。

Description

一种数据处理方法、电子装置及计算机可读存储介质
本申请要求于2019年5月24日提交中国专利局,申请号为201910444607.1、发明名称为“一种数据处理方法、电子装置及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及信息处理技术领域,尤其涉及一种数据处理方法、电子装置及计算机可读存储介质。
背景技术
神经网络是一种模仿动物神经网络行为特征,进行分布式并行信息处理的算法数学模型,这种网络由大量的节点(或称神经元)之间星湖连接构成,通过调整内部大量节点之间相互连接的关系,利用输入数据、权值产生输出数据模拟人脑的信息处理过程处理信息并生成模式识别之后的结果。
对于神经网络运算来说,如果该神经网络运算具有多层运算,多层运算的输入神经元和输出神经元并非是指整个神经网络的输入层中神经元和输出层中神经元,而是对于网络中任意相邻的两层,处于网络正向运算下层中的神经元即为输入神经元,处于网络正向运算上层中的神经元即为输出神经元。以卷积神经网络模型为例,设一个卷积神经网络有L层,K=1,2,...,L-1,对于第K层和第K+1层来说,我们将第K层称为输入层,其中的神经元为所述输入神经元,第K+1层称为输出层,其中的神经元为所述输出神经元。即除最顶层外,每一层都可以作为输入层,其下一层为对应的输出层。
在神经网络模型的计算过程中,神经网络模型中前一层计算得到的输出结果会作为后一层计算的输入。一般情况下,处理器将神经网络模型中前一层计算得到的输出结果存入相应的输出数据缓存器中,在进行后一层的计算时,处 理器需要先读取输出数据缓存器中存储好的前一层的输出结果,然后将其作为当前层的输入进行计算。现有技术中,处理器将神经网络模型中前一层计算得到的输出结果存入相应的输出数据缓存器的过程中,回存效率不高。
发明内容
本申请实施例提供一种数据处理方法、电子装置及计算机可读存储介质,以解决现有的神经网络模型中数据回存效率不高的问题,通过并行请求的触发,可以实现多个数据的同时存储,从而可以提高回存效率。
第一方面,本申请实施例提供了一种数据处理方法,该方法应用于电子装置,所述电子装置包括处理器和存储器,所述处理器包括M个按照标识大小依次设置的处理元件,M为正整数,所述方法包括:
所述电子装置通过所述处理器在每个轮询内向所述存储器并行发送N个存储请求Q次;其中,所述N个存储请求用于请求所述存储器存储所述M个处理元件中N个标识连续的处理元件各自生成的一行输出数据;所述Q是根据处理元件的数量M以及存储请求的数量N确定的;
所述电子装置通过所述存储器在第P个轮询内根据接收到的Q×N个存储请求回存所述M个处理元件各自生成的第P行输出数据。
在其中一个可能的实现方式中,所述N个存储请求与所述N个标识连续的处理元件一一对应,每个存储请求包括对应的处理元件生成的一行输出数据以及所述对应的处理元件生成的一行输出数据拟存储在所述存储器中的首地址;
所述电子装置通过所述存储器在第P个轮询内根据接收到的Q×N个存储请求回存所述M个处理元件各自生成的第P行输出数据,包括:
所述电子装置通过所述存储器在所述第P个轮询内根据所述M个处理元件各自生成的第P行输出数据拟存储在所述存储器中的首地址存储所述M个处理元件各自生成的第P行输出数据。
在其中一个可能的实现方式中,所述方法还包括:
获取所述M个处理元件各自对应的标志位参数,并根据所述标志位参数确定所述M个处理元件中每个处理元件各自生成的一行输出数据拟存储在所述存储器中的首地址。
在其中一个可能的实现方式中,所述标志位参数包括第一标志位参数,所述第一标志位参数为第i个处理元件对应的标志位参数;其中,i为小于等于M的正整数;所述方法还包括:
对所述M个处理元件进行分组,得到T个处理元件小组;
所述根据所述标志位参数确定所述M个处理元件中每个处理元件各自生成的一行输出数据拟存储在所述存储器中的首地址,包括:
当所述第一标志位参数为第一参数时,所述第i个处理元件生成的一行输出数据拟存储在存储器中的首地址为:addr_start(i)=dm_init_addr+co_size*n,(n=1,2,...,T);其中,dm_init_addr表示初始化地址;co_size表示神经网络模型中每层网络的一个输出通道的大小;n表示处理元件小组的序号;所述第一参数用于确定所述T个处理元件小组中每组内的最低位处理元件;
当所述第一标志位参数为第二参数时,所述第i个处理元件生成的一行输出数据拟存储在存储器中的首地址为:addr_start(i)=addr_start(i-1)+16;其中,所述addr_start(i-1)表示上一个处理元件生成的一行输出数据拟存储在存储器中的首地址;所述第二参数用于剔除所述M个处理元件中的无效处理元件;
当所述第一标志位参数为第三参数时,所述第i个处理元件生成的一行输出数据拟存储在存储器中的首地址为:addr_start(i)=addr_start(i-1)+16;所述addr_start(i-1)表示上一个处理元件生成的一行输出数据拟存储在存储器中的首地址;所述第三参数用于确定所述T个处理元件小组中每组内的最高位处理元件。
在其中一个可能的实现方式中,在对所述M个处理元件进行分组时,包括:
获取神经网络模型中的每层网络的输出通道的宽度;
根据所述每层网络的输出通道的宽度确定一个处理元件小组内包含的处理元件的数量S;
根据一个处理元件小组内包含的处理元件的数量S对所述M个处理元件进行分组,得到所述T个处理元件小组。
在其中一个可能的实现方式中,所述Q是将M除以N并经过向上取整操作得到的。
在其中一个可能的实现方式中,在处理元件生成输出数据时,包括:
获取输入数据以及计算指令;其中,所述输入数据包括权值数据、输入神经元数据以及计算所需要的配置参数;
根据所述输入数据以及计算指令执行神经网络计算,得到输出数据。
第二方面,本申请实施例提供了一种电子装置,包括处理器和存储器,所述处理器、输入设备、输出设备和存储器相互连接,其中,所述存储器用于存储支持终端执行上述方法的计算机程序,所述计算机程序包括程序指令,所述处理器被配置用于调用所述程序指令,执行上述第一方面的方法。
第三方面,本申请实施例提供了一种计算机可读存储介质,所述计算机存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行上述第一方面的方法。
第四方面,本申请实施例提供了一种计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行上述第一方面的方法。
实施本申请实施例,具有如下有益效果:
在本申请实施例中,处理器在一个轮询内向存储器并行发送多个存储请求多次,继而,存储器可以根据多个存储请求同时存储多个处理元件各自生成的输出数据,以解决现有的神经网络模型中数据回存效率不高的问题,可以提高数据的回存效率。那么,在此基础上,当进行神经网络计算时,可以提高神经网络模型的计算效率。
附图说明
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍。
图1是本申请实施例提供的一种处理元件的结构示意图;
图2是本申请实施例提供的32个处理元件的结构示意图;
图3是本申请实施例提供的一种数据存储格式的示意图;
图4是本申请实施例提供的一种数据处理方法的流程示意图;
图5是本申请实施例提供的一种确定输出数据拟存储在存储器中的首地址的示意图;
图6是本申请实施例提供的一种电子装置的结构示意图。
具体实施方式
在本申请实施例中,请参见图1,是本申请实施例提供的一种用于实现神经网络运算的处理元件PE,包括:第一缓冲器11(即,输入缓冲器),被配置为存储输入数据以及与输入数据对应的权值;运算单元12,被配置为基于输入数据执行神经网络计算,并生成输出数据;这里,神经网络计算可以为卷积神经网络计算,还可以为其他神经网络计算;第二缓冲器13(即,输出缓冲器),被配置为存储所述输出数据。
进一步地,如图1所示,第一缓冲器11例如可以包括输入数据缓冲器111和权值数据缓冲器112;其中,输入数据缓冲器112配置为存储输入数据;以及权值数据缓冲器111配置为存储所述输入数据对应的权值。
在对例如一个图像进行卷积运算时,通常采用多个PE,对于该图像的不同部分的图像数据分别进行卷积运算。其中,每个PE是一个数位宽度为m的SIMD处理器(或数位宽度为m的矢量处理器)。此外,每个PE有自己的指令缓冲器IQ,指令解码和控制逻辑等。每个PE可以执行一个独立的卷积神经网络(CNN)计算。或者,多个相邻的PE也可以组合在一起以共同执行一个CNN 计算。
在其中一个可能的实现方式中,处理器包括多个处理元件(Processing Element,PE),这多个处理元件按照标识大小依次设置,其标识大小排序可以表示为PE0、PE0、......、PEn。如图2所示,假设该处理器有32个PE(PE0~PE31),每个PE有7个MAC单元。因此该处理器有224个MAC。每个PE是一个7位宽的SIMD处理器。每个PE有自己的指令缓冲器(IQ),指令解码和控制逻辑等。
在每个PE中,有三个本地缓冲器:i)IBUF(对应于输入数据缓冲器),用于存储输入数据ci;ii)WBUF(对应于权值数据缓冲器),用于存储权值;以及iii)OBUF(对应于第二缓冲器),用于存储输出数据co。
在本申请实施例中,数据在存储器中的存储格式可以如图3所示。以神经网络模型为卷积神经网络为例,特征(feature map)的每个数据为16bit,每个通道(channel)的数据在存储器中按行连续存放。可以理解的是,卷积神经网络模型中的每层网络计算的输入特征(input feature map)和输出特征(output feature map)都按照按行连续存放的格式进行存储。
接下来介绍本申请实施例所涉及的第一参数、第二参数、第三参数以及第四参数。
第一参数,也即pe_low_vld_mask[M-1:0],表示每个处理元件小组的低位有效处理元件。以处理元件的数量M=32,对32个处理元件分成8个处理元件小组为例,当pe_low_vld_mask[M-1:0]=0x11111111时,32个处理元件各自对应的标志位参数可以如表1所示:
表1
Figure PCTCN2019121602-appb-000001
由表1可以知道,根据第一参数的值可以确定在每个处理元件小组里面最 低位PE是哪一个。以PE0-PE3这个PE组(PE_GROUP0)为例,其中,PE0表示PE_GROUP0这个PE组里面的最低位PE。由于可以根据第一参数确定M个处理元件的分组情况,通过这一实现方式,为后续的确定M个处理元件中的每个处理元件的首地址提供了便利,可以提高存储速度。
第二参数,也即pe_high_vld_mask[M-1:0],表示每个处理元件小组的高位有效处理元件。以处理元件的数量M=32,对32个处理元件分成8个处理元件小组为例,当pe_high_vld_mask[M-1:0]=0x88888888时,32个处理元件各自对应的标志位参数可以如表2所示:
表2
Figure PCTCN2019121602-appb-000002
由表2可以知道,根据第二参数的值可以确定在每个处理元件小组里面最高位PE是哪一个。以PE0-PE3这个PE组(PE_GROUP0)为例,其中,PE3表示PE_GROUP0这个PE组里面的最高位PE。由于可以根据第二参数确定M个处理元件的分组情况,通过这一实现方式,为后续的确定M个处理元件中的每个处理元件的首地址提供了便利,可以提高存储速度。
第三参数,也即pe_mac_mask[M-1:0],表示处理元件是否有效。以处理元件的数量M=32,对32个处理元件分成8个处理元件小组为例,当pe_mac_mask[M-1:0]=0x77777777时,32个处理元件各自对应的标志位参数可以如表3所示:
表3
Figure PCTCN2019121602-appb-000003
由表3可以知道,以PE0-PE3这个PE组(PE_GROUP0)为例,根据第三参数的值可以确定实际有效的处理元件为PE0-PE3,并且,PE4是无效处理元件, 不产生有效结果。在本申请实施例中,对于pe_mac_mask[M-1:0]中比特位bit为0对应的处理元件PE,不产生存储请求。通过这一实现方式,可以剔除无效PE,避免无用的数据写入以及读取操作,可以提高神经网络的计算效率。
第四参数,也即mac_boundary,表示高位有效处理元件PE中,有多少个mac是有效的。例如一个pe_group中的高位有效PE中有8个mac,当mac_boundary=0x7f时,这8个mac各自对应的标志位参数可以如表4所示:
表4
mac 7 6 5 4 3 2 1 0
标志位参数 0 1 1 1 1 1 1 1
由表4可以知道,最后产生的co只用到了7个mac,在这种情况下,表示mc7是无效的。需要说明的是,无效mac产生的数据是不需要存储在存储器中的。通过这一实现方式,可以剔除无效mac,避免无用的数据写入以及读取操作,可以提高神经网络的计算效率。
现有技术中,处理器一次只能发送一个存储请求,进而存储器一次只能存储一个处理元件生成的输出数据,从而容易带来神经网络模型中数据回存效率不高的问题。为了解决现有技术中的上述技术问题,本发明提供一种数据处理方法、电子装置及计算机可读存储介质,通过并行触发多个存储请求,以提高神经网络模型中数据的回存效率,以达到提高神经网络模型的运算速度的目的。
基于此,下面结合图4所示本申请实施例提供的一种数据处理方法的流程示意图,具体说明在本申请实施例中是如何回存数据的,可以包括但不限于如下步骤:
步骤S301、所述电子装置通过所述处理器在每个轮询内向所述存储器并行发送N个存储请求Q次;其中,所述N个存储请求用于请求所述存储器存储所述M个处理元件中N个标识连续的处理元件各自生成的一行输出数据;所述Q是根据处理元件的数量M以及存储请求的数量N确定的。
在本申请实施例中,轮询是一个周期性的重复过程。在实际应用中,一个轮询可以包括Q次具体的轮询操作。从本申请来看,一个轮询内可以包括Q次 N个存储请求的触发。当处理器在一个轮询内向存储器并行发送Q次N个存储请求,这意味着,处理器指示存储器在第P个轮询内存储M个处理元件各自对应的第P行输出数据。
在本申请实施例中,可以根据处理元件生成的输出数据的行数确定轮询的个数。例如,处理元件生成的输出数据的行数为J,当处理器向存储器发送了J个轮询的存储请求,此时,存储器可以根据存储请求存储处理元件生成的第J行输出数据,同时,这也意味着,存储器根据存储请求将处理元件生成的输出数据存储完毕。
在其中一个可能的实现方式中,可以根据处理元件的数量M和存储请求的数量N确定一个轮询内包含的轮询次数Q。例如,M=32,N=4,这意味着一个轮询内的轮询次数为8。进一步可以知道的是,在这种情况下,一个轮询内的每次轮询并行发送的N个存储请求都是有效的。
在其中一个可能的实现方式中,所述Q是将M除以N并经过向上取整操作得到的。
在实际应用中,例如,处理元件的数量M=32,存储请求的数量N=7,这意味着一个轮询内的轮询次数为5。可以理解的是,在一个轮询内的第5次轮询时,处理器向存储器发送了7个存储请求(例如,7个存储请求可以表示为A1,A2,......,A7),需要说明的是,在这7个存储请求中,存储请求A1-A4有效,存储请求A5-A7无效。这里,存储请求有效是指,存储器可以根据存储请求存储处理元件生成的输出数据。
在其中一个可能的实现方式中,在处理元件生成输出数据时,包括:
获取输入数据以及计算指令;其中,所述输入数据包括权值数据以及输入神经元数据;
根据所述输入数据以及计算指令执行神经网络计算,得到输出数据。
在本申请实施例中,获取输入数据以及计算指令的方式可以通过数据输入输出单元单元得到,该数据输入输出单元具体可以为一个或多个数据I/O接口 或I/O引脚。
进一步地,上述计算指令可以包括但不限于:神经网络运算指令(例如,卷积神经网络运算指令)、正向运算指令或反向运算指令等等,本申请具体实施方式并不限制上述计算指令的具体表现形式。
对于神经网络中的运算可以为神经网络中的一层的运算,对于多层神经网络,其实现过程是,在正向运算中,当上一层神经网络执行完成之后,下一层的运算指令会将运算单元中计算出的输出神经元(也即,输出数据)作为下一层的输入神经元进行运算(或者是对该输出神经元进行某些操作再作为下一层的输入神经元),同时,将权值也替换为下一层的权值;在反向运算中,当上一层神经网络的反向运算执行完成后,下一层运算指令会将运算单元中计算出的输入神经元梯度作为下一层的输出神经元梯度进行运算(或者是对该输入神经元梯度进行某些操作再作为下一层的输出神经元梯度),同时将权值替换为下一层的权值。
对于神经网络运算,如果该神经网络运算具有多层运算,多层运算的输入神经元和输出神经元并非是指整个神经网络的输入层中神经元和输出层中神经元,而是对于网络中任意相邻的两层,处于网络正向运算下层中的神经元即为输入神经元,处于网络正向运算上层中的神经元即为输出神经元。以卷积神经网络为例,设一个卷积神经网络有L层,K=1,2,...,L-1,对于第K层和第K+1层来说,我们将第K层称为输入层,其中的神经元为所述输入神经元,第K+1层称为输出层,其中的神经元为所述输出神经元。即除最顶层外,每一层都可以作为输入层,其下一层为对应的输出层。
如前所述,在本申请实施例中,对于第i个处理元件(这里,i为小于等于M的正整数)来说,其获取的输入数据缓存在第一缓冲器中,第i个处理元件根据输入数据以及计算指令生成的输出数据缓存在第二缓冲器中。
在本申请实施例中,在确定M个处理元件生成的一行输出数据拟存储在所述存储器中的首地址时,包括:
获取所述M个处理元件各自对应的标志位参数,并根据所述标志位参数确定所述M个处理元件中每个处理元件各自生成的一行输出数据拟存储在所述存储器中的首地址。
具体地,根据设定好的扫描顺序同时扫描M个处理元件各自对应的标志位参数。在本申请实施例中,设定好的扫描顺序可以包括从低位到高位,也可以包括从高位到低位等等,本申请实施例不作具体限定。
在本申请实施例中,以第i个处理元件为例,其对应的标志位参数可以第一参数、第二参数或第三参数中的任意一种。
进一步可选地,标志位参数还可以包括第四参数,其中,第四参数用于剔除处理元件中的无效mac。
具体实现中,所述标志位参数包括第一标志位参数,所述第一标志位参数为第i个处理元件对应的标志位参数;其中,i为小于等于M的正整数;所述方法还包括:
对所述M个处理元件进行分组,得到T个处理元件小组;
所述根据所述标志位参数确定所述M个处理元件中每个处理元件各自生成的一行输出数据拟存储在存储器中的首地址,包括:
当所述第一标志位参数为第一参数时,所述第i个处理元件生成的一行输出数据拟存储在存储器中的首地址为:addr_start(i)=dm_init_addr+co_size*n,(n=1,2,...,T);其中,dm_init_addr表示初始化地址;co_size表示神经网络模型中每层网络的一个输出通道的大小;n表示处理元件小组的序号;所述第一参数用于确定所述T个处理元件小组中每组内的最低位处理元件;
在一种可能的实现方式中,处理元件小组的序号可以从0开始。例如,当T=8时,这8个处理元件小组可以表示为处理元件小组0,处理元件小组1,......,处理元件小组7。
在另一种可能的实现方式中,处理元件小组的序号可以表示从1开始。例如,当T=8时,这8个处理元件小组可以表示为处理元件小组1,处理元件小 组2,......,处理元件小组8。为了便于阐述,在本申请实施例中,n=1,2,......,T。
如图5所示,以处理元件小组0和处理元件小组1为例,处理元件小组0中包含处理元件PE0-PE3,处理元件小组1中包含处理元件PE4-PE7。在实际应用中,由于处理元件小组0和处理元件小组1各自对应的OBUF中均存储了两个输出通道产生的输出数据,例如,处理元件0的OBUF中存储了co0和co8。在一种情形下,这8个处理元件各自对应的标志位参数可以如表5所示:
表5
处理元件 7 6 5 4 3 2 1 0
标志位参数 0 0 0 1 0 0 0 1
由表5可以知道,处理元件小组0中最低位处理元件为PE0,当存储器存储co0时,PE0对应的首地址为addr0;当存储器存储co8时,PE0对应的首地址为:addr0+co_size*1。
当所述第一标志位参数为第二参数时,所述第i个处理元件生成的一行输出数据拟存储在存储器中的首地址为:addr_start(i)=addr_start(i-1)+16;其中,所述addr_start(i-1)表示上一个处理元件生成的一行输出数据拟存储在存储器中的首地址;所述第二参数用于剔除所述M个处理元件中的无效处理元件;
如图5所示,以处理元件小组0为例,其中,处理元件小组0中包含处理元件PE0-PE3。在一种情形下,这4个处理元件各自对应的标志位参数可以如表6所示:
表6
处理元件 3 2 1 0
标志位参数 1 1 1 1
由表6可以知道,PE0-PE3均为有效处理元件,假设PE0对应的首地址为addr0,此时,PE1对应的首地址为addr1=addr0+16,PE2对应的首地址为addr2=addr1+16,PE3对应的首地址为addr3=addr2+16。
当所述第一标志位参数为第三参数时,所述第i个处理元件生成的一行输 出数据拟存储在存储器中的首地址为:addr_start(i)=addr_start(i-1)+16;所述addr_start(i-1)表示上一个处理元件生成的一行输出数据拟存储在存储器中的首地址;所述第三参数用于确定所述T个处理元件小组中每组内的最高位处理元件。
如图5所示,以处理元件小组0和处理元件小组1为例,其中,处理元件小组0中包含处理元件PE0-PE3,处理元件小组1中包含处理元件PE4-PE7。在一种情形下,这8个处理元件各自对应的标志位参数可以如表7所示:
表7
处理元件 7 6 5 4 3 2 1 0
标志位参数 1 0 0 0 1 0 0 0
由表7可以知道,处理元件小组0中最高位处理元件为PE3,处理元件小组1中最高位处理元件为PE7,假设PE3对应的首地址为add0,由于PE3-PE6为无效处理元件,此时,PE4对应的首地址为:addr4=addr0+16。
具体实现中,在对所述M个处理元件进行分组时,包括:
获取神经网络模型中的每层网络的输出通道的宽度;
根据所述每层网络的输出通道的宽度确定一个处理元件小组内包含的处理元件的数量S;
根据一个处理元件小组内包含的处理元件的数量S对所述M个处理元件进行分组,得到所述T个处理元件小组。
在本申请实施例中,为了满足计算需求,可以根据神经网络中的每层网络的输出通道的宽度确定一个处理元件小组内包含的处理元件的数量。具体地,对于卷积神经网络来说,每个卷积核具有长、宽(width)、深(height)三个维度。这里,在卷积神经网络进行计算的过程中,输出通道的宽度等于卷积核的宽(width)。例如,假设卷积神经网络模型中的某一层的output feature map有10个输出通道,需要4个处理元件联合起来组成一个处理元件小组才能完成一个输出通道的计算。在这种情况下,当M=32时,将32个处理元件分成了8组,每个处理元件小组中包含4个处理元件,每个处理元件小组完成不同的输 出通道的计算。例如,PE_GROUP0完成输出通道1的计算,PE_GROUP1完成输出通道2的计算,等等。
步骤S302、所述电子装置通过所述存储器在第P个轮询内根据接收到的Q×N个存储请求回存所述M个处理元件各自生成的第P行输出数据。
具体实现中,所述N个存储请求与所述N个标识连续的处理元件一一对应,每个存储请求包括对应的处理元件生成的一行输出数据以及所述对应的处理元件生成的一行输出数据拟存储在所述存储器中的首地址;
所述电子装置通过所述存储器在第P个轮询内根据接收到的Q×N个存储请求回存所述M个处理元件各自生成的第P行输出数据,包括:
所述电子装置通过所述存储器在所述第P个轮询内根据所述M个处理元件各自生成的第P行输出数据拟存储在所述存储器中的首地址存储所述M个处理元件各自生成的第P行输出数据。
在本申请实施例中,每个存储请求还包括对应的处理元件的标识信息,其中,处理元件的标识信息可以用于区分不同的存储请求。
在本申请实施例中,在一个轮询内,电子装置可以通过处理器向存储器并行发送Q次N个存储请求。以M=32,存储请求N=4,轮询的次数Q=8为例,在一个轮询内的第1次轮询中,4个存储请求用于请求存储器存储PE0-PE3这4个标识连续的处理元件各自生成的一行输出数据;在一个轮询内的第2次轮询中,4个存储请求用于请求存储器存储PE4-PE7这4个标识连续的处理元件各自生成的一行输出数据;可以理解的是,在一个轮询内的第8次轮询中,4个存储请求用于请求存储器存储PE28-PE31这4个标识连续的处理元件各自生成的一行输出数据。从而可以知道的是,在经历了一个轮询之后,存储器可以实现将32个处理元件各自生成的一行输出数据存储起来。
那么,经历了一个轮询之后,表示M个处理元件各自生成的一行输出数据存储完毕,在这种情况下,更新M个处理元件中的每个处理元件各自对应的第二缓冲器的地址,例如,更新后的地址为:
addr(Q)=addr_start+co_line_num*co_line_size,其中,addr_start表示32个处理元件中的每个处理元件各自对应的首地址;co_line_num表示输出数据的行数序号;co_line_size表示每行输出数据的大小。
可以理解的是,在下一个轮询内,存储器按照预设规则存储M个处理元件各自生成的另一行输出数据,例如,在第2个轮询内,存储器按照预设规则存储M个处理元件各自生成的第2行输出数据。当轮询的个数与处理元件生成的输出数据的行数相等时,表示存储器将M个处理元件各自生成的多行输出数据存储完毕。
在本申请实施例中,存储器可能由多块静态随机存取存储器SRAM(SRAM,static ram)组成,如果N个存储请求(例如,4个存储请求)的地址中有两个都映射到了同一个SRAM,另外两个映射到其他的SRAM上,那么4个请求中就有两个请求会访问同一块SRAM,此时产生了冲突。为了解决这个冲突,需要把同时访问同一块SRAM的存储请求分两个周期分别访问这块SRAM。所以在这种情况下,存储器的控制器需要在第一个周期完成3个不冲突的SRAM存储请求,第二个周期完成剩下的一个SRAM的存储请求。通过这一实现方式,可以避免神经网络模型的数据回存过程中的存储冲突。
为了便于理解,下面结合具体的实例进行阐述。例如,神经网络模型中的某一层网络有10个输出通道,需要4个处理元件联合起来组成一个处理元件小组才能完成一个输出通道的计算。在这种情况下,当M=32时,意味着将32个处理元件分成了8组(包括处理元件小组0-处理元件小组7),每个处理元件小组中包含4个处理元件,每个处理元件小组完成不同的输出通道的计算。在完成这一层的计算之后,每个处理元件的第二缓冲器中存放的输出数据如图5所示,其中,处理元件小组0和处理元件小组1的OBUF中存储2个输出通道产生的输出数据,其他的处理元件小组的OBUF中存储1个输出通道产生的输出数据。需要说明的是,在图3中,每个处理元件小组中实际参与计算的处理元件的数量为2个半,每个处理元件小组的最高位的处理元件PE不产生有效 结果,但是这个PE会为低位的提高原始数据进行计算。存储器根据存储请求将上述8个处理元件小组各自对应的输出数据co存储在存储器中,并作为下一层的输入数据ci参与计算。
以处理器向存储器并行发送的存储请求的数量为4为例,存储器根据存储请求将上述8个处理元件小组各自对应的输出数据co存储在存储器中的实现过程可以包括:
处理器轮询向存储器并行发送4个存储请求,其中,一个轮询内轮询的次数为8,在一个轮询内的第1次轮询中,存储器按照预设规则存储PE0-PE3这4个处理元件各自生成的一行输出数据。存储器根据4个存储请求存储PE0-PE3各自生成的一行输出数据时,从低位到高位扫描4个处理元件各自对应的标志位参数,例如,这4个处理元件各自对应的标志位参数均为第二参数,以第1个处理元件为例,在获取到第1个处理元件对应的标志位参数为第二参数的情况下,此时,确定第1个处理元件生成的一行输出数据拟存储在存储器中的首地址为addr0。以此类推,确定第2个处理元件生成的一行输出数据拟存储在存储器中的首地址为addr1,其中,addr1=addr0+16;确定第3个处理元件生成的一行输出数据拟存储在存储器中的首地址为addr2,其中,addr2=addr1+16;确定第4个处理元件生成的一行输出数据拟存储在存储器中的首地址为addr3,其中,addr3=addr2+0。之后,存储器根据第1个处理元件(也即PE0)对应的输出数据拟存储在存储器中的首地址(addr0)存储第1个处理元件生成的第一行输出数据。同理,存储器根据第2个处理元件(也即PE1)对应的输出数据拟存储在存储器中的首地址(addr1)存储第2个处理元件生成的第一行输出数据;存储器根据第3个处理元件(也即PE2)对应的输出数据拟存储在存储器中的首地址(addr2)存储第3个处理元件生成的第一行输出数据;存储器根据第4个处理元件(也即PE3)对应的输出数据拟存储在存储器中的首地址(addr3)存储第4个处理元件生成的第一行输出数据。
那么,在经历了第一个轮询之后,存储器可以完成针对32个处理元件各自 对应的第一行输出数据的存储。在这种情况下,更新32个处理元件中的每个处理元件各自对应的第二缓冲器的地址。
之后,根据更新后的地址在第二个轮询中存储32个处理元件各自对应的第二行输出数据,重复上述实现流程,直至将OBUF中的co0-co7中的所有行输出数据存储完毕。
进一步地,在将OBUF中的co0-co7存储完毕之后,存储co8-co9,此时,需要进行co的地址切换,切换后的地址为:addr=addr_start(K)+obuf_co_num*co_size;其中,addr_start(K)表示co的初始地址,obuf_co_num表示OBUF中的co的序号;co_size表示co的大小。在这种情况下,如图5所示,由于M=8,N=4,这意味着处理器轮询向存储器并行发送4个存储请求,一个轮询内的轮询次数为2,在第1个轮询内的第1次轮询中,4个存储请求用于请求存储器存储PE0-PE3这4个标识连续的处理元件各自生成的第1行输出数据;在第1个轮询内的第2次轮询中,4个存储请求用于请求存储器存储PE4-PE7这4个标识连续的处理元件各自生成的第2行输出数据。从而可以知道的是,在经历了第1个轮询之后,存储器可以实现将8个处理元件各自生成的第1行输出数据存储起来,在实际应用中,关于存储8个处理元件的第2行输出数据的具体实现请参考前述描述,此处不多加赘述。
通过实施本申请实施例,处理器在一个周期内向存储器并行发送多个存储请求,存储器根据多个存储请求同时存储多个处理元件生成的输出数据,以解决现有的神经网络模型中数据回存效率不高的问题,提高数据回存效率,从而可以提高神经网络模型的计算效率。
如图6所示,是本申请实施例提供的一种电子装置的结构示意图,所述电子装置可以包括:处理器601、存储器602、通信总线603和通信接口604,处理器601通过所述通信总线连接所述存储器602和所述通信接口603。
可选地,该电子装置60还可以包括人工智能处理器605。人工智能处理器605可以作为协处理器挂载到主CPU(Host CPU)上,由主CPU为其分配任务。 人工智能处理器605可以实现上述数据处理方法中涉及的一种或多种运算。例如,以神经网络处理器(network processing unit,NPU)NPU为例,NPU的核心部分为运算电路,通过控制器控制运算电路提取存储器602中的矩阵数据并进行乘加运算。
在本申请实施例中,所述电子装置通过所述处理器601在每个轮询内向所述存储器602并行发送N个存储请求Q次;其中,所述N个存储请求用于请求所述存储器602存储所述M个处理元件中N个标识连续的处理元件各自生成的一行输出数据;所述Q是根据处理元件的数量M以及存储请求的数量N确定的;
所述电子装置通过所述存储器602在第P个轮询内根据接收到的Q×N个存储请求回存所述M个处理元件各自生成的第P行输出数据。
其中,所述N个存储请求与所述N个标识连续的处理元件一一对应,每个存储请求包括对应的处理元件生成的一行输出数据以及所述对应的处理元件生成的一行输出数据拟存储在所述存储器中的首地址;
所述电子装置通过所述存储器602在第P个轮询内根据接收到的Q×N个存储请求回存所述M个处理元件各自生成的第P行输出数据,包括:
所述电子装置通过所述存储器602在所述第P个轮询内根据所述M个处理元件各自生成的第P行输出数据拟存储在所述存储器602中的首地址存储所述M个处理元件各自生成的第P行输出数据。
其中,所述方法还包括:
获取所述M个处理元件各自对应的标志位参数,并根据所述标志位参数确定所述M个处理元件中每个处理元件各自生成的一行输出数据拟存储在所述存储器中的首地址。
其中,所述标志位参数包括第一标志位参数,所述第一标志位参数为第i个处理元件对应的标志位参数;其中,i为小于等于M的正整数;所述方法还包括:
处理器601对所述M个处理元件进行分组,得到T个处理元件小组;
所述处理器601根据所述标志位参数确定所述M个处理元件中每个处理元件各自生成的一行输出数据拟存储在存储器中的首地址,包括:
当所述第一标志位参数为第一参数时,所述第i个处理元件生成的一行输出数据拟存储在存储器中的首地址为:addr_start(i)=dm_init_addr+co_size*n,(n=1,2,...,T);其中,dm_init_addr表示初始化地址;co_size表示神经网络模型中每层网络的一个输出通道的大小;n表示处理元件小组的序号;所述第一参数用于确定所述T个处理元件小组中每组内的最低位处理元件;
当所述第一标志位参数为第二参数时,所述第i个处理元件生成的一行输出数据拟存储在存储器中的首地址为:addr_start(i)=addr_start(i-1)+16;其中,所述addr_start(i-1)表示上一个处理元件生成的一行输出数据拟存储在存储器中的首地址;所述第二参数用于剔除所述M个处理元件中的无效处理元件;
当所述第一标志位参数为第三参数时,所述第i个处理元件生成的一行输出数据拟存储在存储器中的首地址为:addr_start(i)=addr_start(i-1)+16;所述addr_start(i-1)表示上一个处理元件生成的一行输出数据拟存储在存储器中的首地址;所述第三参数用于确定所述T个处理元件小组中每组内的最高位处理元件。
其中,处理器601在对所述M个处理元件进行分组时,包括:
获取神经网络模型中的每层网络的输出通道的宽度;
根据所述每层网络的输出通道的宽度确定一个处理元件小组内包含的处理元件的数量S;
根据一个处理元件小组内包含的处理元件的数量S对所述M个处理元件进行分组,得到所述T个处理元件小组。
本申请实施例还提供了一种计算机存储介质,用于存储为上述图4所示的电子装置所用的计算机软件指令,其包含用于执行上述方法实施例所涉及的程 序。通过执行存储的程序,可以提高神经网络模型中数据的回存效率。
由上可见,本申请实施例提供一种数据处理方法、电子装置及计算机可读存储介质,通过并行请求的触发,可以实现多个数据的同时存储,从而可以提高回存效率。

Claims (10)

  1. 一种数据处理方法,其特征在于,应用于电子装置,所述电子装置包括处理器和存储器,所述处理器包括M个按照标识大小依次设置的处理元件,M为正整数,所述方法包括:
    所述电子装置通过所述处理器在每个轮询内向所述存储器并行发送N个存储请求Q次;其中,所述N个存储请求用于请求所述存储器存储所述M个处理元件中N个标识连续的处理元件各自生成的一行输出数据;所述Q是根据处理元件的数量M以及存储请求的数量N确定的;
    所述电子装置通过所述存储器在第P个轮询内根据接收到的Q×N个存储请求回存所述M个处理元件各自生成的第P行输出数据。
  2. 根据权利要求1所述的方法,其特征在于,所述N个存储请求与所述N个标识连续的处理元件一一对应,每个存储请求包括对应的处理元件生成的一行输出数据以及所述对应的处理元件生成的一行输出数据拟存储在所述存储器中的首地址;
    所述电子装置通过所述存储器在第P个轮询内根据接收到的Q×N个存储请求回存所述M个处理元件各自生成的第P行输出数据,包括:
    所述电子装置通过所述存储器在所述第P个轮询内根据所述M个处理元件各自生成的第P行输出数据拟存储在所述存储器中的首地址存储所述M个处理元件各自生成的第P行输出数据。
  3. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    获取所述M个处理元件各自对应的标志位参数,并根据所述标志位参数确定所述M个处理元件中每个处理元件各自生成的一行输出数据拟存储在所述存储器中的首地址。
  4. 根据权利要求3所述的方法,其特征在于,所述标志位参数包括第一标志位参数,所述第一标志位参数为第i个处理元件对应的标志位参数;其中,i为小于等于M的正整数;所述方法还包括:
    对所述M个处理元件进行分组,得到T个处理元件小组;
    所述根据所述标志位参数确定所述M个处理元件中每个处理元件各自生成的一行输出数据拟存储在存储器中的首地址,包括:
    当所述第一标志位参数为第一参数时,所述第i个处理元件生成的一行输出数据拟存储在存储器中的首地址为:addr_start(i)=dm_init_addr+co_size*n,(n=1,2,...,T);其中,dm_init_addr表示初始化地址;co_size表示神经网络模型中每层网络的一个输出通道的大小;n表示处理元件小组的序号;所述第一参数用于确定所述T个处理元件小组中每组内的最低位处理元件;
    当所述第一标志位参数为第二参数时,所述第i个处理元件生成的一行输出数据拟存储在存储器中的首地址为:addr_start(i)=addr_start(i-1)+16;其中,所述addr_start(i-1)表示上一个处理元件生成的一行输出数据拟存储在存储器中的首地址;所述第二参数用于剔除所述M个处理元件中的无效处理元件;
    当所述第一标志位参数为第三参数时,所述第i个处理元件生成的一行输出数据拟存储在存储器中的首地址为:addr_start(i)=addr_start(i-1)+16;所述addr_start(i-1)表示上一个处理元件生成的一行输出数据拟存储在存储器中的首地址;所述第三参数用于确定所述T个处理元件小组中每组内的最高位处理元件。
  5. 根据权利要求4所述的方法,其特征在于,所述对所述M个处理元件进行分组时,包括:
    获取神经网络模型中的每层网络的输出通道的宽度;
    根据所述每层网络的输出通道的宽度确定一个处理元件小组内包含的处理元件的数量S;
    根据一个处理元件小组内包含的处理元件的数量S对所述M个处理元件进行分组,得到所述T个处理元件小组。
  6. 根据权利要求1所述的方法,其特征在于,所述Q是将M除以N并经过 向上取整操作得到的。
  7. 根据权利要求1所述的方法,其特征在于,在处理元件生成输出数据时,包括:
    获取输入数据以及计算指令;其中,所述输入数据包括权值数据、输入神经元数据以及计算所需要的配置参数;
    根据所述输入数据以及计算指令执行神经网络计算,得到输出数据。
  8. 一种电子装置,其特征在于,包括处理器和存储器,所述处理器、输入设备、输出设备和存储器相互连接,其中,所述存储器用于存储计算机程序,所述计算机程序包括程序指令,所述处理器被配置用于调用所述程序指令,执行如权利要求1-7任一项所述的方法。
  9. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行如权利要求1-7任一项所述的方法。
  10. 一种计算机程序,其特征在于,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行如权利要求1-7任一项所述的方法。
PCT/CN2019/121602 2019-05-24 2019-11-28 一种数据处理方法、电子装置及计算机可读存储介质 WO2020238106A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/257,324 US11061621B2 (en) 2019-05-24 2019-11-28 Data processing method, electronic apparatus, and computer-readable storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910444607.1 2019-05-24
CN201910444607.1A CN110298441B (zh) 2019-05-24 2019-05-24 一种数据处理方法、电子装置及计算机可读存储介质

Publications (1)

Publication Number Publication Date
WO2020238106A1 true WO2020238106A1 (zh) 2020-12-03

Family

ID=68027193

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/121602 WO2020238106A1 (zh) 2019-05-24 2019-11-28 一种数据处理方法、电子装置及计算机可读存储介质

Country Status (3)

Country Link
US (1) US11061621B2 (zh)
CN (1) CN110298441B (zh)
WO (1) WO2020238106A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110298441B (zh) * 2019-05-24 2022-01-11 深圳云天励飞技术有限公司 一种数据处理方法、电子装置及计算机可读存储介质
CN113032483B (zh) * 2021-03-12 2023-08-08 北京百度网讯科技有限公司 跨平台的数据资产共享方法、装置及电子设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160179434A1 (en) * 2014-12-19 2016-06-23 Intel Corporation Storage device and method for performing convolution operations
CN106569727A (zh) * 2015-10-08 2017-04-19 福州瑞芯微电子股份有限公司 一种多控制器间多存储器共享并行数据读写装置及其写入、读取方法
CN109284130A (zh) * 2017-07-20 2019-01-29 上海寒武纪信息科技有限公司 神经网络运算装置及方法
CN109740732A (zh) * 2018-12-27 2019-05-10 深圳云天励飞技术有限公司 神经网络处理器、卷积神经网络数据复用方法及相关设备
CN109799959A (zh) * 2019-01-22 2019-05-24 华中科技大学 一种提高开放通道固态盘写并行性的方法
CN110298441A (zh) * 2019-05-24 2019-10-01 深圳云天励飞技术有限公司 一种数据处理方法、电子装置及计算机可读存储介质

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9626624B2 (en) * 2010-07-20 2017-04-18 Analog Devices, Inc. Programmable probability processing
US10839289B2 (en) * 2016-04-28 2020-11-17 International Business Machines Corporation Neural network processing with von-Neumann cores
JP6961011B2 (ja) * 2016-12-09 2021-11-05 ベイジン ホライズン インフォメーション テクノロジー カンパニー リミテッド データ管理のためのシステム及び方法
EP3998539A1 (en) * 2016-12-30 2022-05-18 INTEL Corporation Deep learning hardware
US20190095776A1 (en) * 2017-09-27 2019-03-28 Mellanox Technologies, Ltd. Efficient data distribution for parallel processing
CN109117184A (zh) * 2017-10-30 2019-01-01 上海寒武纪信息科技有限公司 人工智能处理器及使用处理器执行平面旋转指令的方法
CN109416755B (zh) * 2018-01-15 2021-11-23 深圳鲲云信息科技有限公司 人工智能并行处理方法、装置、可读存储介质、及终端
CN108491924B (zh) * 2018-02-11 2022-01-07 江苏金羿智芯科技有限公司 一种面向人工智能计算的神经网络数据串行流水处理装置
CN108470009B (zh) * 2018-03-19 2020-05-29 上海兆芯集成电路有限公司 处理电路及其神经网络运算方法
CN109213772B (zh) * 2018-09-12 2021-03-26 华东师范大学 数据存储方法及NVMe存储系统
US10831702B2 (en) * 2018-09-20 2020-11-10 Ceva D.S.P. Ltd. Efficient utilization of systolic arrays in computational processing

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160179434A1 (en) * 2014-12-19 2016-06-23 Intel Corporation Storage device and method for performing convolution operations
CN106569727A (zh) * 2015-10-08 2017-04-19 福州瑞芯微电子股份有限公司 一种多控制器间多存储器共享并行数据读写装置及其写入、读取方法
CN109284130A (zh) * 2017-07-20 2019-01-29 上海寒武纪信息科技有限公司 神经网络运算装置及方法
CN109740732A (zh) * 2018-12-27 2019-05-10 深圳云天励飞技术有限公司 神经网络处理器、卷积神经网络数据复用方法及相关设备
CN109799959A (zh) * 2019-01-22 2019-05-24 华中科技大学 一种提高开放通道固态盘写并行性的方法
CN110298441A (zh) * 2019-05-24 2019-10-01 深圳云天励飞技术有限公司 一种数据处理方法、电子装置及计算机可读存储介质

Also Published As

Publication number Publication date
CN110298441A (zh) 2019-10-01
US11061621B2 (en) 2021-07-13
CN110298441B (zh) 2022-01-11
US20210173590A1 (en) 2021-06-10

Similar Documents

Publication Publication Date Title
US10943167B1 (en) Restructuring a multi-dimensional array
CN110390385B (zh) 一种基于bnrp的可配置并行通用卷积神经网络加速器
US11960934B2 (en) Systems and methods for improved neural network execution
US11960566B1 (en) Reducing computations for data including padding
US11775430B1 (en) Memory access for multiple circuit components
US10846591B2 (en) Configurable and programmable multi-core architecture with a specialized instruction set for embedded application based on neural networks
EP3496007B1 (en) Device and method for executing neural network operation
US20200327079A1 (en) Data processing method and device, dma controller, and computer readable storage medium
US10678479B1 (en) Registers for restricted memory
TWI766396B (zh) 資料暫存裝置、資料暫存方法以及計算方法
US20220083857A1 (en) Convolutional neural network operation method and device
US20190095791A1 (en) Learning affinity via a spatial propagation neural network
WO2022179074A1 (zh) 数据处理装置、方法、计算机设备及存储介质
WO2020238106A1 (zh) 一种数据处理方法、电子装置及计算机可读存储介质
CN111191784A (zh) 转置的稀疏矩阵乘以稠密矩阵用于神经网络训练
CN111465943A (zh) 芯片上计算网络
EP3844610B1 (en) Method and system for performing parallel computation
WO2024027039A1 (zh) 数据处理方法、装置、设备和可读存储介质
CN111667542A (zh) 适用于人工神经网络的用于处理压缩数据的解压缩技术
CN114429214A (zh) 运算单元、相关装置和方法
WO2022179075A1 (zh) 一种数据处理方法、装置、计算机设备及存储介质
US20210097396A1 (en) Neural network training in a distributed system
CN110738317A (zh) 基于fpga的可变形卷积网络运算方法、装置和系统
CN111199276B (zh) 数据处理方法及相关产品
CN116185937B (zh) 基于众核处理器多层互联架构的二元运算访存优化方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19931470

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19931470

Country of ref document: EP

Kind code of ref document: A1