WO2023004762A1 - 计算机系统和数据处理方法 - Google Patents

计算机系统和数据处理方法 Download PDF

Info

Publication number
WO2023004762A1
WO2023004762A1 PCT/CN2021/109650 CN2021109650W WO2023004762A1 WO 2023004762 A1 WO2023004762 A1 WO 2023004762A1 CN 2021109650 W CN2021109650 W CN 2021109650W WO 2023004762 A1 WO2023004762 A1 WO 2023004762A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
buffer
unit
read
input
Prior art date
Application number
PCT/CN2021/109650
Other languages
English (en)
French (fr)
Inventor
高立稳
李震桁
陈艾德
袁宏辉
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2021/109650 priority Critical patent/WO2023004762A1/zh
Priority to CN202180096718.3A priority patent/CN117223008A/zh
Publication of WO2023004762A1 publication Critical patent/WO2023004762A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of computer technology, in particular to a computer system and a data processing method.
  • CNN Convolutional Neural Network
  • Embodiments of the present application provide a computer system and a data processing method, which can speed up deep convolution processing and reduce system energy consumption.
  • the first aspect of the present application provides a computer system, including a matrix computing unit composed of two-dimensional computing nodes, the matrix computing unit is used to perform convolution operations on input data according to convolution parameters, and each computing node completes Computation of a single convolution window.
  • This application uses a matrix computing unit composed of two-dimensional computing nodes to perform deep convolution convolution operations. Each clock cycle can complete the calculation of multiple data points, speed up deep convolution processing, and reduce system energy consumption.
  • the number of columns of the matrix calculation unit is equal to the number of channels of the input data.
  • the computer system further includes an external memory, and the external memory is used for storing the input data and the convolution parameters.
  • the computer system further includes a controller, a data buffer write unit, a data buffer read unit, a data supply unit, a data write-back unit, a data buffer, a parameter buffer, and an output result buffer
  • the parameter buffer is connected to the external memory and is used to cache the convolution parameters;
  • the controller is used to control the data buffer writing unit to read the input data from the external memory, Writing the read input data into the data buffer for caching;
  • the controller is further configured to control the data supply unit to send an input data read request to the data buffer reading unit;
  • the data a buffer reading unit configured to read the input data from the data buffer according to the input data read request, and transmit the input data to the matrix calculation unit through the data supply unit;
  • the control The device is further used to control the matrix calculation unit to read the convolution parameters from the parameter buffer, perform convolution operations on the input data according to the convolution parameters to obtain output data, and cache the output data to the output result buffer;
  • the controller is also used to control the data write-back unit to read the output
  • the input data includes subgraphs obtained by slicing the feature map
  • the external memory is also used to store filling data
  • the controller is also used to control the writing of the data buffer
  • the unit reads the filling data from the external memory and generates a filling data write request, and writes the read filling data into the data buffer according to the filling data write request;
  • the controller further It is used to control the data buffer reading unit to generate a fill data read request, read new fill data from the output data cached in the data buffer according to the fill data read request, and store the new fill data Fill data is stored from the data buffer to the external memory.
  • the data buffer writing unit is further configured to perform a write conflict check on the output data write request and the padding data write request.
  • the priority of the output data write request is higher than the priority of the filling data write request.
  • the data buffer writing unit is further configured to perform a read conflict check on the input data read request and the filling data read request.
  • the priority of the input data read request is higher than the priority of the filling data read request.
  • the sub-graph is obtained by segmenting along the height direction or the width direction of the feature map.
  • the computer system performs depthwise convolution on the input data in a pipeline manner.
  • the input data is divided into data blocks along the width direction or the height direction
  • the data block is divided into sub-data blocks along the height direction or the width direction
  • each clock of the matrix calculation unit Periodic convolution operation is performed on a sub-block.
  • the height of the sub-data block is 1.
  • the second aspect of the present application provides a data processing method, which is applied to a computer system.
  • the computer system includes a matrix computing unit composed of two-dimensional computing nodes.
  • the method includes: the matrix computing unit performs convolution parameters according to Convolution operation is performed on the input data, and each computing node completes the calculation of a single convolution window.
  • the number of columns of the matrix calculation unit is equal to the number of channels of the input data.
  • the computer system further includes an external memory, and the external memory is used for storing the input data and the convolution parameters.
  • the computer system further includes a controller, a data buffer write unit, a data buffer read unit, a data supply unit, a data write-back unit, a data buffer, a parameter buffer, and an output result buffer
  • the parameter buffer is connected to the external memory for caching the convolution parameters
  • the method further includes; the controller controls the data buffer writing unit to read the data from the external memory Input data, write the read input data into the data buffer for caching; the controller controls the data supply unit to send an input data read request to the data buffer reading unit; the data buffer The reading unit reads the input data from the data buffer according to the input data reading request, and transmits the input data to the matrix calculation unit through the data supply unit; the controller controls the The matrix calculation unit reads the convolution parameters from the parameter buffer, performs convolution operations on the input data according to the convolution parameters to obtain output data, and caches the output data into the output result buffer; The controller controls the data write-back unit to read the output data cached in the output result buffer, sends the read output data
  • the input data includes subgraphs obtained by segmenting the feature map, and the external memory is also used to store filling data; the method further includes: the controller controls the data buffer The writing unit reads the filling data from the external memory and generates a filling data write request, and writes the read filling data into the data buffer according to the filling data write request; the controller Controlling the data buffer reading unit to generate a fill data read request, reading new fill data from the output data cached in the data buffer according to the fill data read request, and storing the new fill data store from the data buffer to the external memory.
  • the method further includes: the data buffer writing unit is further configured to perform a write conflict check on the output data write request and the padding data write request.
  • the priority of the output data write request is higher than the priority of the filling data write request.
  • the method further includes: the data buffer writing unit performs a read conflict check on the input data read request and the filling data read request.
  • the priority of the input data read request is higher than the priority of the filling data read request.
  • the sub-graph is obtained by segmenting along the height direction or the width direction of the feature map.
  • the data processing method performs depthwise convolution on the input data in a pipeline manner.
  • the input data is divided into data blocks along the width direction or the height direction
  • the data block is divided into sub-data blocks along the height direction or the width direction
  • each clock of the matrix calculation unit Periodic convolution operation is performed on a sub-block.
  • the height of the sub-data block is 1.
  • FIG. 1 is a schematic diagram of a computer system provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of data supply of the matrix computing unit in FIG. 1 .
  • Fig. 3 is a schematic diagram of computing nodes in the embodiment of the present application.
  • Fig. 4 is a schematic diagram of data multiplexing in the matrix calculation unit in the embodiment of the present application.
  • Fig. 5 is a schematic diagram of dividing a feature map into subgraphs and processing the subgraphs.
  • FIG. 6 is a flow chart of processing subgraphs in the embodiment of the present application.
  • Fig. 7 is a schematic diagram of the relationship between a feature map, a sub-graph, a data block, and a sub-data block in the embodiment of the present application.
  • Fig. 8 is a schematic diagram of performing depthwise convolution on data blocks in an input feature map/input sub-map in an embodiment of the present application.
  • FIG. 9 is a timing diagram of depth convolution performed in a pipeline manner in an embodiment of the present application.
  • FIG. 10 is a flowchart of a data processing method provided by an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 1 is a schematic diagram of a computer system provided by an embodiment of the present application.
  • the computer system 10 includes a chip 100 (that is, a system on a chip, System On Chip, SOC) and an external memory 101.
  • the chip 100 includes a controller 1001, a data buffer write unit 1002, a data buffer read unit 1003, a data supply unit 1004, a data write-back unit 1005, a matrix calculation unit 1006, a data buffer 1007, a parameter buffer 1008, and an output result buffer device 1009.
  • External memory 101 is connected with data buffer reading unit 1003, data buffer writing unit 1002, parameter buffer 1008, data buffer 1007 is connected with data buffer writing unit 1002, data buffer reading unit 1003, controller 1001 is connected with data buffer Write unit 1002, data buffer read unit 1003, data supply unit 1004, data write back unit 1005, matrix calculation unit 1006 are connected, data supply unit 1004 is connected to data buffer read unit 1003 and matrix calculation unit 1006, data write back unit 1005 connects the data buffer writing unit 1002 and the output result buffer 1009 , and the matrix calculation unit 1006 connects the parameter buffer 1008 and the output result buffer 1009 .
  • the controller 1001 may include at least one of the following types: a central processing unit (Central Processing Unit, CPU), a microcontroller (Microcontroller Unit, MCU), a digital signal processor (Digital Signal Processor, DSP), an application processing unit Device (Application Processor, AP), image processor (Graphics Processing Unit, GPU) and neural network processor (Neural-network Processing Unit, NPU).
  • a central processing unit Central Processing Unit, CPU
  • MCU Microcontroller Unit
  • DSP Digital Signal Processor
  • AP Application Processor
  • GPU Graphics Processing Unit
  • NPU neural network processor
  • the computer system 10 may include a robot, a mobile phone, a vehicle-mounted computer, etc., and the corresponding chip 100 may be a robot chip, a mobile phone chip, a vehicle-mounted chip, etc.
  • the computer system 10 can implement machine vision tasks such as recognition and classification through related software.
  • the computer system 10 uses the matrix computing unit 1006 to perform convolution operations of depth convolution, which can realize fast and low-energy consumption depth convolution.
  • the computer system 10 utilizes the cooperation of various units in the chip 100 to increase storage bandwidth.
  • computer system 10 may be used for MobileNet.
  • MobileNet is an efficient model proposed for mobile and embedded devices.
  • MobileNet is based on a streamlined architecture and uses depthwise separable convolutions to build a lightweight deep neural network.
  • Depthwise separable convolution includes two parts: depthwise convolution and pointwise convolution.
  • the computer system 10 provided in the embodiment of the present application can be used for deep convolution of MobileNet. It can be understood that the computer system 10 provided in the embodiment of the present application can be used to implement other convolutional neural networks.
  • the external memory 101 is used to store input data and convolution parameters.
  • the chip 100 is configured to perform depthwise convolution on the input data stored in the external memory 101 according to convolution parameters.
  • the input data can be a feature map (feature map), or a submap (ie sub-feature map) obtained by segmenting the feature map.
  • the feature map has three dimensions of width (Width, W), height (Height, H) and channel number (Channel, H).
  • the width of the feature map can be expressed as w
  • the height can be expressed as h
  • the number of channels can be expressed as c.
  • the feature map size can be expressed as w*h*c.
  • Convolution parameters can include weights and biases.
  • the input data can be an image (the image can be regarded as a special feature map), and the chip 100 can perform deep convolution on the image stored in the external memory 101 according to the convolution parameters.
  • the feature map can be divided into sub-maps, and the sub-maps can be stored in the external memory 101 .
  • the feature map can be divided into subgraphs along the height direction, and the feature map can also be divided into subgraphs along the width direction.
  • the size of the feature map is 1024*1024*64 (the width is 1024, the height is 1024, and the number of channels is 64)
  • the feature map can be divided along the height direction, and the feature map can be divided into four 1024*256*64
  • the sub-picture four sub-pictures of 1024*256*64 are stored in the external memory 101 .
  • the chip 100 can perform one-layer depth convolution on the input data (ie feature map or sub-image), and can also perform multi-layer depth convolution on the input data.
  • the chip 100 performs depthwise convolution on each sub-image. If multiple layers of depth convolution are required, the chip 100 performs depth convolution on all layers of a sub-image, and then performs depth convolution on the next sub-image.
  • the filling data required for the next layer of depth convolution can be obtained from the output data obtained by each layer of depth convolution (that is, the output sub-image), and the filling data can be stored externally.
  • the memory 101 is used for the depthwise convolution of the next layer.
  • the size of the convolution kernel is k*k, and the bottom k-1 rows of data in the height direction of the output subimage are used as padding data.
  • the size of the convolution kernel is 3*3
  • the padding data is the bottom 2 rows of data in the height direction of the output subimage.
  • the data buffer 1007 is used for buffering input data, output data and filling data.
  • the data buffer 1007 is composed of multiple random access memories (Random Access Memory, RAM).
  • the random access memory constituting the data buffer 1007 may be a single-port random access memory or a dual-port random access memory. Single-port random access memory has only one set of data lines and address lines, which cannot be read and written at the same time; dual-port random access memory has two sets of data lines and address lines, which can be read and written at the same time.
  • the random access memory in the data buffer 1007 is mapped according to a certain address, and is used to cache the input data (ie, input feature map/input submap), output data (ie, output feature map/output submap) and filling of depth convolution data.
  • the input feature map/input submap and output feature map/output submap are relative to a depth convolution.
  • the input feature map/input submap refers to the feature map/submap before the depth convolution, such as from an external memory.
  • the feature map/submap of 101, the output feature map/output submap is the feature map/submap after deep convolution.
  • the data buffer 1007 includes at least two groups of random access memories to support simultaneous reading and writing. Each group of random access memories includes multiple random access memories to support reading/writing multiple data points in the width direction of the feature map/submap at one time.
  • 1 includes two groups of random access memories, the width of the input feature map/input submap is w, the size of the convolution kernel is k*k, and each group includes at least w+k-1 random access memories .
  • a set of random access memories stores input feature maps/input submaps, and another set of random access memories stores output feature maps/output submaps, at least supporting independent address read/write of w+k-1 random access memories at the same time .
  • the more groups of random access memories that the data buffer 1007 supports simultaneous reading and writing the smaller the probability of data reading and writing conflicts.
  • the data buffer 1007 can read and write data in a ping-pong manner.
  • Ping-pong is a means of data buffering, using two data buffers to achieve the purpose of continuous data transmission and improve the data transmission rate. Since the data obtained by a single buffer is easily overwritten during transmission and processing, the ping-pong method can keep reading data from one buffer and writing data from another buffer, that is, the two buffers read alternately. fetch and write.
  • the data in the data buffer 1007 is stored in the NHWC data format, where N represents the number of pictures in a batch, H represents the height, W represents the width, and C represents the number of channels.
  • N represents the number of pictures in a batch
  • H represents the height
  • W represents the width
  • C represents the number of channels.
  • To store data according to the NHWC data format is to store data in the order of C direction, W direction, H direction, and N direction.
  • Parameter buffer 1008 is used to cache convolution parameters.
  • the parameter buffer 1008 may be composed of a small-capacity random access memory, and the parameter buffer 1008 may cache convolution parameters in a ping-pong manner.
  • the computer system 10 may include a DMA (Direct Memory Access, DMA) controller (not shown in the figure), and the computer system may write the convolution parameters from the external memory 101 to the parameter buffer through the DMA controller.
  • DMA Direct Memory Access
  • the output result buffer 1009 is used for buffering the operation result (ie output data) of the matrix calculation unit 1006 .
  • the output result buffer 1009 supports the reading of the data write-back unit 1005 .
  • the output result buffer 1009 may consist of a small-capacity random access memory.
  • the output result buffer 1009 may buffer the operation result of the matrix calculation unit 1006 in a ping-pong manner.
  • the controller 1001 is used to send control instructions to the data buffer writing unit 1002, the data buffer reading unit 1003, the data supply unit 1004, the data writing back unit 1005, and the matrix calculation unit 1006 to indicate the size and location of the processed data, etc. .
  • the controller 1001 controls the data buffer writing unit 1002 to read input data from the external memory 101, and write the read input data into the data buffer 1007 for buffering.
  • the controller 1001 is further configured to control the data supply unit 1004 to send an input data read request to the data buffer read unit 1003 .
  • the data buffer reading unit 1003 is configured to read input data from the data buffer 1007 according to the input data reading request, and transmit the input data to the matrix calculation unit 1006 through the data supply unit 1004 .
  • the data supply unit 1004 can process the input data by using an activation function.
  • the activation function used by the data supply unit 1004 is a Relu function.
  • the controller 1001 is also used to control the matrix calculation unit 1006 to read convolution parameters from the parameter buffer 1008 , perform convolution operation on the input data according to the convolution parameters to obtain output data, and cache the output data to the output result buffer 1009 .
  • the controller 1001 is also used to control the data write-back unit 1005 to read the output data cached in the output result buffer 1009, send the read output data to the data buffer writing unit 1002, and send the output data to the data buffer writing unit 1002 write request.
  • the data write-back unit 1005 can quantify the operation result. Quantizing the operation result is to convert the format of the operation result, for example, converting the operation result from a 32-bit floating-point number to an 8-bit fixed-point number.
  • the data buffer writing unit 1002 is further configured to write the output data into the data buffer memory 1007 for caching according to the output data write request.
  • the external memory 101 is also used to store filling data.
  • the controller 1001 is further configured to control the data buffer writing unit 1002 to read filling data from the external memory 101 and generate a filling data write request, and write the read filling data into the data buffer 1007 for caching according to the filling data write request.
  • the controller 1001 is also used to control the data buffer reading unit 1003 to generate a fill data read request, read new fill data from the output data cached by the data buffer 1007 according to the fill data read request, and transfer the new fill data from the data
  • the buffer 1007 is stored in the external memory 101 .
  • the data buffer reading unit 1003 performs a read conflict check on the input data read request and the filling data read request. Since the data volume of the filling data is small, the priority of the input data reading request may be higher than that of the filling data reading request.
  • the data buffer writing unit 1002 performs write conflict check on the output data write request and the padding data write request. Since the data volume of the padding data is small, the priority of the output data write request may be higher than that of the padding data write request.
  • the input data read request of the data supply unit 1004 is consecutive to the address requested by the output data write request of the data write-back unit 1005, so as to avoid read conflicts and data write-back of the data supply unit 1004 Unit 1005 has a write conflict.
  • the data supply unit 1004 is a read data buffer
  • the data write-back unit 1005 is a write data buffer.
  • the data supply unit 1004 and the data write-back unit 1005 access multiple RAMs of the data buffer 1007 at the same time.
  • a single group of random access memories can only receive read requests or write requests at the same time, and can also receive read requests and write requests at the same time.
  • the read requests and write requests are executed serially, and the write requests are prioritized.
  • the data buffer writing unit 1002 and the data buffer reading unit 1003 alternately occupy two groups of random access memories, so as to avoid read-write conflicts and improve the utilization rate of the data buffer 1007 .
  • the matrix computing unit 1006 is composed of two-dimensional computing nodes. As shown in FIG. 1 , the matrix computing unit 1006 includes m*n computing nodes, m is the number of rows of the matrix computing unit 1006 , and n is the number of columns of the matrix computing unit 1006 .
  • the row number m of the matrix calculation unit 1006 is equal to the width of the input data provided by the data supply unit 1004
  • the column number n of the matrix calculation unit 1006 is equal to the channel number c of the input data provided by the data supply unit 1004 . That is, the width and channel of the input data are mapped to the rows and columns of the matrix calculation unit 1006, respectively.
  • Using the matrix calculation unit 1006 can improve the performance of depthwise convolution.
  • FIG. 2 is a schematic diagram of data supply of the matrix computing unit in FIG. 1 .
  • the data supply unit provides input data (input feature map/input submap) for the matrix calculation unit.
  • the data supply unit transmits the input data of one sub-data block to the matrix calculation unit at a time.
  • the sub-data block is, for example, the gray part in the upper left corner in the figure.
  • the height of the sub-data block is 1. Sub-data blocks will be described in detail later.
  • the size of the convolution kernel is k*k
  • the data supply unit transmits k data points in the sub-data block to each calculation node of the matrix calculation unit, and the matrix calculation unit completes a convolution
  • the product operation needs to go through k times of data transmission.
  • the size of the convolution kernel is 3*3, the data supply unit transmits 3 data points in the sub-data block to each calculation node of the matrix calculation unit each time, and the matrix calculation unit needs to go through 3 data points to complete a convolution operation send.
  • the data supply unit transmits a0, a1, and a2 to computing node (0,0), transmits a1, a2, and a3 to computing node (1,0), and transmits a2, a3, a4 is sent to computing nodes (2,0),....
  • the parameter buffer provides convolution parameters for the matrix computation unit.
  • the convolution parameters provided by the parameter buffer to the matrix calculation unit include weights (represented as weight in the figure) and biases (represented as bias in the figure).
  • the parameter buffer provides a set of weights and a bias for each column of the matrix computation unit. For example, weights d0, d1, d2, d3, d4, d5, d6, d7, d8 and offset e0 are provided for the first column of the matrix calculation unit.
  • Each column of the matrix calculation unit completes the data calculation of one channel of input data, and each row of the matrix calculation unit completes the data calculation of one width of the input data. Therefore, the matrix calculation unit generates m*n operation results per clock cycle (cycle) .
  • Data multiplexing in the width or height direction of input data can be realized through the matrix calculation unit.
  • the size of the convolution kernel in the figure is 3*3, and along the width (W) direction of the input feature map/input submap, the first to third data points a0, a1, a2 are input to the calculation node (0 ,0), the second to fourth data points a1, a2, a3 are input (1,0), and the second to third data points a1, a2 are multiplexed.
  • the data of the input feature map/input submap is read out and expanded inside the matrix calculation unit 1006 , and data multiplexing in the width/height direction saves the connection overhead between the data buffer 1007 and the matrix calculation unit 1006 .
  • Fig. 3 is a schematic diagram of computing nodes in the embodiment of the present application.
  • each computing node may include a first register, a second register, a multiplier, an addition tree (the multiplier and the addition tree are collectively represented as MAC in the figure) and an accumulator (the figure represents for bias adder).
  • the first register is used to store the data points of the input data (represented as data in the figure)
  • the second register is used to store the convolution parameters (including weight and bias, the weight is represented as weight in the figure, and the bias is represented as bias)
  • multiplication The multiplier is used to multiply each data point with the corresponding weight
  • the addition tree is used to add the results of each multiplier
  • the accumulator is used to add the result of the adder and the offset.
  • the number of multipliers contained in each computing node is equal to the convolution kernel size.
  • the size of the convolution kernel is k*k, and the number of multipliers included in each computing node is k*k.
  • the size of the convolution kernel is 3*3, and the number of multipliers included in each computing node is 9.
  • Each calculation node of the matrix calculation unit completes the calculation of a single convolution window.
  • the data of the input feature map/input subgraph is expressed as in[0], in[1],...,in[8], and the weight (ie, the data of the convolution kernel) is expressed as w[ 0],w[1],...,w[8], the bias is expressed as bias, and the calculation completed by the computing node can be expressed as:
  • each computation node may include a first register, a second register, a multiplier, and an addition tree.
  • the addition tree will sum the results of the individual multipliers and add them to the bias.
  • the matrix calculation unit may adopt a window data circular register method (see FIG. 4 ) to reduce the line width from the data buffer to the matrix calculation unit.
  • Fig. 4 is a schematic diagram of data multiplexing in the matrix calculation unit in the embodiment of the present application.
  • the matrix calculation unit does not buffer the data, the data supply unit is required to buffer and transmit the data, which will greatly increase the number of wires between the data supply unit and the matrix calculation unit.
  • the embodiment of the present application greatly reduces the wiring between the data supply unit and the matrix calculation unit.
  • the current mainstream computer architecture does not support deep convolution well.
  • the mainstream architecture deals with general-purpose convolution operations, that is, three-dimensional (3D) convolution operations, which realize data multiplexing in the channel direction to reduce system energy consumption, while deep convolution does not require channels. direction of data multiplexing.
  • 3D three-dimensional
  • the computer system provided by the embodiment of the present application performs convolution operation through the matrix calculation unit, and can complete the calculation of multiple data points in each clock cycle, which speeds up the deep convolution processing and reduces the energy consumption of the system.
  • the computer system provided by the embodiment of the present application multiplexes width or height data to further reduce system energy consumption.
  • the computer system provided by the embodiment of the present application utilizes the data buffer to buffer the input data and output data, and the data entering and exiting the data buffer is mainly filling data with a small amount of data, which greatly alleviates the bandwidth requirement of the memory.
  • Fig. 5 is a schematic diagram of dividing a feature map into subgraphs and processing the subgraphs.
  • Figure 5 shows a three-layer depthwise convolution.
  • the feature map can be divided into submaps along the height or width direction.
  • Figure 5 shows the segmentation of the feature map into sub-graphs along the height direction.
  • subgraph [0,i-1] represents the i-1th input subgraph of the first layer depth convolution
  • subgraph [0,i] represents the ith input subgraph of the first layer depth convolution Input subgraph
  • subgraph [0,i+1] represents the i+1th input subgraph of the first layer of depth convolution
  • subgraph [1,i-1] represents the i-th input of the second layer of depth convolution 1 input subgraph
  • subgraph [1,i] represents the i-th input subgraph of the second layer of depth convolution
  • subgraph [1,i+1] represents the i+1th of the second layer of depth convolution
  • subgraph [2, i-1] represents the i-1th input subgraph of the third layer depth convolution
  • subgraph [2, i] represents the ith input subgraph of the third layer depth convolution Graph
  • subgraph [2,i+1] represents the i+1th input subgraph of the third layer depthwise convolution.
  • the filling data is obtained from the output subgraph (that is, the output data), and the filling data is stored in an external memory for splicing the input subgraph of the next layer of depth convolution. picture.
  • the input subgraph (e.g. subgraph [0,i]) of the first layer of depthwise convolution can be loaded directly without using padded data concatenation.
  • the data buffer writing unit is the output subgraph of the subgraph (such as the output subgraph of subgraph [0, i])
  • the bottom 2 rows of data in the graph) are cached from the external memory to the data buffer and the output subgraph of the subgraph is spliced into the input subgraph of the next layer of depth convolution (such as subgraph [1,i]), and the data buffer is read
  • the fetching unit obtains filling data from the output subgraph of the subgraph cached in the data buffer, and is used for splicing the output subgraph
  • the filling data can no longer be read and stored in the external memory.
  • the padding data is stitched with the output submap, the padding data is added on top of the output submap. It should be noted that, for the first sub-graph, it is not necessary to read the filling data from the external memory and write it into the data buffer; for the last sub-graph, it is not necessary to read the filling data from the data buffer and store it in the external memory.
  • FIG. 6 is a flow chart of processing subgraphs in the embodiment of the present application.
  • the chip after the chip performs depth convolution of all layers on a sub-image, it performs depth convolution on the next sub-image. For example, after the chip performs depth convolution of all layers on the i-1th subimage, it performs depth convolution of all layers on the i-th subimage, and then performs depth convolution of all layers on the i+1th subimage, and so on. analogy.
  • the main input data and output data are always stored in the data buffer, and the data buffer
  • the main purpose of this method is to fill data with a small amount of data, which can greatly alleviate the bandwidth requirements of the memory.
  • the chip may perform deep convolution on the input feature map/input sub-map in a pipeline (pipe-line) manner.
  • data processing is performed in units of data blocks (data blocks can be expressed as tiles).
  • Each input feature map/input submap includes multiple data blocks.
  • the matrix calculation unit performs the convolution operation, data processing is performed in units of sub-data blocks, and each data block includes multiple sub-data blocks.
  • Fig. 7 is a schematic diagram of the relationship between a feature map, a sub-graph, a data block, and a sub-data block in the embodiment of the present application.
  • a feature map is a single layer of data to be processed in a convolutional neural network, with three dimensions of width (W), height (H) and number of channels (C).
  • the size of the feature map may be too large to be stored in the memory of the chip (such as the data buffer in the embodiment of this application) for processing. It needs to be segmented, and the divided data can be stored in the memory of the chip. picture.
  • the number of channels of the submap is the same as that of the feature map.
  • the feature map can be divided into subgraphs along the height direction (for example, three subgraphs along the height direction in the figure), or the feature map can be divided into subgraphs along the width direction.
  • the sub-picture can be stored in the memory of the chip, it is not processed all at once. It needs to be divided into data blocks, which can be expressed as tiles.
  • the number of channels of the data block is the same as that of the submap and feature map. If the feature map is divided into subgraphs along the height direction, the subgraph can be divided into data blocks along the width direction; if the feature map is divided into subgraphs along the width direction, the subgraph can be divided into data blocks along the height direction. Every time a data block is processed, the controller sends a control instruction to the data buffer reading unit, the data buffer writing unit, the data supply unit, the data write-back unit, and the matrix calculation unit. After the data buffer reading unit, data buffer writing unit, data supply unit, data write-back unit, and matrix calculation unit receive the control command, the entire data flow works according to the ping-pong pipeline.
  • the matrix calculation unit When the matrix calculation unit performs convolution operations, the calculation of the entire data block cannot be completed at one time, and the data block needs to be divided into sub-data blocks (sub-data blocks can be expressed as slices), and each clock cycle is passed to the matrix calculation unit.
  • the number of channels of the subblock is the same as that of the block, submap, and feature map.
  • the data block may be divided into sub-data blocks along the height direction.
  • Fig. 8 is a schematic diagram of performing depthwise convolution on data blocks in an input feature map/input sub-map in an embodiment of the present application.
  • the chip processes the next data block.
  • each sub-data block of the data block is obtained one by one for processing, and after all the sub-data blocks of the data block are processed, the next data block is transferred. For example, it may move along the height direction, acquire each sub-data block of the data block one by one, and process each sub-data block of the data block.
  • FIG. 9 is a timing diagram of depth convolution performed in a pipeline manner in an embodiment of the present application.
  • the figure shows the time relationship of the data processing performed by the data buffer writing unit, the data supply unit, the matrix calculation unit, the data writing back unit, and the data buffer reading unit.
  • Tile0, Tile1, Tile2, ..., Tilez represent data block 0 to data block z.
  • the data buffer writing unit shown in the figure means that the data buffer writing unit reads filling data from the external memory, and writes the read filling data into the data buffer.
  • the data supply unit shown in the figure means that the data supply unit transmits the input data to the matrix calculation unit.
  • the matrix operation unit shown in the figure means that the matrix operation unit performs convolution operation on the input data.
  • the data write-back unit shown in the figure means that the data write-back unit writes output data back to the data buffer.
  • the data buffer reading unit shown in the figure refers to that the data buffer reading unit acquires filling data from the output submap cached by the data buffer, and stores the acquired filling data in an external memory.
  • FIG. 10 is a flowchart of a data processing method provided by an embodiment of the present application.
  • the data processing method provided in the embodiment of the present application is applied to a computer system, such as the computer system 10 shown in FIG. 1 .
  • the controller controls the data buffer writing unit to read input data from an external memory, and write the read input data into a data buffer for caching.
  • the input data stored in the external memory can be a feature map or a sub-map. If the input data stored in the external memory is a sub-picture, the data buffer writing unit starts to read from the first sub-picture.
  • the controller can send a first control instruction to the data buffer writing unit, and the data buffer writing unit reads input data (such as the first sub-graph) from the external memory according to the first control instruction, and writes the read input data into the data buffer.
  • input data such as the first sub-graph
  • the computer system may include a DMA controller, and the computer system may write the convolution parameters into the parameter buffer through the DMA controller.
  • the controller controls the data buffer writing unit to read filling data from the external memory, and write the read filling data into the data buffer for caching.
  • the data buffer writing unit can issue a read request for the filling data at the beginning, And wait for it to complete before starting the depth convolution of the next layer of the subgraph (for example, the depth convolution of the subgraph [1, i]).
  • the controller can send a second control instruction to the data buffer writing unit, and the data buffer writing unit reads the filling data from the external memory according to the second control instruction and generates a filling data write request, and writes the read filling data into the data buffer device.
  • the controller controls the data supply unit to send an input data read request to the data buffer read unit.
  • the controller may send a third control instruction to the data supply unit, and the data supply unit sends an input data read request to the data buffer reading unit according to the third control instruction.
  • the data buffer reading unit reads the input data from the data buffer according to the input data reading request, and transmits the input data to the matrix calculation unit through the data supply unit.
  • the data buffer reading unit reads one sub-data block of input data from the data buffer each time, and transmits the sub-data block to the matrix calculation unit.
  • the sub-data block may include (m+k-1)*n data points, and the data buffer reading unit reads one sub-data block from the data buffer each time.
  • the controller controls the matrix calculation unit to read convolution parameters from the parameter buffer, performs convolution operation on the input data according to the convolution parameters to obtain output data, and caches the output data to the output result buffer.
  • the controller may send a fourth control instruction to the control matrix calculation unit, and the matrix calculation unit performs convolution operation according to the fourth control instruction to obtain output data, and stores the output data in the output result buffer.
  • the matrix calculation unit performs convolution operation on a sub-block of input data every clock cycle to obtain an operation result corresponding to a sub-block.
  • the controller controls the data write-back unit to read the output data cached in the output result buffer, sends the read output data to the data buffer write unit, and sends an output data write request to the data buffer write unit.
  • the controller may send a fifth control instruction to the data write-back unit, and the data write-back unit reads the output data cached in the output result buffer according to the fifth control instruction, and sends the read output data to to the data buffer write unit, and send an output data write request to the data buffer write unit.
  • the controller may write the operation result of the data block into the data buffer after performing the convolution operation on each data block of the input data.
  • Each data block includes a plurality of sub-data blocks.
  • the data buffer writing unit writes the output data into the data buffer memory for caching according to the output data write request.
  • the controller controls the data buffer reading unit to generate a fill data read request, reads new fill data from the output data cached in the data buffer according to the fill data read request, and stores the new fill data Data is stored from the data buffer to external memory.
  • the controller may send a sixth control instruction to the data buffer reading unit, and the data buffer reading unit generates a filling data read request according to the sixth control instruction, and reads the filling data from the data buffer according to the filling data reading request. Read new padding data from the output data of the data buffer, and store the new padding data from the data buffer to the external memory.
  • a two-dimensional matrix composed of a matrix calculation unit is used for convolution operation, and the width and channel of the feature map are respectively mapped to the two dimensions of the matrix.
  • the embodiment of the present application solves the storage bandwidth problem of the convolution operation, improves the utilization rate of the multiplier in the computing node, and saves the input subgraph and output subgraph in the data buffer, which can greatly relieve the bandwidth pressure and release the hardware computing power.
  • FIG. 11 is a schematic structural diagram of an electronic device provided by an embodiment of the present application (for example, the computer system 10 in FIG. 1 ).
  • the electronic device 110 may include: a radio frequency (Radio Frequency, RF) circuit 1101, a memory 1102, an input unit 1103, a display unit 1104, a sensor 1105, an audio circuit 1106, a Wi-Fi module 1107, a processor 1108 and Power supply 1109 and other components.
  • RF Radio Frequency
  • FIG. 11 does not constitute a limitation on the electronic device, and may include more or less components than shown in the figure, or combine some components, or arrange different components.
  • the RF circuit 1101 can be used to send and receive information or receive and send signals during a call. In particular, after receiving the downlink information from the base station, it transfers it to the processor 1108 for processing; in addition, it sends uplink data to the base station.
  • the RF circuit 1101 includes, but is not limited to: an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (Low Noise Amplifier, LNA), a duplexer, and the like.
  • the memory 1102 can be used to store software programs and modules, and the processor 1108 executes various functional applications and data processing of the electronic device by running the software programs and modules stored in the memory 1102 .
  • the memory 1102 can mainly include a program storage area and a data storage area, wherein the program storage area can store an operating system, at least one application program required by a function (such as a sound playback function, an image playback function, etc.), etc.; Data created by the use of electronic devices (such as audio data, phonebook, etc.), etc.
  • the memory 1102 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage devices.
  • the input unit 1103 can be used to receive input numbers or character information, and generate key signal input related to user settings and function control of the electronic device.
  • the input unit 1103 may include a touch panel 11031 and other input devices 11032 .
  • the touch panel 11031 also referred to as a touch screen, can collect touch operations of the user on or near it (for example, the user uses any suitable object or accessory such as a finger, a stylus, etc. on the touch panel 11031 or near the touch panel 11031 operation), and drive the corresponding connection device according to the preset program.
  • the touch panel 11031 may include two parts, a touch detection device and a touch controller.
  • the touch detection device detects the user's touch orientation, and detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts it into contact coordinates, and sends it to the to the processor 1108, and receive and execute commands sent by the processor 1108.
  • the touch panel 11031 can be implemented in various types such as resistive, capacitive, infrared, and surface acoustic wave.
  • the input unit 1103 may also include other input devices 11032 .
  • other input devices 11032 may include, but are not limited to, one or more of physical keyboards, function keys (such as volume control keys, switch keys, etc.), trackballs, mice, joysticks, and the like.
  • the display unit 1104 may be used to display information input by or provided to the user and various menus of the electronic device.
  • the display unit 1104 may include a display panel 11041.
  • the display panel 11041 may be configured in the form of a liquid crystal display (Liquid Crystal Display, LCD), an organic light-emitting diode (Organic Light-Emitting Diode, OLED), or the like.
  • the touch panel 11031 can cover the display panel 11041, and when the touch panel 11031 detects a touch operation on or near it, it sends it to the processor 1108 to determine the type of the touch event, and then the processor 1108 according to the touch event Type provides a corresponding visual output on the display panel 11041.
  • the touch panel 11031 and the display panel 11041 are used as two independent components to realize the input and output functions of the electronic device, in some embodiments, the touch panel 11031 and the display panel 11041 can be integrated And realize the input and output function of electronic equipment.
  • the electronic device may also include at least one sensor 1105, such as a light sensor, a motion sensor, and other sensors.
  • the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 11041 according to the brightness of the ambient light, and the proximity sensor may turn off the display panel 11041 and the / or backlighting.
  • the accelerometer sensor can detect the magnitude of acceleration in various directions (generally three axes), and can detect the magnitude and direction of gravity when it is stationary, and can be used to identify the posture of electronic equipment (such as horizontal and vertical screen switching, Related games, magnetometer attitude calibration), vibration recognition related functions (such as pedometer, tap), etc.; in addition, electronic devices can also be configured with other sensors such as gyroscopes, barometers, hygrometers, thermometers, and infrared sensors. This will not be repeated here.
  • the audio circuit 1106, the speaker 11061, and the microphone 11062 can provide an audio interface between the user and the electronic device.
  • the audio circuit 1106 can transmit the electrical signal converted from the received audio data to the speaker 11061, and the speaker 11061 converts it into an audio signal for output; After being received, it is converted into audio data, and after being processed by the audio data output processor 1108, it is sent to another electronic device through the RF circuit 1101, or the audio data is output to the memory 1102 for further processing.
  • Wi-Fi is a short-distance wireless transmission technology.
  • the electronic device 110 can help users send and receive emails, browse web pages, and access streaming media through the Wi-Fi module 1107, which provides users with wireless broadband Internet access.
  • FIG. 11 shows the Wi-Fi module 1107, it can be understood that it is not a necessary component of the electronic device, and can be omitted as needed without changing the essence of the invention.
  • the processor 1108 is the control center of the electronic device, and uses various interfaces and lines to connect various parts of the entire electronic device, by running or executing software programs and/or modules stored in the memory 1102, and calling data stored in the memory 1102 , to perform various functions of the electronic equipment and process data, so as to monitor the electronic equipment as a whole.
  • the processor 1108 may include one or more processing units; preferably, the processor 1108 may integrate an application processor and a modem, wherein the application processor mainly processes the operating system, user interface and application programs, etc., and the modem mainly processes Wireless communication. It can be understood that the foregoing modem processor may not be integrated into the processor 1108 .
  • the electronic device also includes a power supply 1109 (such as a battery) for supplying power to various components.
  • a power supply 1109 (such as a battery) for supplying power to various components.
  • the power supply can be logically connected to the processor 1108 through a power management system, so that functions such as charging, discharging, and power consumption management can be implemented through the power management system. .
  • the electronic device may also include a camera, a Bluetooth module, etc., which will not be repeated here.
  • the electronic device described in FIG. 11 may be used to implement the data processing method introduced in the embodiment of the present application. For details, reference may be made to relevant descriptions in the foregoing embodiments, and details are not repeated here.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

本申请提供一种计算机系统和数据处理方法,计算机系统包括由二维的计算节点组成的矩阵计算单元,矩阵计算单元用于:根据卷积参数对输入数据进行卷积运算,每个计算节点完成单个卷积窗口的计算。本申请能够加快深度卷积处理,降低系统能耗。

Description

计算机系统和数据处理方法 技术领域
本申请涉及计算机技术领域,尤其涉及一种计算机系统和数据处理方法。
背景技术
当前机器视觉主流的技术方案多采用卷积神经网络(Convolutional Neural Network,CNN)。然而,经典的CNN参数多、运算量大,在算力、电量较小的边缘计算平台上,实时性与续航是产品落地的主要障碍。深度卷积(depthwise convolution)可以显著减低CNN的参数数量,以有限的精度损失带来计算速度上的巨大提升,在边缘计算领域很有工程价值。
然而,当前主流的计算机架构并不能很好地支持深度卷积,存在计算速度慢、系统能耗高的问题。
发明内容
本申请实施例提供了一种计算机系统和数据处理方法,能够加快深度卷积处理,降低系统能耗。
本申请第一方面提供了一种计算机系统,包括由二维的计算节点组成的矩阵计算单元,所述矩阵计算单元用于:根据卷积参数对输入数据进行卷积运算,每个计算节点完成单个卷积窗口的计算。
本申请利用由二维的计算节点组成的矩阵计算单元进行深度卷积的卷积运算,每个时钟周期可以完成多个数据点的计算,加快深度卷积处理,降低系统能耗。
在一些可选的实施方式中,所述矩阵计算单元的列数等于所述输入数据的通道数。
在一些可选的实施方式中,所述计算机系统还包括外部存储器,所述外部存储器用于存储所述输入数据和所述卷积参数。
在一些可选的实施方式中,所述计算机系统还包括控制器、数据缓冲写入单元、数据缓冲读取单元、数据供应单元、数据回写单元、数据缓冲器、参数缓冲器、输出结果缓冲器,所述参数缓冲器与所述外部存储器相连,用于缓存所述卷积参数;所述控制器,用于控制所述数据缓冲写入单元从所述外部存储器读取所述输入数据,将读取的所述输入数据写入所述数据缓冲器进行缓存;所述控制器,还用于控制所述数据供应单元向所述数据缓冲读取单元发送输入数据读取请求;所述数据缓冲读取单元,用于根据所述输入数据读取请求从所述数据缓冲器读取所述输入数据,将所述输入数据通过所述数据供应单元传送给所述矩阵计算单元;所述控制器,还用于控制所述矩阵计算单元从所述参数缓冲器读取所述卷积参数,根据所述卷积参数对所述输入数据进行卷积运算得到输出数据,将所 述输出数据缓存到所述输出结果缓冲器;所述控制器,还用于控制所述数据回写单元读取所述输出结果缓冲器缓存的所述输出数据,将读取的所述输出数据发送给所述数据缓冲写入单元,并向所述数据缓冲写入单元发送输出数据写入请求;所述数据缓冲写入单元,还用于根据所述输出数据写入请求将所述输出数据写入所述数据缓冲存储器进行缓存。
在一些可选的实施方式中,所述输入数据包括由特征图切分得到的子图,所述外部存储器还用于存储填充数据;所述控制器,还用于控制所述数据缓冲写入单元从所述外部存储器读取所述填充数据并生成填充数据写入请求,根据所述填充数据写入请求将读取的所述填充数据写入所述数据缓冲器;所述控制器,还用于控制所述数据缓冲读取单元生成填充数据读取请求,根据所述填充数据读取请求从所述数据缓冲器缓存的所述输出数据中读取新的填充数据,将所述新的填充数据从所述数据缓冲器存储到所述外部存储器。
在一些可选的实施方式中,所述数据缓冲写入单元还用于对所述输出数据写入请求与所述填充数据写入请求进行写冲突检查。
在一些可选的实施方式中,所述输出数据写入请求的优先级高于所述填充数据写入请求的优先级。
在一些可选的实施方式中,所述数据缓冲写入单元还用于对所述输入数据读取请求与所述填充数据读取请求进行读冲突检查。
在一些可选的实施方式中,所述输入数据读取请求的优先级高于所述填充数据读取请求的优先级。
在一些可选的实施方式中,所述子图沿所述特征图的高度方向或宽度方向切分得到。
在一些可选的实施方式中,所述计算机系统按照流水线方式对所述输入数据进行深度卷积。
在一些可选的实施方式中,所述输入数据沿宽度方向或者高度方向切分为数据块,所述数据块沿高度方向或宽度方向切分为子数据块,所述矩阵计算单元每个时钟周期对一个子数据块进行卷积运算。
在一些可选的实施方式中,所述子数据块的高度为1。
本申请第二方面提供了一种数据处理方法,应用于计算机系统,所述计算机系统包括由二维的计算节点组成的矩阵计算单元,所述方法包括:所述矩阵计算单元根据卷积参数对输入数据进行卷积运算,每个计算节点完成单个卷积窗口的计算。
在一些可选的实施方式中,所述矩阵计算单元的列数等于所述输入数据的通道数。
在一些可选的实施方式中,所述计算机系统还包括外部存储器,所述外部存储器用于存储所述输入数据和所述卷积参数。
在一些可选的实施方式中,所述计算机系统还包括控制器、数据缓冲写入单元、数据缓冲读取单元、数据供应单元、数据回写单元、数据缓冲器、参数缓冲器、输出结果缓冲器,所述参数缓冲器与所述外部存储器相连,用于缓存所述卷积参数,所述方法还包括;所述控制器控制所述数据缓冲写入单元从所述外部存储器读取所述输入数据,将读取的所述输入数据写入所述数据缓冲器进行缓存;所述控制器控制所述数据供应单元向所述数据缓冲读取单元发送输入数据读取请求;所述数据缓冲读取单元根据所述输入数据读取请求 从所述数据缓冲器读取所述输入数据,将所述输入数据通过所述数据供应单元传送给所述矩阵计算单元;所述控制器控制所述矩阵计算单元从所述参数缓冲器读取所述卷积参数,根据所述卷积参数对所述输入数据进行卷积运算得到输出数据,将所述输出数据缓存到所述输出结果缓冲器;所述控制器控制所述数据回写单元读取所述输出结果缓冲器缓存的所述输出数据,将读取的所述输出数据发送给所述数据缓冲写入单元,并向所述数据缓冲写入单元发送输出数据写入请求;所述数据缓冲写入单元根据所述输出数据写入请求将所述输出数据写入所述数据缓冲存储器进行缓存。
在一些可选的实施方式中,所述输入数据包括由特征图切分得到的子图,所述外部存储器还用于存储填充数据;所述方法还包括:所述控制器控制所述数据缓冲写入单元从所述外部存储器读取所述填充数据并生成填充数据写入请求,根据所述填充数据写入请求将读取的所述填充数据写入所述数据缓冲器;所述控制器控制所述数据缓冲读取单元生成填充数据读取请求,根据所述填充数据读取请求从所述数据缓冲器缓存的所述输出数据中读取新的填充数据,将所述新的填充数据从所述数据缓冲器存储到所述外部存储器。
在一些可选的实施方式中,所述方法还包括:所述数据缓冲写入单元还用于对所述输出数据写入请求与所述填充数据写入请求进行写冲突检查。
在一些可选的实施方式中,所述输出数据写入请求的优先级高于所述填充数据写入请求的优先级。
在一些可选的实施方式中,所述方法还包括:所述数据缓冲写入单元对所述输入数据读取请求与所述填充数据读取请求进行读冲突检查。
在一些可选的实施方式中,所述输入数据读取请求的优先级高于所述填充数据读取请求的优先级。
在一些可选的实施方式中,所述子图沿所述特征图的高度方向或宽度方向切分得到。
在一些可选的实施方式中,所述数据处理方法按照流水线方式对所述输入数据进行深度卷积。
在一些可选的实施方式中,所述输入数据沿宽度方向或者高度方向切分为数据块,所述数据块沿高度方向或宽度方向切分为子数据块,所述矩阵计算单元每个时钟周期对一个子数据块进行卷积运算。
在一些可选的实施方式中,所述子数据块的高度为1。
应当理解地,上述提供的第二方面所述的数据处理方法与上述第一方面的计算机系统对应,因此,其所能达到的有益效果可参考上文所提供的对应的方法中的有益效果,此处不再赘述。
附图说明
图1是本申请实施例提供的计算机系统的示意图。
图2是图1中矩阵计算单元的数据供应的示意图。
图3是本申请实施例中计算节点的示意图。
图4是本申请实施例中矩阵计算单元内数据复用的示意图。
图5是将特征图切分为子图以及对子图进行处理的示意图。
图6是本申请实施例中对子图进行处理的流程图。
图7是本申请实施例中特征图、子图、数据块、子数据块的关系示意图。
图8是本申请实施例中对输入特征图/输入子图中的数据块进行深度卷积的示意图。
图9是本申请实施例中按照流水线方式进行深度卷积的时序关系图。
图10是本申请实施例提供的数据处理方法的流程图。
图11是本申请实施例提供的电子设备的结构示意图。
具体实施方式
为了便于理解,示例性地给出了部分与本申请实施例相关概念的说明以供参考。
需要说明的是,本申请中“至少一个”是指一个或者多个,“多个”是指两个或多于两个。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A,B可以是单数或者复数。本申请的说明书和权利要求书及附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不是用于描述特定的顺序或先后次序。
图1是本申请实施例提供的计算机系统的示意图。
本申请实施例提供的计算机系统10包括芯片100(即片上系统,System On Chip,SOC)和外部存储器101。芯片100包括控制器1001、数据缓冲写入单元1002、数据缓冲读取单元1003、数据供应单元1004、数据回写单元1005、矩阵计算单元1006、数据缓冲器1007、参数缓冲器1008、输出结果缓冲器1009。外部存储器101与数据缓冲读取单元1003、数据缓冲写入单元1002、参数缓冲器1008相连,数据缓冲器1007与数据缓冲写入单元1002、数据缓冲读取单元1003相连,控制器1001与数据缓冲写入单元1002、数据缓冲读取单元1003、数据供应单元1004、数据回写单元1005、矩阵计算单元1006相连,数据供应单元1004连接数据缓冲读取单元1003和矩阵计算单元1006,数据回写单元1005连接数据缓冲写入单元1002和输出结果缓冲器1009,矩阵计算单元1006连接参数缓冲器1008和输出结果缓冲器1009。
示例性的,控制器1001可以包括以下至少一种类型:中央处理器(Central Processing Unit,CPU)、微控制器(Microcontroller Unit,MCU)、数字信号处理器(Digital Signal Processor,DSP)、应用处理器(Application Processor,AP)、图像处理器(Graphics Processing Unit,GPU)和神经网络处理器(Neural-network Processing Unit,NPU)。
计算机系统10可以包括机器人、手机、车载电脑等,对应的芯片100可以是机器人芯片、手机芯片、车载芯片等。计算机系统10可以通过相关软件实现识别、分类等机器视觉任务。
计算机系统10利用矩阵计算单元1006进行深度卷积的卷积运算,能够实现快速、低能耗的深度卷积。计算机系统10利用芯片100中各个单元相互配合,可以提高存储带宽。在本申请的一个实施例中,计算机系统10可以用于MobileNet。MobileNet是为移动和嵌入式设备提出的高效模型。MobileNet基于流线型架构(streamlined),使用深度可分离卷积(depthwise separable convolutions)来构建轻量级深度神经网络。深度可分离卷积包 括深度卷积(depthwise convolution)和逐点卷积(pointwise convolution)两部分。本申请实施例提供的计算机系统10可以用于MobileNet的深度卷积。可以理解,本申请实施例提供的计算机系统10可以用于实现其他的卷积神经网络。
外部存储器101用于存储输入数据和卷积参数。芯片100用于根据卷积参数对外部存储器101存储的输入数据进行深度卷积。
输入数据可以是特征图(feature map),也可以是由特征图切分得到的子图(即子特征图)。
特征图具有宽度(Width,W)、高度(Height,H)和通道数(Channel,H)三个维度,特征图的宽度可以表示为w,高度可以表示为h,通道数可以表示为c,则特征图大小可以表示为w*h*c。卷积参数可以包括权重(weight)和偏置(bias)。
需要说明的是,输入数据可以是图像(图像可以视为特殊的特征图),芯片100可以根据卷积参数对外部存储器101存储的图像进行深度卷积。
为了支持对大尺寸的特征图进行深度卷积,当特征图的尺寸较大时,可以将特征图切分为子图,将子图存储于外部存储器101。可以将特征图沿高度方向切分为子图,也可以将特征图沿宽度方向切分为子图。例如,特征图的大小为1024*1024*64(宽度为1024,高度为1024,通道数为64),可以沿高度方向切分特征图,将特征图切分为4个1024*256*64的子图,将4个1024*256*64的子图存储在外部存储器101。
芯片100可以对输入数据(即特征图或子图)进行一层深度卷积,也可以对输入数据进行多层深度卷积。
若输入数据包括多个子图,芯片100对每个子图进行深度卷积。若需要进行多层深度卷积,则芯片100对一个子图进行所有层的深度卷积后,再对下一个子图进行深度卷积。
芯片100对子图进行多层深度卷积时,可以从每层深度卷积得到的输出数据(即输出子图)中获得下一层深度卷积所需的填充数据,将填充数据存储至外部存储器101,用于下一层的深度卷积。在本申请的一个实施例中,卷积核大小为k*k,将输出子图沿高度方向的底部k-1行数据作为填充数据。例如,卷积核大小为3*3,填充数据是输出子图沿高度方向的底部2行数据。
数据缓冲器1007用于缓存输入数据、输出数据和填充数据。数据缓冲器1007由多个随机存取存储器(Random Access Memory,RAM)组成。组成数据缓冲器1007的随机存取存储器可以是单口随机存取存储器,也可以是双口随机存取存储器。单口随机存取存储器只有一组数据线与地址线,不能同时进行读写;双口随机存取存储器有两组数据线与地址线,可以同时进行读写。数据缓冲器1007中的随机存取存储器按照一定的地址映射,用于缓存深度卷积的输入数据(即输入特征图/输入子图)、输出数据(即输出特征图/输出子图)和填充数据。输入特征图/输入子图和输出特征图/输出子图是相对一次深度卷积而言的,输入特征图/输入子图是指深度卷积前的特征图/子图,例如来自于外部存储器101的特征图/子图,输出特征图/输出子图是深度卷积后的特征图/子图。数据缓冲器1007包括至少两组随机存取存储器,以支持同时进行读写。每组随机存取存储器包括多个随机访问存储器,以支持一次读/写特征图/子图宽度方向的多个数据点。图1中数据缓冲器1007包括两组随机存取存储器,输入特征图/输入子图的宽度为w,卷积核 的大小为k*k,每组包括至少w+k-1个随机访问存储器。一组随机存取存储器存储输入特征图/输入子图,另一组随机存取存储器存储输出特征图/输出子图,最少支持同时w+k-1个随机存取存储器的独立地址读/写。数据缓冲器1007支持同时读写的随机存取存储器的组数越多,数据读写冲突的概率越小。
在本申请的一个实施例中,数据缓冲器1007可以采用ping-pong的方式读写数据。ping-pong是一种数据缓冲的手段,利用两个数据缓冲区达到数据连续传输的目的,提高数据传输速率。由于单个缓冲区得到的数据在传输和处理中很容易被覆盖,而ping-pong的方式能够保持从一个缓冲区读取数据,从另一个缓冲区写入数据,即两个缓冲区交替地读取和写入。
在本申请的一个实施例中,数据缓冲器1007中的数据按照NHWC数据格式存储,其中N表示一个簇(batch)中图片的数量,H表示高度,W表示宽度,C表示通道数。按照NHWC数据格式存储数据,就是按照C方向、W方向、H方向、N方向的先后顺序取数据进行存储。
参数缓冲器1008用于缓存卷积参数。参数缓冲器1008可以由小容量的随机访问存储器组成,参数缓冲器1008可以采用ping-pong的方式缓存卷积参数。计算机系统10可以包括DMA(Direct Memory Access,DMA)控制器(图上未示出),计算机系统可以通过DMA控制器将卷积参数通过DMA方式从外部存储器101写入参数缓冲器。
输出结果缓冲器1009用于缓存矩阵计算单元1006的运算结果(即输出数据)。输出结果缓冲器1009支持数据回写单元1005的读取。输出结果缓冲器1009可以由小容量的随机访问存储器组成。输出结果缓冲器1009可以采用ping-pong的方式缓存矩阵计算单元1006的运算结果。
控制器1001用于向数据缓冲写入单元1002、数据缓冲读取单元1003、数据供应单元1004、数据回写单元1005、矩阵计算单元1006发送控制指令,以指示处理的数据的大小、位置等信息。
控制器1001控制数据缓冲写入单元1002从外部存储器101读取输入数据,将读取的输入数据写入数据缓冲器1007进行缓存。
控制器1001还用于控制数据供应单元1004向数据缓冲读取单元1003发送输入数据读取请求。
数据缓冲读取单元1003用于根据输入数据读取请求从数据缓冲器1007读取输入数据,将输入数据通过数据供应单元1004传送给矩阵计算单元1006。数据供应单元1004可以采用激活函数对输入数据进行处理。在本申请的一个实施例中,数据供应单元1004采用的激活函数为Relu函数。
控制器1001还用于控制矩阵计算单元1006从参数缓冲器1008读取卷积参数,根据卷积参数对输入数据进行卷积运算得到输出数据,将输出数据缓存到输出结果缓冲器1009。
控制器1001还用于控制数据回写单元1005读取输出结果缓冲器1009缓存的输出数据,将读取的输出数据发送给数据缓冲写入单元1002,并向数据缓冲写入单元1002发送输出数据写入请求。数据回写单元1005可以对运算结果进行量化。对运算结果进行量化就是对运算结果进行格式转换,例如将运算结果从32位浮点数转换为8位定点数。
数据缓冲写入单元1002还用于根据输出数据写入请求将输出数据写入数据缓冲存储器1007进行缓存。
若输入数据是由特征图切分得到的子图,外部存储器101还用于存储填充数据。控制器1001还用于控制数据缓冲写入单元1002从外部存储器101读取填充数据并生成填充数据写入请求,根据填充数据写入请求将读取的填充数据写入数据缓冲器1007进行缓存。控制器1001还用于控制数据缓冲读取单元1003生成填充数据读取请求,根据填充数据读取请求从数据缓冲器1007缓存的输出数据中读取新的填充数据,将新的填充数据从数据缓冲器1007存储到外部存储器101。
在本申请的一个实施例中,数据缓冲读取单元1003对输入数据读取请求与填充数据读取请求进行读冲突检查。由于填充数据的数据量较小,输入数据读取请求的优先级可以高于填充数据读取请求的优先级。
在本申请的一个实施例中,数据缓冲写入单元1002对输出数据写入请求与填充数据写入请求进行写冲突检查。由于填充数据的数据量较小,输出数据写入请求的优先级可以高于填充数据写入请求的优先级。
在本申请的一个实施例中,数据供应单元1004的输入数据读取请求与数据回写单元1005的输出数据写入请求所请求的地址连续,以避免数据供应单元1004发生读冲突和数据回写单元1005发生写冲突。数据供应单元1004是读数据缓冲器,数据回写单元1005是写数据缓冲器,数据供应单元1004和数据回写单元1005同时访问数据缓冲器1007的多个随机存储器。
数据缓冲器1007中,单组随机存取存储器同一时刻可以只接收读请求或写请求,也可以同时接收读请求和写请求,读请求与写请求串行执行,写请求优先。
图1所示实施例中,数据缓冲写入单元1002、数据缓冲读取单元1003交替占据两组随机存取存储器,以避免读写冲突,提升数据缓冲器1007的利用率。
矩阵计算单元1006由二维的计算节点组成。如图1所示,矩阵计算单元1006包括m*n个计算节点,m为矩阵计算单元1006的行数,n为矩阵计算单元1006的列数。矩阵计算单元1006的行数m等于数据供应单元1004提供的输入数据的宽度,矩阵计算单元1006的列数n等于数据供应单元1004提供的输入数据的通道数c。也就是说,输入数据的宽度和通道分别映射到矩阵计算单元1006的行和列。采用矩阵计算单元1006可以提升深度卷积性能。
图2是图1中矩阵计算单元的数据供应的示意图。
数据供应单元为矩阵计算单元提供输入数据(输入特征图/输入子图)。在图2所示实施例中,数据供应单元每次将一个子数据块的输入数据传送给矩阵计算单元。子数据块例如为图中左上角灰色部分。在本申请的一个实施例中,子数据块的高度为1。子数据块将在后面进行详细介绍。在本申请的一个实施例中,卷积核的大小为k*k,数据供应单元每次为矩阵计算单元的每个计算节点传送子数据块中的k个数据点,矩阵计算单元完成一次卷积运算要经过k次数据传送。例如,卷积核的大小为3*3,数据供应单元每次为矩阵计算单元的每个计算节点传送子数据块中的3个数据点,矩阵计算单元完成一次卷积运算要经过3次数据传送。举例来说,数据供应单元在一次数据传送中,将a0、a1、a2传送给计算节点(0,0),将a1、a2、a3传送给计算节点(1,0),将a2、a3、a4 传送给计算节点(2,0),…。
参数缓冲器为矩阵计算单元提供卷积参数。在图2所示实施例中,参数缓冲器为矩阵计算单元提供的卷积参数包括权重(图中表示为weight)和偏置(图中表示为bias)。参数缓冲器为矩阵计算单元的每一列提供一组权重和一个偏置。例如,为矩阵计算单元的第一列提供权重d0、d1、d2、d3、d4、d5、d6、d7、d8和偏置e0。
矩阵计算单元的每一列完成输入数据1个通道的数据计算,矩阵计算单元的每一行完成输入数据一个宽度的数据计算,因此,矩阵计算单元每个时钟周期(cycle)产生m*n个运算结果。
通过矩阵计算单元可以实现输入数据宽度或高度方向的数据复用。参见图2所示,图中卷积核大小为3*3,沿输入特征图/输入子图的宽度(W)方向,第一到第三个数据点a0、a1、a2输入计算节点(0,0),第二到第四个数据点a1、a2、a3输入(1,0),第二到第三个数据点a1、a2得以复用。输入特征图/输入子图的数据读出后在矩阵计算单元1006内部展开,宽度/高度方向的数据复用节省了数据缓冲器1007与矩阵计算单元1006之间的连线开销。
图3是本申请实施例中计算节点的示意图。
在本申请的一个实施例中,每个计算节点可以包括第一寄存器、第二寄存器、乘法器、加法树(图中将乘法器和加法树合起来表示为MAC)和累加器(图中表示为bias adder)。第一寄存器用于存储输入数据的数据点(图中表示为data),第二寄存器用于存储卷积参数(包括权重和偏置,图中权重表示为weight,偏置表示为bias),乘法器用于对每个数据点和对应的权重相乘,加法树用于对各个乘法器的结果相加,累加器用于对加法器的结果和偏置相加。每个计算节点包含的乘法器的数量等于卷积核大小。卷积核大小为k*k,每个计算节点包含的乘法器的数量为k*k。例如,卷积核大小为3*3,每个计算节点包含的乘法器的数量为9。
矩阵计算单元的每个计算节点完成单个卷积窗口的计算。以3*3卷积为例,输入特征图/输入子图的数据表示为in[0],in[1],…,in[8],权重(即卷积核的数据)表示为w[0],w[1],…,w[8],偏置表示为bias,计算节点完成的计算可以表示为:
Out=(in[0]*w[0]+in[1]*w[1]+…+in[8]*w[8])+bias。
例如,in[0]=a0,in[1]=a1,in[2]=a2,in[3]=b0,in[4]=b1,in[5]=b2,in[6]=c0,in[7]=c1,in[8]=c2,w[0]=d0,w[1]=d1,w[2]=d2,w[3]=d3,w[4]=d4,w[5]=d5,w[6]=d6,w[7]=d7,w[8]=d8,bias=e0,则计算节点完成的计算可以表示为:
Out=(a0*d0+a1*d1+a2*d2+b0*d3+b1*d4+b2*d5+c0*d6+c1*d7+c2*d2+b0*d8)+e0。
在本申请的另一个实施例中,每个计算节点可以包括第一寄存器、第二寄存器、乘法器和加法树。加法树将对各个乘法器的结果相加,并且与偏置相加。
在本申请的一个实施例中,矩阵计算单元可以采用窗口数据循环寄存(参见图4)的方法,减少数据缓冲器到矩阵计算单元的线宽。
图4是本申请实施例中矩阵计算单元内数据复用的示意图。
在深度卷积过程中,在输入数据(输入特征图/输入子图)上沿着高度(H)或宽度(W)方向滑动,以获取提供给矩阵计算单元的数据点(即存入第一寄存器的数据)。 矩阵计算单元每个时钟周期在第一寄存器存储k*k(例如9)个数据,下一时钟周期在输入数据上沿着高度/宽度方向滑动时,计算节点内的第一寄存器已经存储了k*k个数据,只需将顶部1*k个数据替换为新输入的1*k个数据,组成新的窗口进行下一次的计算,这样可以充分利用输入特征图/输入子图高度方向的数据,实现了数据复用(例如3*3窗口,步长(stride)=1,上下/左右相邻两个窗口有2*3/3*2的数据复用)。例如,参阅图4中step4、step5,在step4进行一次卷积运算后,只需将a0、a1、a2替换为新输入的d0、d1、d2,即可进行下一次的卷积运算。本申请实施例中,不需要重复到数据缓冲器中取数据,每个时钟周期都能进行有效计算并能节省存储器开销,减少系统能耗。如果矩阵计算单元不对数据进行缓冲,就需要数据供应单元缓冲并传送数据,这会大大增加数据供应单元与矩阵计算单元之间的走线数量。本申请实施例大大降低数据供应单元与矩阵计算单元之间的走线。
当前主流的计算机架构并不能很好地支持深度卷积。一方面是因为主流架构应对的是通用卷积操作,即三维(three dimensional,3D)卷积操作,实现通道(Channel)方向的数据复用以降低系统能耗,而深度卷积并不需要通道方向的数据复用。另一方面,深度卷积对数据依赖小、计算很快,存储会成为主要瓶颈。
本申请实施例提供的计算机系统通过矩阵计算单元进行卷积运算,每个时钟周期可以完成多个数据点的计算,加快了深度卷积处理,减少了系统能耗。本申请实施例提供的计算机系统对宽度或高度的数据进行复用,进一步减少系统能耗。此外,本申请实施例提供的计算机系统利用数据缓冲器对输入数据和输出数据进行缓存,进出数据缓冲器的主要是小数据量的填充数据,大大缓解存储器的带宽需求。
图5是将特征图切分为子图以及对子图进行处理的示意图。图5展示了三层深度卷积。
对于大尺寸的特征图,可以将特征图沿高度或宽度方向切分为子图。图5示出沿高度方向将特征图切分为子图。
参见图5所示,子图[0,i-1]表示第一层深度卷积的第i-1个输入子图,子图[0,i]表示第一层深度卷积的第i个输入子图,子图[0,i+1]表示第一层深度卷积的第i+1个输入子图,子图[1,i-1]表示第二层深度卷积的第i-1个输入子图,子图[1,i]表示第二层深度卷积的第i个输入子图,子图[1,i+1]表示第二层深度卷积的第i+1个输入子图,子图[2,i-1]表示第三层深度卷积的第i-1个输入子图,子图[2,i]表示第三层深度卷积的第i个输入子图,子图[2,i+1]表示第三层深度卷积的第i+1个输入子图。输入子图进行一层深度卷积得到输出子图后,从输出子图(即输出数据)中获取填充数据,将填充数据存储到外部存储器中,用于拼接下一层深度卷积的输入子图。
第一层深度卷积的输入子图(例如子图[0,i])可以直接加载,不需要使用填充数据拼接。在完成一个子图(例如子图[0,i])的一层深度卷积后,数据缓冲写入单元将该子图的输出子图(例如子图[0,i]的输出子图)写入数据缓存器,将该子图的上一子图(例如子图[0,i-1])的输出子图中获取的填充数据(例如子图[0,i-1]的输出子图的底部2行数据)从外部存储器缓存至数据缓冲器与该子图的输出子图拼接为下一层深度卷积的输入子图(例如子图[1,i]),并且数据缓冲读取单元从数据缓冲器缓存的该子图的输出子图中获取填充数据,供该子图的下一子图(例如子图[0,i+1])的输出子图拼接使用。对于最后一 层深度卷积,由于后面没有卷积运算,填充数据可以不再读出存储到外部存储器。在本申请的一个实施例中,当将填充数据与输出子图进行拼接时,填充数据加在输出子图的顶部。需要说明的是,对于第一个子图,可以不需要从外部存储器读取填充数据并写入数据缓冲器;对于最后一个子图,可以不需要从数据缓冲器读出填充数据并存储到外部存储器。
图6是本申请实施例中对子图进行处理的流程图。
如图6所示,芯片对一个子图进行所有层的深度卷积后,对下一个子图进行深度卷积。例如,芯片对第i-1个子图进行所有层的深度卷积后,对第i个子图进行所有层的深度卷积,再对第i+1个子图进行所有层的深度卷积,依此类推。
本申请实施例中,除了第一层深度卷积,主要的输入数据和输出数据(输入特征图/输入子图和输出特征图/输出子图)一直存储在数据缓冲器中,进出数据缓冲器的主要是小数据量的填充数据,这种方式可以大大缓解存储器的带宽需求。
本申请实施例中,芯片可以按照流水线(pipe-line)方式对输入特征图/输入子图进行深度卷积。芯片按照流水线方式进行深度卷积时,以数据块(数据块可以表示为tile)为单位进行数据处理。每个输入特征图/输入子图包括多个数据块。矩阵计算单元进行卷积运算时,以子数据块为单位进行数据处理,每个数据块包括多个子数据块。
图7是本申请实施例中特征图、子图、数据块、子数据块的关系示意图。
特征图是卷积神经网络中要处理的单层数据,具有宽度(W)、高度(H)和通道数(C)三个维度。特征图可能尺寸过大,无法全部存入芯片的存储器(例如本申请实施例中的数据缓冲器)中处理,需要将其切分,切分后的数据能存储在芯片的存储器中,叫做子图。子图的通道数与特征图的通道数相同。可以沿高度方向将特征图切分为子图(例如图中沿高度方向切分成三个子图),也可以沿宽度方向将特征图切分为子图。
子图虽然能存储在芯片的存储器中,但是并不是一下子就处理完的,还需要切分为数据块,数据块可表示为tile。数据块的通道数与子图和特征图的通道数相同。若特征图沿高度方向切分为子图,子图可以沿宽度方向切分为数据块;若特征图沿宽度方向切分为子图,子图可以沿高度方向切分为数据块。每处理一个数据块,控制器向数据缓冲读取单元、数据缓冲写入单元、数据供应单元、数据回写单元、矩阵计算单元发送一次控制指令。数据缓冲读取单元、数据缓冲写入单元、数据供应单元、数据回写单元、矩阵计算单元接收到控制指令后,整个数据流根据ping-pong的流水线方式工作。
矩阵计算单元在进行卷积运算时,一次并不能完成整个数据块的计算,还需要将数据块切分成子数据块(子数据块可以表示为slice),每个时钟周期传递给矩阵计算单元一个子数据块的数据。子数据块的通道数与数据块、子图和特征图的通道数相同。可以沿高度方向将数据块切分成子数据块。在本申请的一个实施例中,子数据块的高度为1。例如,参见图7所示,子数据的宽度为m+k-1,高度为1,通道数为n,m<=w。
图8是本申请实施例中对输入特征图/输入子图中的数据块进行深度卷积的示意图。如图8所示,芯片对输入特征图/输入子图中的一个数据块处理完毕后,对下一个数据块进行处理。每处理一个数据块,逐一获取数据块的各个子数据块进行处理,对数据块的所有子数据块处理完毕后再转到下一个数据块。例如,可以沿高度方向移动,逐一获取数据块的各个子数据块,对数据块的各个子数据块进行处理。
图9是本申请实施例中按照流水线方式进行深度卷积的时序关系图。图中示出了数据缓冲写入单元、数据供应单元、矩阵计算单元、数据回写单元、数据缓冲读取单元进行数据处理的时间关系。图中Tile0、Tile1、Tile2、…、Tilez表示数据块0~数据块z。
图中所示的数据缓冲写入单元是指数据缓冲写入单元从外部存储器读取填充数据,将读取的填充数据写入数据缓冲器。图中所示的数据供应单元是指数据供应单元将输入数据传送给矩阵计算单元。图中所示的矩阵运算单元是指矩阵运算单元对输入数据进行卷积运算。图中所示的数据回写单元是指数据回写单元将输出数据回写到数据缓冲器。图中所示的数据缓冲读取单元是指数据缓冲读取单元从数据缓冲器缓存的输出子图中获取填充数据,将获取的填充数据存储到外部存储器。
图10是本申请实施例提供的数据处理方法的流程图。本申请实施例提供的数据处理方法应用于计算机系统,例如图1所示的计算机系统10。
1001,控制器控制数据缓冲写入单元从外部存储器读取输入数据,将读取的输入数据写入数据缓冲器进行缓存。
外部存储器存储的输入数据可以是特征图,也可以是子图。若外部存储器存储的输入数据是子图,数据缓冲写入单元从第一个子图开始读取。
控制器可以向数据缓冲写入单元发送第一控制指令,数据缓冲写入单元根据第一控制指令从外部存储器读取输入数据(例如第一个子图),将读取的输入数据写入数据缓冲器。
1002,将卷积参数从外部存储器写入参数缓冲器进行缓存。
计算机系统可以包括DMA控制器,计算机系统可以通过DMA控制器将卷积参数通过DMA方式写入参数缓冲器。
1003,若输入数据是子图,控制器控制数据缓冲写入单元从外部存储器读取填充数据,将读取的填充数据写入数据缓冲器进行缓存。
由于填充数据的载入与子图当前层的深度卷积(例如子图[0,i]的深度卷积)不存在数据依赖,数据缓冲写入单元可以开始就发出对填充数据的读请求,并等其完成后才启动子图下一层的深度卷积(例如子图[1,i]的深度卷积)。
若读取的输入数据不是子图,执行904。
控制器可以向数据缓冲写入单元发送第二控制指令,数据缓冲写入单元根据第二控制指令从外部存储器读取填充数据并生成填充数据写入请求,将读取的填充数据写入数据缓冲器。
1004,控制器控制数据供应单元向数据缓冲读取单元发送输入数据读取请求。
在本申请的一个实施例中,控制器可以向数据供应单元发送第三控制指令,数据供应单元根据第三控制指令向数据缓冲读取单元发送输入数据读取请求。
1005,数据缓冲读取单元根据输入数据读取请求从数据缓冲器读取输入数据,将输入数据通过数据供应单元传送给矩阵计算单元。
在本申请的一个实施例中,数据缓冲读取单元每次从数据缓冲器读取输入数据的一个子数据块,将子数据块传送给矩阵计算单元。
例如,参阅图7所示,子数据块可以包括(m+k-1)*n个数据点,数据缓冲读取单元每次从数据缓冲器读取一个子数据块。
1006,控制器控制矩阵计算单元从参数缓冲器读取卷积参数,根据卷积参数对输入数据进行卷积运算得到输出数据,将输出数据缓存到输出结果缓冲器。
在本申请的一个实施例中,控制器可以向控制矩阵计算单元发送第四控制指令,矩阵计算单元根据第四控制指令进行卷积运算得到输出数据,将输出数据存储到输出结果缓冲器。
在本申请的一个实施例中,矩阵计算单元每个时钟周期对输入数据的一个子数据块进行卷积运算,得到一个子数据块对应的运算结果。
1007,控制器控制数据回写单元读取输出结果缓冲器缓存的输出数据,将读取的输出数据发送给数据缓冲写入单元,并向数据缓冲写入单元发送输出数据写入请求。
在本申请的一个实施例中,控制器可以向数据回写单元发送第五控制指令,数据回写单元根据第五控制指令读取输出结果缓冲器缓存的输出数据,将读取的输出数据发送给数据缓冲写入单元,并向数据缓冲写入单元发送输出数据写入请求。
在本申请的一个实施例中,控制器可以在输入数据的每个数据块进行卷积运算后,将该数据块的运算结果写入数据缓冲器。每个数据块包括多个子数据块。
1008,数据缓冲写入单元根据输出数据写入请求将输出数据写入数据缓冲存储器进行缓存。
1009,若输入数据是子图,控制器控制数据缓冲读取单元生成填充数据读取请求,根据填充数据读取请求从数据缓冲器缓存的输出数据中读取新的填充数据,将新的填充数据从数据缓冲器存储到外部存储器。
在本申请的一个实施例中,控制器可以向数据缓冲读取单元发送第六控制指令,数据缓冲读取单元根据第六控制指令生成填充数据读取请求,根据填充数据读取请求从数据缓冲器缓存的输出数据中读取新的填充数据,将新的填充数据从数据缓冲器存储到外部存储器。
本申请实施例使用由矩阵计算单元组成的二维矩阵进行卷积运算,特征图的宽度和通道分别映射到矩阵的两个维度。本申请实施例解决了卷积运算的存储带宽问题,提升了计算节点中乘法器的利用率,输入子图与输出子图保存在数据缓冲器,可以极大缓解带宽压力,释放硬件算力。
图11是本申请实施例提供的电子设备(例如图1中计算机系统10)的结构示意图。如图11所示,电子设备110可以包括:射频(Radio Frequency,RF)电路1101、存储器1102、输入单元1103、显示单元1104、传感器1105、音频电路1106、Wi-Fi模块1107、处理器1108以及电源1109等部件。本领域技术人员可以理解,图11中示出的结构并不构成对电子设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
RF电路1101可用于收发信息或在通话过程中,对信号进行接收和发送,特别地,接收基站的下行信息后,转给处理器1108进行处理;另外,将涉及上行的数据发送给基站。通常,RF电路1101包括,但不限于:天线、至少一个放大器、收发信机、耦合器、低噪声放大器(Low Noise Amplifier,LNA)、双工器等。
存储器1102可用于存储软件程序以及模块,处理器1108通过运行存储在存储器1102中的软件程序以及模块,从而执行电子设备的各种功能应用以及数据处理。存储器1102 可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据电子设备的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器1102可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。
输入单元1103可用于接收输入的数字或字符信息,以及产生与电子设备的用户设置以及功能控制有关的键信号输入。具体地,输入单元1103可包括触控面板11031以及其他输入设备11032。触控面板11031,也称为触摸屏,可收集用户在其上或附近的触摸操作(比如用户使用手指、触控笔等任何适合的物体或附件在触控面板11031上或在触控面板11031附近的操作),并根据预先设定的程序驱动相应的连接装置。可选地,触控面板11031可包括触摸检测装置和触摸控制器两个部分。其中,触摸检测装置检测用户的触摸方位,并检测触摸操作带来的信号,将信号传送给触摸控制器;触摸控制器从触摸检测装置上接收触摸信息,并将它转换成触点坐标,再送给处理器1108,并接收处理器1108发来的命令并加以执行。此外,可以采用电阻式、电容式、红外线以及表面声波等多种类型实现触控面板11031。除了触控面板11031,输入单元1103还可以包括其他输入设备11032。具体地,其他输入设备11032可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆等中的一种或多种。
显示单元1104可用于显示由用户输入的信息或提供给用户的信息以及电子设备的各种菜单。显示单元1104可包括显示面板11041,可选地,可以采用液晶显示器(Liquid Crystal Display,LCD)、有机发光二极管(Organic Light-Emitting Diode,OLED)等形式来配置显示面板11041。进一步地,触控面板11031可覆盖显示面板11041,当触控面板11031检测到在其上或附近的触摸操作后,传送给处理器1108以确定触摸事件的类型,随后处理器1108根据触摸事件的类型在显示面板11041上提供相应的视觉输出。虽然在图11中,触控面板11031与显示面板11041是作为两个独立的部件来实现电子设备的输入和输出功能,但是在某些实施例中,可以将触控面板11031与显示面板11041集成而实现电子设备的输入和输出功能。
电子设备还可包括至少一种传感器1105,比如光传感器、运动传感器以及其他传感器。具体地,光传感器可包括环境光传感器及接近传感器,其中,环境光传感器可根据环境光线的明暗来调节显示面板11041的亮度,接近传感器可在电子设备移动到耳边时,关闭显示面板11041和/或背光。作为运动传感器的一种,加速计传感器可检测各个方向上(一般为三轴)加速度的大小,静止时可检测出重力的大小及方向,可用于识别电子设备姿态的应用(比如横竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击)等;此外,电子设备还可配置的陀螺仪、气压计、湿度计、温度计、红外线传感器等其他传感器,在此不再赘述。
音频电路1106、扬声器11061,传声器11062可提供用户与电子设备之间的音频接口。音频电路1106可将接收到的音频数据转换后的电信号,传输到扬声器11061,由扬声器11061转换为声音信号输出;另一方面,传声器11062将收集的声音信号转换为电信号,由音频电路1106接收后转换为音频数据,再将音频数据输出处理器1108处理后,经RF电路1101发送给另一电子设备,或者将音频数据输出至存储器1102以便进一步处 理。
Wi-Fi属于短距离无线传输技术,电子设备110通过Wi-Fi模块1107可以帮助用户收发电子邮件、浏览网页和访问流式媒体等,它为用户提供了无线的宽带互联网访问。虽然图11示出了Wi-Fi模块1107,但是可以理解的是,其并不属于电子设备的必需构成,完全可以根据需要、在不改变发明本质的范围内进行省略。
处理器1108是电子设备的控制中心,利用各种接口和线路连接整个电子设备的各个部分,通过运行或执行存储在存储器1102内的软件程序和/或模块,以及调用存储在存储器1102内的数据,执行电子设备的各种功能和处理数据,从而对电子设备进行整体监控。可选地,处理器1108可包括一个或多个处理单元;优选的,处理器1108可集成应用处理器和调制解调器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器1108中。
电子设备还包括给各个部件供电的电源1109(比如电池),可选地,电源可以通过电源管理系统与处理器1108逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。
尽管未示出,电子设备还可以包括摄像头、蓝牙模块等,在此不再赘述。
图11中描述的电子设备可以用于实施本申请实施例介绍的数据处理方法,可参见前述实施例中的相关阐述,这里不再赘述。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何在本申请揭露的技术范围内的变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (26)

  1. 一种计算机系统,其特征在于,包括由二维的计算节点组成的矩阵计算单元,所述矩阵计算单元用于:
    根据卷积参数对输入数据进行卷积运算,每个计算节点完成单个卷积窗口的计算。
  2. 如权利要求1所述的计算机系统,其特征在于,所述矩阵计算单元的列数等于所述输入数据的通道数。
  3. 如权利要求1所述的计算机系统,其特征在于,所述计算机系统还包括外部存储器,所述外部存储器用于存储所述输入数据和所述卷积参数。
  4. 如权利要求3所述的计算机系统,其特征在于,所述计算机系统还包括控制器、数据缓冲写入单元、数据缓冲读取单元、数据供应单元、数据回写单元、数据缓冲器、参数缓冲器、输出结果缓冲器,
    所述参数缓冲器与所述外部存储器相连,用于缓存所述卷积参数;
    所述控制器,用于控制所述数据缓冲写入单元从所述外部存储器读取所述输入数据,将读取的所述输入数据写入所述数据缓冲器进行缓存;
    所述控制器,还用于控制所述数据供应单元向所述数据缓冲读取单元发送输入数据读取请求;
    所述数据缓冲读取单元,用于根据所述输入数据读取请求从所述数据缓冲器读取所述输入数据,将所述输入数据通过所述数据供应单元传送给所述矩阵计算单元;
    所述控制器,还用于控制所述矩阵计算单元从所述参数缓冲器读取所述卷积参数,根据所述卷积参数对所述输入数据进行卷积运算得到输出数据,将所述输出数据缓存到所述输出结果缓冲器;
    所述控制器,还用于控制所述数据回写单元读取所述输出结果缓冲器缓存的所述输出数据,将读取的所述输出数据发送给所述数据缓冲写入单元,并向所述数据缓冲写入单元发送输出数据写入请求;
    所述数据缓冲写入单元,还用于根据所述输出数据写入请求将所述输出数据写入所述数据缓冲存储器进行缓存。
  5. 如权利要求4所述的计算机系统,其特征在于,所述输入数据包括由特征图切分得到的子图,
    所述外部存储器还用于存储填充数据;
    所述控制器,还用于控制所述数据缓冲写入单元从所述外部存储器读取所述填充数据并生成填充数据写入请求,根据所述填充数据写入请求将读取的所述填充数据写入所述数据缓冲器进行缓存;
    所述控制器,还用于控制所述数据缓冲读取单元生成填充数据读取请求,根据所述填充数据读取请求从所述数据缓冲器缓存的所述输出数据中读取新的填充数据,将所述新的填充数据从所述数据缓冲器存储到所述外部存储器。
  6. 如权利要求5所述的计算机系统,其特征在于,所述数据缓冲写入单元还用于对所述输出数据写入请求与所述填充数据写入请求进行写冲突检查。
  7. 如权利要求6所述的计算机系统,其特征在于,所述输出数据写入请求的优先级高于所述填充数据写入请求的优先级。
  8. 如权利要求5所述的计算机系统,其特征在于,所述数据缓冲写入单元还用于对所述输入数据读取请求与所述填充数据读取请求进行读冲突检查。
  9. 如权利要求8所述的计算机系统,其特征在于,所述输入数据读取请求的优先级高于所述填充数据读取请求的优先级。
  10. 如权利要求5所述的计算机系统,其特征在于,所述子图沿所述特征图的高度方向或宽度方向切分得到。
  11. 如权利要求1至10任一项所述的计算机系统,其特征在于,所述计算机系统按照流水线方式对所述输入数据进行深度卷积。
  12. 如权利要求11所述的计算机系统,其特征在于,所述输入数据沿宽度方向或者高度方向切分为数据块,所述数据块沿高度方向或宽度方向切分为子数据块,所述矩阵计算单元每个时钟周期对一个子数据块进行卷积运算。
  13. 如权利要求12所述的计算机系统,其特征在于,所述子数据块的高度为1。
  14. 一种数据处理方法,应用于计算机系统,所述计算机系统包括由二维的计算节点组成的矩阵计算单元,其特征在于,所述方法包括:
    所述矩阵计算单元根据卷积参数对输入数据进行卷积运算,每个计算节点完成单个卷积窗口的计算。
  15. 如权利要求14所述的数据处理方法,其特征在于,所述矩阵计算单元的列数等于所述输入数据的通道数。
  16. 如权利要求14所述的数据处理方法,其特征在于,所述计算机系统还包括外部存储器,所述外部存储器用于存储所述输入数据和所述卷积参数。
  17. 如权利要求16所述的数据处理方法,其特征在于,所述计算机系统还包括控制器、数据缓冲写入单元、数据缓冲读取单元、数据供应单元、数据回写单元、数据缓冲器、参数缓冲器、输出结果缓冲器,所述参数缓冲器与所述外部存储器相连,用于缓存所述卷积参数,所述方法还包括;
    所述控制器控制所述数据缓冲写入单元从所述外部存储器读取所述输入数据,将读取的所述输入数据写入所述数据缓冲器进行缓存;
    所述控制器控制所述数据供应单元向所述数据缓冲读取单元发送输入数据读取请求;
    所述数据缓冲读取单元根据所述输入数据读取请求从所述数据缓冲器读取所述输入数据,将所述输入数据通过所述数据供应单元传送给所述矩阵计算单元;
    所述控制器控制所述矩阵计算单元从所述参数缓冲器读取所述卷积参数,根据所述卷积参数对所述输入数据进行卷积运算得到输出数据,将所述输出数据缓存到所述输出结果缓冲器;
    所述控制器控制所述数据回写单元读取所述输出结果缓冲器缓存的所述输出数据,将读取的所述输出数据发送给所述数据缓冲写入单元,并向所述数据缓冲写入单元发送输出数据写入请求;
    所述数据缓冲写入单元根据所述输出数据写入请求将所述输出数据写入所述数据缓 冲存储器进行缓存。
  18. 如权利要求17所述的数据处理方法,其特征在于,所述输入数据包括由特征图切分得到的子图,所述外部存储器还用于存储填充数据;
    所述方法还包括:
    所述控制器控制所述数据缓冲写入单元从所述外部存储器读取所述填充数据并生成填充数据写入请求,根据所述填充数据写入请求将读取的所述填充数据写入所述数据缓冲器;
    所述控制器控制所述数据缓冲读取单元生成填充数据读取请求,根据所述填充数据读取请求从所述数据缓冲器缓存的所述输出数据中读取新的填充数据,将所述新的填充数据从所述数据缓冲器存储到所述外部存储器。
  19. 如权利要求18所述的数据处理方法,其特征在于,所述方法还包括:
    所述数据缓冲写入单元还用于对所述输出数据写入请求与所述填充数据写入请求进行写冲突检查。
  20. 如权利要求19所述的数据处理方法,其特征在于,所述输出数据写入请求的优先级高于所述填充数据写入请求的优先级。
  21. 如权利要求18所述的数据处理方法,其特征在于,所述方法还包括:
    所述数据缓冲写入单元对所述输入数据读取请求与所述填充数据读取请求进行读冲突检查。
  22. 如权利要求21所述的数据处理方法,其特征在于,所述输入数据读取请求的优先级高于所述填充数据读取请求的优先级。
  23. 如权利要求18所述的数据处理方法,其特征在于,所述子图沿所述特征图的高度方向或宽度方向切分得到。
  24. 如权利要求14至23任一项所述的数据处理方法,其特征在于,所述数据处理方法按照流水线方式对所述输入数据进行深度卷积。
  25. 如权利要求24所述的数据处理方法,其特征在于,所述输入数据沿宽度方向或者高度方向切分为数据块,所述数据块沿高度方向或宽度方向切分为子数据块,所述矩阵计算单元每个时钟周期对一个子数据块进行卷积运算。
  26. 如权利要求25所述的数据处理方法,其特征在于,所述子数据块的高度为1。
PCT/CN2021/109650 2021-07-30 2021-07-30 计算机系统和数据处理方法 WO2023004762A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2021/109650 WO2023004762A1 (zh) 2021-07-30 2021-07-30 计算机系统和数据处理方法
CN202180096718.3A CN117223008A (zh) 2021-07-30 2021-07-30 计算机系统和数据处理方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/109650 WO2023004762A1 (zh) 2021-07-30 2021-07-30 计算机系统和数据处理方法

Publications (1)

Publication Number Publication Date
WO2023004762A1 true WO2023004762A1 (zh) 2023-02-02

Family

ID=85086125

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/109650 WO2023004762A1 (zh) 2021-07-30 2021-07-30 计算机系统和数据处理方法

Country Status (2)

Country Link
CN (1) CN117223008A (zh)
WO (1) WO2023004762A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110210610A (zh) * 2018-03-27 2019-09-06 腾讯科技(深圳)有限公司 卷积计算加速器、卷积计算方法及卷积计算设备
US20190294413A1 (en) * 2018-03-23 2019-09-26 Amazon Technologies, Inc. Accelerated quantized multiply-and-add operations
CN111178519A (zh) * 2019-12-27 2020-05-19 华中科技大学 卷积神经网络加速引擎、卷积神经网络加速系统及方法
WO2020196407A1 (ja) * 2019-03-28 2020-10-01 株式会社エヌエスアイテクス 畳込み演算装置
CN111797882A (zh) * 2019-07-30 2020-10-20 华为技术有限公司 图像分类方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190294413A1 (en) * 2018-03-23 2019-09-26 Amazon Technologies, Inc. Accelerated quantized multiply-and-add operations
CN110210610A (zh) * 2018-03-27 2019-09-06 腾讯科技(深圳)有限公司 卷积计算加速器、卷积计算方法及卷积计算设备
WO2020196407A1 (ja) * 2019-03-28 2020-10-01 株式会社エヌエスアイテクス 畳込み演算装置
CN111797882A (zh) * 2019-07-30 2020-10-20 华为技术有限公司 图像分类方法及装置
CN111178519A (zh) * 2019-12-27 2020-05-19 华中科技大学 卷积神经网络加速引擎、卷积神经网络加速系统及方法

Also Published As

Publication number Publication date
CN117223008A (zh) 2023-12-12

Similar Documents

Publication Publication Date Title
US10877924B2 (en) Instruction set processing method based on a chip architecture and apparatus, and storage medium
CN109768926B (zh) 一种数据处理方法、终端设备及计算机可读存储介质
EP3333733B1 (en) Method and device for use in parallel execution of terminal database
WO2021109703A1 (zh) 数据处理方法、芯片、设备及存储介质
CN110018970B (zh) 缓存预取方法、装置、设备及计算机可读存储介质
CN110147347B (zh) 用于矩阵处理的芯片、矩阵处理方法、装置及存储介质
WO2020108457A1 (zh) 目标对象的控制方法、装置、设备及存储介质
CN111078172B (zh) 一种显示流畅度的调整方法、装置、电子设备及存储介质
CN107533450A (zh) 一种显示方法及终端设备
CN110110045B (zh) 一种检索相似文本的方法、装置以及存储介质
WO2021109709A1 (zh) 场景更新控制方法、装置、电子设备及存储介质
CN107783709A (zh) 一种图像的查看方法及移动终端
CN111209423A (zh) 一种基于电子相册的图像管理方法、装置以及存储介质
CN107329838A (zh) 一种业务交互方法、终端和计算机可读存储介质
CN104090849A (zh) 用于图形处理单元的存储器映射
CN112381020A (zh) 一种视频场景识别方法、系统及电子设备
CN115237618A (zh) 请求处理方法、装置、计算机设备及可读存储介质
CN109361864B (zh) 一种拍摄参数设置方法及终端设备
CN109189576B (zh) 基于Redis的请求处理方法、服务器及计算机可读存储介质
WO2023004762A1 (zh) 计算机系统和数据处理方法
WO2021016932A1 (zh) 数据处理方法、装置及计算机可读存储介质
CN107766464A (zh) 一种文件存储方法、终端及计算机可读存储介质
CN113469322B (zh) 确定模型的可执行程序的方法、装置、设备及存储介质
CN115437776A (zh) 绘图线程排程方法、装置及计算机设备
CN109857821B (zh) 一种运动轨迹的记录方法、终端及计算机可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21951356

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202180096718.3

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE