WO2022021073A1 - Procédé de fonctionnement à opérateurs multiples et appareil pour modèle de réseau neuronal - Google Patents

Procédé de fonctionnement à opérateurs multiples et appareil pour modèle de réseau neuronal Download PDF

Info

Publication number
WO2022021073A1
WO2022021073A1 PCT/CN2020/105217 CN2020105217W WO2022021073A1 WO 2022021073 A1 WO2022021073 A1 WO 2022021073A1 CN 2020105217 W CN2020105217 W CN 2020105217W WO 2022021073 A1 WO2022021073 A1 WO 2022021073A1
Authority
WO
WIPO (PCT)
Prior art keywords
image data
original image
data
read
operator
Prior art date
Application number
PCT/CN2020/105217
Other languages
English (en)
Chinese (zh)
Inventor
刘敏丽
张楠赓
Original Assignee
嘉楠明芯(北京)科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 嘉楠明芯(北京)科技有限公司 filed Critical 嘉楠明芯(北京)科技有限公司
Priority to CN202080102306.1A priority Critical patent/CN116134446A/zh
Priority to PCT/CN2020/105217 priority patent/WO2022021073A1/fr
Publication of WO2022021073A1 publication Critical patent/WO2022021073A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models

Definitions

  • the present application relates to the field of artificial intelligence, in particular to the field of multi-operator operations.
  • the convolutional neural network also includes operations such as activation, pooling, and batch normalization. Although these operations account for a small proportion of the entire convolutional neural network, they are very important. At present, there are two ways to realize operations such as activation, pooling, and batch normalization. In actual scenarios, various combinations of hardware of multiple individual computing modules are performed to complete computing tasks. However, each operation corresponds to the hardware of a single operation module, which not only increases the chip area, but also increases the production cost. Moreover, the hardware that can only implement one type of operation module can only realize conventional simple calculations, and complex operations cannot be realized.
  • the second way is to use general hardware accelerators such as CPU (central processing unit, central processing unit), DSP (Digital Signal Processing) or GPU (graphics processing unit, Graphics Processing Unit) to realize activation, pooling, Operations such as batch normalization.
  • CPUs, DSPs or GPUs are not specially designed for operations such as activation, pooling, batch normalization, etc. in neural networks, resulting in lower operation rates.
  • the embodiments of the present application provide a multi-operator computing method and device for a neural network model to solve the problems existing in the related art, and the technical solutions are as follows:
  • an embodiment of the present application provides a multi-operator computing method for a neural network model, including:
  • Obtain a configuration instruction and determine, according to the configuration instruction, multiple computing devices corresponding to the multiple operators, and the execution order of the multiple computing devices, where the multiple operators are decomposed according to the computing formula, and the multiple computing devices is selected from the set of computing devices;
  • the plurality of operation devices are controlled to process the original image data in a serial execution manner to output final image data.
  • the configuration instruction includes a preset data length, and the read pixel points included in the tensor corresponding to the original image to obtain the original image data, including:
  • the read request In the case that the read request is passed, read the pixels contained in the tensor corresponding to the original image to obtain the original image data;
  • the reading of the pixel points is stopped.
  • the configuration instruction includes a preset vector length
  • the reading of the pixels included in the tensor of the original image to obtain the original image data includes:
  • the pixel points are read in the order of arrangement, and when each pixel point is read, each of the pixel points is repeatedly read M1 times;
  • Each of the vectors is repeatedly read M2 times to obtain the original image data, wherein both M1 and M2 are greater than or equal to 1.
  • controlling the plurality of computing devices to process the original image data in a serial execution manner, and outputting the final image data includes:
  • the plurality of operation devices are controlled to perform parallel processing on the original image data corresponding to the plurality of pixel points in a serial manner, and output the final image data.
  • the configuration instruction includes a mapping relationship table between the output terminal of each computing device and the input terminals of the remaining computing devices, and according to the execution sequence, the plurality of computing devices are controlled to serially
  • the original image data is processed in the manner of line execution, and the final image data is output, including:
  • the original image data is controlled to be input to the first operation device for operation to obtain first intermediate data
  • the first intermediate data is input to the second operation device for operation to obtain second intermediate data
  • the N-1th intermediate data is input to the Nth operation device for operation
  • the final image data is output, where N is a positive integer greater than or equal to 1.
  • the configuration instruction includes a constant, and the constant is obtained by decomposing the operation formula.
  • it also includes:
  • the final image data is downsampled.
  • the original image data, the first intermediate data to the N-1th intermediate data, and the final image data are all in a data format of 16-bit floating point numbers.
  • an embodiment of the present application provides a multi-operator computing device for a neural network model, including:
  • a configuration instruction acquisition module configured to acquire a configuration instruction, and determine a plurality of operation devices corresponding to a plurality of operators according to the configuration instructions, and the execution order of the plurality of operation devices, and the multiple operators are decomposed according to the operation formula Obtained, the plurality of operation devices are selected from the operation device set;
  • the data reading module is used to read the pixels contained in the tensor corresponding to the original image to obtain the original image data;
  • the multi-operator operation module is configured to control the plurality of operation devices to process the original image data in a serial execution manner according to the execution sequence, and output final image data.
  • the configuration instruction includes a preset data length
  • the data reading module includes:
  • a read request sending submodule used to send a read request to the external memory and/or the internal local buffer
  • a data reading submodule configured to read the pixels contained in the tensor corresponding to the original image to obtain the original image data when the read request is passed;
  • a data reading stop sub-module is configured to stop reading the pixel point when the length of the original image data is equal to the preset data length.
  • the configuration instruction includes a preset vector length
  • the data reading submodule includes:
  • a vector dividing unit configured to divide the tensor into a plurality of vectors according to the preset vector length, and the vectors include a plurality of pixels;
  • a first reading unit used in the vector, to read the pixel points according to the arrangement order, and when each pixel point is read, each of the pixel points is repeatedly read M1 times;
  • the second reading unit is used to repeatedly read each of the vectors M2 times to obtain the original image data, wherein both M1 and M2 are greater than or equal to 1.
  • the multi-operator operation module is configured to control the plurality of operation devices to perform parallel processing on the original image data corresponding to the plurality of pixels in a serial execution manner within one clock cycle, and output the the final image data.
  • the configuration instruction includes a mapping relationship table between the output terminal of each operation device and the input terminals of the remaining operation devices
  • the multi-operator operation module includes:
  • an execution order determination submodule configured to determine the execution order according to the mapping relationship table
  • a multi-operator operation sub-module configured to control the input of the original image data to the first operation device for operation according to the execution sequence, obtain first intermediate data, and input the first intermediate data to the second operation
  • the device performs operations to obtain second intermediate data, until the N-1th intermediate data is input to the Nth operation device for operation, and the final image data is output, where N is a positive integer greater than or equal to 1.
  • the configuration instruction includes a constant, and the constant is obtained by decomposing the operation formula.
  • it also includes:
  • a down-sampling module for down-sampling the final image data.
  • the original image data, the first intermediate data to the N-1th intermediate data, and the final image data are all in a data format of 16-bit floating point numbers.
  • an electronic device comprising:
  • At least one processor and a memory communicatively coupled to the at least one processor;
  • the memory stores instructions executable by at least one processor, and the instructions are executed by at least one processor, so that the at least one processor can perform any one of the above methods.
  • a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause a computer to execute any of the above methods.
  • An embodiment in the above application has the following advantages or beneficial effects: since any complex operation formula can be decomposed into multiple operators, multiple operators are configured with corresponding multiple operation devices, and multiple operation devices are used to serially It can process the original image data and output the final image data, so it can support various types of complex operations in various neural networks, and the operations are programmable, which improves the operation efficiency.
  • the operation devices corresponding to multiple operators are selected from the operation set, when various complex operations are performed, the multiple operation devices are configurable and reusable, and each complex operation does not require The corresponding hardware is designed, which effectively saves the chip area and reduces the chip cost.
  • general-purpose hardware accelerators such as CPU, DSP, or GPU are not directly used to perform various operations of the neural network model, but the multi-operator computing device of the neural network model provided in the present application is used, which avoids the need for CPU, Communication with general-purpose hardware accelerators such as DSP or GPU improves the computational timeliness.
  • FIG. 1 is a schematic diagram of a multi-operator computing method of a neural network model according to an embodiment of the present application
  • FIG. 2 is a scene diagram of a multi-operator computing device of a neural network model according to an embodiment of the present application
  • FIG. 3 is a schematic diagram of a multi-operator operation method of a neural network model according to another embodiment of the present application.
  • FIG. 4 is a structural diagram of an internal local buffer according to an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a method for reading pixels included in a tensor corresponding to an original image according to another embodiment of the present application;
  • FIG. 6 is a scene diagram of a multi-operator computing method according to an embodiment of the present application.
  • FIG. 7 is a scene diagram of a multi-operator computing device of a neural network model according to another embodiment of the present application.
  • FIG. 8 is a schematic diagram of a multi-operator computing device of a neural network model according to an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a multi-operator computing device of a neural network model according to another embodiment of the present application.
  • FIG. 10 is a block diagram of an electronic device used to implement a multi-operator operation method of a neural network model according to an embodiment of the present application.
  • a multi-operator operation method of a neural network model including the following steps:
  • Step S110 Acquire a configuration instruction, and determine, according to the configuration instruction, multiple computing devices corresponding to the multiple operators, and the execution order of the multiple computing devices, where the multiple operators are decomposed according to the computing formula, and the multiple computing devices are obtained from the operation selected from the device collection;
  • Step S120 Read the pixels contained in the tensor corresponding to the original image to obtain original image data
  • Step S130 According to the execution sequence, control a plurality of computing devices to process the original image data in a serial execution manner, and output the final image data.
  • the multi-operator computing device of the neural network model may include a data reading module, a multi-operator computing module and a data writing module which are connected in sequence.
  • the multi-operator operation module can be set based on a mesh network (Meshnet).
  • the convolution accelerator includes a multi-operator computing device and a GLB (internal local buffer, Global local buffer), and the multi-operator computing device is connected to the GLB.
  • DDR Double Data Rate synchronous dynamic random-access memory
  • the data reading module in the multi-operator computing device can read data from GLB and/or DDR.
  • GLB and/or DDR are provided with multiple storage areas, and each storage area can store tensors (Tensors) corresponding to different original images.
  • the tensor includes four dimensions of the original image: N (number of frames, batch), C (number of channels, channels), H (height, height), and W (width, width).
  • NCHW is used to represent a four-dimensional image.
  • N represents the number of frames of this batch of images
  • H represents the number of pixels in the vertical direction of the image
  • W represents the number of pixels in the horizontal direction
  • the data reading module may read each pixel in the tensor corresponding to the original image from the GLB and/or DDR, and the original image data may include the value of one pixel or the values of multiple pixels.
  • the configuration module provided by the upper-layer software is used to split any complex operation formula into multiple basic operators that the Meshnet network can support.
  • Basic operators may include addition operators, multiplication operators, square root operators, square operators, sine and cosine operators, and base logarithm operators.
  • a corresponding computing device is required to perform the operation of each operator. Therefore, in this embodiment, a set of computing devices is set, and the set of computing devices is used to implement the operations of commonly used operators in operations such as activation, pooling, and batch normalization in the neural network.
  • the set of arithmetic devices can include adders, multipliers, one-to-two copy operators, sixteen piece-wise linear fittings, one-of-two operators, comparators, dividers, binary logic operators, unary logic operators, rounders. Input operator, square root operator, square operator, sine and cosine operator, exponentiation operator with base e, logarithm operator with base e, etc.
  • the set of computing devices can be adaptively adjusted according to actual needs, which are all within the protection scope of this embodiment.
  • the input end of each operation device can be used as the input end of the multi-operator operation module for receiving original image data.
  • the output terminal of each computing device in the computing device set can be connected to the input terminals of the remaining computing devices to ensure that the intermediate data output by the previous computing device is used as the input data of the next computing device, and is input to the next computing device to continue. operation.
  • the output end of each operation device can also be used as the output end of the multi-operator operation module for outputting the final image data.
  • the configuration module queries multiple operation devices corresponding to multiple operators from the operation device set (each operation does not necessarily use all the operation devices in the operation device set), and determines according to the mathematical operation order of the multiple operators The execution order of multiple arithmetic devices. Then, the configuration module sends a configuration instruction to the multi-operator operation module, where the configuration instruction includes multiple operation devices corresponding to the multiple operators and an execution sequence of the multiple operation devices.
  • the multi-operator operation module obtains configuration instructions from the configuration module, and on the other hand, reads the pixels contained in the tensor corresponding to the original image from GLB and/or DDR to obtain the original image data, and controls the original image data according to the execution order.
  • Input to the first operation device for operation obtain the first intermediate data, input the first intermediate data to the second operation device for operation, and obtain the second intermediate data, until the N-1th intermediate data is input to the Nth
  • the operation device performs operation and outputs final image data.
  • the multi-operator operation module is provided with a logic control unit, which controls the multi-operator operation module to read the original image data, and according to the execution sequence, controls multiple operation devices to process the original image data in a serial execution manner, and outputs the final image the entire process of data. Finally, the final image data is written into the GLB and/or DDR using the data write-out module.
  • tanh_shrink activation function (a type of activation function in the neural network structure) is:
  • the corresponding arithmetic devices include that the first arithmetic device is a base-e exponentiation operator, the second arithmetic device is a one-to-two copy operator, the third arithmetic device is another one-to-two copy operator, and the fourth
  • the first operational device is an adder, the fifth operational device is a subtractor, the sixth operational device is a divider, and the seventh operational device is another subtractor.
  • the original image data is controlled to be input to the e-base exponentiation operator to obtain the first intermediate data, and the first intermediate data is input to the one-to-two copy operator, The second intermediate data is obtained, and so on, and the final image data is output from the subtractor (the seventh operation device).
  • any complex operation formula can be decomposed into multiple operators, multiple operation devices corresponding to multiple operators are configured, and multiple operation devices are used to perform serial execution on the original image data. process, and output the final image data, so it can support various types of complex operations in various neural networks, and the operation is programmable, which improves the operation efficiency.
  • the multiple operation devices since the operation devices corresponding to the multiple operators are selected from the operation set, the multiple operation devices have configurability when various complex operations are performed. For different complex mathematical operations, each operation device in the operation set may be reused, and there is no need to design corresponding hardware for each complex operation, which effectively saves chip area and reduces chip cost.
  • the configuration instruction includes a preset data length
  • step S120 includes:
  • Step S121 send a read request to the external memory and/or the internal local buffer
  • Step S122 in the case that the read request is passed, read the pixel points contained in the tensor corresponding to the original image to obtain the original image data;
  • Step S123 In the case that the length of the original image data is equal to the preset data length, stop reading pixels.
  • the data read module may send one or more read requests to DDR (external memory) and/or GLB (internal local buffer). For example, if a request is made to read two tensors of raw images, then one read request can be sent to the DDR and another read request to the GLB; alternatively, two read requests can be sent to the DDR; alternatively, two read requests can be sent to the GLB ask.
  • the data reading module can read tensors corresponding to multiple frames of original images, and the reading method and the number of read tensors can be adaptively adjusted according to actual needs, which are all within the protection scope of this embodiment.
  • the DDR and/or GLB After the DDR and/or GLB receive the read request, it feeds back the result of allowing the read to the data read module. Then, the data reading module reads the pixel points contained in the tensor corresponding to the original image, obtains the original image data, and sends the original image data to the multi-operator operation module.
  • the data reading module includes a mapping function (Map) unit and/or a broadcasting function (Broadcast) unit, which can implement the reading method of the mapping function and the reading method of the broadcast function.
  • GLB is the data cache SRAM (Static Random-Access Memory) of the convolution accelerator. It has a large storage space, and the data reading module can directly obtain data from GLB.
  • GLB can include eight independent RAMs (Random Access Memory, Random Access Memory), each RAM has a depth of 512 and a width of 128 bits, and the eight independent RAMs are numbered bank0 to bank7 respectively.
  • the mapping functional unit needs one input in a single clock cycle, mapping one input to the eight independent RAMs of the GLB.
  • the GLB responds to a read request in a single clock cycle, and the mapping functional unit selects a bankA (A is any integer from 0 to 7) from bank0 to bank7 to read the tensors stored in bankA.
  • the broadcast function unit requires at least one input in a single clock cycle, mapping the two inputs to eight independent RAMs.
  • GLB responds to two read requests in a single clock cycle, and the broadcast function unit selects two bankB and bankC from bank0 ⁇ bank7 (B and C are any integers from 0 to 7 and B is not equal to C) to read the data stored in bankB.
  • the data write-out module will send a write request to an independent RAM of the DDR and/or GLB in a single clock cycle. After the write request is passed, the data-write-out module will write the final image data into the DDR and/or GLB.
  • the data reading module can read the pixel points included in the tensor of the original image according to the configuration instruction to obtain the original image data. Since the configuration instruction sent by the configuration module to the data reading module includes the preset data length, the data reading module stops reading pixels when the length of the read original image data is equal to the preset data length.
  • the mapping functional unit supports reading tensors corresponding to one original image from GLB or DDR, and regards the tensors corresponding to a single original image as a one-dimensional vector, and the pixels are arranged in the NCHW. the point.
  • the mapping functional unit reads pixels from the GLB or DDR in the order of the first row and then the column until the entire four-dimensional image is read, and sends the read original image data to the multi-operator operation module in turn.
  • the NCHW in the tensor corresponding to the original image is 1*2*30*40, and the tensor is regarded as a one-dimensional vector.
  • the mapping functional unit does not necessarily have to read all the pixels of the NCHW in the tensor, but reads according to the preset data length.
  • the NCHW in the tensor is 1*2*30*40, which means that one line contains 40 pixels. If the preset data length is 120, then only three lines of pixels need to be read.
  • the configuration instruction includes a preset vector length
  • step S122 includes:
  • S1221 Divide the tensor into multiple vectors according to the preset vector length, and the vectors include multiple pixels;
  • the broadcast functional unit supports reading tensors corresponding to multiple frames of raw images in GLB and/or DDR.
  • the configuration module sends a configuration instruction to the data reading module, and the configuration instruction includes a preset vector length.
  • the configuration instruction may also include information such as the number of times M1 of repeated reading of pixel points, the number of times of repeated reading of vectors M2, and the like. Different tensors, vector lengths, pixel point repeated reading times M1, and vector repeated reading times M2 can be configured according to actual conditions, which are all within the protection scope of this embodiment.
  • the NCHW in the tensor corresponding to the first original image is 1*2*30*40
  • the NCHW in the tensor corresponding to the second original image is 1*3*20*40.
  • the first pixel point X0 is repeatedly read three times to obtain (X0, X0, X0)
  • the second pixel point X1 is repeatedly read three times to obtain (X1, X1, X1) ... until the vector in the read is completed. each pixel.
  • multiple computing devices are controlled to process the original image data corresponding to the multiple pixels in a serial manner, and output the final image data.
  • step 130 includes:
  • multiple operation devices are controlled to perform parallel processing on the original image data corresponding to multiple pixel points in a serial manner, and the final image data is output.
  • the data reading module reads multiple pixels from the DDR and/or GLB at a time. In one clock cycle, the data reading module sends the read original image data corresponding to multiple pixels to the multi-operator operation module, so that the multi-operator operation module can parallelize the original image data corresponding to multiple pixels at the same time. calculate. For example, in one clock cycle, the data reading module can send the original image data corresponding to four pixels to the multi-operator operation module.
  • the data reading module in the 0th clock cycle, can send four points X00 ⁇ X03 in the first row to the multi-operator operation module; in the first clock cycle, the data reading module can The four points X04 ⁇ X07 are sent to the multi-operator operation module, and so on.
  • the number of pixel repetitions M1 it is assumed that the number of pixel repetitions M1 is 3.
  • the data reading module can send (X00, X00, X00, X01) to the multi-operator operation module; in the first clock cycle, the data read The module can send (X01, X01, X02, X02) to the multi-operator operation module, and so on.
  • each operation device can be set, and the four identical operation devices work at the same time to process the original image data corresponding to the four pixels in parallel.
  • each computing device can also be set with a larger number, for example, eight adders, eight subtractors, etc., or each computing device can also be set with a smaller number, for example, two or three adders, Two or three subtractors, etc. Adaptive adjustment according to actual needs is within the protection scope of this embodiment.
  • the data reading module sends the original image data corresponding to multiple pixels to the multi-operator operation module, so as to realize parallel operation and effectively improve the operation efficiency.
  • the configuration instruction includes a mapping relationship table between the output terminal of each computing device and the input terminals of the remaining computing devices, and step 130 includes:
  • Step 131 an execution order determination submodule, configured to determine the execution order according to the mapping relationship table;
  • Step 132 According to the execution sequence, control the input of the original image data to the first operation device for operation to obtain the first intermediate data, input the first intermediate data to the second operation device for operation, and obtain the second intermediate data, until the The N-1th intermediate data is input to the Nth operation device for operation, and the final image data is output, where N is a positive integer greater than or equal to 1.
  • the configuration instruction includes determining a constant, and the constant is obtained by decomposing the operation formula.
  • the computing device set includes 27 computing devices, from computing device 0 to computing device 26 .
  • Each computing device may include two or three input terminals, one output terminal or two output terminals.
  • the operation device may include two tensor input terminals, or two constant input terminals, or two tensor input terminals and one constant input terminal, etc., which can be set according to requirements.
  • the tensor input is used to input raw image data or intermediate image data.
  • Constant inputs are used to enter constants.
  • the operation formula may contain constants, for example, the operation formula
  • 3.73 and 5.89 are both decimal values, configure the first constant as 3.73 and the second constant as 5.89. Therefore, the number of constants is related to the operation formula. Of course, if there are no constants in the operation formula, you do not need to configure the constants.
  • a multi-operator operation method of a neural network model is provided.
  • the configuration module is used to decompose the operation formula to obtain multiple operators, and determine multiple operation devices corresponding to the multiple operators.
  • the output of the previous operation device and the adjacent lower An input terminal of an arithmetic device generates a mapping table.
  • the input end of each computing device will have a corresponding output end number.
  • the preferred numbering method is to unify the default numbering of multiple tensor input terminals (eg, two), multiple constant input terminals (eg, four), and output terminals of all computing devices, so as to facilitate the connection between the input terminals of each computing device and the outputs of all devices.
  • the multi-operator operation module receives the configuration instruction sent by the configuration module, and determines the execution sequence according to the mapping relationship between the output end of each operation device and the input end of the remaining operation devices in the mapping relationship table.
  • first-level register (reg) is inserted after each operation is performed, in order to ensure that the timing of the input of the first tensor and the second tensor is consistent, additional computing devices can be called to perform calculations on the second tensor.
  • different storage areas in the GLB store tensors (first tensor and second tensor) corresponding to different original images
  • the data reading module reads the tensors contained in the tensors corresponding to the two original images from the GLB. Pixel points to obtain two channels of original image data. It is assumed that the original image data corresponding to the two tensors is sent to the multi-operator operation module.
  • Four computing devices are determined according to the configuration instruction, including the first computing device to the fourth computing device.
  • the first operation device is an adder 0, the second operation device is an adder 1, the third operation device is a square operator, and the fourth operation device is a comparator.
  • mapping relationship table in Table 1 two tensors and four constants are numbered uniformly with the output terminals of each operational device.
  • the two tensors and four constants are uniformly numbered 1 to 6, that is, the number of the first tensor is 1, the number of the second tensor is 2, the number of the first constant is 3, the number of the second constant is 4, The number of the third constant is 5, and the number of the fourth constant is 6. See the mapping table for the output terminal numbers of each computing device.
  • the execution order is obtained according to the mapping relationship table: the tensor input terminal of adder 0 is used to input the original image data corresponding to the first tensor, the constant input terminal is used to input the second constant, and the output terminal of adder 0 is used to input the original image data of the adder 1.
  • the tensor input terminal is connected; the tensor input terminal of the adder 1 is used to input the first intermediate data corresponding to the first tensor, the constant input terminal of the adder 1 corresponds to the input of the first constant, and the output terminal of the adder 1 is used for the square operation
  • the tensor input end of the square operator is connected to the tensor input end of the square operator; the tensor input end of the square operator is used to input the second intermediate data corresponding to the first tensor, and the output end of the square operator is connected to the tensor input end of the comparator;
  • the quantity input terminal is used to input the third intermediate data corresponding to the first tensor, the constant input terminal of the comparison operator is used to input the third constant, and the output terminal of the comparator is used to output the final image data.
  • any input terminal of each operation device can be a constant input, a tensor input, or a tensor output of other devices.
  • the output terminal and input terminal of each computing device in the computing device set can be controlled by the logic control unit to be turned on or off, indicating whether they can be used to transmit data.
  • the logic control unit controls the working state of each computing device according to the mapping relationship table. If the output terminal number corresponding to the input terminal of a certain computing device is 0, it means that the computing device will not work and can be in a closed state.
  • Step S140 down-sampling the final image data.
  • the down-sampling (Reduce) module down-samples the final image data
  • the input data can come from GLB or DDR or a multi-operator operation module, and can be any one of N, C, H, or W.
  • Dimension downsampling operation For example, the operations of finding the maximum value, minimum value, summation, subtraction, and multiplication of any dimension of N or C or H or W.
  • the input of downsampling can be the data read back by GLB or DDR, or the final image data output by the multi-operator operation module. If the multi-operator operation module does not work, you can directly downsample the original image data corresponding to the tensor, that is, downsample the pixels of any dimension in the tensor.
  • the format of the original image data, the format of the first intermediate data to the N-1th intermediate data, and the final image data are all data formats of 16-bit floating point numbers.
  • the operations involved in the multi-operator operation module and the downsampling module are both floating-point operations in BF16 format, which can effectively improve the operation precision.
  • Floating-point operations in BF16 (16-bit floating-point numbers, bfloat) format can be replaced with fixed-point operations in INT8 format, which can save the hardware area of the multi-operator operation module or downsampling module.
  • a multi-operator computing device of a neural network model including:
  • the configuration instruction acquisition module 110 is configured to acquire a configuration instruction, and according to the configuration instruction, determine multiple computing devices corresponding to multiple operators, and the execution order of the multiple computing devices, and the multiple operators are based on the operation formula. Decomposed, multiple computing devices are selected from the computing device set;
  • the data reading module 120 is used to read the pixel points contained in the tensor corresponding to the original image to obtain the original image data;
  • the multi-operator operation module 130 is configured to control the plurality of operation devices to process the original image data in a serial execution manner according to the execution sequence, and output final image data.
  • the general hardware accelerators such as CPU, DSP or GPU are not directly used in this embodiment to perform various operations of the neural network model, but the multi-operator computing device of the neural network model provided by this application is used, It avoids the communication with general hardware accelerators such as CPU, DSP or GPU, and improves the time efficiency of operation.
  • the configuration instruction includes a preset data length
  • the data reading module 120 includes:
  • a read request sending submodule 121 is used to send a read request to the external memory and/or the internal local buffer;
  • the data reading submodule 122 is configured to read the pixel points contained in the tensor corresponding to the original image to obtain the original image data when the read request is passed;
  • the data reading stop sub-module 123 is configured to stop reading the pixel point when the length of the original image data is equal to the preset data length.
  • the configuration instruction includes a preset vector length
  • the data reading sub-module 122 includes:
  • a vector dividing unit 1221 configured to divide the tensor into multiple vectors according to the preset vector length, and the vector includes multiple pixels;
  • the first reading unit 1222 is used in the vector to read the pixel points according to the arrangement sequence, and when each pixel point is read, each of the pixel points is repeatedly read M1 times;
  • the second reading unit 1223 is configured to repeatedly read each of the vectors M2 times to obtain the original image data, wherein both M1 and M2 are greater than or equal to 1.
  • the multi-operator operation module is configured to control the plurality of operation devices to perform parallel processing on the original image data corresponding to the plurality of pixels in a serial execution manner within one clock cycle, and output the the final image data.
  • the configuration instruction includes a mapping relationship table between the output terminal of each operation device and the input terminals of the remaining operation devices
  • the multi-operator operation module 130 includes:
  • an execution order determination submodule 131 configured to determine the execution order according to the mapping relationship table
  • the multi-operator operation sub-module 132 is configured to control the input of the original image data to the first operation device for operation according to the execution sequence, obtain the first intermediate data, and input the first intermediate data to the second operation device.
  • the operation device performs operations to obtain second intermediate data, until the N-1th intermediate data is input to the Nth operation device for operation, and the final image data is output, where N is a positive integer greater than or equal to 1.
  • the down-sampling module 140 is configured to down-sample the final image data.
  • the original image data, the first intermediate data to the N-1th intermediate data, and the final image data are all in a data format of 16-bit floating point numbers.
  • the present application further provides an electronic device and a readable storage medium.
  • FIG. 10 it is a block diagram of an electronic device of a multi-operator operation method of a neural network model according to an embodiment of the present application.
  • Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers.
  • Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smart phones, wearable devices, and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions are by way of example only, and are not intended to limit implementations of the application described and/or claimed herein.
  • the electronic device includes: one or more processors 1001, a memory 1002, and interfaces for connecting various components, including a high-speed interface and a low-speed interface.
  • the various components are interconnected using different buses and may be mounted on a common motherboard or otherwise as desired.
  • the processor may process instructions for execution within the electronic device, including storing in or on memory to display a Graphical User Interface (GUI) on an external input/output device such as a display device coupled to the interface ) instructions for graphics information.
  • GUI Graphical User Interface
  • multiple processors and/or multiple buses may be used with multiple memories and multiple memories, if desired.
  • multiple electronic devices may be connected, each providing some of the necessary operations (eg, as a server array, a group of blade servers, or a multiprocessor system).
  • a processor 1001 is used as an example.
  • the memory 1002 is the non-transitory computer-readable storage medium provided by the present application.
  • the memory stores instructions executable by at least one processor, so that the at least one processor executes the multi-operator operation method of the neural network model provided by the present application.
  • the non-transitory computer-readable storage medium of the present application stores computer instructions, and the computer instructions are used to make the computer execute the multi-operator operation method of the neural network model provided by the present application.
  • the memory 1002 can be used to store non-transitory software programs, non-transitory computer-executable programs and modules, such as corresponding to the multi-operator operation method of a neural network model in the embodiments of the present application.
  • Program instructions/modules for example, the configuration instruction acquisition module 110, the data reading module 120, and the multi-operator operation module 130 shown in FIG. 8).
  • the processor 1001 executes various functional applications and data processing of the server by running the non-transitory software programs, instructions and modules stored in the memory 1002, that is, to realize the multi-operator operation of a neural network model in the above method embodiments. method.
  • the memory 1002 can include a stored program area and a stored data area, wherein the stored program area can store an operating system and an application program required by at least one function; data created by the use of the device, etc. Additionally, memory 1002 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 1002 may optionally include memory located remotely relative to processor 1001, and these remote memories may be connected to the aforementioned electronic device through a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
  • the above electronic device may further include: an input device 1003 and an output device 1004 .
  • the processor 1001 , the memory 1002 , the input device 1003 and the output device 1004 may be connected by a bus or in other ways, and the connection by a bus is taken as an example in FIG. 10 .
  • the input device 1003 can receive input numerical or character information, and generate key signal input related to user settings and function control of the above-mentioned electronic equipment, such as a touch screen, a keypad, a mouse, a trackpad, a touchpad, a pointing stick, one or more Input devices such as mouse buttons, trackballs, joysticks, etc.
  • Output devices 1004 may include display devices, auxiliary lighting devices (eg, LEDs), haptic feedback devices (eg, vibration motors), and the like.
  • the display device may include, but is not limited to, a liquid crystal display (Liquid Cr10stal Displa10, LCD), a light emitting diode (Light Emitting Diode, LED) display, and a plasma display. In some implementations, the display device may be a touch screen.
  • Various implementations of the systems and techniques described herein can be implemented in digital electronic circuitry, integrated circuit systems, application specific integrated circuits (ASICs), computer hardware, firmware, software, and/or combinations thereof .
  • ASICs application specific integrated circuits
  • These various embodiments may include being implemented in one or more computer programs executable and/or interpretable on a programmable system including at least one programmable processor that The processor, which may be a special purpose or general-purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device an output device.
  • machine-readable medium and “computer-readable medium” refer to any computer program product, apparatus, and/or apparatus for providing machine instructions and/or data to a programmable processor ( For example, magnetic disks, optical disks, memories, programmable logic devices (PLDs)), including machine-readable media that receive machine instructions as machine-readable signals.
  • machine-readable signal refers to any signal used to provide machine instructions and/or data to a programmable processor.
  • the systems and techniques described herein may be implemented on a computer having: a display device (eg, a CRT (Cathode Ray Tube) or an LCD (liquid crystal) for displaying information to the user monitor); and a keyboard and pointing device (eg, mouse or trackball) through which a user can provide input to the computer.
  • a display device eg, a CRT (Cathode Ray Tube) or an LCD (liquid crystal) for displaying information to the user monitor
  • a keyboard and pointing device eg, mouse or trackball
  • Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (eg, visual feedback, auditory feedback, or tactile feedback); and can be in any form (including acoustic input, voice input, or tactile input) to receive input from the user.
  • the systems and techniques described herein may be implemented on a computing system that includes back-end components (eg, as a data server), or a computing system that includes middleware components (eg, an application server), or a computing system that includes front-end components (eg, a user's computer having a graphical user interface or web browser through which a user may interact with implementations of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system.
  • the components of the system may be interconnected by any form or medium of digital data communication (eg, a communication network). Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN), and the Internet.
  • a computer system can include clients and servers.
  • Clients and servers are generally remote from each other and usually interact through a communication network.
  • the relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Processing (AREA)
  • Facsimile Image Signal Circuits (AREA)
  • Image Analysis (AREA)

Abstract

Dans la présente application, un procédé de fonctionnement à opérateurs multiples et un appareil pour modèle de réseau neuronal sont divulgués. La solution de mise en œuvre spécifique consiste à obtenir une instruction de configuration, et à déterminer, selon l'instruction de configuration, une pluralité de dispositifs d'exploitation correspondant à une pluralité d'opérateurs, et une séquence d'exécution de la pluralité de dispositifs d'exploitation, la pluralité d'opérateurs étant obtenue par factorisation de formule d'opération, et la pluralité de dispositifs d'exploitation étant sélectionnés à partir d'un ensemble de dispositifs d'exploitation ; à lire des points de pixel compris dans un tenseur correspondant à une image d'origine pour obtenir des données d'image d'origine ; et à commander, selon la séquence d'exécution, la pluralité de dispositifs d'exploitation pour traiter les données d'image d'origine d'une manière d'exécution en série, et à délivrer en sortie des données d'image finale. Divers types d'opérations complexes dans divers réseaux neuronaux peuvent être pris en charge, et les opérations sont programmables. De plus, le dispositif d'exploitation est configurable et réutilisable, ce qui réduit efficacement la surface de la puce et le coût de la puce.
PCT/CN2020/105217 2020-07-28 2020-07-28 Procédé de fonctionnement à opérateurs multiples et appareil pour modèle de réseau neuronal WO2022021073A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202080102306.1A CN116134446A (zh) 2020-07-28 2020-07-28 神经网络模型的多算子运算方法以及装置
PCT/CN2020/105217 WO2022021073A1 (fr) 2020-07-28 2020-07-28 Procédé de fonctionnement à opérateurs multiples et appareil pour modèle de réseau neuronal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/105217 WO2022021073A1 (fr) 2020-07-28 2020-07-28 Procédé de fonctionnement à opérateurs multiples et appareil pour modèle de réseau neuronal

Publications (1)

Publication Number Publication Date
WO2022021073A1 true WO2022021073A1 (fr) 2022-02-03

Family

ID=80037244

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/105217 WO2022021073A1 (fr) 2020-07-28 2020-07-28 Procédé de fonctionnement à opérateurs multiples et appareil pour modèle de réseau neuronal

Country Status (2)

Country Link
CN (1) CN116134446A (fr)
WO (1) WO2022021073A1 (fr)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190354400A1 (en) * 2017-05-10 2019-11-21 Atlantic Technical Organization, Llc System and method of schedule validation and optimization of machine learning flows for cloud computing
CN110503199A (zh) * 2019-08-14 2019-11-26 北京中科寒武纪科技有限公司 运算节点的拆分方法和装置、电子设备和存储介质
CN110503195A (zh) * 2019-08-14 2019-11-26 北京中科寒武纪科技有限公司 利用人工智能处理器执行任务的方法及其相关产品
CN110826708A (zh) * 2019-09-24 2020-02-21 上海寒武纪信息科技有限公司 一种用多核处理器实现神经网络模型拆分方法及相关产品
CN111126558A (zh) * 2018-10-31 2020-05-08 北京嘉楠捷思信息技术有限公司 一种卷积神经网络计算加速方法及装置、设备、介质
CN111242321A (zh) * 2019-04-18 2020-06-05 中科寒武纪科技股份有限公司 一种数据处理方法及相关产品

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190354400A1 (en) * 2017-05-10 2019-11-21 Atlantic Technical Organization, Llc System and method of schedule validation and optimization of machine learning flows for cloud computing
CN111126558A (zh) * 2018-10-31 2020-05-08 北京嘉楠捷思信息技术有限公司 一种卷积神经网络计算加速方法及装置、设备、介质
CN111242321A (zh) * 2019-04-18 2020-06-05 中科寒武纪科技股份有限公司 一种数据处理方法及相关产品
CN110503199A (zh) * 2019-08-14 2019-11-26 北京中科寒武纪科技有限公司 运算节点的拆分方法和装置、电子设备和存储介质
CN110503195A (zh) * 2019-08-14 2019-11-26 北京中科寒武纪科技有限公司 利用人工智能处理器执行任务的方法及其相关产品
CN110826708A (zh) * 2019-09-24 2020-02-21 上海寒武纪信息科技有限公司 一种用多核处理器实现神经网络模型拆分方法及相关产品

Also Published As

Publication number Publication date
CN116134446A (zh) 2023-05-16

Similar Documents

Publication Publication Date Title
EP3144805B1 (fr) Procédé et appareil de traitement pour effectuer une opération arithmétique
WO2022000802A1 (fr) Procédé et appareil d'adaptation de modèle d'apprentissage profond et dispositif électronique
US9818170B2 (en) Processing unaligned block transfer operations
US20210223048A1 (en) Method and apparatus for updating point cloud
EP4071619A1 (fr) Procédé de génération d'adresses, dispositif associé et support de stockage
KR20210131225A (ko) 영상 프레임 처리 방법, 장치, 전자 기기, 저장 매체 및 프로그램
CN111340905B (zh) 图像风格化方法、装置、设备和介质
US20220292638A1 (en) Video resolution enhancement method, storage medium, and electronic device
KR20210090558A (ko) 행렬식 텍스트를 저장하는 방법, 장치 및 전자기기
CN111027704B (zh) 量子资源估计方法、装置和电子设备
US20210209727A1 (en) Method for displaying electronic map, electronic device and readable storage medium
WO2022021073A1 (fr) Procédé de fonctionnement à opérateurs multiples et appareil pour modèle de réseau neuronal
CN111553962A (zh) 一种图表显示方法、系统及显示设备
US20210357151A1 (en) Dynamic processing memory core on a single memory chip
US11393068B2 (en) Methods and apparatus for efficient interpolation
US11704896B2 (en) Method, apparatus, device and storage medium for image processing
CN112036561B (zh) 数据处理方法、装置、电子设备及存储介质
WO2023284130A1 (fr) Puce et procédé de commande de calcul de convolution, et dispositif électronique
CN111931937B (zh) 图像处理模型的梯度更新方法、装置及系统
US20220309395A1 (en) Method and apparatus for adapting deep learning model, and electronic device
CN118154486B (zh) 基于频域分解的双流水下图像增强方法、装置和设备
WO2022206138A1 (fr) Procédé et appareil de fonctionnement basés sur un réseau neuronal
JP7403586B2 (ja) 演算子の生成方法および装置、電子機器、記憶媒体並びにコンピュータプログラム
US9218647B2 (en) Image processing apparatus, image processing method, and storage medium
CN113436325B (zh) 一种图像处理方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20946659

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20946659

Country of ref document: EP

Kind code of ref document: A1