WO2024001841A1 - Data computing method and related device - Google Patents

Data computing method and related device Download PDF

Info

Publication number
WO2024001841A1
WO2024001841A1 PCT/CN2023/101090 CN2023101090W WO2024001841A1 WO 2024001841 A1 WO2024001841 A1 WO 2024001841A1 CN 2023101090 W CN2023101090 W CN 2023101090W WO 2024001841 A1 WO2024001841 A1 WO 2024001841A1
Authority
WO
WIPO (PCT)
Prior art keywords
matrix
network device
data
column vector
ratio
Prior art date
Application number
PCT/CN2023/101090
Other languages
French (fr)
Chinese (zh)
Inventor
傅光宁
谢星华
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2024001841A1 publication Critical patent/WO2024001841A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization

Definitions

  • This application relates to the field of computers, and further relates to the application of artificial intelligence (AI) technology in the field of computer networks, and in particular, to a data calculation method and related equipment.
  • AI artificial intelligence
  • the mainstream sparse acceleration technology is 2:4 fine-grained structured matrix multiplication acceleration technology.
  • This application provides a data calculation method and related equipment.
  • the network device can flexibly set the value of the sparsity ratio according to the configuration instructions. Therefore, the computing performance of the hardware can be more fully utilized for different models, with a wide range of applicable scenarios and strong compatibility.
  • a first aspect of this application provides a data calculation method.
  • the network device obtains a first matrix and a second matrix.
  • the first matrix is used to represent the pruned weight matrix in the target model.
  • the second matrix The matrix is used to represent the data input to the target model;
  • the network device receives configuration instructions, and the configuration instructions are used to set the sparse ratio;
  • the network device compresses the second matrix according to the sparse ratio to obtain a third matrix ;
  • the network device calculates the product of the first matrix and the third matrix.
  • the network device obtains a first matrix and a second matrix, where the first matrix is the pruned weight matrix in the target model, and the second matrix is the data input to the target model.
  • Network devices receive configuration instructions that set the sparsity ratio.
  • the network device compresses the second matrix according to the sparse ratio to obtain a third matrix, and then calculates the product of the first matrix and the third matrix.
  • Network devices can flexibly set the sparsity ratio value according to configuration instructions. Therefore, the computing performance of the hardware can be more fully utilized for different models. The applicable scenarios are wider and the compatibility is stronger.
  • the network device compresses the second matrix according to the sparse ratio to obtain a third matrix, including: the network device compresses the first column vector according to the sparse ratio. Obtain a second column vector, the first column vector belongs to the second matrix, and the second column vector belongs to the third matrix; the network device calculates the product of the first matrix and the third matrix , including: the network device calculates the product of a first row vector and the second column vector, where the first row vector belongs to the first matrix.
  • the first column vector belongs to the second matrix
  • the second column vector belongs to the third matrix.
  • the process in which the network device compresses the second matrix according to the sparse ratio to obtain the third matrix includes the process in which the network device compresses the first column vector according to the sparse ratio to obtain the second column vector.
  • the first row of vectors belongs to the first matrix.
  • the process of the network device calculating the product of the first matrix and the third matrix will also include the process of the network device calculating the product of the first row vector and the second column vector.
  • This possible implementation method illustrates a specific calculation process of multiplying the first matrix and the third matrix, which improves the realizability of the solution.
  • the method further includes: the network device calculates a first relocation number based on a non-zero data index, the non-zero data index being used to represent the number of moves in the first matrix.
  • the distribution of data in the weight matrix before pruning, as described in Chapter 1 A moving number is used to represent the number of moving steps of the selected data, and the selected data is used to represent the data corresponding to the non-zero data index in the first column vector; the network device compresses the first column vector according to the non-zero data index.
  • Obtaining the second column vector includes: the network device compresses the first column vector according to the first migration number to obtain the second column vector.
  • the non-zero data index represents the distribution of the data in the first matrix in the weight matrix before pruning. Since the first matrix includes the first row vector, the non-zero data index includes the distribution of the first row vector in the weight matrix before pruning.
  • the network device can obtain the selected data in the first column vector based on the non-zero data index. The selected data is obtained based on the distribution of the first row vector in the weight matrix before pruning. In addition, it can also obtain the selected data based on the first row vector.
  • the distribution status in the weight matrix before pruning is obtained by the number of moving steps of the selected data, that is, the first moving number, and then the first column vector is compressed according to the first moving number to obtain the second column vector.
  • the network device calculates the first number of moving steps through the non-zero data index, and then compresses the first column vector according to the first number of moving steps to obtain the second column vector. In this way, the compression of the first column vector is achieved, which improves the computing efficiency of the network device.
  • the network device compresses the first column vector according to the first migration index to obtain a second column vector, including: the network device compresses the first column vector according to the non-zero data The index selects the selected data from the first column vector; the network device shifts the selected data according to the first movement number to obtain the second column vector.
  • the network device can obtain the selected data in the first column vector based on the non-zero data index, and the selected data is obtained based on the distribution of the first row vector in the weight matrix before pruning.
  • the network device can also obtain the first migration number based on the distribution of the first row vector in the weight matrix before pruning, and then shift the selected data according to the first migration number to obtain the second column vector.
  • the method provides a specific way to obtain the second column vector, which improves the achievability of the solution.
  • the method further includes: the network device selects multiple groups of selected data from the first column vector according to the non-zero data index, and each group of selected data includes One or more selected data; the network device shifts the multiple sets of selected data according to the first movement number to obtain multiple sets of selected vectors; the network device compresses the multiple sets of selected vectors according to the sparsity ratio Finally, the second column vector is obtained.
  • the network device can obtain multiple sets of selected data in the first column vector based on the non-zero data index, where each set of selected data is based on the first row Obtained from the distribution of vectors in the weight matrix before pruning.
  • the network device can also obtain the first movement number corresponding to each group of selected data based on the distribution of the first row vector in the weight matrix before pruning, and then shift each group of selected data according to each first movement number.
  • Using multiple groups of selected vectors compress the multiple groups of selected vectors according to the sparsity ratio to obtain the second column of vectors.
  • the network device can compress multiple sets of selected vectors to form a second column vector, which is processed in one processing
  • the product of the first row vector and the second column vector is completed in the component, making full use of the computing power of the processing component, avoiding waste of computing power, and improving calculation efficiency.
  • the sparsity ratio is determined by the computing power of the processing component in the network device and/or the weight matrix before pruning.
  • the value of the sparse ratio can be determined based on the computing power of the processing component, the complexity of the weight matrix before pruning, the accuracy required for weight matrix training before pruning, and other factors. .
  • This possible implementation provides multiple factors that need to be considered when determining the sparse ratio.
  • different values can be set based on different factors, which increases the flexibility of the solution.
  • a second aspect of the present application provides a network device, which includes at least one processor, a memory, and a communication interface.
  • the processor is coupled to memory and communication interfaces.
  • the memory is used to store instructions
  • the processor is used to execute the instructions
  • the communication interface is used to communicate with other network devices under the control of the processor.
  • the instruction causes the network device to execute the method in the above first aspect or any possible implementation of the first aspect.
  • a third aspect of the present application provides a computer-readable storage medium that stores a program that causes a terminal device to execute the method in the above-mentioned first aspect or any possible implementation of the first aspect.
  • a fourth aspect of the present application provides a computer program product that stores one or more computer-executable instructions.
  • the processor executes the above-mentioned first aspect or any of the first aspects.
  • a fifth aspect of the present application provides a chip.
  • the chip includes a processor and a communication interface.
  • the processor is coupled to the communication interface.
  • the processor is configured to read instructions to execute the above first aspect or any one of the first aspects. possible implementation methods.
  • a sixth aspect of this application is a network system.
  • the network system includes a network device, and the network device can execute the method described in the first aspect or any possible implementation of the first aspect.
  • Figure 1 is a schematic structural diagram of a data computing system provided by this application.
  • Figure 2 is a schematic flow chart of a data calculation method provided by this application.
  • Figure 3 is a schematic diagram of an embodiment of a data calculation method provided by this application.
  • Figure 4 is a schematic diagram of another embodiment of a data calculation method provided by this application.
  • Figure 5 is a schematic diagram of another embodiment of a data calculation method provided by this application.
  • Figure 6 is a schematic diagram of another embodiment of a data calculation method provided by this application.
  • Figure 7 is a schematic diagram of another embodiment of a data calculation method provided by this application.
  • Figure 8 is a schematic diagram of another embodiment of a data calculation method provided by this application.
  • Figure 9 is a schematic diagram of another embodiment of a data calculation method provided by this application.
  • Figure 10 is a schematic structural diagram of a network device provided by this application.
  • Figure 11 is a schematic structural diagram of another network device provided by this application.
  • At least one of a, b, or c can mean: a, b, c, a-b, a-c, b-c, or a-b-c, where a, b, c can be single or multiple .
  • the mainstream sparse acceleration technology is 2:4 fine-grained structured matrix multiplication acceleration technology.
  • this application provides a data calculation method, a data calculation system and a network device.
  • the network device obtains the first matrix and the second matrix, where the first matrix is the target model.
  • the weight matrix after pruning, the second matrix is the data input to the target model.
  • Network devices receive configuration instructions that set the sparsity ratio.
  • the network device compresses the second matrix according to the sparse ratio to obtain a third matrix, and then calculates the product of the first matrix and the third matrix.
  • Network equipment can be configured according to Setting instructions are used to flexibly set the value of the sparse ratio. Therefore, the computing performance of the hardware can be more fully utilized for different models. The applicable scenarios are wider and the compatibility is stronger.
  • Figure 1 is a schematic structural diagram of a data computing system provided by this application.
  • the data computing system includes a computing engine.
  • the computing engine obtains a first matrix and a second matrix from a register, where the first matrix is the pruned weight matrix in the target model.
  • the scale of is M ⁇ K, that is, the first matrix has M rows of data and K columns of data.
  • the second matrix is the data input to the target model. It is assumed that the scale of the second matrix is WK ⁇ N, that is, the second matrix has WK rows of data and N columns of data, where W is the sparse ratio.
  • the calculation engine can flexibly set the value of the sparse ratio according to the configuration instruction, and further, can compress the second matrix according to the sparse ratio to obtain the third matrix.
  • the size of the third matrix is K ⁇ N, that is, the third matrix has K rows of data and N columns of data.
  • the size of the matrix obtained after the calculation engine calculates the product of the first matrix and the third matrix is M ⁇ N, that is, the matrix has M rows of data and N columns of data.
  • the data computing system includes a computing engine.
  • the computing engine can flexibly set the value of the sparsity ratio according to the configuration instructions. Therefore, the computing performance of the hardware can be more fully utilized for different models. The applicable scenarios are wider and the compatibility is stronger. .
  • Figure 2 is a schematic diagram of the calculation flow of a data calculation method provided by this application.
  • the network device obtains the first matrix and the second matrix.
  • the first matrix is used to represent the pruned weight matrix in the target model
  • the second matrix is used to represent the data input to the target model
  • the size of the first matrix is M ⁇ K, that is, the first matrix has M rows of data and K columns of data.
  • the first matrix may be obtained by pruning the weight matrix before pruning, and the scale of the weight matrix before pruning is WM ⁇ K.
  • the second matrix is the data input to the target model. It is assumed that the scale of the second matrix is WK ⁇ N, that is, the second matrix has WK rows of data and N columns of data, where W is the sparse ratio.
  • the network device receives the configuration instruction.
  • the configuration instruction is used to set the sparse ratio, and the sparse ratio can be changed.
  • network devices There are many ways for network devices to configure the sparsity ratio according to configuration instructions, which are explained in detail below.
  • Method 1 Include the sparsity ratio value in the configuration directive.
  • the configuration instruction may include the value of the sparse ratio, and the sparse ratio may be configured by operation and maintenance personnel according to requirements.
  • the computing power of the processing component can be considered when configuring the sparse ratio. Assuming that the processing component PE can complete 8 ⁇ 8 type data calculation, the sparse ratio can be based on the existing computing power and not exceeding the computing power of the processing component. , depending on the complexity of the weight matrix and the training accuracy requirements of the model, choose 8:16, 4:16, 4:8, 2:8 and other solutions.
  • the computing power of the processing component supports it, you can also consider the training accuracy required for the unpruned weight matrix when configuring the sparse ratio value, and flexibly adjust the sparse ratio value based on the training accuracy.
  • the computing power of the processing component supports it, the complexity of the unpruned weight matrix can also be considered when configuring the value of the sparse ratio, and the value of the sparse ratio can be flexibly adjusted according to the complexity of the weight matrix.
  • operation and maintenance personnel can also configure the sparsity ratio value based on other factors, which are not limited here.
  • Method 2 Generate sparse ratio according to configuration instructions.
  • the configuration instruction may not include the sparse ratio value, and the sparse ratio value may be generated by the network device after receiving the configuration instruction.
  • the network device can determine the sparsity ratio value based on the data amount of the unpruned weight matrix, and generate different sparse ratio values for the unpruned weight matrix with different data amounts. For example, when the network device confirms that the data amount of the unpruned weight matrix is greater than the first threshold A, the value of the sparse ratio can be set to ratio A.
  • the value of the sparse proportion can be set to proportion B, where the value of proportion A is smaller than the value of proportion B.
  • the sparsity ratio of the unpruned weight matrix is small when the amount of data is large, which can improve the training efficiency of the weight matrix.
  • the network device can determine the sparsity ratio value based on the training accuracy required for the unpruned weight matrix. For example, an unpruned weight matrix can be accompanied by an identifier, which is used to describe the training accuracy of the unpruned weight matrix.
  • the network device can generate a corresponding sparsity ratio value based on the identifier, so that the network device can generate a corresponding sparsity ratio value based on the identifier. to generate values for different coefficient ratios.
  • the network device can also generate a coefficient ratio value based on other factors, which are not limited here.
  • the network device compresses the second matrix according to the sparsity ratio to obtain the third matrix.
  • the calculation engine can compress the second matrix according to the sparsity ratio to obtain the third matrix. If it is assumed that the scale of the second matrix is WK ⁇ N, that is, the second matrix has WK rows of data and N columns of data, where W is the sparse ratio, then the scale of the third matrix is K ⁇ N, that is, the third matrix has K rows of data. , N columns of data.
  • the network device calculates the product of the first matrix and the third matrix.
  • the network device may use the processing component PE to calculate the product of the first matrix and the third matrix to obtain a convolution result.
  • the network device obtains the first matrix and the second matrix.
  • the first matrix is the pruned weight matrix in the target model.
  • the second matrix is the data input to the target model.
  • the network device can compress the second matrix according to the sparse ratio. matrix to obtain a third matrix, and the network device calculates the product of the first matrix and the third matrix.
  • the network device compresses the second matrix according to the sparse ratio to obtain a third matrix, and then calculates the product of the first matrix and the third matrix.
  • Network devices can flexibly set the sparsity ratio value according to configuration instructions. Therefore, the computing performance of the hardware can be more fully utilized for different models. The applicable scenarios are wider and the compatibility is stronger.
  • the network device compresses the second matrix according to the sparse ratio to obtain the third matrix.
  • the network device compresses the second matrix according to the sparse ratio to obtain the third matrix.
  • the process of the network device compressing the second matrix according to the sparse ratio to obtain the third matrix includes the process of the network device compressing the first column vector according to the sparse ratio to obtain the second column vector, wherein the first column vector belongs to the third matrix.
  • the process of the network device calculating the product of the first matrix and the third matrix will also include the process of the network device calculating the product of the first row vector and the second column vector, where the first row vector belongs to the first matrix.
  • Figure 3 is a schematic diagram of another calculation flow of a data calculation method provided by this application.
  • the network device can dynamically configure the sparse ratio according to the computing power of the computing component and the complexity of the unpruned weight matrix to ensure that the model adopts the configuration of N:M sparse ratio. , to achieve a fine-grained structured sparse acceleration process.
  • the bitmap (non-zero data index) data of matrix A (first matrix) into the Prefix Sum module, calculate the number of local moving steps of the selected data in the column vector of matrix B (second matrix) of each row of PE (first Move number), and then input the calculation result and sparse ratio to the non-zero data compressor, and complete the data compression of the column vector in the B matrix on a PE through local non-zero data compression and global data compression.
  • the sparse matrix accelerator can parse and calculate the bitmap of matrix A by adding Prefix Sum and None-Zero Shifter units to each PE module, and perform non-zero data movement on the column vector of matrix B based on the results, thereby achieving acceleration of matrix operations. .
  • the following describes in detail the process by which the network device compresses the first column vector according to the sparsity ratio to obtain the second column vector.
  • the network device can calculate the first migration number based on the non-zero data index.
  • the non-zero data index is used to represent the distribution of the data in the first matrix in the weight matrix before pruning.
  • the first migration number is It represents the number of moving steps of the selected data.
  • the selected data is used to represent the data corresponding to the non-zero data index in the first column vector.
  • the network device can introduce the Prefix Sum and None-Zero Shifter modules in the matrix calculation engine to ensure that the fine-grained structured sparse acceleration process can be achieved when the model adopts the configuration of N:M sparse ratio.
  • the specific operation process of the Prefix-Sum module is shown in Figure 3. Invert each bit of data in the bitmap (non-zero data index), and accumulate the data before each bitmap, so that the number of zero-valued elements of the previous data can be obtained as the distance that the element will subsequently move (first Number of moves).
  • Figure 4 is a schematic diagram of another calculation flow of a calculation method provided by this application.
  • the data included in the non-zero data index in Figure 4 are 1, 0, 0, 1, 1, 0, 1, 0 respectively.
  • the inverted results are 0, 1, 1, 0, 0, 1, 0, 1.
  • the network device after the network device obtains the first migration number, it can compress the first column vector into the second column vector.
  • the specific compression process will be described in detail in the following example.
  • the non-zero data movement control signal required to be moved is obtained from Prefix Sum and sent to the Local None-Zero Shifter unit.
  • the network device first selects the selected data from the first column vector according to the non-zero data index. Then, the network The device shifts the selected data according to the first move number to obtain the second column vector, that is, moves each selected data unit to the appropriate position according to the first move number.
  • Figure 5 is a schematic diagram of another calculation flow of a data calculation method provided by this application.
  • the None-Zero Shifter unit at least includes the local None-Zero Shifter unit.
  • None-Zero Shifter can complete the data compression of the column vector of the B matrix on a PE based on the number of non-zero local movement steps (the first movement number) calculated by Prefix Sum. As shown in Figure 5, if the sparse ratio is 4:8, according to the instructions of the non-zero data index, the selected data are 1 in the first position, 1 in the fourth position, 1 in the fifth position, and 7th position. 1, then the selected data in every 8 elements will be moved to the first 4 positions of the queue.
  • the non-zero data movement control signal required to be moved is obtained from Prefix Sum and sent to the Local None-Zero Shifter unit.
  • the network device selects multiple groups of selected data from the first column vector according to the non-zero data index, where each group The selected data includes one or more selected data.
  • the network device shifts the multiple sets of selected data according to the first shift number to obtain multiple sets of selected vectors.
  • the network device compresses the multiple sets of selected vectors according to the sparsity ratio to obtain the second column vector.
  • Figure 8 is a schematic diagram of another calculation flow of a data calculation method provided by this application.
  • the None-Zero Shifter unit can be divided into two sub-modules, namely local None-Zero Shifter and global None-Zero Shifter.
  • None-Zero Shifter can calculate the number of non-zero local moving steps (the first moving number) and the sparse ratio based on Prefix Sum, and complete the data of column vectors in the B matrix on a PE through the two modules of local non-zero data compression and global data compression. compression process.
  • Local None-Zero Shifter completes local compression.
  • the specific implementation method is similar to the implementation method mentioned in the above example.
  • Global None-Zero Shifter uses the result of local compression as input to perform global data compression, and compresses the compressed data. Data is sent to PE.
  • Figure 9 is a schematic diagram of another calculation flow of a data calculation method provided by this application.
  • the global data compression module has a specific implementation method for compressing the second matrix, which will be explained in detail in the following example.
  • the non-zero data movement control signal required to be moved is obtained from Prefix Sum and sent to the Local None-Zero Shifter unit to move the non-zero data of each data unit to the appropriate location. If the sparse ratio is 4:8, then the non-zero data in every 8 elements will be moved to the first 4 positions of the queue; if the sparse ratio is 2:8, then the non-zero data in every 8 elements will be moved to the queue. The first 2 positions; if the sparse ratio is 1:8, non-zero data in every 8 elements will be moved to the first position of the queue.
  • the non-zero data in multiple queues are assembled to complete 16 data inputs for subsequent calculations. If the sparse ratio is 4:8, move the first 4 non-zero data in the 4 queues into a non-zero data vector with 16 digits; if the sparse ratio is 2:8, move the first 2 non-zero data in the 8 queues The data is moved to form a non-zero data vector with 16 digits; if the sparse ratio is 1:8, the first zero data in the 16 queues is moved to form a non-zero data vector with 16 digits.
  • the network device obtains a first matrix and a second matrix, where the first matrix is the pruned weight matrix in the target model, and the second matrix is the data input to the target model.
  • Network devices receive configuration instructions that set the sparsity ratio.
  • the network device compresses the second matrix according to the sparse ratio to obtain a third matrix, and then calculates the product of the first matrix and the third matrix.
  • Network devices can flexibly set the sparsity ratio value according to configuration instructions. Therefore, the computing performance of the hardware can be more fully utilized for different models. The applicable scenarios are wider and the compatibility is stronger.
  • the above examples provide different implementations of a data calculation method.
  • the following provides a network device 20, as shown in Figure 10.
  • the network device 20 is used to execute the data calculation method involved in the above examples.
  • the execution steps are as follows: Please refer to the above corresponding examples to understand the corresponding beneficial effects, which will not be repeated here, including:
  • the processing unit 201 is configured to obtain a first matrix and a second matrix, the first matrix is used to represent the pruned weight matrix in the target model, and the second matrix is used to represent the data input to the target model;
  • the receiving unit 202 is used to receive configuration instructions, where the configuration instructions are used to set the sparse ratio;
  • the processing unit 201 is used for:
  • the network device calculates a product of the first matrix and the third matrix.
  • the processing unit 201 is used for:
  • the processing unit 201 is also configured to calculate the first migration number based on the non-zero data index, which is used to represent the distribution status of the data in the first matrix in the weight matrix before pruning, so The first moving number is used to represent the number of moving steps of the selected data, and the selected data is used to represent the data corresponding to the non-zero data index in the first column vector;
  • the processing unit 201 is configured to compress the first column vector according to the first movement number to obtain a second column vector.
  • the processing unit 201 is used for:
  • the selected data is shifted according to the first shift number to obtain the second column vector.
  • the processing unit 201 is used for:
  • the second column vector is obtained by compressing the plurality of selected vectors according to the sparsity ratio.
  • the sparse ratio is determined by the computing power of the processing component in the network device and/or the weight matrix before pruning.
  • the above examples provide different implementations of the network device 20.
  • the following provides a network device 30, as shown in Figure 11.
  • the network device 30 is used to perform the sudden access response method in the above example.
  • the execution steps Please refer to the above corresponding examples to understand the specific beneficial effects and will not be repeated here.
  • the network device 30 includes: a processor 302 , a communication interface 303 , and a memory 301 .
  • bus 304 may be included.
  • the communication interface 303, the processor 302 and the memory 301 can be connected to each other through the bus 304;
  • the bus 304 can be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (EISA) bus etc.
  • PCI peripheral component interconnect standard
  • EISA extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one thick line is used in Figure 11, but it does not mean that there is only one bus or one type of bus.
  • the network device 30 can implement the functions of any network device in the example shown in FIG. 11 .
  • the processor 302 and the communication interface 303 can perform corresponding operations of the network device in the above method examples.
  • the memory 301 may be a volatile memory (volatile memory), such as a random-access memory (RAM); or a non-volatile memory (non-volatile memory), such as a read-only memory (read-only memory).
  • volatile memory such as a random-access memory (RAM); or a non-volatile memory (non-volatile memory), such as a read-only memory (read-only memory).
  • ROM read-only memory
  • flash memory flash memory
  • HDD hard disk drive
  • SSD solid-state drive
  • the processor 302 is the control center of the controller, which can be a central processing unit (CPU), an application specific integrated circuit (ASIC), or is configured to implement the examples provided in this application.
  • One or more integrated circuits such as one or more digital signal processors (DSP), or one or more field programmable gate arrays (FPGA).
  • DSP digital signal processors
  • FPGA field programmable gate arrays
  • Communication interface 303 is used to communicate with other network devices.
  • the processor 302 can perform the operations performed by the network device in the example shown in FIG. 10, which will not be described again here.
  • the chip includes a processor and a communication interface.
  • the processor is coupled to the communication interface.
  • the processor is used to read instructions and execute the network in the embodiments described in Figures 1 to 11. The operation performed by the device.
  • This application provides a network system, which includes the network device described in the embodiment described in Figure 10.
  • the disclosed systems, devices and methods can be implemented in other ways.
  • the device examples described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or may be Integrated into another system, or some features can be ignored, or not implemented.
  • the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. You can select some or all of the units according to actual needs to achieve the purpose of this example.
  • each functional unit in each example of this application can be integrated into one processing unit, or each unit can exist physically alone, or two or more units can be integrated into one unit.
  • the above integrated units can be implemented in the form of hardware or software functional units.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to cause a computer device (which can be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in each example of this application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, read-only memory), random access memory (RAM, random access memory), magnetic disk or optical disk and other media that can store program code. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Stored Programmes (AREA)
  • Complex Calculations (AREA)

Abstract

The present application provides a data computing method and a related device. In the present application, a network device obtains a first matrix and a second matrix, wherein the first matrix is a pruned weight matrix in a target model, and the second matrix is data input to the target model. The network device receives a configuration instruction, the configuration instruction being used for setting a sparsity ratio. The network device compresses the second matrix according to the sparsity ratio to obtain a third matrix, and then calculates the product of the first matrix and the third matrix. The network device can flexibly set the value of the sparsity ratio according to the configuration instruction. Therefore, the computing performance of hardware can be more fully exerted for different models, the application scenario is wide, and the compatibility is high.

Description

一种数据计算方法以及相关设备A data calculation method and related equipment
本申请要求于2022年06月27日提交中国专利局、申请号为CN202210736691.6、申请名称为“一种数据计算方法以及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application submitted to the China Patent Office on June 27, 2022, with the application number CN202210736691.6 and the application title "A data calculation method and related equipment", the entire content of which is incorporated by reference. in this application.
技术领域Technical field
本申请涉及计算机领域,进一步涉及人工智能(artificial intelligence,AI)技术在计算机网络领域中的应用,尤其涉及一种数据计算方法以及相关设备。This application relates to the field of computers, and further relates to the application of artificial intelligence (AI) technology in the field of computer networks, and in particular, to a data calculation method and related equipment.
背景技术Background technique
在当前的AI领域中,模型种类较多且模型逐渐趋于复杂。根据彩票假说,稀疏模型和复杂模型比有着相同的甚至更好的学习能力。并且,复杂模型训练时开销较大,耗时较长。稀疏模型训练时开销小,耗时较短。如何将复杂模型稀疏化并且保证稀疏模型可以流畅运行是实现模型加速训练的关键。In the current AI field, there are many types of models and the models are gradually becoming more complex. According to the lottery ticket hypothesis, sparse models have the same or even better learning capabilities than complex models. Moreover, training complex models is expensive and time-consuming. Sparse model training is less expensive and takes less time. How to sparse a complex model and ensure that the sparse model can run smoothly is the key to accelerated model training.
为了实现对于复杂模型的稀疏化,通常可以对复杂模型中的权重矩阵做剪枝,对剪枝后的权重矩阵和数据做运算。目前,主流的稀疏加速技术为2:4细颗粒度结构化矩阵乘法加速技术。通过对AI模型中的权重矩阵进行2:4结构化剪枝,即权重矩阵中的每4个元素中保留2个元素,完成对权重矩阵的稀疏化,从而通过该技术实现AI模型计算加速,节约模型训练所需要的时间。In order to realize the sparseness of complex models, you can usually prune the weight matrix in the complex model and perform operations on the pruned weight matrix and data. Currently, the mainstream sparse acceleration technology is 2:4 fine-grained structured matrix multiplication acceleration technology. By performing 2:4 structural pruning on the weight matrix in the AI model, that is, retaining 2 elements out of every 4 elements in the weight matrix, the sparseness of the weight matrix is completed, thereby achieving AI model calculation acceleration through this technology. Save time required for model training.
然而,传统模型加速技术中,仅能采用固定的2:4稀疏结构对权重矩阵进行稀疏化,无法改变权重矩阵的稀疏比例,适用场景较为单一,兼容性较差。However, in traditional model acceleration technology, only a fixed 2:4 sparse structure can be used to sparse the weight matrix, and the sparse ratio of the weight matrix cannot be changed. The applicable scenarios are relatively single and the compatibility is poor.
发明内容Contents of the invention
本申请提供了一种数据计算方法以及相关设备。本申请中,网络设备可以根据配置指令来灵活设置稀疏比例的值,因此,针对不同的模型可以更加充分的发挥硬件的计算性能,适用场景较为广泛,兼容性较强。This application provides a data calculation method and related equipment. In this application, the network device can flexibly set the value of the sparsity ratio according to the configuration instructions. Therefore, the computing performance of the hardware can be more fully utilized for different models, with a wide range of applicable scenarios and strong compatibility.
本申请第一方面提供了一种数据计算方法,该方法中,网络设备获取第一矩阵和第二矩阵,所述第一矩阵用于表示目标模型中剪枝后的权重矩阵,所述第二矩阵用于表示输入所述目标模型的数据;所述网络设备接收配置指令,所述配置指令用于设置稀疏比例;所述网络设备根据所述稀疏比例压缩所述第二矩阵以得到第三矩阵;所述网络设备计算所述第一矩阵和所述第三矩阵的乘积。A first aspect of this application provides a data calculation method. In this method, the network device obtains a first matrix and a second matrix. The first matrix is used to represent the pruned weight matrix in the target model. The second matrix The matrix is used to represent the data input to the target model; the network device receives configuration instructions, and the configuration instructions are used to set the sparse ratio; the network device compresses the second matrix according to the sparse ratio to obtain a third matrix ; The network device calculates the product of the first matrix and the third matrix.
本申请中,网络设备获取第一矩阵和第二矩阵,其中,第一矩阵是目标模型中剪枝后的权重矩阵,第二矩阵是输入所述目标模型的数据。网络设备接收配置指令,该配置指令用于设置稀疏比例。网络设备根据稀疏比例压缩第二矩阵以得到第三矩阵,进而计算第一矩阵和第三矩阵的乘积。网络设备可以根据配置指令来灵活设置稀疏比例的值,因此,针对不同的模型可以更加充分的发挥硬件的计算性能,适用场景较为广泛,兼容性较强。In this application, the network device obtains a first matrix and a second matrix, where the first matrix is the pruned weight matrix in the target model, and the second matrix is the data input to the target model. Network devices receive configuration instructions that set the sparsity ratio. The network device compresses the second matrix according to the sparse ratio to obtain a third matrix, and then calculates the product of the first matrix and the third matrix. Network devices can flexibly set the sparsity ratio value according to configuration instructions. Therefore, the computing performance of the hardware can be more fully utilized for different models. The applicable scenarios are wider and the compatibility is stronger.
在第一方面的一种可能的实现方式中,所述网络设备根据所述稀疏比例压缩所述第二矩阵以得到第三矩阵,包括:所述网络设备根据所述稀疏比例压缩第一列向量得到第二列向量,所述第一列向量属于所述第二矩阵,所述第二列向量属于所述第三矩阵;所述网络设备计算所述第一矩阵和所述第三矩阵的乘积,包括:所述网络设备计算第一行向量和所述第二列向量的乘积,所述第一行向量属于所述第一矩阵。In a possible implementation of the first aspect, the network device compresses the second matrix according to the sparse ratio to obtain a third matrix, including: the network device compresses the first column vector according to the sparse ratio. Obtain a second column vector, the first column vector belongs to the second matrix, and the second column vector belongs to the third matrix; the network device calculates the product of the first matrix and the third matrix , including: the network device calculates the product of a first row vector and the second column vector, where the first row vector belongs to the first matrix.
该种可能的实现方式中,第一列向量属于第二矩阵,第二列向量属于第三矩阵。可以理解的是,网络设备根据稀疏比例压缩第二矩阵以得到第三矩阵的过程中,包括了网络设备根据稀疏比例压缩第一列向量得到第二列向量的过程。此外,第一行向量属于第一矩阵。网络设备计算第一矩阵和第三矩阵的乘积的过程,也会包括网络设备计算第一行向量和第二列向量的乘积的过程。该种可能的实现方式阐述了一种第一矩阵和第三矩阵做乘积的具体的计算过程,提升了方案的可实现性。In this possible implementation, the first column vector belongs to the second matrix, and the second column vector belongs to the third matrix. It can be understood that the process in which the network device compresses the second matrix according to the sparse ratio to obtain the third matrix includes the process in which the network device compresses the first column vector according to the sparse ratio to obtain the second column vector. Furthermore, the first row of vectors belongs to the first matrix. The process of the network device calculating the product of the first matrix and the third matrix will also include the process of the network device calculating the product of the first row vector and the second column vector. This possible implementation method illustrates a specific calculation process of multiplying the first matrix and the third matrix, which improves the realizability of the solution.
在第一方面的一种可能的实现方式中,所述方法还包括:所述网络设备根据非零数据索引计算第一搬移数,所述非零数据索引用于表示所述第一矩阵中的数据在剪枝前的权重矩阵中的分布状况,所述第 一搬移数用于表示入选数据的搬移步数,所述入选数据用于表示第一列向量中与所述非零数据索引对应的数据;所述网络设备根据非零数据索引压缩第一列向量得到第二列向量,包括:所述网络设备根据所述第一搬移数压缩所述第一列向量得到第二列向量。In a possible implementation of the first aspect, the method further includes: the network device calculates a first relocation number based on a non-zero data index, the non-zero data index being used to represent the number of moves in the first matrix. The distribution of data in the weight matrix before pruning, as described in Chapter 1 A moving number is used to represent the number of moving steps of the selected data, and the selected data is used to represent the data corresponding to the non-zero data index in the first column vector; the network device compresses the first column vector according to the non-zero data index. Obtaining the second column vector includes: the network device compresses the first column vector according to the first migration number to obtain the second column vector.
该种可能的实现方式中,非零数据索引表示了第一矩阵中的数据在剪枝前的权重矩阵中的分布状况。由于第一矩阵中包括第一行向量,因此,非零数据索引中包括第一行向量在剪枝前的权重矩阵中的分布状况。网络设备可以根据非零数据索引在第一列向量中获取入选数据,入选数据是根据第一行向量在剪枝前的权重矩阵中的分布状况所得到的,此外,还可以根据第一行向量在剪枝前的权重矩阵中的分布状况得到入选数据的搬移步数,即第一搬移数,进而根据第一搬移数压缩第一列向量得到第二列向量。该种可能的实现方式中,网络设备通过非零数据索引计算第一搬移步数,进而根据第一搬移步数压缩第一列向量得到第二列向量。通过这样的方式实现对于第一列向量的压缩,提升了网络设备的计算效率。In this possible implementation, the non-zero data index represents the distribution of the data in the first matrix in the weight matrix before pruning. Since the first matrix includes the first row vector, the non-zero data index includes the distribution of the first row vector in the weight matrix before pruning. The network device can obtain the selected data in the first column vector based on the non-zero data index. The selected data is obtained based on the distribution of the first row vector in the weight matrix before pruning. In addition, it can also obtain the selected data based on the first row vector. The distribution status in the weight matrix before pruning is obtained by the number of moving steps of the selected data, that is, the first moving number, and then the first column vector is compressed according to the first moving number to obtain the second column vector. In this possible implementation, the network device calculates the first number of moving steps through the non-zero data index, and then compresses the first column vector according to the first number of moving steps to obtain the second column vector. In this way, the compression of the first column vector is achieved, which improves the computing efficiency of the network device.
在第一方面的一种可能的实现方式中,所述网络设备根据所述第一搬移数引压缩所述第一列向量得到第二列向量,包括:所述网络设备根据所述非零数据索引从所述第一列向量中选择入选数据;所述网络设备根据所述第一搬移数对所述入选数据移位得到所述第二列向量。In a possible implementation of the first aspect, the network device compresses the first column vector according to the first migration index to obtain a second column vector, including: the network device compresses the first column vector according to the non-zero data The index selects the selected data from the first column vector; the network device shifts the selected data according to the first movement number to obtain the second column vector.
该种可能的实现方式中,网络设备可以根据非零数据索引在第一列向量中获取入选数据,入选数据是根据第一行向量在剪枝前的权重矩阵中的分布状况所得到的。网络设备还可以根据第一行向量在剪枝前的权重矩阵中的分布状况获取第一搬移数,进而根据第一搬移数对入选数据进行移位以得到第二列向量,该种可能的实现方式提供了一种具体的获取第二列向量的方式,提升了方案的可实现性。In this possible implementation, the network device can obtain the selected data in the first column vector based on the non-zero data index, and the selected data is obtained based on the distribution of the first row vector in the weight matrix before pruning. The network device can also obtain the first migration number based on the distribution of the first row vector in the weight matrix before pruning, and then shift the selected data according to the first migration number to obtain the second column vector. This possible implementation The method provides a specific way to obtain the second column vector, which improves the achievability of the solution.
在第一方面的一种可能的实现方式中,所述方法还包括:所述网络设备根据所述非零数据索引从所述第一列向量中选择多组入选数据,每组入选数据中包括一个或多个入选数据;所述网络设备根据所述第一搬移数对所述多组入选数据移位得到多组入选向量;所述网络设备根据所述稀疏比例将所述多组入选向量压缩后得到所述第二列向量。In a possible implementation of the first aspect, the method further includes: the network device selects multiple groups of selected data from the first column vector according to the non-zero data index, and each group of selected data includes One or more selected data; the network device shifts the multiple sets of selected data according to the first movement number to obtain multiple sets of selected vectors; the network device compresses the multiple sets of selected vectors according to the sparsity ratio Finally, the second column vector is obtained.
该种可能的实现方式中,若第一矩阵数据量较大,网络设备可以根据非零数据索引在第一列向量中获取多组入选数据,其中,每一组入选数据均是根据第一行向量在剪枝前的权重矩阵中的分布状况所得到的。网络设备还可以根据第一行向量在剪枝前的权重矩阵中的分布状况获取每一组入选数据对应的第一搬移数,进而根据每一个第一搬移数对每一组入选数据进行移位以得多组入选向量,根据稀疏比例将多组入选向量压缩后得到第二列向量。该种可能的实现方式中,若第一矩阵数据量较大时,根据处理部件(processing element,PE)的算力,网络设备可以将多组入选向量压缩后形成第二列向量,在一个处理部件中完成第一行向量和第二列向量的乘积,充分的利用了处理部件的算力,避免了算力浪费,提升了计算效率。In this possible implementation, if the first matrix data amount is large, the network device can obtain multiple sets of selected data in the first column vector based on the non-zero data index, where each set of selected data is based on the first row Obtained from the distribution of vectors in the weight matrix before pruning. The network device can also obtain the first movement number corresponding to each group of selected data based on the distribution of the first row vector in the weight matrix before pruning, and then shift each group of selected data according to each first movement number. Using multiple groups of selected vectors, compress the multiple groups of selected vectors according to the sparsity ratio to obtain the second column of vectors. In this possible implementation, if the amount of data in the first matrix is large, based on the computing power of the processing element (PE), the network device can compress multiple sets of selected vectors to form a second column vector, which is processed in one processing The product of the first row vector and the second column vector is completed in the component, making full use of the computing power of the processing component, avoiding waste of computing power, and improving calculation efficiency.
在第一方面的一种可能的实现方式中,所述稀疏比例由所述网络设备中处理部件的算力和/或剪枝前的权重矩阵决定。In a possible implementation of the first aspect, the sparsity ratio is determined by the computing power of the processing component in the network device and/or the weight matrix before pruning.
该种可能的实现方式中,配置稀疏比例时,可以根据处理部件的算力、剪枝前的权重矩阵的复杂程度、剪枝前权重矩阵训练时需要的精度以及其他因素来确定稀疏比例的值。该种可能的实现方式提供了确定稀疏比例时需要考虑的多种因素,网络设备设置稀疏比例时可以根据不同的因素设置不同的值,增加了方案的灵活性。In this possible implementation, when configuring the sparse ratio, the value of the sparse ratio can be determined based on the computing power of the processing component, the complexity of the weight matrix before pruning, the accuracy required for weight matrix training before pruning, and other factors. . This possible implementation provides multiple factors that need to be considered when determining the sparse ratio. When the network device sets the sparse ratio, different values can be set based on different factors, which increases the flexibility of the solution.
本申请第二方面提供了一种网络设备,该网络设备包括至少一个处理器、存储器和通信接口。处理器与存储器和通信接口耦合。存储器用于存储指令,处理器用于执行该指令,通信接口用于在处理器的控制下与其他网络设备进行通信。该指令在被处理器执行时,使得所述网络设备执行上述第一方面或第一方面的任意可能的实现方式中的方法。A second aspect of the present application provides a network device, which includes at least one processor, a memory, and a communication interface. The processor is coupled to memory and communication interfaces. The memory is used to store instructions, the processor is used to execute the instructions, and the communication interface is used to communicate with other network devices under the control of the processor. When executed by the processor, the instruction causes the network device to execute the method in the above first aspect or any possible implementation of the first aspect.
本申请第三方面提供了一种计算机可读存储介质,该计算机可读存储介质存储有程序,该程序使得终端设备执行上述第一方面或第一方面的任意可能的实现方式中的方法。A third aspect of the present application provides a computer-readable storage medium that stores a program that causes a terminal device to execute the method in the above-mentioned first aspect or any possible implementation of the first aspect.
本申请第四方面提供了一种存储一个或多个计算机执行指令的计算机程序产品,当所述计算机执行指令被所述处理器执行时,所述处理器执行上述第一方面或第一方面任意一种可能实现方式的方法。A fourth aspect of the present application provides a computer program product that stores one or more computer-executable instructions. When the computer-executable instructions are executed by the processor, the processor executes the above-mentioned first aspect or any of the first aspects. One possible way to achieve this.
本申请第五方面提供了一种芯片,该芯片包括处理器和通信接口,所述处理器与所述通信接口耦合,所述处理器用于读取指令执行上述第一方面或第一方面任意一种可能实现方式的方法。 A fifth aspect of the present application provides a chip. The chip includes a processor and a communication interface. The processor is coupled to the communication interface. The processor is configured to read instructions to execute the above first aspect or any one of the first aspects. possible implementation methods.
本申请第六方面一种网络系统,该网络系统包括网络设备,网络设备上可以执行上述第一方面或第一方面任意一种可能实现方式中所述的方法。A sixth aspect of this application is a network system. The network system includes a network device, and the network device can execute the method described in the first aspect or any possible implementation of the first aspect.
附图说明Description of drawings
图1为本申请提供的一种数据计算系统的一种结构示意图;Figure 1 is a schematic structural diagram of a data computing system provided by this application;
图2为本申请提供的一种数据计算方法的一种流程示意图;Figure 2 is a schematic flow chart of a data calculation method provided by this application;
图3为本申请提供的一种数据计算方法的一种实施例示意图;Figure 3 is a schematic diagram of an embodiment of a data calculation method provided by this application;
图4为本申请提供的一种数据计算方法的另一种实施例示意图;Figure 4 is a schematic diagram of another embodiment of a data calculation method provided by this application;
图5为本申请提供的一种数据计算方法的另一种实施例示意图;Figure 5 is a schematic diagram of another embodiment of a data calculation method provided by this application;
图6为本申请提供的一种数据计算方法的另一种实施例示意图;Figure 6 is a schematic diagram of another embodiment of a data calculation method provided by this application;
图7为本申请提供的一种数据计算方法的另一种实施例示意图;Figure 7 is a schematic diagram of another embodiment of a data calculation method provided by this application;
图8为本申请提供的一种数据计算方法的另一种实施例示意图;Figure 8 is a schematic diagram of another embodiment of a data calculation method provided by this application;
图9为本申请提供的一种数据计算方法的另一种实施例示意图;Figure 9 is a schematic diagram of another embodiment of a data calculation method provided by this application;
图10为本申请提供的一种网络设备的一种结构示意图;Figure 10 is a schematic structural diagram of a network device provided by this application;
图11为本申请提供的另一种网络设备的一种结构示意图。Figure 11 is a schematic structural diagram of another network device provided by this application.
具体实施方式Detailed ways
下面结合附图,对本申请提供的示例进行描述,显然,所描述的示例仅仅是本申请一部分的示例,而不是全部的示例。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请提供的技术方案对于类似的技术问题,同样适用。The examples provided in this application are described below in conjunction with the accompanying drawings. Obviously, the described examples are only examples of a part of this application, not all examples. Persons of ordinary skill in the art know that with the development of technology and the emergence of new scenarios, the technical solutions provided in this application are also applicable to similar technical problems.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的示例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", etc. in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It is to be understood that data so used are interchangeable under appropriate circumstances so that the examples described herein can be practiced in sequences other than those illustrated or described herein. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusions, e.g., a process, method, system, product, or apparatus that encompasses a series of steps or units and need not be limited to those explicitly listed. Those steps or elements may instead include other steps or elements not expressly listed or inherent to the process, method, product or apparatus.
本申请中的“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况,其中A,B可以是单数或者复数。并且,在本申请的描述中,除非另有说明,“多个”是指两个或多于两个。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b,或c中的至少一项(个),可以表示:a,b,c,a-b,a-c,b-c,或a-b-c,其中a,b,c可以是单个,也可以是多个。"And/or" in this application is just an association relationship describing related objects, indicating that there can be three relationships, for example, A and/or B, which can mean: A exists alone, A and B exist simultaneously, and they exist alone. In the three cases of B, A and B can be singular or plural. Furthermore, in the description of this application, unless otherwise specified, "plurality" means two or more than two. "At least one of the following" or similar expressions thereof refers to any combination of these items, including any combination of a single item (items) or a plurality of items (items). For example, at least one of a, b, or c can mean: a, b, c, a-b, a-c, b-c, or a-b-c, where a, b, c can be single or multiple .
在当前的AI领域中,模型种类较多且模型逐渐趋于复杂。根据彩票假说,稀疏模型和复杂模型比有着相同的甚至更好的学习能力。并且,复杂模型训练时开销较大,耗时较长。稀疏模型训练时开销小,耗时较短。如何将复杂模型稀疏化并且保证稀疏模型可以流畅运行是实现模型加速训练的关键。In the current AI field, there are many types of models and the models are gradually becoming more complex. According to the lottery ticket hypothesis, sparse models have the same or even better learning capabilities than complex models. Moreover, training complex models is expensive and time-consuming. Sparse model training is less expensive and takes less time. How to sparse a complex model and ensure that the sparse model can run smoothly is the key to accelerated model training.
为了实现对于复杂模型的稀疏化,通常可以对复杂模型中的权重矩阵做剪枝,对剪枝后的权重矩阵和数据做运算。目前,主流的稀疏加速技术为2:4细颗粒度结构化矩阵乘法加速技术。通过对AI模型中的权重矩阵进行2:4结构化剪枝,即权重矩阵中的每4个元素中保留2个元素,完成对权重矩阵的稀疏化,从而通过该技术实现AI模型计算加速,节约模型训练所需要的时间。In order to realize the sparseness of complex models, you can usually prune the weight matrix in the complex model and perform operations on the pruned weight matrix and data. Currently, the mainstream sparse acceleration technology is 2:4 fine-grained structured matrix multiplication acceleration technology. By performing 2:4 structural pruning on the weight matrix in the AI model, that is, retaining 2 elements out of every 4 elements in the weight matrix, the sparseness of the weight matrix is completed, thereby achieving AI model calculation acceleration through this technology. Save time required for model training.
然而,传统模型加速技术中,仅能采用固定的2:4稀疏结构对权重矩阵进行稀疏化,无法改变权重矩阵的稀疏比例,适用场景较为单一,兼容性较差。However, in traditional model acceleration technology, only a fixed 2:4 sparse structure can be used to sparse the weight matrix, and the sparse ratio of the weight matrix cannot be changed. The applicable scenarios are relatively single and the compatibility is poor.
为了解决上述方案中存在的问题,本申请提供了一种数据计算方法、数据计算系统以及网络设备,本申请中,网络设备获取第一矩阵和第二矩阵,其中,第一矩阵是目标模型中剪枝后的权重矩阵,第二矩阵是输入所述目标模型的数据。网络设备接收配置指令,该配置指令用于设置稀疏比例。网络设备根据稀疏比例压缩第二矩阵以得到第三矩阵,进而计算第一矩阵和第三矩阵的乘积。网络设备可以根据配 置指令来灵活设置稀疏比例的值,因此,针对不同的模型可以更加充分的发挥硬件的计算性能,适用场景较为广泛,兼容性较强。In order to solve the problems existing in the above solution, this application provides a data calculation method, a data calculation system and a network device. In this application, the network device obtains the first matrix and the second matrix, where the first matrix is the target model. The weight matrix after pruning, the second matrix is the data input to the target model. Network devices receive configuration instructions that set the sparsity ratio. The network device compresses the second matrix according to the sparse ratio to obtain a third matrix, and then calculates the product of the first matrix and the third matrix. Network equipment can be configured according to Setting instructions are used to flexibly set the value of the sparse ratio. Therefore, the computing performance of the hardware can be more fully utilized for different models. The applicable scenarios are wider and the compatibility is stronger.
下面首先介绍本申请提供的一种数据计算系统的结构。The following first introduces the structure of a data computing system provided by this application.
图1为本申请提供的一种数据计算系统的一种结构示意图。Figure 1 is a schematic structural diagram of a data computing system provided by this application.
本申请中,数据计算系统包括计算引擎,计算引擎从寄存器中获取第一矩阵和第二矩阵,其中,第一矩阵是目标模型中剪枝后的权重矩阵,如图1中,假设第一矩阵的规模为M×K,即第一矩阵有M行数据,K列数据。第二矩阵是输入目标模型的数据,假设第二矩阵的规模为WK×N,即第二矩阵具有WK行数据,N列数据,其中W为稀疏比例。计算引擎可以根据配置指令来灵活设置稀疏比例的值,进而,可以根据稀疏比例压缩第二矩阵以得到第三矩阵。其中,第三矩阵的规模为K×N,即第三矩阵具有K行数据,N列数据。计算引擎计算第一矩阵和第三矩阵的乘积后得到的矩阵的规模为M×N,即该矩阵有M行数据,N列数据。In this application, the data computing system includes a computing engine. The computing engine obtains a first matrix and a second matrix from a register, where the first matrix is the pruned weight matrix in the target model. As shown in Figure 1, assuming that the first matrix The scale of is M×K, that is, the first matrix has M rows of data and K columns of data. The second matrix is the data input to the target model. It is assumed that the scale of the second matrix is WK×N, that is, the second matrix has WK rows of data and N columns of data, where W is the sparse ratio. The calculation engine can flexibly set the value of the sparse ratio according to the configuration instruction, and further, can compress the second matrix according to the sparse ratio to obtain the third matrix. The size of the third matrix is K×N, that is, the third matrix has K rows of data and N columns of data. The size of the matrix obtained after the calculation engine calculates the product of the first matrix and the third matrix is M×N, that is, the matrix has M rows of data and N columns of data.
本申请中,数据计算系统包括计算引擎,计算引擎可以根据配置指令来灵活设置稀疏比例的值,因此,针对不同的模型可以更加充分的发挥硬件的计算性能,适用场景较为广泛,兼容性较强。In this application, the data computing system includes a computing engine. The computing engine can flexibly set the value of the sparsity ratio according to the configuration instructions. Therefore, the computing performance of the hardware can be more fully utilized for different models. The applicable scenarios are wider and the compatibility is stronger. .
基于图1所描述的数据计算系统,对本申请提供的数据计算方法进行介绍。Based on the data calculation system described in Figure 1, the data calculation method provided by this application is introduced.
图2为本申请提供的一种数据计算方法的计算流程示意图。Figure 2 is a schematic diagram of the calculation flow of a data calculation method provided by this application.
101、网络设备获取第一矩阵和第二矩阵。101. The network device obtains the first matrix and the second matrix.
本申请中,第一矩阵用于表示目标模型中剪枝后的权重矩阵,第二矩阵用于表示输入目标模型的数据。In this application, the first matrix is used to represent the pruned weight matrix in the target model, and the second matrix is used to represent the data input to the target model.
示例性的,如图1中,假设第一矩阵的规模为M×K,即第一矩阵有M行数据,K列数据。第一矩阵可以是由剪枝前的权重矩阵剪枝后得到,剪枝前的权重矩阵的规模为WM×K。第二矩阵是输入目标模型的数据,假设第二矩阵的规模为WK×N,即第二矩阵具有WK行数据,N列数据,其中W为稀疏比例。For example, as shown in Figure 1, it is assumed that the size of the first matrix is M×K, that is, the first matrix has M rows of data and K columns of data. The first matrix may be obtained by pruning the weight matrix before pruning, and the scale of the weight matrix before pruning is WM×K. The second matrix is the data input to the target model. It is assumed that the scale of the second matrix is WK×N, that is, the second matrix has WK rows of data and N columns of data, where W is the sparse ratio.
102、网络设备接收配置指令。102. The network device receives the configuration instruction.
本申请中,配置指令用于设置稀疏比例,稀疏比例是可以改变的。网络设备根据配置指令配置稀疏比例的方式可以有多种,下面详细进行说明。In this application, the configuration instruction is used to set the sparse ratio, and the sparse ratio can be changed. There are many ways for network devices to configure the sparsity ratio according to configuration instructions, which are explained in detail below.
方式一:配置指令中包括稀疏比例的值。Method 1: Include the sparsity ratio value in the configuration directive.
本申请中,配置指令中可以包括稀疏比例的值,该稀疏比例可以由运维人员根据需求进行配置。可选的,配置稀疏比例时可以考虑处理部件的算力,假设处理部件PE能够完成8×8类型的数据计算,则稀疏比例可以在现有算力下,不超过处理部件算力的基础上,根据权重矩阵的复杂程度以及模型的训练精度要求等选择8:16,4:16,4:8,2:8等多种方案。可选的,在处理部件算力支持的条件下,配置稀疏比例的值时还可以考虑未剪枝的权重矩阵需要的训练精度,根据训练精度灵活调整稀疏比例的值。可选的,在处理部件算力支持的条件下,配置稀疏比例的值时还可以考虑未剪枝的权重矩阵的复杂程度,根据权重矩阵的复杂程度灵活调整稀疏比例的值。可选的,运维人员还可以根据其他因素配置稀疏比例的值,具体此处不做限定。In this application, the configuration instruction may include the value of the sparse ratio, and the sparse ratio may be configured by operation and maintenance personnel according to requirements. Optionally, the computing power of the processing component can be considered when configuring the sparse ratio. Assuming that the processing component PE can complete 8×8 type data calculation, the sparse ratio can be based on the existing computing power and not exceeding the computing power of the processing component. , depending on the complexity of the weight matrix and the training accuracy requirements of the model, choose 8:16, 4:16, 4:8, 2:8 and other solutions. Optionally, if the computing power of the processing component supports it, you can also consider the training accuracy required for the unpruned weight matrix when configuring the sparse ratio value, and flexibly adjust the sparse ratio value based on the training accuracy. Optionally, if the computing power of the processing component supports it, the complexity of the unpruned weight matrix can also be considered when configuring the value of the sparse ratio, and the value of the sparse ratio can be flexibly adjusted according to the complexity of the weight matrix. Optionally, operation and maintenance personnel can also configure the sparsity ratio value based on other factors, which are not limited here.
方式二:根据配置指令生成稀疏比例。Method 2: Generate sparse ratio according to configuration instructions.
本申请中,配置指令中也可以不包括稀疏比例的值,该稀疏比例的值可以由网络设备接收到配置指令后生成。可选的,网络设备接收配置指令后,可以根据未剪枝的权重矩阵的数据量来确定稀疏比例的值,针对具有不同数据量的未剪枝的权重矩阵生成不同的稀疏比例的值。例如,当网络设备确认未剪枝的权重矩阵的数据量大于第一阈值A,则可以将稀疏比例的值设置为比例A,当网络设备确认未剪枝的权重矩阵的数据量小于等于第一阈值A,则可以将稀疏比例的值设置为比例B,其中,比例A的值小于比例B的值。这样的设置方式中,未剪枝的权重矩阵数据量大时稀疏比例小,可以提升对于权重矩阵的训练效率。可选的,网络设备接收配置指令后,可以根据未剪枝的权重矩阵需要的训练精度来确定稀疏比例的值。例如,未剪枝的权重矩阵可以附带标识,该标识用来说明该未剪枝的权重矩阵的训练精度,网络设备可以根据该标识生成对应的稀疏比例的值,从而网络设备可以根据不同的精度来生成不同的系数比例的值。可选的,网络设备还可以根据其他因素来生成系数比例的值,具体此处不做限定。In this application, the configuration instruction may not include the sparse ratio value, and the sparse ratio value may be generated by the network device after receiving the configuration instruction. Optionally, after receiving the configuration instruction, the network device can determine the sparsity ratio value based on the data amount of the unpruned weight matrix, and generate different sparse ratio values for the unpruned weight matrix with different data amounts. For example, when the network device confirms that the data amount of the unpruned weight matrix is greater than the first threshold A, the value of the sparse ratio can be set to ratio A. When the network device confirms that the data amount of the unpruned weight matrix is less than or equal to the first threshold Threshold A, then the value of the sparse proportion can be set to proportion B, where the value of proportion A is smaller than the value of proportion B. In this setting, the sparsity ratio of the unpruned weight matrix is small when the amount of data is large, which can improve the training efficiency of the weight matrix. Optionally, after receiving the configuration instruction, the network device can determine the sparsity ratio value based on the training accuracy required for the unpruned weight matrix. For example, an unpruned weight matrix can be accompanied by an identifier, which is used to describe the training accuracy of the unpruned weight matrix. The network device can generate a corresponding sparsity ratio value based on the identifier, so that the network device can generate a corresponding sparsity ratio value based on the identifier. to generate values for different coefficient ratios. Optionally, the network device can also generate a coefficient ratio value based on other factors, which are not limited here.
103、网络设备根据稀疏比例压缩第二矩阵以得到第三矩阵。 103. The network device compresses the second matrix according to the sparsity ratio to obtain the third matrix.
本申请中,计算引擎可以根据稀疏比例压缩第二矩阵以得到第三矩阵。若假设第二矩阵的规模为WK×N,即第二矩阵具有WK行数据,N列数据,其中W为稀疏比例,则第三矩阵的规模为K×N,即第三矩阵具有K行数据,N列数据。In this application, the calculation engine can compress the second matrix according to the sparsity ratio to obtain the third matrix. If it is assumed that the scale of the second matrix is WK×N, that is, the second matrix has WK rows of data and N columns of data, where W is the sparse ratio, then the scale of the third matrix is K×N, that is, the third matrix has K rows of data. , N columns of data.
104、网络设备计算第一矩阵和第三矩阵的乘积。104. The network device calculates the product of the first matrix and the third matrix.
本申请中,网络设备可以使用处理部件PE计算第一矩阵和第三矩阵的乘积,以便得出卷积结果。In this application, the network device may use the processing component PE to calculate the product of the first matrix and the third matrix to obtain a convolution result.
本申请中,网络设备获取第一矩阵和第二矩阵,其中,第一矩阵是目标模型中剪枝后的权重矩阵,第二矩阵是输入目标模型的数据,网络设备可以根据稀疏比例压缩第二矩阵以得到第三矩阵,网络设备计算第一矩阵和第三矩阵的乘积。网络设备根据稀疏比例压缩第二矩阵以得到第三矩阵,进而计算第一矩阵和第三矩阵的乘积。网络设备可以根据配置指令来灵活设置稀疏比例的值,因此,针对不同的模型可以更加充分的发挥硬件的计算性能,适用场景较为广泛,兼容性较强。In this application, the network device obtains the first matrix and the second matrix. The first matrix is the pruned weight matrix in the target model. The second matrix is the data input to the target model. The network device can compress the second matrix according to the sparse ratio. matrix to obtain a third matrix, and the network device calculates the product of the first matrix and the third matrix. The network device compresses the second matrix according to the sparse ratio to obtain a third matrix, and then calculates the product of the first matrix and the third matrix. Network devices can flexibly set the sparsity ratio value according to configuration instructions. Therefore, the computing performance of the hardware can be more fully utilized for different models. The applicable scenarios are wider and the compatibility is stronger.
本申请中,上述步骤103中,网络设备根据稀疏比例压缩所述第二矩阵以得到第三矩阵具有具体的实现方式,下面举例进行详细说明。In this application, in the above-mentioned step 103, the network device compresses the second matrix according to the sparse ratio to obtain the third matrix. There is a specific implementation method, which will be described in detail below with an example.
本申请中,网络设备根据稀疏比例压缩第二矩阵以得到第三矩阵的过程中,包括了网络设备根据稀疏比例压缩第一列向量得到第二列向量的过程,其中,第一列向量属于第二矩阵,第二列向量属于第三矩阵。此外,网络设备计算第一矩阵和第三矩阵的乘积的过程,也会包括网络设备计算第一行向量和第二列向量的乘积的过程,其中,第一行向量属于第一矩阵。In this application, the process of the network device compressing the second matrix according to the sparse ratio to obtain the third matrix includes the process of the network device compressing the first column vector according to the sparse ratio to obtain the second column vector, wherein the first column vector belongs to the third matrix. Two matrices, the second column vector belongs to the third matrix. In addition, the process of the network device calculating the product of the first matrix and the third matrix will also include the process of the network device calculating the product of the first row vector and the second column vector, where the first row vector belongs to the first matrix.
图3为本申请提供的一种数据计算方法的另一种计算流程示意图。Figure 3 is a schematic diagram of another calculation flow of a data calculation method provided by this application.
示例性的,请参阅图3,本申请中,网络设备可以根据计算部件的算力和未剪枝的权重矩阵的复杂程度来动态配置稀疏比例,以保证模型采用N:M稀疏比例的配置时,实现细颗粒结构化稀疏加速过程。通过将矩阵A(第一矩阵)的bitmap(非零数据索引)数据输入到Prefix Sum模块来计算每行PE的B矩阵(第二矩阵)的列向量中入选数据的局部搬移步数(第一搬移数),再将其计算结果和稀疏比例输入至非零数据压缩器,通过局部非零数据压缩和全局数据压缩完成一个PE上B矩阵中列向量的数据压缩。稀疏矩阵加速器可以通过在每个PE模块中增加Prefix Sum和None-Zero Shifter单元来对矩阵A的bitmap进行解析和计算,根据结果对矩阵B列向量进行非零数据搬移,从而实现矩阵运算的加速。For example, please refer to Figure 3. In this application, the network device can dynamically configure the sparse ratio according to the computing power of the computing component and the complexity of the unpruned weight matrix to ensure that the model adopts the configuration of N:M sparse ratio. , to achieve a fine-grained structured sparse acceleration process. By inputting the bitmap (non-zero data index) data of matrix A (first matrix) into the Prefix Sum module, calculate the number of local moving steps of the selected data in the column vector of matrix B (second matrix) of each row of PE (first Move number), and then input the calculation result and sparse ratio to the non-zero data compressor, and complete the data compression of the column vector in the B matrix on a PE through local non-zero data compression and global data compression. The sparse matrix accelerator can parse and calculate the bitmap of matrix A by adding Prefix Sum and None-Zero Shifter units to each PE module, and perform non-zero data movement on the column vector of matrix B based on the results, thereby achieving acceleration of matrix operations. .
下面详细介绍网络设备根据稀疏比例压缩第一列向量得到第二列向量的过程。The following describes in detail the process by which the network device compresses the first column vector according to the sparsity ratio to obtain the second column vector.
本申请中,首先,网络设备可以根据非零数据索引计算第一搬移数,非零数据索引用于表示第一矩阵中的数据在剪枝前的权重矩阵中的分布状况,第一搬移数用于表示入选数据的搬移步数,入选数据用于表示第一列向量中与非零数据索引对应的数据。In this application, first, the network device can calculate the first migration number based on the non-zero data index. The non-zero data index is used to represent the distribution of the data in the first matrix in the weight matrix before pruning. The first migration number is It represents the number of moving steps of the selected data. The selected data is used to represent the data corresponding to the non-zero data index in the first column vector.
示例性的,如图3所示,网络设备在矩阵计算引擎中可以引入Prefix Sum和None-Zero Shifter模块,以保证模型采用N:M稀疏比例的配置时,实现细颗粒结构化稀疏加速过程。本申请中,Prefix-Sum模块具体的作业流程如图3所示。将bitmap(非零数据索引)中的每一位数据取反,将每个bitmap之前的数据进行累加,从而可以获取之前数据的零值元素个数,作为该元素后续要移动的距离(第一搬移数)。For example, as shown in Figure 3, the network device can introduce the Prefix Sum and None-Zero Shifter modules in the matrix calculation engine to ensure that the fine-grained structured sparse acceleration process can be achieved when the model adopts the configuration of N:M sparse ratio. In this application, the specific operation process of the Prefix-Sum module is shown in Figure 3. Invert each bit of data in the bitmap (non-zero data index), and accumulate the data before each bitmap, so that the number of zero-valued elements of the previous data can be obtained as the distance that the element will subsequently move (first Number of moves).
图4为本申请提供的一种计算方法的另一种计算流程示意图。Figure 4 is a schematic diagram of another calculation flow of a calculation method provided by this application.
例如,图4中非零数据索引中包括的数据分别为1、0、0、1、1、0、1、0,其中,对上述数据进行取反后,可知取反后的结果为0、1、1、0、0、1、0、1。将上述数据进行累加,则得到0、0、1、2、2、2、、3、3的结果,即表示第一列向量中第一位至第8位需要进行移位的数据分别为0、0、1、2、2、2、3、3。For example, the data included in the non-zero data index in Figure 4 are 1, 0, 0, 1, 1, 0, 1, 0 respectively. After inverting the above data, it can be seen that the inverted results are 0, 1, 1, 0, 0, 1, 0, 1. Accumulating the above data, you will get the result of 0, 0, 1, 2, 2, 2,, 3, 3, which means that the data that needs to be shifted from the first to the 8th bit in the first column vector are 0 respectively. ,0,1,2,2,2,3,3.
本申请中,网络设备获取到第一搬移数后便可以将第一列向量压缩为第二列向量,具体的压缩过程将在下面的示例中进行详细说明。In this application, after the network device obtains the first migration number, it can compress the first column vector into the second column vector. The specific compression process will be described in detail in the following example.
本申请中,从Prefix Sum获取所需移动的非零数据移动控制信号发送给Local None-Zero Shifter单元,网络设备首先根据所述非零数据索引从第一列向量中选择入选数据,然后,网络设备根据第一搬移数对入选数据移位得到第二列向量,即将每个被选中的数据单元根据第一搬移数搬移至合适的位置上。In this application, the non-zero data movement control signal required to be moved is obtained from Prefix Sum and sent to the Local None-Zero Shifter unit. The network device first selects the selected data from the first column vector according to the non-zero data index. Then, the network The device shifts the selected data according to the first move number to obtain the second column vector, that is, moves each selected data unit to the appropriate position according to the first move number.
图5为本申请提供的一种数据计算方法的另一种计算流程示意图。Figure 5 is a schematic diagram of another calculation flow of a data calculation method provided by this application.
示例性的,请参阅图5,None-Zero Shifter单元中至少包括local None-Zero Shifter单元, None-Zero Shifter可以根据Prefix Sum计算得出的非零局部搬移步数(第一搬移数),完成一个PE上B矩阵的列向量的数据压缩。如图5所示,若稀疏比例为4:8的话,根据非零数据索引的指示,被选中的数据分别为第一位的1、第四位的1、第五位的1以及第七位的1,则将每8个元素中被选中的数据移到队列的前4个位置。同理,如图6所示,若稀疏比例为2:8的话,则将每8个元素中被选中的数据搬移到队列的前2个位置。如图7所示,若稀疏比例为1:8的话,则将每8个元素中被选中的数据搬移到队列的首位。For example, please refer to Figure 5. The None-Zero Shifter unit at least includes the local None-Zero Shifter unit. None-Zero Shifter can complete the data compression of the column vector of the B matrix on a PE based on the number of non-zero local movement steps (the first movement number) calculated by Prefix Sum. As shown in Figure 5, if the sparse ratio is 4:8, according to the instructions of the non-zero data index, the selected data are 1 in the first position, 1 in the fourth position, 1 in the fifth position, and 7th position. 1, then the selected data in every 8 elements will be moved to the first 4 positions of the queue. In the same way, as shown in Figure 6, if the sparse ratio is 2:8, the selected data in every 8 elements will be moved to the first 2 positions of the queue. As shown in Figure 7, if the sparse ratio is 1:8, the selected data in every 8 elements will be moved to the first place in the queue.
本申请中,网络设备获取到第一搬移数后,将第一列向量压缩为第二列向量还有其他实现方式,具体的压缩过程将在下面的示例中进行详细说明。In this application, after the network device obtains the first migration number, there are other ways to compress the first column vector into the second column vector. The specific compression process will be explained in detail in the following example.
本申请中,从Prefix Sum获取所需移动的非零数据移动控制信号发送给Local None-Zero Shifter单元,网络设备根据非零数据索引从第一列向量中选择多组入选数据,其中,每组入选数据中包括一个或多个入选数据,网络设备根据第一搬移数对多组入选数据移位得到多组入选向量,网络设备根据稀疏比例将多组入选向量压缩后得到第二列向量。In this application, the non-zero data movement control signal required to be moved is obtained from Prefix Sum and sent to the Local None-Zero Shifter unit. The network device selects multiple groups of selected data from the first column vector according to the non-zero data index, where each group The selected data includes one or more selected data. The network device shifts the multiple sets of selected data according to the first shift number to obtain multiple sets of selected vectors. The network device compresses the multiple sets of selected vectors according to the sparsity ratio to obtain the second column vector.
图8为本申请提供的一种数据计算方法的另一种计算流程示意图。Figure 8 is a schematic diagram of another calculation flow of a data calculation method provided by this application.
请参阅图8,None-Zero Shifter单元可以分为两个子模块,分别为local None-Zero Shifter和global None-Zero Shifter。None-Zero Shifter可以根据Prefix Sum计算非零局部搬移步数(第一搬移数)和稀疏比例,通过局部非零数据压缩和全局数据压缩两个模块完成一个PE上对B矩阵中列向量的数据压缩过程。其中Local None-Zero Shifter是完成局部压缩,具体的实现方式和上述示例中提及的实现方式相类似,global None-Zero Shifter是以局部压缩的结果做为输入进行全局数据压缩,将压缩后的数据输送给PE。Please refer to Figure 8. The None-Zero Shifter unit can be divided into two sub-modules, namely local None-Zero Shifter and global None-Zero Shifter. None-Zero Shifter can calculate the number of non-zero local moving steps (the first moving number) and the sparse ratio based on Prefix Sum, and complete the data of column vectors in the B matrix on a PE through the two modules of local non-zero data compression and global data compression. compression process. Among them, Local None-Zero Shifter completes local compression. The specific implementation method is similar to the implementation method mentioned in the above example. Global None-Zero Shifter uses the result of local compression as input to perform global data compression, and compresses the compressed data. Data is sent to PE.
图9为本申请提供的一种数据计算方法的另一种计算流程示意图。Figure 9 is a schematic diagram of another calculation flow of a data calculation method provided by this application.
请参阅图9,本申请中,全局数据压缩模块对第二矩阵进行压缩时具有具体的实现方式,将在下面的示例中进行详细说明。Please refer to Figure 9. In this application, the global data compression module has a specific implementation method for compressing the second matrix, which will be explained in detail in the following example.
本申请中,从Prefix Sum获取所需移动的非零数据移动控制信号发送给Local None-Zero Shifter单元,将每个数据单元非零数据搬移到合适的位置上。若稀疏比例为4:8的话,则将每8个元素中非零数据移到队列的前4个位置;若稀疏比例为2:8的话,则将每8个元素中非零数据搬移到队列的前2个位置;若稀疏比例为1:8的话,则将每8个元素中非零数据搬移到队列的首位。根据Local None-Zero Shifter搬移的队列数据和稀疏比例参数W,将多个队列中非零数据组装完成16个数据输入后续计算。若稀疏比例为4:8,将4个队列中前4的非零数据搬移合成一个位数为16的非零数据向量;若稀疏比例为2:8,将8个队列中前2的非零数据搬移合成一个位数为16的非零数据向量;若稀疏比例为1:8,将16个队列中首位零数据搬移合成一个位数为16的非零数据向量。In this application, the non-zero data movement control signal required to be moved is obtained from Prefix Sum and sent to the Local None-Zero Shifter unit to move the non-zero data of each data unit to the appropriate location. If the sparse ratio is 4:8, then the non-zero data in every 8 elements will be moved to the first 4 positions of the queue; if the sparse ratio is 2:8, then the non-zero data in every 8 elements will be moved to the queue. The first 2 positions; if the sparse ratio is 1:8, non-zero data in every 8 elements will be moved to the first position of the queue. According to the queue data moved by Local None-Zero Shifter and the sparse proportion parameter W, the non-zero data in multiple queues are assembled to complete 16 data inputs for subsequent calculations. If the sparse ratio is 4:8, move the first 4 non-zero data in the 4 queues into a non-zero data vector with 16 digits; if the sparse ratio is 2:8, move the first 2 non-zero data in the 8 queues The data is moved to form a non-zero data vector with 16 digits; if the sparse ratio is 1:8, the first zero data in the 16 queues is moved to form a non-zero data vector with 16 digits.
本申请中,网络设备获取第一矩阵和第二矩阵,其中,第一矩阵是目标模型中剪枝后的权重矩阵,第二矩阵是输入所述目标模型的数据。网络设备接收配置指令,该配置指令用于设置稀疏比例。网络设备根据稀疏比例压缩第二矩阵以得到第三矩阵,进而计算第一矩阵和第三矩阵的乘积。网络设备可以根据配置指令来灵活设置稀疏比例的值,因此,针对不同的模型可以更加充分的发挥硬件的计算性能,适用场景较为广泛,兼容性较强。In this application, the network device obtains a first matrix and a second matrix, where the first matrix is the pruned weight matrix in the target model, and the second matrix is the data input to the target model. Network devices receive configuration instructions that set the sparsity ratio. The network device compresses the second matrix according to the sparse ratio to obtain a third matrix, and then calculates the product of the first matrix and the third matrix. Network devices can flexibly set the sparsity ratio value according to configuration instructions. Therefore, the computing performance of the hardware can be more fully utilized for different models. The applicable scenarios are wider and the compatibility is stronger.
上述示例提供了一种数据计算方法的不同的实施方式,下面提供了一种网络设备20,如图10所示,该网络设备20用于执行上述示例中涉及的数据计算方法,该执行步骤以及相应的有益效果具体请参照上述相应的示例进行理解,此处不再赘述,包括:The above examples provide different implementations of a data calculation method. The following provides a network device 20, as shown in Figure 10. The network device 20 is used to execute the data calculation method involved in the above examples. The execution steps are as follows: Please refer to the above corresponding examples to understand the corresponding beneficial effects, which will not be repeated here, including:
处理单元201,用于获取第一矩阵和第二矩阵,所述第一矩阵用于表示目标模型中剪枝后的权重矩阵,所述第二矩阵用于表示输入所述目标模型的数据;The processing unit 201 is configured to obtain a first matrix and a second matrix, the first matrix is used to represent the pruned weight matrix in the target model, and the second matrix is used to represent the data input to the target model;
接收单元202,用于接收配置指令,所述配置指令用于设置稀疏比例;The receiving unit 202 is used to receive configuration instructions, where the configuration instructions are used to set the sparse ratio;
所述处理单元201用于:The processing unit 201 is used for:
根据所述稀疏比例压缩所述第二矩阵以得到第三矩阵;Compress the second matrix according to the sparse ratio to obtain a third matrix;
所述网络设备计算所述第一矩阵和所述第三矩阵的乘积。The network device calculates a product of the first matrix and the third matrix.
一种可能的实现方式中, In one possible implementation,
所述处理单元201用于:The processing unit 201 is used for:
根据所述稀疏比例压缩第一列向量得到第二列向量,所述第一列向量属于所述第二矩阵,所述第二列向量属于所述第三矩阵;Compress the first column vector according to the sparse ratio to obtain a second column vector, the first column vector belongs to the second matrix, and the second column vector belongs to the third matrix;
计算第一行向量和所述第二列向量的乘积,所述第一行向量属于所述第一矩阵。Calculate the product of a first row vector belonging to the first matrix and the second column vector.
一种可能的实现方式中,In one possible implementation,
所述处理单元201,还用于根据非零数据索引计算第一搬移数,所述非零数据索引用于表示所述第一矩阵中的数据在剪枝前的权重矩阵中的分布状况,所述第一搬移数用于表示入选数据的搬移步数,所述入选数据用于表示第一列向量中与所述非零数据索引对应的数据;The processing unit 201 is also configured to calculate the first migration number based on the non-zero data index, which is used to represent the distribution status of the data in the first matrix in the weight matrix before pruning, so The first moving number is used to represent the number of moving steps of the selected data, and the selected data is used to represent the data corresponding to the non-zero data index in the first column vector;
所述处理单元201,用于根据所述第一搬移数压缩所述第一列向量得到第二列向量。The processing unit 201 is configured to compress the first column vector according to the first movement number to obtain a second column vector.
一种可能的实现方式中,In one possible implementation,
所述处理单元201用于:The processing unit 201 is used for:
根据所述非零数据索引从所述第一列向量中选择入选数据;Select selected data from the first column vector based on the non-zero data index;
根据所述第一搬移数对所述入选数据移位得到所述第二列向量。The selected data is shifted according to the first shift number to obtain the second column vector.
一种可能的实现方式中,In one possible implementation,
所述处理单元201用于:The processing unit 201 is used for:
根据所述非零数据索引从所述第一列向量中选择多组入选数据,每组入选数据中包括一个或多个入选数据;Select multiple groups of selected data from the first column vector according to the non-zero data index, each group of selected data including one or more selected data;
根据所述第一搬移数对所述多组入选数据移位得到多组入选向量;Shift the multiple sets of selected data according to the first shift number to obtain multiple sets of selected vectors;
根据所述稀疏比例将所述多组入选向量压缩后得到所述第二列向量。The second column vector is obtained by compressing the plurality of selected vectors according to the sparsity ratio.
一种可能的实现方式中,In one possible implementation,
所述稀疏比例由所述网络设备中处理部件的算力和/或剪枝前的权重矩阵决定。The sparse ratio is determined by the computing power of the processing component in the network device and/or the weight matrix before pruning.
需要说明的是,上述网络设备20的各模块之间的信息交互、执行过程等内容,由于与本申请方法示例基于同一构思,其执行步骤与上述方法步骤的详细内容一致,可参见上述方法示例处的描述。It should be noted that the information interaction, execution process, etc. between the modules of the network device 20 are based on the same concept as the method examples of this application, and the execution steps are consistent with the details of the above method steps. Please refer to the above method examples. description of the location.
上述示例提供了一种网络设备20的不同的实施方式,下面提供了一种网络设备30,如图11所示,该网络设备30用于执行上述示例中突发访问的应对方法,该执行步骤以及相应的有益效果具体请参照上述相应的示例进行理解,此处不再赘述。The above examples provide different implementations of the network device 20. The following provides a network device 30, as shown in Figure 11. The network device 30 is used to perform the sudden access response method in the above example. The execution steps Please refer to the above corresponding examples to understand the specific beneficial effects and will not be repeated here.
参阅图11所示,为本申请提供一种网络设备的结构示意图,该网络设备30包括:处理器302、通信接口303、存储器301。可选的,可以包括总线304。其中,通信接口303、处理器302以及存储器301可以通过总线304相互连接;总线304可以是外围部件互连标准(Peripheral Component Interconnect,PCI)总线或扩充工业标准体系结构(extended industry standard architecture,EISA)总线等。所述总线可以分为地址总线、数据总线、控制总线等。为便于表示,图11中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。该网络设备30可以实现图11所示的示例中的任意一个网络设备的功能。处理器302和通信接口303可以执行上述方法示例中网络设备相应的操作。Referring to FIG. 11 , a schematic structural diagram of a network device is provided for this application. The network device 30 includes: a processor 302 , a communication interface 303 , and a memory 301 . Optionally, bus 304 may be included. Among them, the communication interface 303, the processor 302 and the memory 301 can be connected to each other through the bus 304; the bus 304 can be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (EISA) bus etc. The bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one thick line is used in Figure 11, but it does not mean that there is only one bus or one type of bus. The network device 30 can implement the functions of any network device in the example shown in FIG. 11 . The processor 302 and the communication interface 303 can perform corresponding operations of the network device in the above method examples.
下面结合图10对网络设备的各个构成部件进行具体的介绍:The following is a detailed introduction to each component of the network equipment in conjunction with Figure 10:
其中,存储器301可以是易失性存储器(volatile memory),例如随机存取存储器(random-access memory,RAM);或者非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM),快闪存储器(flash memory),硬盘(hard disk drive,HDD)或固态硬盘(solid-state drive,SSD);或者上述种类的存储器的组合,用于存储可实现本申请方法的程序代码、配置文件或其他内容。The memory 301 may be a volatile memory (volatile memory), such as a random-access memory (RAM); or a non-volatile memory (non-volatile memory), such as a read-only memory (read-only memory). memory (ROM), flash memory (flash memory), hard disk drive (HDD) or solid-state drive (SSD); or a combination of the above types of memories, used to store data that can implement the method of the present application. Program code, configuration files, or other content.
处理器302是控制器的控制中心,可以是一个中央处理器(central processing unit,CPU),也可以是特定集成电路(application specific integrated circuit,ASIC),或者是被配置成实施本申请提供的示例的一个或多个集成电路,例如:一个或多个数字信号处理器(digital signal processor,DSP),或,一个或者多个现场可编程门阵列(field programmable gate array,FPGA)。The processor 302 is the control center of the controller, which can be a central processing unit (CPU), an application specific integrated circuit (ASIC), or is configured to implement the examples provided in this application. One or more integrated circuits, such as one or more digital signal processors (DSP), or one or more field programmable gate arrays (FPGA).
通信接口303用于与其他网络设备进行通信。Communication interface 303 is used to communicate with other network devices.
该处理器302可以执行前述图10所示示例中网络设备所执行的操作,具体此处不再赘述。The processor 302 can perform the operations performed by the network device in the example shown in FIG. 10, which will not be described again here.
需要说明的是,上述网络设备30的各模块之间的信息交互、执行过程等内容,由于与本申请方法 示例基于同一构思,其执行步骤与上述方法步骤的详细内容一致,可参见上述方法示例处的描述。It should be noted that the information exchange, execution process, etc. between the modules of the above-mentioned network device 30 are different from those of the method of the present application. The example is based on the same concept, and its execution steps are consistent with the details of the above method steps. Please refer to the description of the above method example.
本申请提供了一种芯片,该芯片包括处理器和通信接口,所述处理器与所述通信接口耦合,所述处理器用于读取指令执行上述图1至图11所述的实施例中网络设备所执行的操作。This application provides a chip. The chip includes a processor and a communication interface. The processor is coupled to the communication interface. The processor is used to read instructions and execute the network in the embodiments described in Figures 1 to 11. The operation performed by the device.
本申请提供了一种网络系统,该系统包括上述图10所述的实施例中所述的网络设备。This application provides a network system, which includes the network device described in the embodiment described in Figure 10.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述示例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that for the convenience and simplicity of description, the specific working processes of the systems, devices and units described above can be referred to the corresponding processes in the foregoing examples, and will not be described again here.
在本申请所提供的几个示例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置示例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。Among the several examples provided in this application, it should be understood that the disclosed systems, devices and methods can be implemented in other ways. For example, the device examples described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or may be Integrated into another system, or some features can be ignored, or not implemented. On the other hand, the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本示例的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. You can select some or all of the units according to actual needs to achieve the purpose of this example.
另外,在本申请各个示例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each example of this application can be integrated into one processing unit, or each unit can exist physically alone, or two or more units can be integrated into one unit. The above integrated units can be implemented in the form of hardware or software functional units.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个示例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,read-only memory)、随机存取存储器(RAM,random access memory)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to cause a computer device (which can be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in each example of this application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, read-only memory), random access memory (RAM, random access memory), magnetic disk or optical disk and other media that can store program code. .
以上所述的具体实施方式,对本发明的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,不同的示例可以进行组合,以上所述仅为本发明的具体实施方式而已,并不用于限定本发明的保护范围,凡在本发明的精神和原则之内,所做的任何组合、修改、等同替换、改进等,均应包含在本发明的保护范围之内。以上所述,以上示例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述示例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各示例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各示例技术方案的范围。 The above-mentioned specific implementations further describe the purpose, technical solutions and beneficial effects of the present invention in detail. It should be understood that different examples can be combined, and the above are only specific implementations of the present invention. It is not intended to limit the protection scope of the present invention. Any combination, modification, equivalent substitution, improvement, etc. made within the spirit and principles of the present invention shall be included in the protection scope of the present invention. As mentioned above, the above examples are only used to illustrate the technical solution of the present application, but not to limit it; although the present application has been described in detail with reference to the foregoing examples, those of ordinary skill in the art should understand that they can still modify the foregoing examples. The recorded technical solutions may be modified, or some of the technical features thereof may be equivalently substituted; however, these modifications or substitutions shall not cause the essence of the corresponding technical solutions to depart from the scope of the exemplary technical solutions of this application.

Claims (15)

  1. 一种数据计算方法,其特征在于,包括:A data calculation method, characterized by including:
    网络设备获取第一矩阵和第二矩阵,所述第一矩阵用于表示目标模型中剪枝后的权重矩阵,所述第二矩阵用于表示输入所述目标模型的数据;The network device obtains a first matrix and a second matrix, the first matrix is used to represent the pruned weight matrix in the target model, and the second matrix is used to represent the data input to the target model;
    所述网络设备接收配置指令,所述配置指令用于设置稀疏比例;The network device receives a configuration instruction, and the configuration instruction is used to set the sparse ratio;
    所述网络设备根据所述稀疏比例压缩所述第二矩阵以得到第三矩阵;The network device compresses the second matrix according to the sparse ratio to obtain a third matrix;
    所述网络设备计算所述第一矩阵和所述第三矩阵的乘积。The network device calculates a product of the first matrix and the third matrix.
  2. 根据权利要求1所述的数据计算方法,其特征在于,所述网络设备根据所述稀疏比例压缩所述第二矩阵以得到第三矩阵,包括:The data calculation method according to claim 1, characterized in that the network device compresses the second matrix according to the sparse ratio to obtain a third matrix, including:
    所述网络设备根据所述稀疏比例压缩第一列向量得到第二列向量,所述第一列向量属于所述第二矩阵,所述第二列向量属于所述第三矩阵;The network device compresses the first column vector according to the sparsity ratio to obtain a second column vector, the first column vector belongs to the second matrix, and the second column vector belongs to the third matrix;
    所述网络设备计算所述第一矩阵和所述第三矩阵的乘积,包括:The network device calculates the product of the first matrix and the third matrix, including:
    所述网络设备计算第一行向量和所述第二列向量的乘积,所述第一行向量属于所述第一矩阵。The network device calculates the product of a first row vector belonging to the first matrix and the second column vector.
  3. 根据权利要求2所述的数据计算方法,其特征在于,所述方法还包括:The data calculation method according to claim 2, characterized in that the method further includes:
    所述网络设备根据非零数据索引计算第一搬移数,所述非零数据索引用于表示所述第一矩阵中的数据在剪枝前的权重矩阵中的分布状况,所述第一搬移数用于表示入选数据的搬移步数,所述入选数据用于表示第一列向量中与所述非零数据索引对应的数据;The network device calculates a first migration number based on a non-zero data index. The non-zero data index is used to represent the distribution status of the data in the first matrix in the weight matrix before pruning. The first migration number Used to represent the number of moving steps of the selected data, which is used to represent the data corresponding to the non-zero data index in the first column vector;
    所述网络设备根据非零数据索引压缩第一列向量得到第二列向量,包括:The network device compresses the first column vector according to the non-zero data index to obtain the second column vector, including:
    所述网络设备根据所述第一搬移数压缩所述第一列向量得到第二列向量。The network device compresses the first column vector according to the first migration number to obtain a second column vector.
  4. 根据权利要求3所述的数据计算方法,其特征在于,所述网络设备根据所述第一搬移数引压缩所述第一列向量得到第二列向量,包括:The data calculation method according to claim 3, wherein the network device compresses the first column vector according to the first movement index to obtain a second column vector, including:
    所述网络设备根据所述非零数据索引从所述第一列向量中选择入选数据;The network device selects selected data from the first column vector based on the non-zero data index;
    所述网络设备根据所述第一搬移数对所述入选数据移位得到所述第二列向量。The network device shifts the selected data according to the first movement number to obtain the second column vector.
  5. 根据权利要求3所述的数据计算方法,其特征在于,所述方法还包括:The data calculation method according to claim 3, characterized in that the method further includes:
    所述网络设备根据所述非零数据索引从所述第一列向量中选择多组入选数据,每组入选数据中包括一个或多个入选数据;The network device selects multiple groups of selected data from the first column vector according to the non-zero data index, and each group of selected data includes one or more selected data;
    所述网络设备根据所述第一搬移数对所述多组入选数据移位得到多组入选向量;The network device shifts the plurality of groups of selected data according to the first movement number to obtain multiple groups of selected vectors;
    所述网络设备根据所述稀疏比例将所述多组入选向量压缩后得到所述第二列向量。The network device compresses the plurality of selected vectors according to the sparsity ratio to obtain the second column vector.
  6. 根据权利要求1至5中任意一项所述的数据计算方法,其特征在于,所述稀疏比例由所述网络设备中处理部件的算力和剪枝前的权重矩阵决定。The data calculation method according to any one of claims 1 to 5, characterized in that the sparsity ratio is determined by the computing power of the processing component in the network device and the weight matrix before pruning.
  7. 一种网络设备,其特征在于,包括:A network device, characterized by including:
    处理单元,用于获取第一矩阵和第二矩阵,所述第一矩阵用于表示目标模型中剪枝后的权重矩阵,所述第二矩阵用于表示输入所述目标模型的数据;A processing unit configured to obtain a first matrix and a second matrix, the first matrix being used to represent the pruned weight matrix in the target model, and the second matrix being used to represent the data input to the target model;
    接收单元,用于接收配置指令,所述配置指令用于设置稀疏比例;A receiving unit, configured to receive configuration instructions, where the configuration instructions are used to set the sparse ratio;
    所述处理单元用于:The processing unit is used for:
    根据所述稀疏比例压缩所述第二矩阵以得到第三矩阵;Compress the second matrix according to the sparse ratio to obtain a third matrix;
    所述网络设备计算所述第一矩阵和所述第三矩阵的乘积。The network device calculates a product of the first matrix and the third matrix.
  8. 根据权利要求7所述的网络设备,其特征在于,The network device according to claim 7, characterized in that,
    所述处理单元用于:The processing unit is used for:
    根据所述稀疏比例压缩第一列向量得到第二列向量,所述第一列向量属于所述第二矩阵,所述第二列向量属于所述第三矩阵;Compress the first column vector according to the sparse ratio to obtain a second column vector, the first column vector belongs to the second matrix, and the second column vector belongs to the third matrix;
    计算第一行向量和所述第二列向量的乘积,所述第一行向量属于所述第一矩阵。Calculate the product of a first row vector belonging to the first matrix and the second column vector.
  9. 根据权利要求8所述的网络设备,其特征在于,The network device according to claim 8, characterized in that:
    所述处理单元,还用于根据非零数据索引计算第一搬移数,所述非零数据索引用于表示所述第一矩阵中的数据在剪枝前的权重矩阵中的分布状况,所述第一搬移数用于表示入选数据的搬移步数,所述入 选数据用于表示第一列向量中与所述非零数据索引对应的数据;The processing unit is further configured to calculate a first migration number based on a non-zero data index, which is used to represent the distribution of the data in the first matrix in the weight matrix before pruning. The first migration number is used to represent the number of migration steps of the selected data. The selected data is used to represent the data corresponding to the non-zero data index in the first column vector;
    所述处理单元,用于根据所述第一搬移数压缩所述第一列向量得到第二列向量。The processing unit is configured to compress the first column vector according to the first movement number to obtain a second column vector.
  10. 根据权利要求9所述的网络设备,其特征在于,The network device according to claim 9, characterized in that:
    所述处理单元用于:The processing unit is used for:
    根据所述非零数据索引从所述第一列向量中选择入选数据;Select selected data from the first column vector based on the non-zero data index;
    根据所述第一搬移数对所述入选数据移位得到所述第二列向量。The selected data is shifted according to the first shift number to obtain the second column vector.
  11. 根据权利要求9所述的网络设备,其特征在于,The network device according to claim 9, characterized in that:
    所述处理单元用于:The processing unit is used for:
    根据所述非零数据索引从所述第一列向量中选择多组入选数据,每组入选数据中包括一个或多个入选数据;Select multiple groups of selected data from the first column vector according to the non-zero data index, each group of selected data including one or more selected data;
    根据所述第一搬移数对所述多组入选数据移位得到多组入选向量;Shift the multiple sets of selected data according to the first shift number to obtain multiple sets of selected vectors;
    根据所述稀疏比例将所述多组入选向量压缩后得到所述第二列向量。The second column vector is obtained by compressing the plurality of selected vectors according to the sparsity ratio.
  12. 根据权利要求7至11中任意一项所述的网络设备,其特征在于,所述稀疏比例由所述网络设备中处理部件的算力和/或剪枝前的权重矩阵决定。The network device according to any one of claims 7 to 11, characterized in that the sparsity ratio is determined by the computing power of the processing unit in the network device and/or the weight matrix before pruning.
  13. 一种网络设备,其特征在于,包括:A network device, characterized by including:
    处理器和存储器;processor and memory;
    所述处理器用于执行所述存储器中存储的指令,使得权利要求1至6中任一项所述的方法被执行。The processor is configured to execute instructions stored in the memory, so that the method described in any one of claims 1 to 6 is executed.
  14. 一种计算机可读存储介质,用于存储计算机程序,其特征在于,当所述计算机程序在计算机或处理器上执行时,使得权利要求1至6中任一项所述的方法被执行。A computer-readable storage medium used to store a computer program, characterized in that when the computer program is executed on a computer or processor, the method described in any one of claims 1 to 6 is executed.
  15. 一种计算机程序产品,其特征在于,当其在计算机或处理器上运行时,使得权利要求1至6中任一项所述的方法被执行。 A computer program product, characterized in that, when run on a computer or processor, the method according to any one of claims 1 to 6 is executed.
PCT/CN2023/101090 2022-06-27 2023-06-19 Data computing method and related device WO2024001841A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210736691.6 2022-06-27
CN202210736691.6A CN117332197A (en) 2022-06-27 2022-06-27 Data calculation method and related equipment

Publications (1)

Publication Number Publication Date
WO2024001841A1 true WO2024001841A1 (en) 2024-01-04

Family

ID=89290739

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/101090 WO2024001841A1 (en) 2022-06-27 2023-06-19 Data computing method and related device

Country Status (2)

Country Link
CN (1) CN117332197A (en)
WO (1) WO2024001841A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107944555A (en) * 2017-12-07 2018-04-20 广州华多网络科技有限公司 Method, storage device and the terminal that neutral net is compressed and accelerated
CN112732222A (en) * 2021-01-08 2021-04-30 苏州浪潮智能科技有限公司 Sparse matrix accelerated calculation method, device, equipment and medium
US20210150362A1 (en) * 2019-11-15 2021-05-20 Microsoft Technology Licensing, Llc Neural network compression based on bank-balanced sparsity
CN113762493A (en) * 2020-06-01 2021-12-07 阿里巴巴集团控股有限公司 Neural network model compression method and device, acceleration unit and computing system
CN114341825A (en) * 2019-08-29 2022-04-12 阿里巴巴集团控股有限公司 Method and system for providing vector sparsification in neural networks

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107944555A (en) * 2017-12-07 2018-04-20 广州华多网络科技有限公司 Method, storage device and the terminal that neutral net is compressed and accelerated
CN114341825A (en) * 2019-08-29 2022-04-12 阿里巴巴集团控股有限公司 Method and system for providing vector sparsification in neural networks
US20210150362A1 (en) * 2019-11-15 2021-05-20 Microsoft Technology Licensing, Llc Neural network compression based on bank-balanced sparsity
CN113762493A (en) * 2020-06-01 2021-12-07 阿里巴巴集团控股有限公司 Neural network model compression method and device, acceleration unit and computing system
CN112732222A (en) * 2021-01-08 2021-04-30 苏州浪潮智能科技有限公司 Sparse matrix accelerated calculation method, device, equipment and medium

Also Published As

Publication number Publication date
CN117332197A (en) 2024-01-02

Similar Documents

Publication Publication Date Title
US11709672B2 (en) Computing device and method
US11630666B2 (en) Computing device and method
US11106598B2 (en) Computing device and method
CN111062472B (en) Sparse neural network accelerator based on structured pruning and acceleration method thereof
EP3651070B1 (en) Computation device and method
CN110489428B (en) Multi-dimensional sparse matrix compression method, decompression method, device, equipment and medium
WO2022041188A1 (en) Accelerator for neural network, acceleration method and device, and computer storage medium
CN107943756B (en) Calculation method and related product
CN107957977B (en) Calculation method and related product
CN110515587B (en) Multiplier, data processing method, chip and electronic equipment
CN108108189B (en) Calculation method and related product
WO2024012180A1 (en) Matrix calculation method and device
WO2024001841A1 (en) Data computing method and related device
JP2024028901A (en) Sparse matrix multiplication in hardware
WO2023065701A1 (en) Inner product processing component, arbitrary-precision computing device and method, and readable storage medium
CN108021393B (en) Calculation method and related product
CN110688087B (en) Data processor, method, chip and electronic equipment
CN113033788B (en) Data processor, method, device and chip
US20220318604A1 (en) Sparse machine learning acceleration
CN114282158A (en) Matrix calculation circuit, matrix calculation method, electronic device, and computer-readable storage medium
CN113836481A (en) Matrix calculation circuit, matrix calculation method, electronic device, and computer-readable storage medium
CN114416020A (en) Rapid sequencing method and device based on FPGA
CN117457042A (en) Crossbar architecture based on three-port nonvolatile device, working method and lossless compression method
CN114048839A (en) Acceleration method and device for convolution calculation and terminal equipment
CN113961871A (en) Matrix calculation circuit, matrix calculation method, electronic device, and computer-readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23830021

Country of ref document: EP

Kind code of ref document: A1