CN117978654A

CN117978654A - Method and system for deploying INT8 quantized neural network on programmable switch

Info

Publication number: CN117978654A
Application number: CN202410073576.4A
Authority: CN
Inventors: 郑嘉琦; 殷天润; 潘训涛; 陈贵海
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2024-01-18
Filing date: 2024-01-18
Publication date: 2024-05-03

Abstract

The invention provides a deployment method and a deployment system of an INT8 quantized neural network on a programmable switch, and relates to the field of networks and artificial intelligence. The method comprises the following steps: preprocessing a neural network pre-training model into an INT8 quantized neural network model; configuring an analyzer of a programmable switch, and analyzing input parameters of the INT8 quantized neural network model; configuring a data surface of a programmable switch, and configuring matrix multiplication addition calculation and shift and truncation calculation according to the structure of the INT8 quantized neural network model; issuing specific parameters according to the INT8 quantized neural network model on a control surface of a programmable switch; the programmable switch parses the header of the incoming data packet, modifies it and transmits it out as a result of the neural network calculations. The method can be compatible with the currently mature INT8 quantization technology, the neural network structure is deployed on network equipment of a protocol independent architecture, and the neural network result can be calculated in a high throughput scene.

Description

Method and system for deploying INT8 quantized neural network on programmable switch

Technical Field

The invention relates to the field of networks and artificial intelligence, in particular to a deployment method and a deployment system of an INT8 quantized neural network on a programmable switch.

Background

The neural network has wide reference in solving the classification problem, and is suitable for the network problems such as traffic classification, safety detection and the like. The data driven method can train a targeted classification model aiming at the data packet head information. Previous work often deployed such models in servers, using either CPU or GPU for reasoning, faced the problem of excessive load and cost when faced with high throughput traffic.

Neural networks typically employ 32-bit precision data to represent network weight, bias, etc. information during training. The deployment of neural networks to platforms with limited computational power requires compression using model quantization. The model quantization method comprises binarization quantization, tri-quantization and the like, the most widely applied method at present is INT8 quantization, and the information such as weight, bias and the like is represented by an 8-bit integer, so that the method has little influence on the reasoning precision. The INT8 model quantization method is to scale the information such as the weights to between-128 and 127 according to the model weights and data known in advance. If the quantization is performed using the integer-of-two method, the scaling operation can be converted into a shift and truncate operation, which is more suitable for computing restricted scenes.

The advent of protocol independent architecture (P4) network devices such as programmable switches has led to solutions for high throughput traffic handling. The method can process the data of the data packet header at the line speed, and has the advantage of low delay compared with the CPU or GPU. Machine learning is deployed by using a programmable switch or an intelligent network card, and is a solution for classifying network traffic, and a common implementation manner is to train a model on a server according to existing data, deploy and update the model on the programmable switch or the intelligent network card. However, the hardware resources of the device with the protocol independent architecture are limited, the processing speed of the pipeline architecture is relatively stable, but the usage of the action matching table and the register is limited; the run-complete (RTC) architecture is relatively rich in computing resources, but overly complex computations can impact packet processing speed. Previous work has explored tree-like structures in machine learning, such as decision trees, SVMs, etc., and implemented binary neural networks in neural networks, but currently there is still a lack of a viable deployment scheme on programmable switches of pipeline architecture for the more commonly used, more accurate INT8 quantized neural network model.

Disclosure of Invention

The invention aims to: aiming at the problems that a server deployment neural network model cannot meet high throughput and the precision loss of the existing work is large, a deployment scheme of an INT8 quantized neural network model with higher accuracy on P4 architecture equipment such as a programmable switch is provided, so that the system can carry out neural network reasoning on information carried by a network packet header with high throughput so as to solve the problems in the prior art.

In a first aspect, a method for deploying an INT8 quantized neural network on a programmable switch is provided, including the following steps:

s1, preprocessing a neural network pre-training model into an INT8 quantized neural network model, and recording model parameters of the INT8 quantized neural network model and scaling coefficients of a quantization layer; training the INT8 quantized neural network model according to the application user-defined data packet by a user;

S2, configuring a parser of a programmable switch, parsing a packet header of the custom data packet to obtain the input of the INT8 quantized neural network model;

S3, configuring a data surface of a programmable switch, and configuring matrix multiplication addition calculation and shift and truncation calculation according to the structure of the INT8 quantized neural network model;

S4, issuing model parameters of the INT8 quantized neural network model and scaling coefficients of a quantization layer on a control surface of a programmable switch;

and S5, after the deployment of the programmable switch is completed, the user transmits the custom data packet into the programmable switch, the programmable switch performs model reasoning, and the result is stored in a data packet header for subsequent use by the user.

In a further embodiment of the first aspect, the neural network model preprocessing requires for the INT8 quantization model:

the neural network pre-training model (containing model parameters, here the contents of the subsequent matrices C and D, as both are linear layers) is converted to the onnx format.

A data set that requires training of the neural network is used for quantitative calibration.

The pre-training model and dataset are imported ppq into a quantization framework that, by configuring parameters (including the input structure of the neural network, the method of quantization-here referred to as integer quantization options, etc.), calibrates and quantizes the imported pre-training model from the imported dataset.

The INT8 quantization process modifies the parameters in the pre-training model to become an INT8 type; and the quantization and dequantization layers (the parameters of this layer are the scaling coefficients mentioned later) are added; the pre-trained model at this time has model parameters of the INT8 type in addition to the scaling factors.

In a further embodiment of the first aspect, during quantization, modifying parameters in the neural network pre-training model to become an INT8 type; and adding quantization and dequantization layers, wherein the quantization layers comprise scaling coefficients; the INT8 quantized neural network model includes model parameters.

In a further embodiment of the first aspect, model parameters and scaling coefficients of the INT8 quantized neural network model are obtained by step S1, which are stored in a plurality of matrices.

In a further embodiment of the first aspect, step S2 entails determining the format of the header for subsequent computation in two modes:

mode one, directly storing INT8 value in the packet head of the data packet;

And the second mode analyzes the protocol and port number of the data packet head and is used as the input of the subsequent INT8 quantized neural network model.

In a further embodiment of the first aspect, in the matrix multiply add computation in step S3, the multiply computation adapts to the operation of the INT8 type, and the add computation result is stored using the INT32 type;

using a look-up table, 32-bit meta-variables are used to store the intermediate result of the multiply-add.

In a further embodiment of the first aspect, the table look-up implementation procedure includes:

using exact matching of programmable switches, the result of the multiplication according to weights-128 to 127 is stored in a look-up table in advance for each 8-bit input variable.

In a further embodiment of the first aspect, the shifting and truncating calculation in step S3 includes:

Implemented by mask matching and the shift operation has a higher priority than the truncate operation;

Issuing an entry to a programmable switch according to specific parameters of the INT8 quantized neural network model; wherein each multiplication operation can issue 256 entries for implementing INT8 multiplication, each shift table entry consistent with the 256 masks issued by the truncation table and two table entries for truncation.

In a second aspect of the present invention, an INT8 quantized neural network deployment system is provided, the deployment system being adapted for use with a programmable switch, the deployment system comprising:

The preprocessing unit is used for preprocessing the neural network pre-training model into an INT8 quantized neural network model, and recording model parameters of the INT8 quantized neural network model and scaling coefficients of a quantization layer; training the INT8 quantized neural network model according to the application user-defined data packet by a user;

The analysis unit is used for configuring an analyzer of the programmable switch, analyzing the packet head of the custom data packet and obtaining the input of the INT8 quantized neural network model;

the configuration unit is used for configuring a data surface of the programmable switch, and configuring matrix multiplication addition calculation and shift and truncation calculation according to the structure of the INT8 quantized neural network model;

The issuing unit is used for issuing model parameters of the INT8 quantized neural network model and scaling coefficients of a quantization layer on a control surface of the programmable switch;

and the output unit is used for enabling the user to enter the custom data packet into the programmable switch after the deployment of the programmable switch is completed, and enabling the programmable switch to conduct model reasoning and enable a result to be stored in a data packet header for subsequent use by the user.

In a third aspect of the present invention, a computer readable storage medium is provided, in which at least one executable instruction is stored, which when executed on an electronic device, causes the electronic device to perform the method of deploying an INT8 quantized neural network on a programmable switch as described in the first aspect.

The beneficial effects are that: the invention firstly provides a feasible deployment method of an INT8 quantized neural network on a programmable switch, which can be compatible with the current mature INT8 quantized technology, deploy a neural network structure to network equipment of a protocol independent architecture, and calculate a neural network result in a high throughput scene. The invention provides the possibility of classifying network packets at the programmable switch level and detecting the abnormality, which is beneficial to inspiring the subsequent deployment of the abnormality detection on the programmable switch. The invention only needs INT8 quantization neural network in the programmable exchanger; compatible with existing network protocols. At the same time, it only needs to be deployed on one switch to function.

Drawings

Fig. 1 is a schematic diagram of the overall framework of the system of the present invention.

Fig. 2 is a custom header diagram of the present invention.

Fig. 3 is a picture of the packet processing procedure of the present invention.

Fig. 4 is a photograph of an implementation of the present invention on a programmable switch.

Fig. 5 is a specific implementation of the shifting and truncation of the present invention.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a more thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the invention may be practiced without one or more of these details. In other instances, well-known features have not been described in detail in order to avoid obscuring the invention.

Example 1

The embodiment discloses an INT8 quantized neural network deployment scheme on a programmable switch, comprising the following steps:

preprocessing a neural network pre-training model into an INT8 quantized neural network model;

step two, configuring a parameter part of the programmable switch for analyzing the input parameters of the quantized neural network;

Thirdly, configuring multiplication and multiplication accumulation according to the INT8 quantized neural network model structure on the data surface of the programmable switch, and configuring shifting and truncation operators.

And fourthly, issuing specific parameters according to the neural network model on a control plane of the programmable switch.

In the first step, because of the limitation of the programmable switch to implement multiplication operations, the integer quantization option (Power of 2) in INT8 quantization needs to be adopted, and the scaling factor is a Power of 2; by the first step, the model parameters and scaling factors after quantization need to be obtained and saved in the training model.

In the second step, the format of the header needs to be determined for subsequent calculations, in two modes. The first mode is to directly store INT8 value in the data packet head, and the second mode is to analyze the protocol, port number and other data packet information of the data packet head as the input of the subsequent neural network model.

In the third step, two functions, matrix multiply add computation and shift and truncate computation, need to be implemented for the full connection layer of the INT8 quantized neural network model. Function one is matrix multiplication and multiplication addition calculation. For an input matrix A, a weight matrix B and a bias matrix C, an output matrix D has the following quantitative relation: d=a×b+c; in programmable switches, because the number of times that can be calculated is limited, the way of calculation can only be addition, so that a table look-up method is used instead of multiplication, and 32-bit meta-variables are used to store the intermediate result of multiplication and addition. The table look-up multiplication is realized by the following steps: using exact matching of programmable switches, the result of the multiplication according to weights-128 to 127 is stored in a look-up table in advance for each 8-bit input variable. The second function is to shift and truncate the 32-bit accumulated result, namely, shift is performed first, then the result is reserved between-128 and 127, otherwise truncate to-128 or 127; the invention uses tri-state matching of the programmable switch to realize the function; the tri-state matched keys are priority, mask and key values, and the shift function can be completed through the mask and key values, for example, for 32-bit numbers, right shift five bits can be converted into the last five bits of the mask to be 0 and the rest are 1; the functions of sign assignment 127 and-128 can be achieved by masking the first bit; the truncating function may be implemented by lowering the priority of the assigning function and raising the priority of the shift.

In the fourth step, the required parameters are used for the control plane program to be listed by the quantization model to be analyzed and the multiplication result of the weight and-128 to 127 is required to be calculated for each multiplication operation and listed; 256 shift result entries and two truncations entries need to be calculated for each shift and truncate operation, where the priority of the shift result entries is higher than the priority of the truncations.

Example 2

On the basis of the foregoing embodiment 1, embodiment 2 further discloses details of a deployment method of an INT8 quantized neural network on a programmable switch, as follows:

Specifically, neural network model preprocessing requires for the INT8 quantization model: the neural network pre-training model (containing model parameters, here the contents of the subsequent matrices C and D, as both are linear layers) is converted to the onnx format. A data set that requires training of the neural network is used for quantitative calibration. The pre-training model and dataset are imported ppq into a quantization framework that, by configuring parameters (including the input structure of the neural network, the method of quantization-here referred to as integer quantization options, etc.), calibrates and quantizes the imported pre-training model from the imported dataset. The INT8 quantization process modifies the parameters in the pre-training model to become an INT8 type; and the quantization and dequantization layers (the parameters of this layer are the scaling coefficients mentioned later) are added; the pre-trained model at this time has model parameters of the INT8 type in addition to the scaling factors.

Fig. 1 is a schematic diagram of the overall framework of the system of the present invention. As shown, it is divided into an application layer and a decision layer. The application layer is responsible for training and updating the neural network, and the decision layer is responsible for deployment and execution of the neural network. Specifically, (1) the application layer trains the neural network according to the offline data set, and carries out INT8 quantization on the trained neural network according to the offline data set and the quantization tool, and through the step, the structural information of the neural network and parameters and scaling factors of an INT8 quantized neural network pre-training model can be obtained. (2) In the decision layer, the control plane generates P4 codes according to the structural information of the neural network, and the P4 codes are compiled and then put into a programmable switch; and the control plane generates a table program according to the parameters and the scaling factors of the pre-training model, and updates the table items of the data plane. (3) In the decision layer, the data plane processes the incoming data packet according to the codes and the table entries issued by the control plane. The data plane carries out calculation and shift cut-off operation of the neural network according to the parameter information carried in the data packet, and encapsulates the output result of the INT8 in a packet head and then sends the output result to the next switch or server. (4) In the decision layer, the control plane sends the collected network information to the application layer for updating the offline data set, the updated data set trains a new neural network and then is sent to the control plane, and the control plane updates the new neural network parameter information in the following table mode.

Fig. 2 is a custom header information of the present invention. For data packets entering the switch, the transport layer is followed by the input parameters of the neural network. The input parameters can also be obtained by parsing the IP/UDP/TCP header information. For the data packet processed by the switch, the transmission layer is followed by the processing result of the neural network, and the processing result is in the form of 8-bit signed integer. The input parameters or the output parameters are followed by the load of the data packet.

Fig. 3 is a picture of a packet processing procedure. In an INT8 quantized neural network, the data processing flow is as follows. The input is assumed to be characterized by its data format being an 8-bit signed integer; the features in the pre-trained model are; the intermediate result of matrix addition calculation is that the intermediate result is represented by a 32-bit signed integer; the result after shift truncation is. In the calculation step, the matrix addition calculation calculates multiplication according to the input characteristics and the characteristics in the pre-training model and adds the results to the elements in the result matrix. The shift truncating step performs a shift operation and truncating operation on each element in the pair, the shift operation right shifts the 32-bit integer, the result after the right shift is noted as 127 if it is greater than 127, and as-128 if it is less than-128. This step converts the 32-bit signed integer to an 8-bit signed integer, which is placed in the packet header as a final result.

Fig. 4 is a photograph of an implementation of the present invention on a programmable switch. Programmable switches are typically pipeline structures, consisting of multiple computation stages (stages), each of which can perform parallel table look-up computations. The first n stages implement matrix addition calculations, each of which multiplies a feature of the input m times and stores the result in an intermediate variable. The multiplication is realized by the accurate matching of the action matching table, and the control surface calculates the multiplication calculation result of the-128 to 127 and one feature in the pre-training model in advance and then places the calculation result in the multiplication table. The last stage performs a shift stage operation on m intermediate results of the multiply-accumulate, which is further detailed in fig. 5.

Fig. 5 is a specific implementation of the shifting and truncation of the present invention. The shift truncation table is implemented by a tri-state matching table in the programmable switch. Keys in the tri-state matching table include key values, masks (masks) and priorities. When a data arrives, if all binary values of the data at the position of mask 1 are the same as the key value, then the entry is hit; if a data item can hit multiple items, then the higher priority item will be the final result. The shift truncate table is implemented in two parts, shift and truncate, where the priority of the shift entry is higher than the truncate entry. For the shift operation, assuming that the value to be shifted right is offset, the mask value is 32 bits with the last offset bit being 0 and the remaining bits being 1, the entries being placed in advance total 256, the keys being-128 to 127 numbers shifted left after offset. For the truncate operation, the mask value is a 32-bit number with only the first bit being 1 and the remaining bits being 0, and the 32-bit numbers with keys being all 1 or all 0 correspond to negative and positive numbers. In the shifted case of fig. 5, the value of offset is 5. The key 126 in the case is shifted left by a number of 5 bits. 4033 is binary representation 0b00000000000000000000111111000001, which corresponds to the entry with key 0b00000000000000000000111111000000 in the mask 1 portion, so that a result of 126 can be obtained. Meanwhile, in the truncated case in fig. 5, since 2147483647 does not match the shifted 256 entries and its first bit is 0, a positive number entry in the truncated entry will be matched, being assigned a value of 127.

Example 3

The embodiment provides an INT8 quantized neural network deployment system, which is applicable to a programmable switch and comprises a preprocessing unit, an analysis unit, a configuration unit, a issuing unit and an output unit.

The preprocessing unit is used for preprocessing the neural network pre-training model into an INT8 quantized neural network model, and recording model parameters of the INT8 quantized neural network model and scaling coefficients of a quantization layer; the user trains the INT8 quantized neural network model according to the application custom data packet. The analysis unit is used for configuring an analyzer of the programmable switch, analyzing the packet head of the custom data packet and obtaining the input of the INT8 quantized neural network model. The configuration unit is used for configuring the data surface of the programmable switch, and configuring matrix multiplication addition calculation and shift and truncation calculation according to the structure of the INT8 quantized neural network model. The issuing unit is used for issuing model parameters of the INT8 quantized neural network model and scaling coefficients of a quantized layer on a control surface of the programmable switch. The output unit is used for inputting the custom data packet into the programmable switch by the user after the programmable switch is deployed, the programmable switch performs model reasoning, and the result is stored in the data packet header for subsequent use by the user.

Example 4

The present embodiment proposes a computer readable storage medium, where at least one executable instruction is stored, where the executable instruction when executed on an electronic device causes the electronic device to execute a method for deploying an INT8 quantized neural network on a programmable switch as described in the foregoing embodiment. More specific examples of the computer readable storage medium in the present disclosure may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. The computer readable storage medium may include a data signal propagated in baseband or as part of a carrier wave, with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

The novel point of the invention is that the method provides a neural network deployment method which is feasible on a programmable switch and can be compatible with the currently mature INT8 quantization technology, and can calculate the neural network result in a high-throughput network scene.

As described above, although the present invention has been shown and described with reference to certain preferred embodiments, it is not to be construed as limiting the invention itself. Various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for deploying an INT8 quantized neural network on a programmable switch, comprising the steps of:

2. The method of deploying an INT8 quantized neural network on a programmable switch according to claim 1, wherein step S1 further comprises:

Converting the neural network pre-training model into a onnx format;

transmitting ppq the neural network pre-training model and the data set for quantization calibration into a quantization frame;

And calibrating and quantifying the afferent neural network pre-training model according to the afferent data set by adopting an integer quantification option.

3. The method of deploying an INT8 quantized neural network on a programmable switch according to claim 2, further comprising: during quantization, modifying parameters in the neural network pre-training model to become an INT8 type; and adding quantization and dequantization layers, wherein the quantization layers comprise scaling coefficients; the INT8 quantized neural network model includes model parameters.

4. A method of deploying an INT8 quantized neural network on a programmable switch according to claim 3, wherein model parameters and scaling coefficients of the INT8 quantized neural network model are obtained by step S1 and stored in a plurality of matrices.

5. The method of deploying an INT8 quantized neural network on a programmable switch according to claim 1, wherein step S2 requires determining the format of the data packet header for subsequent computation in two modes:

mode one, directly storing INT8 value in the packet head of the data packet;

6. The method of claim 1, wherein in the matrix multiply add computation in step S3, the multiply computation adapts to the operation of the INT8 type, and the add computation results use the INT32 type storage;

7. The method for deploying the INT8 quantization neural network on the programmable switch according to claim 6, wherein the table look-up method comprises the following steps:

8. The method of deploying an INT8 quantized neural network on a programmable switch according to claim 1, wherein the shifting and truncating calculation in step S3 comprises:

9. An INT8 quantized neural network deployment system, the deployment system adapted for use with a programmable switch, the deployment system comprising:

10. A computer readable storage medium having stored therein at least one executable instruction that when run on an electronic device causes the electronic device to perform the method of deploying an INT8 quantized neural network on a programmable switch as claimed in any one of claims 1 to 8.