WO2024087512A1 - 一种图神经网络压缩方法、装置、电子设备及存储介质 - Google Patents

一种图神经网络压缩方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2024087512A1
WO2024087512A1 PCT/CN2023/085970 CN2023085970W WO2024087512A1 WO 2024087512 A1 WO2024087512 A1 WO 2024087512A1 CN 2023085970 W CN2023085970 W CN 2023085970W WO 2024087512 A1 WO2024087512 A1 WO 2024087512A1
Authority
WO
WIPO (PCT)
Prior art keywords
graph
network
neural network
graph neural
quantized
Prior art date
Application number
PCT/CN2023/085970
Other languages
English (en)
French (fr)
Inventor
胡克坤
董刚
赵雅倩
李仁刚
Original Assignee
浪潮电子信息产业股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 浪潮电子信息产业股份有限公司 filed Critical 浪潮电子信息产业股份有限公司
Publication of WO2024087512A1 publication Critical patent/WO2024087512A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1744Redundancy elimination performed by the file system using compression, e.g. sparse files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the field of neural networks, and in particular to a graph neural network compression method, device, electronic device and storage medium.
  • GNNs graph neural networks
  • GNNs are widely used in various fields such as graph-based vertex classification, molecular interactions, social networks, recommendation systems, or program comprehension.
  • GNN models usually have few parameters
  • GNNs have the characteristics of high memory usage and high computational complexity (manifested as long training or inference time) because the storage and computation requirements of each application are closely related to the size of the input graph data. This feature makes GNNs ineffective in the vast majority of resource-constrained devices, such as embedded systems and IoT devices. There are two main reasons behind this embarrassing situation.
  • the input of GNNs consists of two types of data, graph structure (edge list) and vertex features (embedding).
  • quantization compression can emerge as a "kill two birds with one stone" solution for resource-constrained devices, which can: (1) effectively reduce the memory size of vertex features, thereby reducing memory usage; (2) minimize the size of operands to reduce power consumption.
  • related quantization methods have the following two problems: (1) choose a simple but aggressive uniform quantization for all data to minimize memory and power costs, resulting in high accuracy loss; (2) choose a very conservative quantization to maintain accuracy, which leads to suboptimal memory and energy performance; (3) ignore different hardware architectures and quantize all layers of GNN in a uniform manner.
  • the purpose of this application is to provide a graph neural network compression method, device, electronic device and storage medium, which can use reinforcement learning to automatically determine the optimal quantization bit width for the vertex features in the graph neural network and graph data under the constraints of preset resource constraints, so as to ensure that the obtained quantized graph neural network has both high accuracy and low resource consumption rate.
  • the present application provides a graph neural network compression method, comprising:
  • reinforcement learning and hardware accelerators are used to determine the optimal interval quantization bit width corresponding to each degree interval and the optimal network quantization bit width corresponding to the graph neural network;
  • the optimal interval quantization bit width is used to quantize and compress the vertex features of the graph vertices of the corresponding degree in the graph data
  • the optimal network quantization bit width is used to quantize and compress the graph neural network to obtain the optimal quantized graph data and the optimal quantized graph neural network.
  • the degree distribution range is divided according to the distribution of graph vertices within the range.
  • the optimal network quantization bit width corresponding to the graph neural network specifically refers to the optimal network quantization bit width corresponding to the graph convolution kernel matrix, weight matrix and activation matrix of the graph neural network.
  • preset resource constraints are used to limit the computing resources consumed in processing quantized graph data and quantized graph neural networks.
  • the preset resource constraint conditions include: a computing amount threshold, a memory usage threshold, and a delay threshold.
  • the degree distribution range corresponding to all graph vertices in the graph data is determined, and the degree distribution range is divided into multiple degree intervals, including:
  • the degree distribution range is divided by using the graph vertex sequence to obtain multiple degree intervals; the number of graph vertices contained in each degree interval is the same or the difference is less than a preset threshold.
  • the method further includes:
  • the optimal quantized graph neural network is trained using the optimal quantized graph data to obtain a fine-tuned quantized graph neural network, so as to deploy the fine-tuned quantized graph neural network to external service equipment.
  • the timing structure of the hardware accelerator is a reconfigurable bit-serial matrix multiplication superposition
  • the spatial structure is a BitFusion architecture
  • reinforcement learning and hardware accelerators are used to determine the optimal interval quantization bit width corresponding to each degree interval and the optimal network quantization bit width corresponding to the graph neural network, including:
  • the agent includes an actor module and a critic module;
  • the action sequence is used to save the interval quantization bit width corresponding to each degree interval and the network quantization bit width corresponding to the graph neural network;
  • the state vector is used to record the memory usage and computational complexity of the quantized graph neural network when processing quantized graph data, as well as the accuracy when performing the specified task;
  • the time step is set to 1, and under the constraints of the preset resource constraints, the actor module is used to determine the continuous actions, the continuous actions are used to numerically update the action sequence, and the memory usage and computation amount corresponding to the action sequence are determined after the update;
  • Quantize and compress vertex features and graph neural networks in graph data using action sequences and send the obtained quantized graph data and quantized graph neural networks to a hardware accelerator, so that the hardware accelerator trains the quantized graph neural network using the quantized graph data, and determines the current accuracy of the trained quantized graph neural network for performing a specified task;
  • the historical reward value is updated using the reward value, and the optimal interval quantization bit width and the optimal network quantization bit width are updated using the updated action sequence;
  • the time step is increased by 1, the historical state vector is updated using the current state vector, and the step of determining the continuous action using the actor module is entered under the constraints of the preset resource constraints;
  • the optimal interval quantization bit width and the optimal network quantization bit width are output.
  • the actor module is used to determine continuous actions, the continuous actions are used to update the action sequence numerically, and the memory usage and calculation amount corresponding to the action sequence are determined after the update, including:
  • the actor module is used to select continuous actions according to the Behavior strategy, and the continuous actions are discretized in the following way to obtain discrete action values:
  • a ⁇ (i) represents the continuous action corresponding to the i-th quantization bit width in the action sequence of the ⁇ -th time step
  • a′ ⁇ (i) represents the discrete action value corresponding to a ⁇ (i)
  • Q contains multiple preset quantization bit width values
  • round( ⁇ ) represents the rounding function
  • q min and q max represent the preset minimum quantization bit width and maximum quantization bit width
  • the argmin( ⁇ ) function is used to select the target preset quantization bit width value q in Q so as to minimize
  • the step of using the action sequence to quantize and compress the vertex features and graph neural network in the graph data is entered;
  • the quantization bit width in the action sequence is reduced in sequence according to the preset order to update the action sequence again, and when each reduction action is completed, the step of determining the memory usage, calculation amount and delay amount corresponding to the updated action sequence is entered.
  • the actor module is used to select continuous actions according to the Behavior strategy, including:
  • N ⁇ represents the random UO noise corresponding to the ⁇ th time step
  • O ⁇ represents the historical state vector corresponding to the ⁇ th time step
  • represents the online actor network in the actor module
  • represents the online actor network parameters.
  • the hardware accelerator uses quantized graph data to train the quantized graph neural network, including:
  • the hardware accelerator uses quantized graph data to train the quantized graph neural network based on the mini-batch stochastic gradient descent method.
  • determining the memory usage, computation amount, and delay amount corresponding to the updated action sequence includes:
  • store MB represents the memory usage
  • n b represents the number of graph vertices in a single mini-batch
  • fl represents the vertex dimension value corresponding to the lth network layer of the quantized graph neural network.
  • L represents the number of all network layers of the quantized graph neural network
  • q max represents the maximum value of the interval quantization bit width assigned to all graph vertices in a single mini-batch
  • S represents the total number of convolution kernels
  • q W and q F represent the weight matrix of each network layer of the quantized graph neural network and the network quantization bit width corresponding to the convolution kernel, respectively;
  • the calculation amount is calculated using the following formula:
  • compute MB represents the amount of calculation
  • q ⁇ represents the network quantization bit width corresponding to the activation matrix of each network layer of the quantized graph neural network
  • MAC l represents the total number of multiplication and accumulation operations of the lth layer of the quantized graph neural network
  • the delay is calculated using the following formula:
  • latency MB represents the delay
  • ⁇ l represents the delay of the lth network layer of the quantized graph neural network in processing small batches of graph data.
  • the vertex features in the graph data are quantized and compressed using an action sequence, including:
  • quantize( ⁇ ) represents the quantization compression function
  • round( ⁇ ) represents the rounding function
  • clip(x, y) represents the truncation function used to truncate x to [-y, y] (y>0)
  • Xi ,: represents the vertex feature
  • Xi ,:(j) (j ⁇ [1, f0 ]) represents the jth component in the vertex feature
  • S represents the scaling factor
  • s c/( 2q -1)
  • q represents the interval quantization bit corresponding to the degree of the graph vertex to which Xi ,: belongs in the action sequence.
  • the method before using the action sequence to quantize and compress the vertex features in the graph data, the method further includes:
  • the argmin( ⁇ ) function is used to select the x value to minimize D KL (X i ,:
  • the actor module includes an online actor network and a target actor network
  • the critic module includes an online critic network and a target critic network
  • the agent used for initializing reinforcement learning includes:
  • the online critic network parameters of the online critic network are initialized, and the target critic network parameters of the target critic network are set to the same values as the online critic network parameters.
  • using the conversion data to train the actor module and the critic module includes:
  • loss Q represents the loss function
  • a ⁇ represents the continuous action
  • O ⁇ represents the historical state vector corresponding to the ⁇ th time step
  • Q represents the online critic network
  • ⁇ Q represents the online critic network parameters
  • N represents the preset number
  • Q′ represents the target critic network
  • ⁇ Q′ represents the target critic network parameters
  • ⁇ ′ represents the target actor network
  • ⁇ ⁇ ′ represents the target actor network parameters
  • O ⁇ +1 represents the current state vector corresponding to the ⁇ th time step
  • the target critic network parameters and target actor network parameters are updated using the updated online critic network parameters and online actor network parameters as follows:
  • is a preset value.
  • the present application also provides a graph neural network compression device, comprising:
  • An acquisition module is used to obtain the trained graph neural network and the graph data used in its training
  • An interval determination module is used to determine the degree distribution range corresponding to all graph vertices in the graph data, and divide the degree distribution range into multiple degree intervals;
  • a quantization bit width determination module is used to determine the optimal interval quantization bit width corresponding to each degree interval and the optimal network quantization bit width corresponding to the graph neural network by using reinforcement learning and hardware accelerator under the constraints of preset resource constraints;
  • the quantization compression module is used to quantize and compress the vertex features of graph vertices with corresponding degrees in the graph data using the optimal interval quantization bit width, and to quantize and compress the graph neural network using the optimal network quantization bit width, so as to obtain the optimal quantized graph data and the optimal quantized graph neural network.
  • the present application also provides an electronic device, comprising:
  • a processor is used to implement the above graph neural network compression method when executing a computer program.
  • the present application also provides a non-volatile readable storage medium, which stores computer executable instructions.
  • a non-volatile readable storage medium which stores computer executable instructions.
  • the present application provides a graph neural network compression method, including: obtaining a trained graph neural network and graph data used in its training; determining the degree distribution range corresponding to all graph vertices in the graph data, and dividing the degree distribution range into multiple degree intervals; under the constraints of preset resource constraints, using reinforcement learning and hardware accelerators to determine the optimal interval quantization bit width corresponding to each degree interval and the optimal network quantization bit width corresponding to the graph neural network; using the optimal interval quantization bit width to quantize and compress vertex features of graph vertices of corresponding degrees in the graph data, and using the optimal network quantization bit width to quantize and compress the graph neural network, so as to obtain optimal quantized graph data and optimal quantized graph neural network.
  • the present application when the present application obtains the trained graph neural network and the graph data used for training, it first counts the degree distribution ranges corresponding to all graph vertices in the graph data and divides this return into multiple degree intervals; then, the present application Please use reinforcement learning and hardware accelerators to determine the optimal interval quantization bit width corresponding to each degree interval and the optimal network quantization bit width corresponding to the graph neural network under the constraints of preset resource constraints, and use the above two quantization bit widths to quantize and compress the vertex features of the graph data and the graph neural network.
  • reinforcement learning can automatically search for the optimal quantization bit width allocation strategy corresponding to each degree interval and the graph neural network according to the feedback of the hardware accelerator, that is, it can realize the automatic search for the above-mentioned optimal interval quantization bit width and the optimal network quantization bit width; at the same time, the automatic search action of reinforcement learning is subject to the preset resource constraints, that is, it can ensure that the final optimal interval quantization bit width and the optimal network quantization bit width can be adapted to resource-constrained devices; finally, since the present application has divided the degree distribution range of the graph vertices into multiple degree intervals, and determined the corresponding optimal interval quantization bit width for each interval, that is, it can perform different degrees of quantization compression on the vertex features of graph vertices of different degrees, it can effectively avoid the problem of high-precision loss caused by the simple but radical unified quantization of all data in the related schemes.
  • this application uses reinforcement learning to determine the optimal quantization bit width for the graph neural network and the graph data used in its training, it can not only realize the automatic determination of the quantization bit width, but also effectively balance the relationship between performance and network model accuracy, ensuring that the final quantized graph data and quantized graph neural network not only have high accuracy, but also can be adapted to resource-constrained devices.
  • the present application also provides a graph neural network compression device, an electronic device, and a non-volatile readable storage medium, which have the above-mentioned beneficial effects.
  • FIG1 is a flow chart of a graph neural network compression method provided in an embodiment of the present application.
  • FIG2 is a typical structural diagram of a graph neural network provided in an embodiment of the present application.
  • FIG3 is a structural block diagram of a graph neural network compression system provided in an embodiment of the present application.
  • FIG4 is a structural block diagram of a graph neural network compression device provided in an embodiment of the present application.
  • FIG5 is a structural block diagram of an electronic device provided in an embodiment of the present application.
  • FIG6 is a structural block diagram of a non-volatile readable storage medium provided in an embodiment of the present application.
  • the embodiment of the present application can provide a graph neural network compression method, which can use reinforcement learning to automatically determine the optimal quantization bit width for the graph neural network and graph data under the constraints of preset resource constraints to ensure that the obtained quantized graph neural network has both high accuracy and low resource consumption rate.
  • Figure 1 is a flowchart of a graph neural network compression method provided by an embodiment of the present application, and the method may include:
  • the graph neural network obtained in this step is the original, full-precision graph neural network
  • the graph data is the training data of the network.
  • the weights, convolution kernels and other parameters contained in the graph neural network and the graph data are all floating-point data, and most of them are represented by FP32. Floating-point data has high precision, but correspondingly, the memory space required to store them is small.
  • the goal of this application is to find a suitable quantization bit width for the weights of each layer of the graph neural network, convolution kernel parameters, etc., and graph data, while ensuring the inference accuracy of the graph neural network model, so as to reduce the storage space requirements.
  • the quantization bit width here is usually an integer with lower precision, such as int4, int8, etc.
  • Graph data is the basic input content of graph neural networks.
  • G (V, E) with n vertices and m edges, that is,
  • n and
  • the degree matrix D is a diagonal matrix. The values of the n elements on the main diagonal represent the degrees of the n vertices, and the remaining elements are zero.
  • Each vertex vi has a eigenvector of length f0 , and the eigenvectors of all graph vertices constitute the eigenmatrix
  • the specific part of the graph data to be compressed is a feature matrix composed of the feature vectors of all graph vertices. This matrix is of floating point type.
  • graph neural networks are a special type of neural network that can process irregularly structured data. Although the structure of graph neural networks can be designed following different guiding principles, almost all graph neural networks can be interpreted as performing message passing on vertex features, followed by feature transformation and activation.
  • Figure 2 shows the structure of a typical graph neural network: it consists of an input layer, L graph convolutional layers, and an output layer.
  • the input layer is responsible for reading the adjacency matrix A or adjacency list AdjList representing the graph topology, as well as the vertex feature matrix X0 .
  • the graph convolutional layer is responsible for vertex feature extraction.
  • each graph convolutional layer l (l ⁇ [1,L]), it reads in the adjacency matrix A or adjacency list AdjList, as well as the vertex feature matrix Xl , and outputs a new vertex feature matrix Xl +1 through graph convolution operations and nonlinear transformations.
  • the output layer is freely set according to different tasks.
  • vertex classification can be implemented by a softmax function.
  • the embodiments of the present application are not limited to specific graph neural networks and graph data.
  • the structure of the network can be designed following different guiding principles; at the same time, it is understandable that for different tasks, the specific content of the graph data and even its complexity may be different, so the specific graph neural network and graph data can be selected according to the actual application requirements.
  • the reason why the present application can compress various types of graph neural networks and graph data is that the embodiment of the present application uses reinforcement learning to determine the optimal quantization bit width corresponding to the graph neural network and graph data, and reinforcement learning has strong adaptability to various environments. Therefore, the compression method provided in the embodiment of the present application is applicable to various types of graph neural networks.
  • S200 Determine the degree distribution range corresponding to all graph vertices in the graph data, and divide the degree distribution range into multiple degree intervals.
  • the quantization compression of the vertex features of each vertex in the graph data is usually performed using a unified quantization bit width.
  • this effectively reduces the complexity and storage scale of the graph data this indiscriminate quantization compression method brings significant accuracy loss to the graph neural network model. Therefore, in an embodiment of the present application, different quantization bit widths can be used for compression of graph vertices with different degrees in the graph data, so as to alleviate the accuracy loss of the graph neural network model caused by the quantized graph data.
  • vertices with higher degrees usually obtain richer information from adjacent vertices, which makes them more robust to low quantization bits, because the random errors of quantization can usually be averaged to 0 through a large number of aggregation operations.
  • the quantization error Error i of vertex vi is a random variable and follows a uniform distribution.
  • a large number of Error i and Error j can be aggregated from vertex vi and its adjacent vertex v j , and the average result will converge to 0 according to the law of large numbers. Therefore, vertices with large degrees are more robust to quantization errors, and smaller quantization bits can be used for these high-degree vertices, while larger quantization bits can be used for low-degree vertices.
  • the embodiment of the present application can first count the degrees corresponding to each graph vertex in the graph data, obtain the degree distribution range corresponding to the graph data, and then divide this range into multiple degree intervals to determine the optimal interval quantization bit width for each interval.
  • the distribution law of the optimal interval quantization bit width should be: the larger the degree value corresponding to the degree interval, the larger the corresponding optimal interval quantization bit width.
  • the embodiment of the present application does not limit the method of dividing the degree distribution range.
  • the degree distribution range can be divided equally, or it can be divided according to the distribution of graph vertices within this range, for example, it can be ensured that the number of graph vertices corresponding to each degree interval is the same or close.
  • the degree distribution range can be divided according to the distribution of graph vertices within the range to ensure that the number of graph vertices contained in each interval is the same.
  • determining the degree distribution range corresponding to all graph vertices in the graph data and dividing the degree distribution range into a plurality of degree intervals may include:
  • Step S201 arranging all graph vertices in the graph data in ascending order of degree to obtain a graph vertex sequence
  • Step S202 Divide the degree distribution range by using the graph vertex sequence to obtain multiple degree intervals; the number of graph vertices included in each degree interval is the same or the difference is less than a preset threshold.
  • the embodiment of the present application does not limit the specific value of the preset threshold, which can be set according to actual application requirements.
  • a vertex degree split point list split_point [d 1 , d 2 , ..., d k-1 ] in the sequence to divide all vertices into k intervals: [d j , d j+1 ] (j ⁇ [0, k-1]), so that the number of vertices falling in each interval is the same or close.
  • the embodiment of the present application will use reinforcement learning and hardware accelerators to determine the optimal interval quantization bit width corresponding to each degree interval and the optimal network quantization bit width corresponding to the graph neural network model under the constraints of preset resource constraints.
  • the optimal network quantization bit width corresponding to the graph neural network specifically refers to the optimal network quantization bit width corresponding to the graph convolution kernel matrix, weight matrix and activation matrix of the graph neural network.
  • the optimal network quantization bit widths corresponding to these three matrices may be the same or different; in addition, the optimal network quantization bit widths corresponding to the graph convolution kernel matrix, weight matrix and activation matrix of each network layer of the graph neural network may be the same or different, and can be selected according to actual application requirements, wherein the input layer and the output layer do not have a graph convolution kernel matrix and an activation matrix, while the convolution layer has a graph convolution kernel matrix and an activation matrix. It can be understood that although different optimal network quantization bit widths can bring higher network model accuracy, it is easy to increase the search calculation amount of the optimal network quantization bit width.
  • the setting of the optimal network quantization bit width of the above three matrices can be set as needed after balancing the network model accuracy and the search calculation amount.
  • the convolution layer has these three matrices, but the input layer and output layer do not have graph convolution kernel matrices and activation matrices. Therefore, when setting the network quantization bit width for the graph neural network, it can also be further set according to the specific structure of the graph neural network.
  • the preset resource restriction condition is used to limit the computing resources consumed for processing quantized graph data and quantized graph neural networks (such as training, executing specified tasks, etc.). This is because the graph neural network consumes a lot of computing resources. If the specific hardware framework is not considered and quantization compression is performed arbitrarily, it may cause the final quantized graph data and quantized graph neural network to have a large processing calculation amount, a large memory usage and a long processing delay, which is not conducive to deployment and application. Therefore, the embodiment of the present application will use preset resource restriction conditions to limit reinforcement learning.
  • the embodiment of the present application does not limit specific preset resource restriction conditions, for example, it may include a calculation amount threshold, a memory usage threshold and a delay threshold, and each threshold is set with a corresponding calculation formula for calculating the calculation amount, memory usage and delay amount corresponding to the quantized graph data and the quantized graph neural network. It can be understood that the calculation amount, memory usage and delay amount corresponding to the quantized graph data and the quantized graph neural network should be less than or equal to the corresponding calculation amount threshold, memory usage threshold and delay threshold.
  • the above thresholds and corresponding formulas are determined by direct feedback from the hardware accelerator, where the hardware accelerator is used to verify the quantization effect of graph data and graph neural network, such as verifying the consumption of computing resources by the quantized compression network and the accuracy of the network when performing a specified task.
  • the embodiments of the present application do not limit the specific computing amount threshold, memory usage threshold, etc.
  • the value and delay threshold value are not limited to the calculation formula corresponding to the above threshold value, which can be set according to actual application requirements, or can refer to the description in the subsequent embodiments.
  • the embodiments of the present application do not limit the specific structure of the hardware accelerator.
  • the timing structure of the hardware accelerator can be a reconfigurable bit serial matrix multiplication overlay (BISMO), and the spatial structure can be a BitFusion architecture.
  • a preferred hardware accelerator configuration can be referred to the following table.
  • reinforcement learning is one of the paradigms and methodologies of machine learning, which is used to describe and solve the problem of how an agent learns strategies to maximize rewards or achieve specific goals during its interaction with the environment.
  • the problem to be solved by reinforcement learning is: to let the agent learn how to perform actions in an environment to obtain the maximum total reward.
  • This reward value is generally associated with the task goal defined by the agent.
  • the main learning content of the agent includes: first, the action policy, and second, planning.
  • the learning goal of the action policy is the optimal strategy, that is, using such a strategy, the agent's behavior in a specific environment can obtain the maximum reward value, thereby achieving its task goal.
  • Actions can be simply divided into: (1) continuous, such as the steering wheel angle, throttle, and brake control signals in racing games, and the joint servo motor control signals of robots; (2) discrete, such as Go and Snake games.
  • the embodiments of the present application specifically use a reinforcement learning method based on both value and policy, which can also be called an Actor (actor, also known as an actor)-Critic (critic, also known as a critic) method.
  • the Actor-Critic method combines the advantages of the value-based method and the policy-based method, using the value-based method to learn the Q-value function or the state value function V to improve the sampling efficiency (this part is handled by the critic), and using the policy-based method to learn the policy function (this part is handled by the actor), so that it is suitable for continuous or high-dimensional action spaces.
  • the Actor-Critic method can be seen as an extension of the value-based method in the continuous action space, and can also be seen as an improvement of the policy-based method in reducing sample variance and improving sampling efficiency.
  • FIG. 3 is a block diagram of a graph neural network compression system provided by an embodiment of the present application.
  • the system consists of a DDPG (Deep Deterministic Policy Gradient) agent based on the Actor-Critic framework, a policy, a quantization implementation, and a hardware accelerator.
  • the DDPG agent gives actions according to a specific strategy based on the current environment state O and on the premise of satisfying the constraints of the hardware accelerator resources (i.e., the preset resource constraints): allocate appropriate quantization bit widths for the features of the vertices of each degree interval and the graph convolution kernels (if any), weights, and activations (if any) of all layers of the graph neural network.
  • the hardware accelerator resources i.e., the preset resource constraints
  • the host computer quantizes the trained floating-point graph neural network model and graph data according to the quantization bit width allocation scheme provided by the DDPG agent to obtain a quantized graph neural network model and quantized graph data. Subsequently, the quantized data and the quantized network will be mapped or distributed to the hardware accelerator together, and the latter will use the quantized graph data to train the quantized graph neural network, and after training, use the quantized graph neural network to perform the specified task, and then use the accuracy difference of the graph neural network before and after quantization as a reward, and feedback to the DDPG agent.
  • the DDPG agent adjusts its strategy based on the feedback from the environment and outputs new actions until the optimal strategy is obtained. Including other workflows, to avoid lengthy descriptions, for the specific workflow of the system, please refer to the description in subsequent embodiments.
  • the vertex features of each graph vertex in the corresponding graph data and the graph neural network can be quantized and compressed to obtain the optimal quantized graph data and the optimal quantized graph neural network.
  • the embodiments of the present application do not limit the specific steps of quantization compression, which can be set according to actual application requirements, or can refer to the description in the subsequent embodiments. It should be pointed out that although the embodiments of the present application have tried their best to improve the accuracy of the optimal quantized graph neural network, the quantization compression itself will still have a negative impact on the accuracy of the optimal quantized graph neural network in performing designated tasks.
  • the optimal quantized graph data can be used again to train the quantized graph neural network to restore the accuracy of the optimal quantized graph neural network in performing designated tasks, so that the final fine-tuned quantized graph neural network can be deployed to the external service equipment for external service.
  • the following may also be included:
  • the present application when the present application obtains the trained graph neural network and the graph data used for its training, it will first count the degree distribution ranges corresponding to all graph vertices in the graph data, and divide this range into multiple degree intervals; subsequently, the present application will use reinforcement learning and hardware accelerators to determine the optimal interval quantization bit width corresponding to each degree interval and the optimal network quantization bit width corresponding to the graph neural network under the constraints of preset resource constraints, and use the above two quantization bit widths to quantize and compress the vertex features of each graph vertex in the graph data and the graph neural network, wherein reinforcement learning can automatically search for the optimal quantity corresponding to each degree interval and the graph neural network based on the feedback from the hardware accelerator.
  • the bit width allocation strategy can realize the automatic search for the above-mentioned optimal interval quantization bit width and the optimal network quantization bit width; at the same time, the automatic search action of reinforcement learning is limited by the preset resource restriction conditions, that is, it can ensure that the optimal interval quantization bit width and the optimal network quantization bit width finally obtained can be adapted to resource-constrained devices; finally, since the present application has divided the degree distribution range of the graph vertices into multiple degree intervals, and has determined the corresponding optimal interval quantization bit width for each interval, that is, it can perform different degrees of quantization compression on the vertex features of graph vertices of different degrees, it can effectively avoid the problem of high-precision loss caused by the simple but radical unified quantization of all data in the relevant scheme.
  • the present application uses reinforcement learning to determine the optimal quantization bit width for the graph neural network and the graph data used in its training, it can not only realize the automatic determination of the quantization bit width, but also effectively balance the relationship between performance and network model accuracy, ensuring that the final quantized graph data and quantized graph neural network not only have high accuracy, but also can be adapted to resource-constrained devices.
  • the specific workflow of the graph neural network compression system will be introduced below.
  • the action sequence is used to save the interval quantization bit width corresponding to each degree interval and the network quantization bit width corresponding to the graph neural network.
  • the length of the action sequence can be k+3.
  • the process of determining a complete action sequence is called an episode, and an episode contains N time steps, where the value of N is equal to the length of the action sequence.
  • the action sequence is updated once for each time step, so a strategy can usually generate N different action sequences.
  • the action sequence can be used for quantitative compression, and since the previous action sequence is not the same as the next action sequence, the compression effects corresponding to the two action sequences are also different.
  • the resource consumption (such as memory occupancy, amount of calculation, etc.) corresponding to the quantized graph data and quantized graph neural network generated by these two action sequences is not the same, and the corresponding accuracy when performing the specified task is also different.
  • a state vector can be used to record the changes in resource consumption and accuracy.
  • the corresponding memory occupancy, amount of calculation and the accuracy corresponding to the execution of the specified task can be recorded using the historical state vector, while the memory occupancy, amount of calculation and the accuracy corresponding to the execution of the specified task corresponding to the quantized graph data and quantized graph neural network compressed using the next action sequence can be recorded using the current state vector.
  • the reward value can be determined by using the benchmark accuracy of the original graph neural network to perform a specified task and the accuracy of the quantized graph neural network to perform the same task, where the benchmark accuracy specifically refers to the reasoning accuracy of the graph neural network after the original graph neural network is trained using the original graph data, such as the classification accuracy in the classification task.
  • the historical state vector, action sequence, reward value and current state vector corresponding to each time step constitute a transition data (transition).
  • this data contains the actions, rewards and state transfers of this quantization compression, and the agent can perceive the execution effect of the action through this data.
  • the transition data can be used to train the agent to update the strategy adopted by the agent when determining the action.
  • reinforcement learning and hardware accelerators are used to determine the optimal interval quantization bit width corresponding to each degree interval and the optimal network quantization bit width corresponding to the graph neural network, which may include:
  • the embodiments of the present application do not limit the specific tasks performed by the graph neural network, which can be set according to actual application requirements.
  • the embodiments of the present application will set the accuracy of the original graph neural network in performing the task as the benchmark accuracy.
  • the embodiments of the present application also do not limit the calculation method of the accuracy, which can be set according to actual application requirements.
  • each vertex has only one category label and all vertices have a total of cT category labels
  • the number of vertices with category label i (i ⁇ [1, cT ]) accounts for a proportion of ⁇ i ( ⁇ i ⁇ (0,1)) of the total number of vertices
  • the classification accuracy of this multi-classification problem can be defined as:
  • the embodiment of the present application also specifically sets a historical reward value to record the highest reward value that appears during the search process. When the highest reward value appears, the embodiment of the present application will update the historical record value, the optimal interval quantization bit width, and the optimal network quantization bit width.
  • the historical reward value should also have an initial value, and the initialization process here is to set the initial value for it.
  • the embodiment of the present application does not limit the specific initial value of the historical reward value, as long as it is as small as possible.
  • the embodiments of the present application do not limit the specific process of initializing the intelligent agent.
  • the initialization here mainly refers to initializing the parameters in the intelligent agent. Please refer to the relevant technologies of the DDPG intelligent agent.
  • S320 Set the strategy times to 1, and initialize the action sequence and historical state vector; the action sequence is used to save the interval quantization bit width corresponding to each degree interval and the network quantization bit width corresponding to the graph neural network; the state vector is used to record the memory usage and computational complexity of the quantized graph neural network when processing quantized graph data, as well as the accuracy when performing specified tasks.
  • state vector [acc, store, cpmpt],
  • acc represents accuracy
  • store represents memory usage
  • compt represents computational complexity.
  • the actor module's numerical update of the action sequence is equivalent to the actor module giving an action based on the current state and strategy. It is worth noting that the actor module (actor) will first determine the continuous action, and then use this continuous action to numerically update the action sequence.
  • the quantization bit width is usually a discrete value, for example, the conventional quantization bit width is 2, 4, 8, 16, 32 bits, etc. Therefore, after obtaining the continuous action, it is first necessary to discretize it to obtain the discrete action value, and then use this discrete action value to update the action sequence. This process is described in detail below.
  • the actor module is used to determine continuous actions, the continuous actions are used to numerically update the action sequence, and the memory usage and computation amount corresponding to the action sequence are determined after the update, including:
  • Step S331 Use the actor module to select continuous actions according to the Behavior strategy, and discretize the continuous actions in the following way to obtain discrete action values:
  • a ⁇ (i) represents the continuous action corresponding to the i-th quantization bit width in the action sequence of the ⁇ -th time step
  • a′ ⁇ (i) represents the discrete action value corresponding to a ⁇ (i)
  • Q contains multiple preset quantization bit width values
  • round( ⁇ ) represents the rounding function
  • q min and q max represent the preset minimum quantization bit width and maximum quantization bit width
  • the argmin( ⁇ ) function is used to select the target preset quantization bit width value q in Q so as to minimize
  • Step S332 using the action value to numerically update the action sequence, determine the memory usage, calculation amount and delay amount corresponding to the updated action sequence, and judge whether the memory usage, calculation amount and delay amount meet the limit of the preset resource restriction condition; if the memory usage, calculation amount and delay amount meet the limit of the preset resource restriction condition, proceed to step S333; if the memory usage, calculation amount and delay amount do not meet the limit of the preset resource restriction condition, proceed to step S334;
  • Step S333 If the memory usage, calculation amount and delay amount meet the limits of the preset resource limit conditions, then enter the step of using the action sequence to quantize and compress the vertex features and graph neural network in the graph data;
  • Step S334 If the memory usage, calculation amount and delay amount do not meet the preset resource restriction conditions, the quantization bit width in the action sequence is reduced in sequence according to the preset order to update the action sequence again, and when each reduction action is completed, the step of determining the memory usage, calculation amount and delay amount corresponding to the updated action sequence is entered.
  • the embodiments of the present application hope to find a quantization bit width allocation scheme with optimal reasoning performance under given constraints.
  • the embodiments of the present application encourage the agent to meet the computing budget by limiting the action space. Specifically, each time the agent issues an action a ⁇ , the embodiments of the present application need to estimate the amount of hardware resources that the quantized graph neural network will use. If the current allocation scheme exceeds the hardware accelerator resource budget, the bit widths of the vertices of each degree interval and the graph convolution kernels (if any), weights, and activations (if any) of all layers of the graph neural network are reduced in turn until the hardware accelerator resource budget constraints are finally met. It can also be in other orders, such as reducing in order from large to small according to the currently allocated bit width values, which is not limited by the embodiments of the present application.
  • Behavior strategy ⁇ is a random process generated according to the strategy of the current actor module and the random UO (Uhlenbeck-Ornstein) noise N ⁇ , and its specific process can be:
  • the actor module is used to select continuous actions according to the Behavior strategy, including:
  • N ⁇ represents the random UO noise corresponding to the ⁇ th time step
  • O ⁇ represents the historical state vector corresponding to the ⁇ th time step
  • represents the online actor network in the actor module
  • represents the online actor network parameters.
  • a strategy of an actor module can be specifically represented by a specific model parameter in the module. In other words, updating the strategy of an actor module is actually updating the parameters of the module.
  • S340 Use an action sequence to quantize and compress vertex features and graph neural networks in graph data, and send the obtained quantized graph data and quantized graph neural network to a hardware accelerator, so that the hardware accelerator uses the quantized graph data to train the quantized graph neural network and determines the current accuracy of the trained quantized graph neural network to perform a specified task.
  • acc origin is the benchmark accuracy corresponding to the original graph neural network after training the original graph neural network with the original training set
  • acc quant is the accuracy of the quantized graph neural network after fine-tuning
  • is the scaling factor, whose value can be preferably 0.1.
  • the embodiments of the present application do not limit the specific process of training the actor module and the critic module, and reference may be made to the introduction in the subsequent embodiments.
  • the significance of training is to update the model parameters of the actor module so that it can adopt a new strategy to determine the next action.
  • the embodiments of the present application do not limit the specific preset value, and can be set according to actual application requirements. It is understandable that the larger the preset value, the stronger the agent's perception of the environment, and the more appropriate the optimal interval quantization bit width and optimal network quantization bit width generated by it, but the corresponding calculation time is longer and the calculation amount is larger, so the preset upper limit corresponding to the number of strategies can be set as needed after balancing the accuracy and computing resource consumption.
  • the hardware accelerator processes the quantized graph data and the quantized graph neural network.
  • the main processing content of the quantized graph neural network is to train the quantized graph neural network using the quantized graph data, and the training process can be optimized in a variety of ways, such as full-batch, mini-batch or single-example stochastic gradient descent (SGD) and other strategies.
  • the hardware accelerator in order to improve the training efficiency, can use the small batch stochastic gradient descent method to optimize the training process of the quantized graph neural network.
  • the hardware accelerator uses quantized graph data to train the quantized graph neural network, which may include:
  • the hardware accelerator trains the quantized graph neural network using the quantized graph data based on the mini-batch stochastic gradient descent method.
  • determining the memory usage, computation amount and delay amount corresponding to the updated action sequence includes:
  • store MB represents the memory usage
  • n b represents the number of graph vertices in a single mini-batch
  • fl represents the vertex dimension value corresponding to the lth network layer of the quantized graph neural network.
  • L represents the number of all network layers of the quantized graph neural network
  • q max represents the maximum value of the interval quantization bit width assigned to all graph vertices in a single mini-batch
  • S represents the total number of convolution kernels
  • q W and q F represent the weight matrix of each network layer of the quantized graph neural network and the network quantization bit width corresponding to the convolution kernel, respectively;
  • compute MB represents the amount of calculation
  • q ⁇ represents the network quantization bit width corresponding to the activation matrix of each network layer of the quantized graph neural network
  • MAC l represents the total number of multiplication and accumulation operations of the lth layer of the quantized graph neural network
  • latency MB represents the delay
  • ⁇ l represents the delay of the lth network layer of the quantized graph neural network in processing small batches of graph data.
  • Memory limit , BOPS limit and Latency limit can be used to represent the memory usage threshold, computation amount threshold and latency threshold, where Memory limit is the storage capacity that the hardware acceleration device can provide, BOPS limit represents the upper limit of the total number of bit operations that the hardware accelerator can provide per second, and Latency limit refers to the hardware
  • the memory limit , BOPS limit , and latency limit are all determined by the characteristics of the hardware accelerator itself and can be obtained directly or through measurement.
  • the quantization compression of vertex features in the graph data using an action sequence may include:
  • quantize( ⁇ ) represents the quantization compression function
  • round( ⁇ ) represents the rounding function
  • clip(x, y) represents the truncation function used to truncate x to [-y, y] (y>0)
  • Xi ,: represents the vertex feature
  • Xi ,:(j) (j ⁇ [1, f0 ]) represents the jth component in the vertex feature
  • S represents the scaling factor
  • s c/( 2q -1)
  • q represents the interval quantization bit corresponding to the degree of the graph vertex to which Xi ,: belongs in the action sequence.
  • the embodiment of the present application also designs a method based on minimizing the distribution distance of data features before and after quantization to determine the appropriate c value.
  • the action sequence to quantize and compress the vertex features in the graph data it can also include:
  • the argmin( ⁇ ) function is used to select the x value to minimize D KL (X i ,:
  • the embodiment of the present application does not limit the calculation method of KL divergence (Kullback-Leibler divergence).
  • KL divergence Kullback-Leibler divergence
  • other methods can also be used to determine the distance between the above two feature distributions.
  • JS distance Joint-Shannon Divergence
  • Mutual Information Mutual Information
  • the embodiment of the present application also does not limit the specific acquisition method of the above feature distribution data.
  • the maximum value, minimum value, mean, and variance can be directly obtained through the target data; the sharpness and kurtosis are obtained by constructing a histogram of the target data.
  • the embodiment of the present application will perform similar quantization on them. The difference is that for activations, the embodiment of the present application will truncate them to the range of [0, c] instead of [-c, c], because the activation value (i.e., the output of the ReLU (linear rectifier function) layer) is non-negative.
  • the activation value i.e., the output of the ReLU (linear rectifier function) layer
  • the Actor-Critic framework consists of an Actor (also called a policy network ⁇ ) and Critic (also called Q network or value network).
  • Actor also called a policy network ⁇
  • Critic also called Q network or value network
  • the Actor is responsible for interacting with the environment and learning a better strategy using the policy gradient method under the guidance of the Critic value function
  • the task of the Critic is to use the collected data of the Actor's interaction with the environment to learn a value function Q, which is used to judge the quality of the current state-action pair, and then assist the Actor in updating the strategy.
  • Both the Actor and the Critic contain two networks, one called online and the other called target.
  • the DDPG algorithm uses the technique of freezing the target network: the online network parameters are updated in real time, while the target network parameters are temporarily frozen. When the target network is frozen, the online network is allowed to try and explore.
  • the target network summarizes experience based on the samples generated by the online network, and then takes action and assigns the parameters of the online network to the target network.
  • the DDGP algorithm also uses the experience replay mechanism to remove data correlation and improve sample utilization efficiency.
  • the specific approach is to maintain an experience replay pool, store the conversion data quadruple (state, action, reward, next state) sampled from the environment each time into the experience replay pool, and randomly sample some data from the replay buffer when training the policy network and Q network. This can play the following two roles: (1) Make the samples meet the independence assumption. Using experience replay can break the correlation between samples and make them meet the independence assumption; (2) Improve sample utilization.
  • the functions of the four networks of the DDPG agent are as follows:
  • Target Actor Network responsible for selecting the next optimal action a ⁇ +1 according to the next state O ⁇ +1 sampled from the experience replay pool, and responsible for regularly updating the parameters ⁇ ⁇ of the Online Actor to the parameters ⁇ ⁇ ′ of the Target Actor Network through the exponential moving average method;
  • Online Critic Network responsible for iteratively updating the value network parameter ⁇ Q , responsible for calculating the online Q value Q(O ⁇ , a ⁇
  • Target Critic Network responsible for calculating the estimate of the output of the Target Critic Network Q′(O ⁇ +1 , a ⁇ +1
  • the actor module includes an online actor network and a target actor network
  • the critic module includes an online critic network and a target critic network
  • the agent used for initializing reinforcement learning may include:
  • S312 Initialize the online critic network parameters of the online critic network, and set the target critic network parameters of the target critic network and the online critic network parameters to the same value.
  • training the actor module and the critic module using the transformed data may include:
  • loss Q represents the loss function
  • a ⁇ represents the continuous action
  • O ⁇ represents the historical state vector corresponding to the ⁇ th time step
  • Q represents the online critic network
  • ⁇ Q represents the online critic network parameter
  • N represents the preset number
  • Q′ represents the target critic network
  • ⁇ Q′ represents the target critic network parameters
  • ⁇ ′ represents the target actor network
  • ⁇ ⁇ ′ represents the target actor network parameters
  • O ⁇ +1 represents the current state vector corresponding to the ⁇ th time step
  • the goal of the embodiment of the present application is to find an optimal policy network parameter
  • the DDPG agent implements actions according to the optimal strategy corresponding to this parameter, generating The expected cumulative reward of is the largest.
  • this application defines an objective function J called performance objective:
  • Q(O, ⁇ (O)) refers to the Q value that can be generated in each state O if the action ⁇ (O) is selected according to the strategy ⁇ .
  • the meaning is the expected value of Q(O, ⁇ (O)) when the environmental state O obeys the distribution function ⁇ ⁇ .
  • the objective function The gradient of the policy network parameter ⁇ ⁇ (referred to as policy gradient) can be calculated by the following formula:
  • the calculation of policy gradient uses the chain rule, first taking the derivative of action a, then taking the derivative of policy network parameter ⁇ ⁇ . Then, the function Q is maximized by the gradient ascent method to obtain the action with the largest value.
  • the Monte-Carlo method can be used to estimate the above expected value.
  • the state transition T ⁇ (O ⁇ , a ⁇ , r ⁇ , O ⁇ +1 ) is stored in the experience replay pool P, where a ⁇ is generated based on the DDPG agent according to the Behavior strategy ⁇ , which will be converted into discrete action values based on the method provided in the above embodiment.
  • N conversion data are randomly sampled from the experience replay pool P to form a single batch
  • a single batch of data can be substituted into the above policy gradient formula, which can be used as an unbiased estimate of the above expected value, so the policy gradient can be rewritten as:
  • S376 Update the target critic network parameters and target actor network parameters using the updated online critic network parameters and online actor network parameters in the following manner:
  • is a preset value.
  • Graph Convolutional Network is selected for graph neural network.
  • Pumbed an abstract database constructs a graph dataset, selects the graph learning task as vertex classification, and then designs the objective function and evaluation criteria that match the learning task. Construct a GNN instance containing L layers of graph convolutional layers, and train the GNN model using the CPU or GPU on the host computer according to the small batch stochastic gradient descent method to obtain a trained floating-point GNN model.
  • the graph data and the trained floating-point GNN model are the objects to be quantified in this application.
  • the host computer uses the quantization bit width specified by a′ ⁇ to quantize the features of all graph vertices, the graph convolution kernels (if any), weights and activations (if any) of all layers of the GNN using a quantization method based on minimizing the distance between the data feature distribution before and after quantization.
  • the quantized graph vertex feature data and GNN model are obtained, and the latter is mapped to the hardware accelerator;
  • the hardware accelerator reads the quantized graph vertex features and adjacency matrix from the host computer, trains the GNN model using the mini-batch stochastic gradient descent method, tests its classification accuracy and calculates the value of the reward function r ⁇ , and outputs O ⁇ +1 ; r ⁇ and O ⁇ +1 are returned to the host computer;
  • the host computer updates r_best and a_best.
  • the host computer compares the returned r ⁇ and r_best. If r ⁇ >r_best, then r ⁇ ⁇ r_best, a best ⁇ a′ ⁇ .
  • the host computer randomly samples N transition data from the experience replay pool P as a batch training data for the online Actor and online Critic networks.
  • the host computer updates the gradients of the online Actor network and the online Critic network. Calculate the loss Q about The gradient of ⁇ ⁇ is calculated, and the policy gradient is calculated; the Adam optimizer is used to update the online Critic network parameters Q Q and the online Actor network parameters ⁇ ⁇ ;
  • the host computer soft-updates the parameters of the target Actor network and the target Critic network: using the moving average method, the corresponding online network parameters of the two are soft-updated to the target network parameters:
  • the hardware accelerator retrains the quantized model for one epoch based on a_best to restore the performance, and obtains the final fixed-point GNN quantized model and quantized graph vertex feature data.
  • the following is an introduction to the graph neural network compression device, electronic device and non-volatile readable storage medium provided in the embodiments of the present application.
  • the graph neural network compression device, electronic device and non-volatile readable storage medium described below can be referenced to each other with the graph neural network compression method described above.
  • FIG 4 is a structural block diagram of a graph neural network compression device provided in an embodiment of the present application.
  • the device may include:
  • An acquisition module 401 is used to acquire a trained graph neural network and graph data used in its training;
  • An interval determination module 402 is used to determine the degree distribution range corresponding to all graph vertices in the graph data, and divide the degree distribution range into a plurality of degree intervals;
  • a quantization bit width determination module 403 is used to determine the optimal interval quantization bit width corresponding to each degree interval and the optimal network quantization bit width corresponding to the graph neural network by using reinforcement learning and hardware accelerator under the constraints of preset resource constraints;
  • the quantization compression module 404 is used to quantize and compress the vertex features of the graph vertices of corresponding degrees in the graph data using the optimal interval quantization bit width, and to quantize and compress the graph neural network using the optimal network quantization bit width, so as to obtain the optimal quantized graph data and the optimal quantized graph neural network.
  • the interval determination module 402 may include:
  • the arrangement submodule is used to arrange all graph vertices in the graph data from small to large degrees to obtain a graph vertex sequence
  • the partitioning submodule is used to partition the degree distribution range using the graph vertex sequence to obtain multiple degree intervals; the number of graph vertices contained in each degree interval is the same or the difference is less than a preset threshold.
  • the device may further include:
  • the training module is used to train the optimal quantized graph neural network using the optimal quantized graph data to obtain a fine-tuned quantized graph neural network, so as to deploy the fine-tuned quantized graph neural network to the external service equipment.
  • the timing structure of the hardware accelerator is a reconfigurable bit-serial matrix multiplication superposition
  • the spatial structure is a BitFusion architecture
  • the quantization bit width determination module 403 includes:
  • the initialization submodule is used to obtain the benchmark accuracy of the graph neural network for executing the specified task and initialize the agent and historical reward values used in reinforcement learning;
  • the agent includes the actor module and the critic module;
  • the first setting submodule is used to set the strategy times to 1 and initialize the action sequence and the historical state vector; the action sequence is used to save the interval quantization bit width corresponding to each degree interval and the network quantization bit width corresponding to the graph neural network; the state The vector is used to record the memory usage and computational complexity of the quantized graph neural network when processing quantized graph data, as well as the accuracy when performing the specified task;
  • the second setting submodule is used to set the time step to 1, and under the constraint of the preset resource constraint condition, use the actor module to determine the continuous action, use the continuous action to numerically update the action sequence, and determine the memory usage and calculation amount corresponding to the action sequence after the update;
  • a compression and training submodule is used to quantize and compress vertex features and graph neural networks in graph data using action sequences, and send the obtained quantized graph data and quantized graph neural networks to a hardware accelerator, so that the hardware accelerator trains the quantized graph neural network using the quantized graph data, and determines the current accuracy of the trained quantized graph neural network for performing a specified task;
  • a calculation submodule for determining a current state vector using a memory usage, a calculation amount, and an accuracy corresponding to an action sequence, and determining a reward value using a baseline accuracy and a current accuracy
  • a finer submodule for updating the historical reward value with the reward value when it is determined that the reward value is greater than the historical reward value, and updating the optimal interval quantization bit width and the optimal network quantization bit width with the updated action sequence;
  • An agent training submodule for generating conversion data using historical state vectors, continuous actions, reward values, and current state vectors, and using the conversion data to train the actor module and the critic module, so that the critic module updates the strategy used by the actor module when performing numerical updates;
  • the third setting submodule is used for, when it is determined that the time step does not reach the length of the action sequence, adding 1 to the time step, updating the historical state vector using the current state vector, and entering the step of numerically updating the time step of the action sequence using the actor module under the constraint of the preset resource restriction condition;
  • the fourth setting submodule is used to add 1 to the strategy number when it is determined that the time step reaches the length of the action sequence and the strategy number does not reach the preset value, and enter the step of initializing the action sequence and the historical state vector;
  • the output submodule is used to output the optimal interval quantization bit width and the optimal network quantization bit width when the number of determined strategies reaches a preset value.
  • the second setting submodule may include:
  • the discrete action determination unit is used to select continuous actions according to the Behavior strategy using the actor module, and discretize the continuous actions to obtain discrete action values in the following way:
  • a ⁇ (i) represents the continuous action corresponding to the i-th quantization bit width in the action sequence of the ⁇ -th time step
  • a′ ⁇ (i) represents the discrete action value corresponding to a ⁇ (i)
  • Q contains multiple preset quantization bit width values
  • round( ⁇ ) represents the rounding function
  • q min and q max represent the preset minimum quantization bit width and maximum quantization bit width
  • the argmin( ⁇ ) function is used to select the target preset quantization bit width value q in Q so as to minimize
  • An updating unit used to update the action sequence numerically using the action value, determine the memory usage, calculation amount and delay amount corresponding to the updated action sequence, and determine whether the memory usage, calculation amount and delay amount meet the limits of the preset resource restriction condition;
  • the first processing unit is used for, if the memory usage, the amount of calculation and the amount of delay meet the limits of the preset resource limit conditions, Enter the step of using action sequences to quantize and compress vertex features and graph neural networks in graph data;
  • the second processing unit is used to reduce the quantization bit width in the action sequence in sequence according to a preset order if the memory usage, calculation amount and delay amount do not meet the preset resource constraint conditions, so as to update the action sequence again, and enter the step of determining the memory usage, calculation amount and delay amount corresponding to the updated action sequence each time the reduction action is completed.
  • the discrete action determination unit may include:
  • N ⁇ represents the random UO noise corresponding to the ⁇ th time step
  • O ⁇ represents the historical state vector corresponding to the ⁇ th time step
  • represents the online actor network in the actor module
  • represents the online actor network parameters.
  • the compression and training submodule may include:
  • a hardware accelerator unit is used for training a quantized graph neural network using quantized graph data based on a mini-batch stochastic gradient descent method.
  • the updating unit may include:
  • the first calculation subunit is used to calculate the memory usage using the following formula:
  • store MB represents the memory usage
  • n b represents the number of graph vertices in a single mini-batch
  • fl represents the vertex dimension value corresponding to the lth network layer of the quantized graph neural network.
  • L represents the number of all network layers of the quantized graph neural network
  • q max represents the maximum value of the interval quantization bit width assigned to all graph vertices in a single mini-batch
  • S represents the total number of convolution kernels
  • q W and q F represent the weight matrix of each network layer of the quantized graph neural network and the network quantization bit width corresponding to the convolution kernel, respectively;
  • the second calculation subunit is used to calculate the calculation amount using the following formula:
  • compute MB represents the amount of calculation
  • q ⁇ represents the network quantization bit width corresponding to the activation matrix of each network layer of the quantized graph neural network
  • MAC l represents the total number of multiplication and accumulation operations of the lth layer of the quantized graph neural network
  • the third calculation subunit is used to calculate the delay amount using the following formula:
  • latency MB represents the delay
  • ⁇ l represents the delay of the lth network layer of the quantized graph neural network in processing small batches of graph data.
  • the compression and training submodule includes:
  • quantize( ⁇ ) represents the quantization compression function
  • round( ⁇ ) represents the rounding function
  • clip(x, y) represents the truncation function used to truncate x to [-y, y] (y>0)
  • Xi ,: represents the vertex feature
  • Xi ,:(j) (j ⁇ [1, f0 ]) represents the jth component in the vertex feature
  • S represents the scaling factor
  • s c/( 2q -1)
  • q represents the interval quantization bit corresponding to the degree of the vertex of the graph to which Xi ,: belongs in the action sequence.
  • the compression and training submodule further includes:
  • the cutoff value determination unit is used to determine the c value in the following manner:
  • the argmin( ⁇ ) function is used to select the x value to minimize D KL (X i ,:
  • the actor module includes an online actor network and a target actor network
  • the critic module includes an online critic network and a target critic network
  • the initialization submodule includes:
  • a first initialization unit configured to initialize online actor network parameters of the online actor network, and to set target actor network parameters of the target actor network and online actor network parameters to the same value;
  • the second initialization unit is used to initialize the online critic network parameters of the online critic network, and set the target critic network parameters of the target critic network and the online critic network parameters to the same value.
  • the agent training submodule may include:
  • a training data extraction unit used to add the conversion data to the experience replay pool, and randomly sample a preset number of conversion data from the experience replay pool as training data;
  • a first gradient calculation unit for determining a first gradient of an online critic network parameter using the training data, the target actor network, the target critic network, the online critic network, and the following loss function;
  • loss Q represents the loss function
  • a ⁇ represents the continuous action
  • O ⁇ represents the historical state vector corresponding to the ⁇ th time step
  • Q represents the online critic network
  • ⁇ Q represents the online critic network parameter
  • N represents the preset number
  • Q′ represents the target critic network
  • ⁇ Q′ represents the target critic network parameters
  • ⁇ ′ represents the target actor network
  • ⁇ ⁇ ′ represents the target actor network parameters
  • O ⁇ +1 represents the current state vector corresponding to the ⁇ th time step
  • a first updating unit configured to update the online critic network parameters according to the first gradient
  • a second gradient calculation unit is used to determine the performance target using the training data, the updated online critic network, the online actor network, and the objective function, and determine a second gradient of the performance target with respect to determining the parameters of the online actor network:
  • a second updating unit configured to update the online actor network parameters based on the second gradient
  • the third updating unit is used to update the target critic network parameters and the target actor network parameters by using the updated online critic network parameters and the online actor network parameters in the following manner:
  • is a preset value.
  • FIG. 5 is a structural block diagram of an electronic device provided in an embodiment of the present application.
  • the embodiment of the present application further provides an electronic device, including:
  • Memory 501 used for storing computer programs
  • Processor 502 is used to implement the steps of the graph neural network compression method as described above when executing a computer program.
  • Figure 6 is a structural block diagram of a non-volatile readable storage medium provided in an embodiment of the present application.
  • the embodiment of the present application also provides a non-volatile readable storage medium, and a computer program is stored on the non-volatile readable storage medium 601.
  • the computer program is executed by the processor, the steps of the graph neural network compression method of any of the above embodiments are implemented.
  • the embodiments of the non-volatile readable storage medium part correspond to the embodiments of the graph neural network compression method part, please refer to the description of the embodiments of the graph neural network compression method part for the embodiments of the storage medium part, and will not be repeated here.
  • the steps of the method or algorithm described in conjunction with the embodiments disclosed herein may be implemented directly using hardware, a software module executed by a processor, or a combination of the two.
  • the software module may be placed in a random access memory (RAM), a memory, a read-only memory (ROM), an electrically programmable ROM, an electrically erasable programmable ROM, a register, a hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本申请提供一种图神经网络压缩方法、装置、电子设备及存储介质,涉及神经网络领域,方法包括:获取已训练的图神经网络及其训练时所使用的图数据;确定图数据中所有图顶点对应的度数分布范围,并将度数分布范围划分为多个度数区间;在预设资源限制条件的约束下,利用强化学习及硬件加速器确定各度数区间对应的最优区间量化位宽以及图神经网络对应的最优网络量化位宽;利用最优区间量化位宽对图数据中对应度数的图顶点的顶点特征进行量化压缩,并利用最优网络量化位宽对图神经网络进行量化压缩,得到最优量化图数据和最优量化图神经网络;利用强化学习为图神经网络和图顶点特征确定最优量化位宽,确保量化图神经网络具有高精度及较低资源消耗率。

Description

一种图神经网络压缩方法、装置、电子设备及存储介质
相关申请的交叉引用
本申请要求于2022年10月24日提交中国专利局,申请号为202211299256.8,申请名称为“一种图神经网络压缩方法、装置、电子设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及神经网络领域,特别涉及一种图神经网络压缩方法、装置、电子设备及存储介质。
背景技术
近年来,图神经网络(Graph Neural Network,GNN)因其能够对不规则结构数据进行建模而受到大量关注。GNN被广泛用于基于图的顶点分类、分子相互作用、社交网络、推荐系统或程序理解等各个领域。尽管GNN模型通常参数很少,但由于每个应用程序存储和计算需求与输入图数据的大小紧密相关,导致GNN具有高内存占用和高计算量(表现为训练或者推理时间长)的特点。该特点使得GNN无法有效地应用于绝大多数资源受限的设备,例如嵌入式系统和物联网设备。这种尴尬局面的背后有两个主要原因。首先,GNN的输入由两种类型数据组成,图结构(边列表)和顶点特征(嵌入)。当图规模变大时,很容易导致其存储大小急剧增加。这将使得那些具有非常有限内存预算的小型设备面临巨大压力。其次,更大规模的图数据需要更多的数据操作(例如,加法和乘法)和数据移动(例如,内存事务),它们将消耗大量能量并耗尽这些微型设备上有限功耗预算。
为应对上述挑战,量化压缩可以作为资源受限设备的“一石二鸟”解决方案出现,它可以:(1)有效地减少顶点特征的内存大小,从而降低内存使用;(2)最小化操作数大小可以减少功耗。然而,相关的量化方法存在以下两个问题:(1)对所有数据选择简单但激进的统一量化,以最小化内存和功耗成本,从而导致高精度损失;(2)选择一个非常保守的量化来保持准确性,这会导致次优的内存和节能性能;(3)忽略了不同的硬件架构,以统一的方式量化GNN所有层。
发明内容
本申请的目的是提供一种图神经网络压缩方法、装置、电子设备及存储介质,可在预设资源限制条件的约束下,利用强化学习为图神经网络和图数据中的顶点特征自动确定最优量化位宽,以确保得到的量化图神经网络同时具有较高精度及较低资源消耗率。
为解决上述技术问题,本申请提供一种图神经网络压缩方法,包括:
获取已训练的图神经网络及其训练时所使用的图数据;
确定图数据中所有图顶点对应的度数分布范围,并将度数分布范围划分为多个度数区间;
在预设资源限制条件的约束下,利用强化学习及硬件加速器确定各度数区间对应的最优区间量化位宽以及图神经网络对应的最优网络量化位宽;
利用最优区间量化位宽对图数据中对应度数的图顶点的顶点特征进行量化压缩,以及利用最优网络量化位宽对图神经网络进行量化压缩,得到最优量化图数据和最优量化图神经网 络。
在本申请一些实施例中,度数分布范围依照图顶点在该范围内的分布情况进行划分。
在本申请一些实施例中,图神经网络对应的最优网络量化位宽具体指图神经网络的图卷积核矩阵、权重矩阵及激活矩阵对应的最佳网络量化位宽。
在本申请一些实施例中,预设资源限制条件用于限制处理量化图数据及量化图神经网络所要耗费的计算资源。
在本申请一些实施例中,预设资源限制条件包含:计算量阈值、内存占用量阈值及延迟量阈值。
在本申请一些实施例中,确定图数据中所有图顶点对应的度数分布范围,并将度数分布范围划分为多个度数区间,包括:
将图数据中的所有图顶点按度数从小到大排列,得到图顶点序列;
利用图顶点序列对度数分布范围进行划分,得到多个度数区间;各度数区间包含的图顶点数量相同或差值小于预设阈值。
在本申请一些实施例中,在得到最优量化图数据和最优量化图神经网络之后,还包括:
利用最优量化图数据对最优量化图神经网络进行训练,得到微调量化图神经网络,以将微调量化图神经网络部署至对外服务设备中。
在本申请一些实施例中,硬件加速器的时序结构为可重构位串行矩阵乘法叠加,空间结构为BitFusion架构。
在本申请一些实施例中,在预设资源限制条件的约束下,利用强化学习及硬件加速器确定各度数区间对应的最优区间量化位宽以及图神经网络对应的最优网络量化位宽,包括:
获取图神经网络执行指定任务对应的基准准确度,并初始化强化学习所使用的智能体以及历史奖励值;智能体包括演员模块和评论家模块;
将策略次数设置为1,并初始化动作序列以及历史状态向量;动作序列用于保存各度数区间对应的区间量化位宽以及图神经网络对应的网络量化位宽;状态向量用于记录量化图神经网络在处理量化图数据时对应的内存占用量、计算量以及在执行指定任务时对应的准确度;
将时间步设置为1,并在预设资源限制条件的约束下,利用演员模块确定连续动作,利用连续动作对动作序列进行数值更新,并在更新后确定动作序列对应的内存占用量及计算量;
利用动作序列对图数据中的顶点特征和图神经网络进行量化压缩,并将得到的量化图数据和量化图神经网络发送至硬件加速器,以使硬件加速器利用量化图数据对量化图神经网络进行训练,并确定训练后的量化图神经网络执行指定任务对应的当前准确度;
利用动作序列对应的内存占用量、计算量和准确度确定当前状态向量,以及利用基准准确度和当前准确度确定奖励值;
在确定奖励值大于历史奖励值时,利用奖励值更新历史奖励值,并利用更新后的动作序列对最优区间量化位宽及最优网络量化位宽进行更新;
利用历史状态向量、连续动作、奖励值和当前状态向量生成转换数据,并利用转换数据对演员模块和评论家模块进行训练,以使评论家模块对演员模块在进行数值更新时所使用的策略进行更新;
当确定时间步未达到动作序列的长度时,对时间步加1,利用当前状态向量更新历史状态向量,并进入在预设资源限制条件的约束下,利用演员模块确定连续动作的步骤;
当确定时间步达到动作序列的长度且策略次数未达到预设值时,对策略次数加1,并进入初始化动作序列以及历史状态向量的步骤;
当确定策略次数达到预设值时,输出最优区间量化位宽及最优网络量化位宽。
在本申请一些实施例中,在预设资源限制条件的约束下,利用演员模块确定连续动作,利用连续动作对动作序列进行数值更新,并在更新后确定动作序列对应的内存占用量及计算量,包括:
利用演员模块根据Behavior策略选择连续动作,并通过如下方式将连续动作进行离散化,得到离散动作值:
其中,aτ(i)表示第τ个时间步的动作序列中的第i个量化位宽对应的连续动作,a′τ(i)表示与aτ(i)对应的离散动作值,Q包含多个预设量化位宽值,round(·)表示四舍五入函数,qmin和qmax表示预设的最小量化位宽和最大量化位宽,argmin(·)函数用于在Q中选择目标预设量化位宽值q,以使|q-round(qmin-0.5+aτ(i)×(qmax-qmin+1))|最小;
利用动作值对动作序列进行数值更新,确定更新后的动作序列对应的内存占用量、计算量及延迟量,并判断内存占用量、计算量及延迟量是否满足预设资源限制条件的限制;
若内存占用量、计算量及延迟量满足预设资源限制条件的限制,则进入利用动作序列对图数据中的顶点特征和图神经网络进行量化压缩的步骤;
若内存占用量、计算量及延迟量不满足预设资源限制条件的限制,则按照预设顺序依次对动作序列中的量化位宽进行减少,以再次更新动作序列,并在每次减少动作完成时进入确定更新后的动作序列对应的内存占用量、计算量及延迟量的步骤。
在本申请一些实施例中,利用演员模块根据Behavior策略选择连续动作,包括:
利用演员模块根据Behavior策略以如下方式选择连续动作:
aτ=μ(Oτμ)+Nτ
其中,Nτ表示第τ个时间步对应的随机UO噪声,Oτ表示第τ个时间步对应的历史状态向量,μ表示演员模块中的在线演员网络,θμ表示在线演员网络参数。
在本申请一些实施例中,硬件加速器利用量化图数据对量化图神经网络进行训练,包括:
硬件加速器基于小批量随机梯度下降法利用量化图数据对量化图神经网络进行训练。
在本申请一些实施例中,确定更新后的动作序列对应的内存占用量、计算量及延迟量,包括:
利用如下公式计算内存占用量:
其中,storeMB表示内存占用量,nb表示单个小批量内的图顶点个数,fl表示量化图神经网络第l个网络层对应的顶点维度值,L表示量化图神经网络所有网络层的数量,qmax表示单个小批量内的所有图顶点分配到的区间量化位宽中的最大值,S表示卷积核的总数,qW和qF分别表示量化图神经网络各网络层的权重矩阵和卷积核对应的网络量化位宽;
利用如下公式计算计算量:
其中,computeMB表示计算量,qσ表示量化图神经网络各网络层的激活矩阵对应的网络量化位宽,MACl表示量化图神经网络第l层的乘累加操作的总数;
利用如下公式计算延迟量:
其中,latencyMB表示延迟量,Λl表示量化图神经网络第l个网络层处理小批量图数据的延迟。
在本申请一些实施例中,利用动作序列对图数据中的顶点特征进行量化压缩,包括:
通过如下方式对图数据中各图顶点的顶点特征截断至[-c,c](c>0)范围内,并利用动作序列中与图顶点的度数对应的区间量化比特对截断后的顶点特征进行量化压缩:
quantize(Xi,:(j),a′τ,c)=round(clip(Xi,:(j),c)/s)×s;
其中,quantize(·)表示量化压缩函数,round(·)表示四舍五入函数,clip(x,y)表示截断函数,用于将x截断至[-y,y](y>0),Xi,:表示顶点特征,Xi,:(j)(j∈[1,f0])表示顶点特征中的第j个分量,S表示缩放因子,s=c/(2q-1),q表示动作序列中与Xi,:所属图顶点的度数对应的区间量化比特。
在本申请一些实施例中,在利用动作序列对图数据中的顶点特征进行量化压缩之前,还包括:
通过如下方式确定c值:
其中,argmin(·)函数用于在选择x值,以使DKL(Xi,:||quantize(Xi,:a′τ,x))最小,DKL(Xi,:||quantize(Xi,:a′τ,x))表示Xi,:的特征分布与quantize(Xi,:a′τ,x)的特征分布间的KL散度;特征分布为最大值、最小值、均值、方差、尖度或峰度。
在本申请一些实施例中,演员模块包含在线演员网络和目标演员网络,评论家模块包括在线评论家网络和目标评论家网络,初始化强化学习所使用的智能体,包括:
对在线演员网络的在线演员网络参数进行初始化,并将目标演员网络的目标演员网络参数与在线演员网络参数设置为相同值;
对在线评论家网络的在线评论家网络参数进行初始化,并将目标评论家网络的目标评论家网络参数与在线评论家网络参数设置为相同值。
在本申请一些实施例中,利用转换数据对演员模块和评论家模块进行训练,包括:
将转换数据添加至经验回放池,并从经验回放池中随机采样预设数量的转换数据作为训练数据;
利用训练数据、目标演员网络、目标评论家网络、在线评论家网络及如下损失函数,确定在线评论家网络参数的第一梯度;
其中,lossQ表示损失函数,aτ表示连续动作,Oτ表示第τ个时间步对应的历史状态向量,Q表示在线评论家网络,θQ表示在线评论家网络参数,N表示预设数量;表示对目标评论家网络的估计,rτ表示第τ个时间步对应的奖励值,γ表示预设的折扣因子,Q′表示目标评论家网络,θQ′表示目标评论家网络参数,μ′表示目标演员网络,θμ′表示目标演员网络参数,Oτ+1表示第τ个时间步对应的当前状态向量;
根据第一梯度对在线评论家网络参数进行更新;
利用训练数据、更新后的在线评论家网络、在线演员网络及目标函数确定绩效目标,并 确定绩效目标关于确定在线演员网络参数的第二梯度:
其中,表示当环境状态O服从分布函数为ρβ的分布时Q(O,μ(O))的期望值,θμ表示在线演员网络参数,表示第二梯度;
基于第二梯度对在线演员网络参数进行更新;
利用更新后的在线评论家网络参数和在线演员网络参数以如下方式对目标评论家网络参数和目标演员网络参数进行更新:
其中,α为预设值。
本申请还提供一种图神经网络压缩装置,包括:
获取模块,用于获取已训练的图神经网络及其训练时所使用的图数据;
区间确定模块,用于确定图数据中所有图顶点对应的度数分布范围,并将度数分布范围划分为多个度数区间;
量化位宽确定模块,用于在预设资源限制条件的约束下,利用强化学习及硬件加速器确定各度数区间对应的最优区间量化位宽以及图神经网络对应的最优网络量化位宽;
量化压缩模块,用于利用最优区间量化位宽对图数据中对应度数的图顶点的顶点特征进行量化压缩,以及利用最优网络量化位宽对图神经网络进行量化压缩,得到最优量化图数据和最优量化图神经网络。
本申请还提供一种电子设备,包括:
存储器,用于存储计算机程序;
处理器,用于执行计算机程序时实现如上的图神经网络压缩方法。
本申请还提供一种非易失性可读存储介质,非易失性可读存储介质中存储有计算机可执行指令,计算机可执行指令被处理器加载并执行时,实现如上的图神经网络压缩方法。
本申请提供一种图神经网络压缩方法,包括:获取已训练的图神经网络及其训练时所使用的图数据;确定图数据中所有图顶点对应的度数分布范围,并将度数分布范围划分为多个度数区间;在预设资源限制条件的约束下,利用强化学习及硬件加速器确定各度数区间对应的最优区间量化位宽以及图神经网络对应的最优网络量化位宽;利用最优区间量化位宽对图数据中对应度数的图顶点的顶点特征进行量化压缩,以及利用最优网络量化位宽对图神经网络进行量化压缩,得到最优量化图数据和最优量化图神经网络。
可见,本申请在获取到已训练的图神经网络及其训练时所使用的图数据时,首先会统计图数据中所有图顶点对应的度数分布范围,并将这一返回划分为多个度数区间;随后,本申 请将会在预设资源限制条件的约束下,采用强化学习及硬件加速器确定各度数区间对应的最优区间量化位宽及图神经网络对应的最优网络量化位宽,并利用以上两种量化位宽对图数据的顶点特征和图神经网络进行量化压缩,其中,强化学习能够根据硬件加速器的反馈,自动搜索每个度数区间及图神经网络对应的最优量化位宽分配策略,即能够实现对上述最优区间量化位宽和最优网络量化位宽的自动查找;同时,强化学习的自动查找动作受预设资源限制条件的限制,即能够保证最终得到的最优区间量化位宽和最优网络量化位宽能够适应于资源受限设备;最后,由于本申请已将图顶点的度数分布范围划分为多个度数区间,并为每一区间均确定了对应的最优区间量化位宽,即能够对不同度数的图顶点的顶点特征进行不同程度的量化压缩,因此能够有效避免相关方案对所有数据选择简单但激进的统一量化容易导致的高精度损失的问题。简单来讲,由于本申请采用强化学习为图神经网络及其训练时所使用的图数据确定最优的量化位宽,因此不仅能够实现量化位宽的自动化确定,同时还能够有效平衡性能与网络模型精度间的关系,确保最终得到的量化图数据和量化图神经网络不仅具有较高的精度,同时还能够适应于资源受限设备。本申请还提供一种图神经网络压缩装置、电子设备及非易失性可读存储介质,具有上述有益效果。
附图说明
为了更清楚地说明本申请实施例或相关技术中的技术方案,下面将对实施例或相关技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。
图1为本申请实施例所提供的一种图神经网络压缩方法的流程图;
图2为本申请实施例所提供的一种图神经网络的典型结构图;
图3为本申请实施例所提供的一种图神经网络压缩系统的结构框图;
图4为本申请实施例所提供的一种图神经网络压缩装置的结构框图;
图5为本申请实施例所提供的一种电子设备的结构框图;
图6为本申请实施例所提供的一种非易失性可读存储介质的结构框图。
具体实施方式
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
为对图神经网络模型更加有效地进行量化压缩,以确保压缩得到的量化图神经网络同时具有较高精度及较低资源消耗率,本申请实施例可提供一种图神经网络压缩方法,可在预设资源限制条件的约束下,利用强化学习为图神经网络和图数据自动确定最优量化位宽,以确保得到的量化图神经网络同时具有较高精度及较低资源消耗率。请参考图1,图1为本申请实施例所提供的一种图神经网络压缩方法的流程图,该方法可以包括:
S100、获取已训练的图神经网络及其训练时所使用的图数据。
应当指出的是,本步骤中获取的图神经网络为原始的、全精度的图神经网络,而图数据则为该网络的训练数据,其中图神经网络所包含的权重、卷积核等参数以及图数据均属于浮点型数据,且大多用FP32表示。浮点型数据精度高,但相应的,存储它们所需的内存空间 也较大。本申请的目标是在保证图神经网络模型推理精度的前提下,为图神经网络各层的权重、卷积核参数等,以及图数据找到合适的量化位宽,以降低存储空间需求。这里的量化位宽通常为表示精度较低的整型,如int4,int8等。
为便于理解,首先对图数据及图神经网络进行简单介绍。图数据是图神经网络的基本输入内容。考虑一个具有n个顶点和m条边的图G=(V,E),即有|V|=n和|E|=m,图顶点的平均度数d=m/n。图中的连通性由邻接矩阵A∈{0,1}n×n给出,元素aij=1表示顶点vi和vj相邻接,aij=0表示不邻接。度数矩阵D是一个对角阵,主对角线上的n个元素的值分别表示n个顶点的度数,其余元素都为零。每个顶点vi都有长度为f0的特征向量,所有图顶点的特征向量组成特征矩阵在本申请实施例中,图数据中具体要压缩的部分为由所有图顶点的特征向量组成特征矩阵该矩阵属于浮点型数据。
进一步,图神经网络是一种能够处理不规则结构数据的特殊神经网络。尽管图神经网络的结构可遵循不同指导原则进行设计,但几乎所有图神经网络都可以解释为对顶点特征执行消息传递,然后是特征变换和激活。图2展示了一个典型的图神经网络的结构:它由输入层、L层图卷积层和输出层组成。输入层负责读取表征图拓扑结构的邻接矩阵A或邻接表AdjList,以及顶点特征矩阵X0。图卷积层负责顶点特征提取,对于每一层图卷积层l(l∈[1,L]),它读入邻接矩阵A或邻接表AdjList,以及顶点特征矩阵Xl,经由图卷积操作和非线性变换,输出新的顶点特征矩阵Xl+1。输出层根据任务的不同自由设定,比如顶点分类可通过softmax函数实现。典型的,在一个由L层图卷积层组成的图神经网络中,第l(l∈[1,L])层的图卷积操作通常可以写成以下形式:
Xl+1=σ(ΣsFsXlWl,s)
其中,表示定义消息传递算子的第s∈Z+个图卷积核;σ(*)表示非线性激活函数。是第l层的第s个卷积核对应的可学习线性权重矩阵,fl表示第l层图卷积层输入的顶点特征维度。在这个通用框架内,不同图神经网络的主要差异体现在选择不同的图卷积核Fs。无论是顶点特征矩阵X,还是图卷积核F,亦或是权重W,它们通常都是浮点型数据。需要注意的是,只有图卷积层才有卷积核和激活,输入和输出层只有权重。
应当指出的是,本申请实施例并不限定具体的图神经网络及图数据。正如上文,图神经 网络的结构可遵循不同指导原则进行设计;同时,可以理解的是,对于不同的任务,图数据的具体内容,甚至是其复杂度都可能不同,因此具体的图神经网络和图数据可根据实际应用需求进行选择。本申请之所以可对各类图神经网络及图数据进行压缩,是由于本申请实施例采用了强化学习来确定图神经网络及图数据对应的最佳量化位宽,而强化学习对各类环境均有较强的适应性,因此本申请实施例所提供的压缩方法适用于各类图神经网络。
S200、确定图数据中所有图顶点对应的度数分布范围,并将度数分布范围划分为多个度数区间。
在相关方案中,对图数据中各图顶点的顶点特征的量化压缩通常会采用统一的量化位宽进行。这虽然有效降低了图数据的复杂度及存储规模,但是这种不加区分的量化压缩方法却给图神经网络模型带来了显著的精度损失。因此,在本申请实施例中,针对图数据中拥有不同度数的图顶点,可采用不同的量化位宽进行压缩,以此缓解由量化图数据引起的图神经网络模型精度损失。具体的,在图神经网络计算中,度数较高的顶点通常会从相邻顶点获得更丰富的信息,这使得它们对低量化比特的鲁棒性更强,因为量化的随机误差通常可以通过大量的聚合操作平均为0。特别地,给定一个量化位宽q,顶点vi的量化误差Errori是一个随机变量,并且遵循均匀分布。对于度数较大的顶点,可从顶点vi及其相邻顶点vj聚合大量的Errori和Errorj,并且平均结果将按照大数定律收敛到0。因此,度数大的顶点对量化误差的鲁棒性更强,可以对这些度数高的顶点使用较小的量化比特,而对度数低的顶点使用较大的量化比特。
进一步,由于现实世界的图的顶点度数大多服从幂率分布,如果为每个不同度数的图顶点都分配量化位宽,将会导致状态空间爆炸。例如,即便对于一个小规模的图数据com-livejournal,绝大部分的顶点度数分散在1到104之间。如果量化空间为8的话,则状态空间将达到惊人的810000。显然,如此巨大的状态空间无法满足应用需求。因此,为降低状态空间复杂度,本申请实施例可首先统计图数据中各图顶点对应的度数,得到该图数据对应的度数分布范围,然后将这一范围划分为多个度数区间,以为每一区间确定最佳区间量化位宽,这样,便能够大幅缩小状态空间大小,进而提升最佳量化位宽的搜索便捷性。根据如上描述,最终得到最佳区间量化位宽的分布规律应当为:度数区间对应的度数值越大,则对应的最佳区间量化位宽越大。需要说明的是,本申请实施例并不限定度数分布范围的划分方法,例如可对度数分布范围进行等分,也可以根据图顶点在这一范围内的分布情况进行划分,例如可确保每一度数区间对应的图顶点数量相同或是接近。为了进一步减少精度损失,在本申请实施例中,度数分布范围的划分可依照图顶点在该范围内的分布情况进行划分,以确保各区间所包含的图顶点数量相同。
在一种可能的情况中,确定图数据中所有图顶点对应的度数分布范围,并将度数分布范围划分为多个度数区间,可以包括:
步骤S201、将图数据中的所有图顶点按度数从小到大排列,得到图顶点序列;
步骤S202、利用图顶点序列对度数分布范围进行划分,得到多个度数区间;各度数区间包含的图顶点数量相同或差值小于预设阈值。
需要说明的是,本申请实施例并不限定预设阈值的具体数值,可根据实际应用需求进行设定。为降低图度数区间的图顶点数据差异,预设阈值的数值可尽量小。具体的,对于图数 据G=(V,E),可首先统计其顶点度数分布,将图G中所有顶点按度数从小到大排序。在该序列中找到一个顶点度数分割点列表split_point=[d1,d2,...,dk-1],以将所有顶点划分为k个区间:[dj,dj+1](j∈[0,k-1]),使得落在每个区间中的顶点个数相同或者接近。其中,dj<dj+1;d0=dmin和dk=dmax,dmin和dmax分别表示某个图数据中所有顶点度数的最小值和最大值。在此基础上,制定顶点度数-量化位宽分配表{[dj,dj+1):qj(j∈[0,k-1])}同一区间的图顶点指派相同量化位宽:如果顶点度数落在[dj,dj+1)区间,则为其分配qj位宽。
S300、在预设资源限制条件的约束下,利用强化学习及硬件加速器确定各度数区间对应的最优区间量化位宽以及图神经网络对应的最优网络量化位宽。
在完成度数区间的划分之后,本申请实施例将在预设资源限制条件的约束下,利用强化学习及硬件加速器来确定各度数区间对应的最优区间量化位宽及图神经网络模型对应的最优网络量化位宽。此处应当指出的是,图神经网络对应的最优网络量化位宽具体指图神经网络的图卷积核矩阵、权重矩阵及激活矩阵对应的最佳网络量化位宽,这三个矩阵对应的最佳网络量化位宽可相同,也可不同;此外,图神经网络各网络层的图卷积核矩阵、权重矩阵及激活矩阵对应的最佳网络量化位宽可以相同,也可以不同,可根据实际应用需求进行选择,其中输入层和输出层没有图卷积核矩阵和激活矩阵,而卷积层则有图卷积核矩阵和激活矩阵。可以理解的是,不同的最佳网络量化位宽虽然可以带来更高的网络模型精度,但容易增加最佳网络量化位宽的搜索计算量,因此上述三种矩阵的最佳网络量化位宽的设置可在平衡网络模型精度及搜索计算量之后按需设置。当然,还需指出的是,图神经网络中的各网络层并非均具有图卷积核矩阵、权重矩阵及激活矩阵,例如卷积层中设置有这三个矩阵,但输入层和输出层并未设置图卷积核矩阵和激活矩阵。因此,在为图神经网络设置网络量化位宽时,还可进一步依照图神经网络的具体结构进行设置。
进一步,预设资源限制条件用于限制处理量化图数据及量化图神经网络(如训练、执行指定任务等)所要耗费的计算资源,这是由于图神经网络对计算资源的消耗较大,若不考虑具体的硬件框架而随意地进行量化压缩,则可能导致最终得到的量化图数据及量化图神经网络对应有较大的处理计算量、较大的内存占用量及较长的处理时延,不利于部署应用。因此本申请实施例将采用预设资源限制条件对强化学习进行限制。需要说明的是,本申请实施例并不限定具体的预设资源限制条件,例如可以包含计算量阈值、内存占用量阈值及延迟量阈值,且各阈值设置有对应的计算公式,用于计算量化图数据及量化图神经网络对应的计算量、内存占用量及延迟量。可以理解的是,量化图数据及量化图神经网络对应的计算量、内存占用量及延迟量应当小于等于对应的计算量阈值、内存占用量阈值及延迟量阈值。上述阈值及对应的公式均由硬件加速器的直接反馈确定得到,其中硬件加速器用于验证图数据及图神经网络的量化效果,如验证量化压缩网络对计算资源的消耗,以及该网络在执行指定任务时对应的准确度。需要说明的是,本申请实施例并不限定具体的计算量阈值、内存占用量阈 值及延迟量阈值,也不限定上述阈值具体对应的计算公式,可根据实际应用需求进行设定,或可参考后续实施例中的描述。本申请实施例也不限定硬件加速器的具体结构,例如该硬件加速器的时序结构可以为可重构位串行矩阵乘法叠加(BISMO,Bit-Serial Matrix Multiplication Overlay),空间结构可以为BitFusion架构。一种优选的硬件加速器配置可参考下表。
表1硬件加速器的配置情况
进一步,强化学习是机器学习的范式和方法论之一,用于描述和解决智能体(agent)在与环境的交互过程中通过学习策略以达成回报最大化或实现特定目标的问题。强化学习要解决的问题是:让智能体(agent)学习在一个环境中如何执行动作(action),从而获得最大的奖励值总和(total reward)。这个奖励值一般与智能体定义的任务目标关联。智能体主要学习的内容包括:第一是行为策略(action policy),第二是规划(planning)。其中,行为策略的学习目标是最优策略,也就是使用这样的策略,可以让智能体在特定环境中的行为获得最大的奖励值,从而实现其任务目标。动作(action)可以简单分为:(1)连续的,如赛车游戏中的方向盘角度、油门、刹车控制信号,机器人的关节伺服电机控制信号;(2)离散的,如围棋、贪吃蛇游戏等。
本申请实施例具体使用了同时基于价值和策略的强化学习方法,其又可称为Actor(演员,又可称为演员)-Critic(评论家,又可称为评论者)方法。Actor-Critic方法结合了基于价值的方法和基于策略的方法的优点,利用基于价值的方法学习Q值函数或状态价值函数V来提高采样效率(该部分由评论者处理),并利用基于策略的方法学习策略函数(该部分由演员处理),从而适用于连续或高维的动作空间。Actor-Critic方法可以看作是基于价值的方法在连续动作空间中的扩展,也可以看作是基于策略的方法在减少样本方差和提升采样效率方面的改进。
具体的,请参考图3,图3为本申请实施例所提供的一种图神经网络压缩系统的结构框图,该系统由基于Actor-Critic框架的DDPG(Deep Deterministic Policy Gradient)智能体、策略Policy、量化实施以及硬件加速器共四个部分组成。其中,DDPG智能体根据当前的环境状态O,在满足硬件加速器资源(即预设资源限制条件)约束的前提下,按照特定的策略给出动作:为每一个度数区间的顶点的特征和图神经网络所有层的图卷积核(如果有)、权重和激活(如果有)分配合适的量化位宽。上位机根据DDPG智能体提供的量化位宽分配方案,对已经训练好的浮点图神经网络模型和图数据实施量化,得到量化图神经网络模型和量化图数据。随后,量化数据及量化网络将被一起映射或分布到硬件加速器上,而后者将利用量化图数据对量化图神经网络进行训练,并在训练之后利用量化图神经网络执行指定任务,进而将量化前后图神经网络的准确度差值作为奖励,反馈给DDPG智能体。DDPG智能体根据环境反馈的信息调整策略并输出新的动作,直至获得最优策略。当然,该系统还可包 括其他工作流程,为避免描述冗长,关于该系统的具体工作流程,请参考后续实施例中的描述。
S400、利用最优区间量化位宽对图数据中对应度数的图顶点的顶点特征进行量化压缩,以及利用最优网络量化位宽对图神经网络进行量化压缩,得到最优量化图数据和最优量化图神经网络。
在得到最优区间量化位宽及最优网络量化位宽之后,便可对相应的图数据中各图顶点的顶点特征和图神经网络进行量化压缩,以得到最优量化图数据和最优量化图神经网络。本申请实施例并不限定量化压缩的具体步骤,可根据实际应用需求进行设定,或可参考后续实施例中的描述。应当指出的是,尽管本申请实施例已尽力提升最优量化图神经网络的精度,但量化压缩本身还是会对最优量化图神经网络执行指定任务的准确度带来负面影响。对此,可在量化压缩结束后,再次使用最优量化图数据对量化图神经网络进行训练,以恢复最优量化图神经网络执行指定任务的准确度,以便将最终得到的微调量化图神经网络部署至对外服务设备中进行对外服务。
在一种可能的情况中,在得到最优量化图数据和最优量化图神经网络之后,还可以包括:
S500、利用最优量化图数据对最优量化图神经网络进行训练,得到微调量化图神经网络,以将微调量化图神经网络部署至对外服务设备中。
需要说明的是,本申请实施例并不限定最优量化图神经网络的训练过程,可参考图神经网络的相关技术。
基于上述实施例,本申请在获取到已训练的图神经网络及其训练时所使用的图数据时,首先会统计图数据中所有图顶点对应的度数分布范围,并将这一范围划分为多个度数区间;随后,本申请将会在预设资源限制条件的约束下,采用强化学习及硬件加速器确定各度数区间对应的最优区间量化位宽及图神经网络对应的最优网络量化位宽,并利用以上两种量化位宽对图数据中各图顶点的顶点特征和图神经网络进行量化压缩,其中,强化学习能够根据硬件加速器的反馈,自动搜索每个度数区间及图神经网络对应的最优量化位宽分配策略,即能够实现对上述最优区间量化位宽和最优网络量化位宽的自动搜索;同时,强化学习的自动搜索动作受预设资源限制条件的限制,即能够保证最终得到的最优区间量化位宽和最优网络量化位宽能够适应于资源受限设备;最后,由于本申请已将图顶点的度数分布范围划分为多个度数区间,并为每一区间均确定了对应的最优区间量化位宽,即能够对不同度数的图顶点的顶点特征进行不同程度的量化压缩,因此能够有效避免相关方案对所有数据选择简单但激进的统一量化容易导致的高精度损失的问题。简单来讲,由于本申请采用强化学习为图神经网络及其训练时所使用的图数据确定最优的量化位宽,因此不仅能够实现量化位宽的自动化确定,同时还能够有效平衡性能与网络模型精度间的关系,确保最终得到的量化图数据和量化图神经网络不仅具有较高的精度,同时还能够适应于资源受限设备。
基于上述实施例,下面将对图神经网络压缩系统的具体工作流程进行介绍。为便于理解,首先对后文中出现的动作序列、策略、时间步、奖励值及转换数据进行介绍。动作序列用于保存各度数区间对应的区间量化位宽以及图神经网络对应的网络量化位宽,例如,对于给定的图数据G=(V,E),首先统计其顶点度数分布范围,并按照一定策略划分为k个区间。进而对于k个度数区间和图神经网络的三种矩阵来说,动作序列的长度可以为k+3。确 定一个完整动作序列的过程称作一次策略(episode),一次策略包含N个时间步(step),其中N的值与动作序列的长度相等。应当特别指出的是,每执行一个时间步,都会对动作序列进行一次更新,因此一次策略通常可产生N种不同的动作序列。进一步,可以理解的是,动作序列可用于量化压缩,而由于上一动作序列与下一动作序列之间并不相同,因此这两个动作序列所对应的压缩效果也不同,换句话说,采用这两种动作序列所产生的量化图数据和量化图神经网络对应的资源消耗情况(如内存占用率、计算量等)并不相同,执行指定任务时所对应的准确度也不相同。因此在本申请实施例中,可以采用状态向量来记录资源消耗情况及准确度间的变化情况,具体的,对于利用上一动作序列压缩的量化图数据和量化图神经网络,其对应的内存占用率、计算量及执行指定任务对应的准确度可采用历史状态向量记录,而利用下一动作序列压缩的量化图数据和量化图神经网络对应的内存占用率、计算量及执行指定任务对应的准确度则可采用当前状态向量记录。进一步,可利用原始的图神经网络执行指定任务对应的基准准确度和量化图神经网络执行相同任务对应的准确度确定奖励值,其中基准准确度具体指利用原始的图数据训练原始的图神经网络后,图神经网络对应的推理精度,如分类任务中的分类准确度。此后,每一时间步对应的历史状态向量、动作序列、奖励值及当前状态向量构成一个转换数据(transition),显然,该数据中包含本次量化压缩的动作、奖励及状态转移,智能体可通过这一数据感知动作的执行效果。换句话说,可利用转换数据对智能体进行训练,以更新智能体在确定动作时所采用的策略。
基于上述描述,下面对图神经网络压缩系统的具体工作流程进行详细介绍,在一种可能的情况中,在预设资源限制条件的约束下,利用强化学习及硬件加速器确定各度数区间对应的最优区间量化位宽以及图神经网络对应的最优网络量化位宽,可以包括:
S310、获取图神经网络执行指定任务对应的基准准确度,并初始化强化学习所使用的智能体以及历史奖励值;智能体包括演员模块和评论家模块。
需要说明的是,本申请实施例并不限定图神经网络所执行的具体任务,可根据实际应用需求进行设定。本申请实施例将会把原始的图神经网络执行该任务的准确度设置为基准准确度。本申请实施例也不限定准确度的计算方式,可根据实际应用需求进行设定。在一种可能的情况中,对于多分类任务而言,设测试图数据集GT=(VT,ET),每个顶点仅有一个类别标签且所有顶点共有cT个类别标签,类别标签为i(i∈[1,cT])的顶点数占总顶点数的比例为γii∈(0,1)),且将每一类视为“正类(positive)”,其余类视为“负类(negative)”,并借鉴经典二分类问题中相应指标的定义,该多分类问题的分类准确度可定义为:
进一步,为了在智能体的搜索过程中确定最优区间量化位宽和最优网络量化位宽,本申请实施例还专门设置了历史奖励值,用于记录搜索过程中所出现的最高奖励值。当最高奖励值出现时,本申请实施例将会对历史记录值、最优区间量化位宽和最优网络量化位宽进行更新。当然,可以理解的是,历史奖励值也应当具有初值,此处的初始化过程便是为其设置初值。本申请实施例并不限定历史奖励值具体的初值,只要尽量小即可。
进一步,本申请实施例也不限定对智能体进行初始化的具体过程,此处的初始化主要是对智能体中的参数进行初始化,可参考DDPG智能体的相关技术。
S320、将策略次数设置为1,并初始化动作序列以及历史状态向量;动作序列用于保存各度数区间对应的区间量化位宽以及图神经网络对应的网络量化位宽;状态向量用于记录量化图神经网络在处理量化图数据时对应的内存占用量、计算量以及在执行指定任务时对应的准确度。
具体的,动作序列可表示为:
a={dmin,d1):q1,[d1,d2):q2,…,[dk-1,dmax):qk,F:qF,W:qW,∑:qσ},
其中,“[dj,dj+1):qj”表示为属于区间[dj,dj+1)的图顶点分配量化位宽qj(j∈[0,k-1]),“F:qF”、“W:qW”和“Σ:qσ”分别表示为图神经网络所有层的图卷积核(如果有)、权重和激活(如果有)设置的量化位宽为qF,qW和qσ。当然,如果为图神经网络不同层的图卷积核(或权重或激活)指定不同的量化位宽,此时,DDPG智能体的动作序列a的长度将变为k+3L+2,其中L表示图卷积层的数量,即有:
进一步,状态向量可表示为:
O=[acc,store,cpmpt],
其中acc表示准确度,store表示内存占用量,compt表示计算量。关于内存占用量及计算量的确定方式可参考后续实施例中的描述。
S330、将时间步设置为1,并在预设资源限制条件的约束下,利用演员模块确定连续动作,利用连续动作对动作序列进行数值更新,并在更新后确定动作序列对应的内存占用量及计算量。
可以理解的是,演员模块对动作序列的数值更新即相当于演员模块根据当前状态及策略给出了一种动作。值得注意的是,演员模块(actor)首先会确定连续动作,进而利用这一连续动作对动作序列进行数值更新。然而,量化位宽通常为离散值,例如常规的量化位宽为2、4、8、16、32位等,因此在得到连续动作之后,首先需要将其离散化,得到离散动作值,进而利用这一离散动作值对动作序列进行更新。下面对这一过程进行详细介绍。
在一种可能的情况中,在预设资源限制条件的约束下,利用演员模块确定连续动作,利用连续动作对动作序列进行数值更新,并在更新后确定动作序列对应的内存占用量及计算量,包括:
步骤S331、利用演员模块根据Behavior策略选择连续动作,并通过如下方式将连续动作进行离散化,得到离散动作值:
其中,aτ(i)表示第τ个时间步的动作序列中的第i个量化位宽对应的连续动作,a′τ(i)表示与aτ(i)对应的离散动作值,Q包含多个预设量化位宽值,round(·)表示四舍五入函数,qmin和qmax表示预设的最小量化位宽和最大量化位宽,argmin(·)函数用于在Q中选择目标预设量化位宽值q,以使|q-round(qmin-0.5+aτ(i)×(qmax-qmin+1))|最小;
步骤S332、利用动作值对动作序列进行数值更新,确定更新后的动作序列对应的内存占用量、计算量及延迟量,并判断内存占用量、计算量及延迟量是否满足预设资源限制条件的限制;若内存占用量、计算量及延迟量满足预设资源限制条件的限制,则进入步骤S333,若内存占用量、计算量及延迟量不满足预设资源限制条件的限制,则进入步骤S334;
步骤S333、若内存占用量、计算量及延迟量满足预设资源限制条件的限制,则进入利用动作序列对图数据中的顶点特征和图神经网络进行量化压缩的步骤;
步骤S334、若内存占用量、计算量及延迟量不满足预设资源限制条件的限制,则按照预设顺序依次对动作序列中的量化位宽进行减少,以再次更新动作序列,并在每次减少动作完成时进入确定更新后的动作序列对应的内存占用量、计算量及延迟量的步骤。
具体来讲,对于长度为k+3的动作序列,在第τ个时间步,DDPG智能体采取连续动作aτ=[aτ(1),aτ(2),...aτ(k+3)],且满足aτ(i)∈[0,1](i∈[1,k+2]),并采用如上公式将其每一个分量aτ(i)舍入为Q={2,4,8,16,32}中离它最近的位宽值a′τ,即满足|aτ-a′τ|最小,其中qmin=2,qmax=32。例如,当aτ(i)在如上公式中的计算结果表明,在q选择4时,相较于选择其他预设量化位宽能够确保|aτ-a′τ|最小,因此对应的a′τ应当设置为4。
进一步,在实际应用中,由于计算预算有限(即计算量、延迟和内存占用量),因此本申请实施例希望在给定约束的情况下找到具有最佳推理性能的量化位宽分配方案。本申请实施例鼓励智能体通过限制动作空间来满足计算预算。具体的,每当智能体发出一个动作aτ,本申请实施例就需要预估量化后的图神经网络将使用的硬件资源量。如果当前的分配方案超出硬件加速器资源预算,则依次减少每个度数区间的顶点以及图神经网络所有层的图卷积核(如果有)、权重和激活(如果有)的位宽,直到最终满足硬件加速器资源预算约束为止。也可以按照其他顺序,比如按照当前已分配位宽值由大到小的顺序依次减少,本申请实施例不做限定。
进一步,Behavior策略β是一个根据当前演员模块的策略和随机UO(Uhlenbeck-Ornstein,奥恩斯坦-乌伦贝克)噪声Nτ生成的随机过程,其具体过程可以为:
在一种可能的情况中,利用演员模块根据Behavior策略选择连续动作,包括:
步骤S3311、利用演员模块根据Behavior策略以如下方式选择连续动作:
aτ=μ(Oτμ)+Nτ
其中,Nτ表示第τ个时间步对应的随机UO噪声,Oτ表示第τ个时间步对应的历史状态向量,μ表示演员模块中的在线演员网络,θμ表示在线演员网络参数。
此处应当指出的是,演员模块的一种策略具体可由该模块中具体的模型参数表示。换句话说,对演员模块进行策略更新实际便是对该模块进行参数更新。
S340、利用动作序列对图数据中的顶点特征和图神经网络进行量化压缩,并将得到的量化图数据和量化图神经网络发送至硬件加速器,以使硬件加速器利用量化图数据对量化图神经网络进行训练,并确定训练后的量化图神经网络执行指定任务对应的当前准确度。
S350、利用动作序列对应的内存占用量、计算量和准确度确定当前状态向量,以及利用基准准确度和当前准确度确定奖励值;
具体的,奖励值可通过如下方式进行计算:
r=λ(accquant-accorigin)
其中,accorigin是利用原始的训练集训练原始的图神经网络后,原始图神经网络对应的基准准确度,accquant是微调后的量化图神经网络的准确度,λ为比例因子,其数值可优选为0.1。
S360、在确定奖励值大于历史奖励值时,利用奖励值更新历史奖励值,并利用更新后的动作序列对最优区间量化位宽及最优网络量化位宽进行更新;
S370、利用历史状态向量、连续动作、奖励值和当前状态向量生成转换数据,并利用转换数据对演员模块和评论家模块进行训练,以使评论家模块对演员模块在进行数值更新时所使用的策略进行更新;
需要说明的是,本申请实施例并不限定对演员模块和评论家模块进行训练的具体过程,可参考后续实施例中的介绍。训练的意义即在于对演员模块进行模型参数更新,使其可采用新的策略来确定下一动作。
S380、当确定时间步未达到动作序列的长度时,对时间步加1,利用当前状态向量更新历史状态向量,并进入在预设资源限制条件的约束下,利用演员模块确定连续动作的步骤;
S390、当确定时间步达到动作序列的长度且策略次数未达到预设值时,对策略次数加1,并进入初始化动作序列以及历史状态向量的步骤;
S3100、当确定策略次数达到预设值时,输出最优区间量化位宽及最优网络量化位宽。
需要说明的是,本申请实施例并不限定具体的预设值,可根据实际应用需求进行设定。可以理解的是,预设值越大,则智能体对环境的感知程度越强,其生成的最优区间量化位宽及最优网络量化位宽便更加合适,但相应的计算耗时也更长,计算量也更大,因此策略次数对应的预设上限可在平衡精度及计算资源消耗之后按需设定。
基于上述实施例,下面对内存占用量、计算量及延迟量的计算方式进行介绍。当然,考虑到上述三个量的阈值及计算公式由硬件加速器的直接反馈来确定,因此还需对硬件加速器对量化图数据和量化图神经网络的处理方式进行介绍。具体的,硬件加速器对量化图数据和 量化图神经网络的主要处理内容为利用量化图数据对量化图神经网络进行训练,而训练过程可采用多种方式进行优化,例如全批量(full-batch)、小批量(mini-batch)或单元素(one-example)随机梯度下降(Stochastic Gradient Descent,SGD)等策略进行优化。在本申请实施例中,为提高训练效率,硬件加速器可采用小批量随机梯度下降法对量化图神经网络的训练过程进行优化。
在一种可能的情况中,硬件加速器利用量化图数据对量化图神经网络进行训练,可以包括:
S341、硬件加速器基于小批量随机梯度下降法利用量化图数据对量化图神经网络进行训练。
基于上述训练方式,下面对内存占用量、计算量及延迟量的计算方式进行介绍。在一种可能的情况中,确定更新后的动作序列对应的内存占用量、计算量及延迟量,包括:
S3321、利用如下公式计算内存占用量:
其中,storeMB表示内存占用量,nb表示单个小批量内的图顶点个数,fl表示量化图神经网络第l个网络层对应的顶点维度值,L表示量化图神经网络所有网络层的数量,qmax表示单个小批量内的所有图顶点分配到的区间量化位宽中的最大值,S表示卷积核的总数,qW和qF分别表示量化图神经网络各网络层的权重矩阵和卷积核对应的网络量化位宽;
S3322、利用如下公式计算计算量:
其中,computeMB表示计算量,qσ表示量化图神经网络各网络层的激活矩阵对应的网络量化位宽,MACl表示量化图神经网络第l层的乘累加操作的总数;
S3323、利用如下公式计算延迟量:
其中,latencyMB表示延迟量,Λl表示量化图神经网络第l个网络层处理小批量图数据的延迟。
应当指出的是,在得到上述内存占用量、计算量及延迟量之后,可采用对应的阈值来判断上述三个量是否符合要求。可采用Memorylimit、BOPSlimit和Latencylimit表示内存占用量阈值、计算量阈值及延迟量阈值,其中为Memorylimit硬件加速设备可提供的存储容量,BOPSlimit表示硬件加速器每秒钟可提供的比特操作总数上限,而Latencylimit是指硬 件加速器的延迟特性。Memorylimit、BOPSlimit和Latencylimit均由硬件加速器本身特性决定,可直接获取或通过测量得到。
基于上述实施例,下面对量化压缩的具体过程进行具体介绍。本申请实施例将以图数据量化压缩为例进行介绍。在一种可能的情况中,利用动作序列对图数据中的顶点特征进行量化压缩,可以包括:
S341、通过如下方式对图数据中各图顶点的顶点特征截断至[-c,c](c>0)范围内,并利用动作序列中与图顶点的度数对应的区间量化比特对截断后的顶点特征进行量化压缩:
quantize(Xi,:(j),a′τ,c)=round(clip(Xi,:(j),c)/s)×s
其中,quantize(·)表示量化压缩函数,round(·)表示四舍五入函数,clip(x,y)表示截断函数,用于将x截断至[-y,y](y>0),Xi,:表示顶点特征,Xi,:(j)(j∈[1,f0])表示顶点特征中的第j个分量,S表示缩放因子,s=c/(2q-1),q表示动作序列中与Xi,:所属图顶点的度数对应的区间量化比特。
当然,为进一步降低对截断值c的选择为量化图数据的精度损失,本申请实施例还设计了采用基于最小化量化前后数据特征分布距离的方法来确定合适的c值。具体的,在利用动作序列对图数据中的顶点特征进行量化压缩之前,还可以包括:
S342、通过如下方式确定c值:
其中,argmin(·)函数用于在选择x值,以使DKL(Xi,:||quantize(Xi,:a′τ,x))最小,DKL(Xi,:||quantize(Xi,:a′τ,x))表示Xi,:的特征分布与quantize(Xi,:a′τ,x)的特征分布间的KL散度;特征分布为最大值、最小值、均值、方差、尖度或峰度。
需要说明的是,本申请实施例并不限定KL散度(Kullback-Leibler divergence)的计算方式,当然,也可以采用其他方式来确定上述两种特征分布间的距离,例如还可采用JS距离(Jensen-Shannon Divergence)和互信息(Mutual Information)等,可根据实际应用需求进行设定。本申请实施例也不限定上述特征分布数据的具体获取方式,例如最大值、最小值、均值、方差可直接通过目标数据获得;尖度和峰度通过构建目标数据的直方图获得。至于图神经网络不同层的图卷积核(如果有)、权重和激活(如果有),本申请实施例将对它们进行类似的量化。不同之处在于,对激活来说,本申请实施例会将它们截断到[0,c]的范围内,而不是[-c,c],这是因为激活值(即ReLU(线性整流函数)层的输出)是非负的。
基于上述实施例,下面对演员模块和评论家模块的初始化及训练过程进行详细介绍。首先对DDPG智能体的结构进行简单介绍。Actor-Critic框架由Actor(又可称策略网络μ) 和Critic(又可称Q网络或价值网络)组成。其中,Actor负责与环境交互,并在Critic价值函数的指导下用策略梯度方法学习一个更好的策略;Critic的任务是利用搜集到的Actor与环境交互的数据学习一个价值函数Q,该函数的功能是评判当前状态-动作对的好坏,进而辅助Actor进行策略更新。Actor和Critic均包含两个网络,一个叫做online(在线网络),一个叫做target(目标网络)。因而DDPG算法中共有四个网络,分别是online Actor网络(在线演员网络)、target Actor网络(目标演员网络)、online Critic网络(在线评论家网络)和target Critic网络(目标评论家网络)。其中,online Actor网络和target Actor网络结构相同,参数不同;online Critic网络和target Critic网络同样如此。在网络训练过程中,DDPG算法采用冻结target网络的技巧:让online网络参数实时更新,而target网络参数暂时冻结。冻结target网络之时,让online网络去做尝试和探索,target网络则根据online网络产生的样本总结经验,然后再行动,并将online网络的参数赋值给target网络。
此外,DDGP算法还采用经验回放(experience replay)机制来去除数据相关性和提高样本利用效率。具体做法是维护一个经验回放池,将每次从环境中采样得到的转换数据四元组(状态、动作、奖励、下一状态)存储到经验回放池中,训练策略网络和Q网络的时候再从回放缓冲区中随机采样若干数据。这么做可以起到以下两个作用:(1)使样本满足独立假设。采用经验回放可以打破样本之间的相关性,让其满足独立假设;(2)提高样本利用率。
DDPG智能体的四个网络的功能分别如下:
online Actor网络(在线演员网络):负责策略网络参数θμ的迭代更新、根据当前环境状态Oτ选择当前最优动作aτ、以及负责和环境交互生成下一状态Oτ+1和奖励r;
target Actor网络(目标演员网络):负责根据从经验回放池中采样的下一状态Oτ+1选择下一最优动作aτ+1、负责定期通过指数移动平均法将Online Actor的参数θμ更新给Target Actor网络的参数θμ′;
online Critic网络(在线评论家网络):负责价值网络参数θQ的迭代更新、负责计算当前状态-动作对的online Q值Q(Oτ,aτQ)、负责计算Target Critic网络输出的估计
target Critic网络(目标评论家网络):负责计算Target Critic网络输出的估计中的Q′(Oτ+1,aτ+1Q′)、负责定期通过指数移动平均法将Online Critic的参数θQ更新给Target Critic网络的参数θQ′
在一种可能的情况中,演员模块包含在线演员网络和目标演员网络,评论家模块包括在线评论家网络和目标评论家网络,初始化强化学习所使用的智能体,可以包括:
S311、对在线演员网络的在线演员网络参数进行初始化,并将目标演员网络的目标演员 网络参数与在线演员网络参数设置为相同值;
S312、对在线评论家网络的在线评论家网络参数进行初始化,并将目标评论家网络的目标评论家网络参数与在线评论家网络参数设置为相同值。
具体的,可首先初始化在线演员和在线评论家网络的参数θμ和θQ,并将在线网络的参数拷贝给对应的目标网络参数:
θQ′←θQ,θμ′←θμ
在一种可能的情况中,利用转换数据对演员模块和评论家模块进行训练,可以包括:
S371、将转换数据添加至经验回放池,并从经验回放池中随机采样预设数量的转换数据作为训练数据;
S372、利用训练数据、目标演员网络、目标评论家网络、在线评论家网络及如下损失函数,确定在线评论家网络参数的第一梯度;
其中,lossQ表示损失函数,aτ表示连续动作,Oτ表示第τ个时间步对应的历史状态向量,Q表示在线评论家网络,θQ在线评论家网络参数,N表示预设数量;表示对目标评论家网络的估计,rτ表示第τ个时间步对应的奖励值,γ表示预设的折扣因子,Q′表示目标评论家网络,θQ′表示目标评论家网络参数,μ′表示目标演员网络,θμ′表示目标演员网络参数,Oτ+1表示第τ个时间步对应的当前状态向量;
S373、根据第一梯度对在线评论家网络参数进行更新;
S374、利用训练数据、更新后的在线评论家网络、在线演员网络及目标函数确定绩效目标,并确定绩效目标关于确定在线演员网络参数的第二梯度:
其中,表示当环境状态O服从分布函数为ρβ的分布时Q(O,μ(O))的期望值,θμ表示在线演员网络参数,表示第二梯度。
对于第二梯度的计算过程,需要指出的是,本申请实施例的目标是要寻找一个最优的策略网络参数使得DDPG智能体根据这个参数对应的最优策略实施动作,在环境中产生 的累积奖励的期望最大。为评价策略μ的好坏,本申请定义一个叫做绩效目标(performance objective)的目标函数J:
其中,Q(O,μ(O))是指在每个状态O下,如果都按照策略μ来选择动作μ(O),能够产生的Q值。的含义是当环境状态O服从分布函数为ρβ的分布时,Q(O,μ(O))的期望值。目标函数关于策略网络参数θμ的梯度(简称策略梯度)可通过如下公式计算:
策略梯度的计算利用了链式法则,先对动作a求导,再对策略网络参数θμ求导。然后通过梯度上升的方法来最大化函数Q,得到值最大的动作。
可以用Monte-Carlo(蒙特卡洛)方法来估算上述期望值。在经验回放池P中存储状态转换Tτ=(Oτ,aτ,rτ,Oτ+1),其中aτ是基于DDPG智能体按照Behavior策略β产生的,其将会基于上述实施例所提供的方法被转换为离散动作值。当从经验回放池P中随机采样获得N个转换数据以组成单个batch时,根据Monte-Carlo方法,可将单个batch数据代入上述策略梯度公式,可以作为对上述期望值的一个无偏差估计,所以策略梯度可以改写为:
S375、基于第二梯度对在线演员网络参数进行更新;
S376、利用更新后的在线评论家网络参数和在线演员网络参数以如下方式对目标评论家网络参数和目标演员网络参数进行更新:
其中,α为预设值。
下面基于具体的例子详细介绍上述图神经网络压缩方法。
(a)搭建一个由一台主机(即上位机)和一个硬件加速器组成的异构并行计算系统。使用Xilinx Zynq-7020 FPGA或Inspur F37X FPGA作为GNN推理硬件加速器。在时序结构设计方面,利用可重构位串行矩阵乘法叠加(BISMO,Bit-Serial Matrix Multiplication Overlay)。空间结构方面,采用BitFusion架构。获取硬件加速器的计算、存储和延迟特性数据。
(b)图神经网络选择GCN(Graph Convolutional Network,图卷积神经网络),利用 Pumbed(一种文摘型数据库)构造图数据集,并选择图学习任务为顶点分类,随后设计与学习任务匹配的目标函数和评价标准。构建一个包含L层图卷积层的GNN实例,在上位机利用CPU或GPU按照小批量随机梯度下降法训练该GNN模型,得到训练好的浮点GNN模型。图数据和已训练好的浮点GNN模型是本申请要量化对象。
(c)构建DDPG强化学习环境并完成初始化。1)搭建Actor(策略网络)和Critic(价值网络)。每个网络都有一个副本,一个是online网络,另一个是target网络。2)初始化Actor和Critic的online网络参数θμ和θQ;将online网络的参数拷贝给对应的target网络参数:θQ′←θQ,θμ′←θμ。3)初始化环境状态O0=[acc,store,compt]。4)初始化经验回放池(replay memory buffer)P和采样阈值δ。5)初始化最大奖励r_best和最优动作a_best。
(d)利用DDPG算法寻找最优量化位宽分配策略。除非明确说明,所有步骤均在上位机上执行。具体步骤如下:
重复以下训练过程(一个过程对应一个episode)ε次:
①初始化UO随机过程;
②接收一个随机初始状态O0
③重复执行T个时间步,在每个时间步τ依次执行下述操作:
a.Actor根据Behavior策略选择一个动作aτ=μ(Oτμ)+Nτ其中,Nτ是随机UO(Uhlenbeck-Ornstein)噪声。将aτ转换为离散动作a′τ
b.上位机根据a′τ指定好的量化位宽,采用基于最小化量化前后数据特征分布距离的方法量化方法,对所有图顶点的特征、GNN所有层的图卷积核(如果有)、权重和激活(如果有)实施量化。得到量化后的图顶点特征数据和GNN模型,并将后者映射到硬件加速器;
c.硬件加速器从上位机读取量化后的图顶点特征和邻接矩阵,采用小批量随机梯度下降法训练GNN模型,并测试其分类准确度及计算奖励函数rτ的值,并输出Oτ+1;将rτ和Oτ+1返回给上位机;
d.上位机更新r_best和a_best。上位机比较返回的rτ和r_best的大小,如果rτ>r_best,则令rτ←r_best,abest←a′τ
e.上位机将这个状态转换过程Tτ=(Oτ,aτ,rτ,Oτ+1)存入经验回放池P中。
f.当经验回放池P中的转换个数超过阈值δ时,实施采样:上位机从经验回放池P随机采样N个transition数据,作为online Actor和online Critic网络的一个batch训练数据。
g.上位机更新online Actor网络和online Critic网络的梯度。计算lossQ关于 θμ的梯度,并计算策略梯度;采用Adam optimizer更新online Critic网络参数QQ和online Actor网络参数θμ
h.上位机软更新target Actor网络和target Critic网络的参数:使用移动平均的方法,将两者相应的online网络参数,软更新给target网络参数:
④上位机输出r_best和a_best.
(e)硬件加速器根据a_best将量化模型再训练一个epoch以恢复性能,得到最终的定点GNN量化模型和量化后的图顶点特征数据。
下面对本申请实施例提供的图神经网络压缩装置、电子设备及非易失性可读存储介质进行介绍,下文描述的图神经网络压缩装置、电子设备及非易失性可读存储介质与上文描述的图神经网络压缩方法可相互对应参照。
请参考图4,图4为本申请实施例所提供的一种图神经网络压缩装置的结构框图。该装置可以包括:
获取模块401,用于获取已训练的图神经网络及其训练时所使用的图数据;
区间确定模块402,用于确定图数据中所有图顶点对应的度数分布范围,并将度数分布范围划分为多个度数区间;
量化位宽确定模块403,用于在预设资源限制条件的约束下,利用强化学习及硬件加速器确定各度数区间对应的最优区间量化位宽以及图神经网络对应的最优网络量化位宽;
量化压缩模块404,用于利用最优区间量化位宽对图数据中对应度数的图顶点的顶点特征进行量化压缩,以及利用最优网络量化位宽对图神经网络进行量化压缩,得到最优量化图数据和最优量化图神经网络。
在本申请一些实施例中,区间确定模块402,可以包括:
排列子模块,用于将图数据中的所有图顶点按度数从小到大排列,得到图顶点序列;
划分子模块,用于利用图顶点序列对度数分布范围进行划分,得到多个度数区间;各度数区间包含的图顶点数量相同或差值小于预设阈值。
在本申请一些实施例中,该装置还可以包括:
训练模块,用于利用最优量化图数据对最优量化图神经网络进行训练,得到微调量化图神经网络,以将微调量化图神经网络部署至对外服务设备中。
在本申请一些实施例中,硬件加速器的时序结构为可重构位串行矩阵乘法叠加,空间结构为BitFusion架构。
在本申请一些实施例中,量化位宽确定模块403,包括:
初始化子模块,用于获取图神经网络执行指定任务对应的基准准确度,并初始化强化学习所使用的智能体以及历史奖励值;智能体包括演员模块和评论家模块;
第一设置子模块,用于将策略次数设置为1,并初始化动作序列以及历史状态向量;动作序列用于保存各度数区间对应的区间量化位宽以及图神经网络对应的网络量化位宽;状态 向量用于记录量化图神经网络在处理量化图数据时对应的内存占用量、计算量以及在执行指定任务时对应的准确度;
第二设置子模块,用于将时间步设置为1,并在预设资源限制条件的约束下,利用演员模块确定连续动作,利用连续动作对动作序列进行数值更新,并在更新后确定动作序列对应的内存占用量及计算量;
压缩及训练子模块,用于利用动作序列对图数据中的顶点特征和图神经网络进行量化压缩,并将得到的量化图数据和量化图神经网络发送至硬件加速器,以使硬件加速器利用量化图数据对量化图神经网络进行训练,并确定训练后的量化图神经网络执行指定任务对应的当前准确度;
计算子模块,用于利用动作序列对应的内存占用量、计算量和准确度确定当前状态向量,以及利用基准准确度和当前准确度确定奖励值;
更细子模块,用于在确定奖励值大于历史奖励值时,利用奖励值更新历史奖励值,并利用更新后的动作序列对最优区间量化位宽及最优网络量化位宽进行更新;
智能体训练子模块,用于利用历史状态向量、连续动作、奖励值和当前状态向量生成转换数据,并利用转换数据对演员模块和评论家模块进行训练,以使评论家模块对演员模块在进行数值更新时所使用的策略进行更新;
第三设置子模块,用于当确定时间步未达到动作序列的长度时,对时间步加1,利用当前状态向量更新历史状态向量,并进入在预设资源限制条件的约束下,利用演员模块对动作序列进行数值更新时间步的步骤;
第四设置子模块,用于当确定时间步达到动作序列的长度且策略次数未达到预设值时,对策略次数加1,并进入初始化动作序列以及历史状态向量的步骤;
输出子模块,用于当确定策略次数达到预设值时,输出最优区间量化位宽及最优网络量化位宽。
在本申请一些实施例中,第二设置子模块,可以包括:
离散动作确定单元,用于利用演员模块根据Behavior策略选择连续动作,并通过如下方式将连续动作进行离散化,得到离散动作值:
其中,aτ(i)表示第τ个时间步的动作序列中的第i个量化位宽对应的连续动作,a′τ(i)表示与aτ(i)对应的离散动作值,Q包含多个预设量化位宽值,round(·)表示四舍五入函数,qmin和qmax表示预设的最小量化位宽和最大量化位宽,argmin(·)函数用于在Q中选择目标预设量化位宽值q,以使|q-round(qmin-0.5+aτ(i)×(qmax-qmin+1))|最小;
更新单元,用于利用动作值对动作序列进行数值更新,确定更新后的动作序列对应的内存占用量、计算量及延迟量,并判断内存占用量、计算量及延迟量是否满足预设资源限制条件的限制;
第一处理单元,用于若内存占用量、计算量及延迟量满足预设资源限制条件的限制,则 进入利用动作序列对图数据中的顶点特征和图神经网络进行量化压缩的步骤;
第二处理单元,用于若内存占用量、计算量及延迟量不满足预设资源限制条件的限制,则按照预设顺序依次对动作序列中的量化位宽进行减少,以再次更新动作序列,并在每次减少动作完成时进入确定更新后的动作序列对应的内存占用量、计算量及延迟量的步骤。
在本申请一些实施例中,离散动作确定单元,可以包括:
连续动作确定子单元,用于利用演员模块根据Behavior策略以如下方式选择连续动作:
aτ=μ(Oτ+|θμ)+Nτ
其中,Nτ表示第τ个时间步对应的随机UO噪声,Oτ表示第τ个时间步对应的历史状态向量,μ表示演员模块中的在线演员网络,θμ表示在线演员网络参数。
在本申请一些实施例中,压缩及训练子模块,可以包括:
硬件加速器单元,用于硬件加速器基于小批量随机梯度下降法利用量化图数据对量化图神经网络进行训练。
在本申请一些实施例中,更新单元,可以包括:
第一计算子单元,用于利用如下公式计算内存占用量:
其中,storeMB表示内存占用量,nb表示单个小批量内的图顶点个数,fl表示量化图神经网络第l个网络层对应的顶点维度值,L表示量化图神经网络所有网络层的数量,qmax表示单个小批量内的所有图顶点分配到的区间量化位宽中的最大值,S表示卷积核的总数,qW和qF分别表示量化图神经网络各网络层的权重矩阵和卷积核对应的网络量化位宽;
第二计算子单元,用于利用如下公式计算计算量:
其中,computeMB表示计算量,qσ表示量化图神经网络各网络层的激活矩阵对应的网络量化位宽,MACl表示量化图神经网络第l层的乘累加操作的总数;
第三计算子单元,用于利用如下公式计算延迟量:
其中,latencyMB表示延迟量,Λl表示量化图神经网络第l个网络层处理小批量图数据的延迟。
在本申请一些实施例中,压缩及训练子模块,包括:
压缩单元,用于通过如下方式对图数据中各图顶点的顶点特征截断至[-c,c](c>0)范围内,并利用动作序列中与图顶点的度数对应的区间量化比特对截断后的顶点特征进行量化压缩:
quantize(Xi,:(j),a′τ,c)=round(clip(Xi,:(j),c)/s)×s
其中,quantize(·)表示量化压缩函数,round(·)表示四舍五入函数,clip(x,y)表示截断函数,用于将x截断至[-y,y](y>0),Xi,:表示顶点特征,Xi,:(j)(j∈[1,f0])表示顶点特征中的第j个分量,S表示缩放因子,s=c/(2q-1),q表示动作序列中与Xi,:所属图顶点的度数对应的区间量化比特。
在本申请一些实施例中,压缩及训练子模块,还包括:
截断值确定单元,用于通过如下方式确定c值:
其中,argmin(·)函数用于在选择x值,以使DKL(Xi,:||quantize(Xi,:a′τ,x))最小,DKL(Xi,:||quantize(Xi,:a′τ,x))表示Xi,:的特征分布与quantize(Xi,:a′τ,x)的特征分布间的KL散度;特征分布为最大值、最小值、均值、方差、尖度或峰度。
在本申请一些实施例中,演员模块包含在线演员网络和目标演员网络,评论家模块包括在线评论家网络和目标评论家网络,初始化子模块,包括:
第一初始化单元,用于对在线演员网络的在线演员网络参数进行初始化,并将目标演员网络的目标演员网络参数与在线演员网络参数设置为相同值;
第二初始化单元,用于对在线评论家网络的在线评论家网络参数进行初始化,并将目标评论家网络的目标评论家网络参数与在线评论家网络参数设置为相同值。
在本申请一些实施例中,智能体训练子模块,可以包括:
训练数据抽取单元,用于将转换数据添加至经验回放池,并从经验回放池中随机采样预设数量的转换数据作为训练数据;
第一梯度计算单元,用于利用训练数据、目标演员网络、目标评论家网络、在线评论家网络及如下损失函数,确定在线评论家网络参数的第一梯度;
其中,lossQ表示损失函数,aτ表示连续动作,Oτ表示第τ个时间步对应的历史状态向量,Q表示在线评论家网络,θQ在线评论家网络参数,N表示预设数量;表示对目标评论家网络的估计,rτ表示第τ个时间步对应的奖励值,γ表示预设的折扣因子,Q′表示目标评论家网络,θQ′表示目标评论家网络参数,μ′表示目标演员网络,θμ′表示目标演员网络参数,Oτ+1表示第τ个时间步对应的当前状态向量;
第一更新单元,用于根据第一梯度对在线评论家网络参数进行更新;
第二梯度计算单元,用于利用训练数据、更新后的在线评论家网络、在线演员网络及目标函数确定绩效目标,并确定绩效目标关于确定在线演员网络参数的第二梯度:
其中,表示当环境状态O服从分布函数为ρβ的分布时Q(O,μ(O))的期望值,θμ表示在线演员网络参数,表示第二梯度;
第二更新单元,用于基于第二梯度对在线演员网络参数进行更新;
第三更新单元,用于利用更新后的在线评论家网络参数和在线演员网络参数以如下方式对目标评论家网络参数和目标演员网络参数进行更新:
其中,α为预设值。
请参考图5,图5为本申请实施例所提供的一种电子设备的结构框图,本申请实施例还提供一种电子设备,包括:
存储器501,用于存储计算机程序;
处理器502,用于执行计算机程序时实现如上述的图神经网络压缩方法的步骤。
由于电子设备部分的实施例与图神经网络压缩方法部分的实施例相互对应,因此电子设备部 分的实施例请参见图神经网络压缩方法部分的实施例的描述,这里不再赘述。
请参考图6,图6为本申请实施例所提供的一种非易失性可读存储介质的结构框图,本申请实施例还提供一种非易失性可读存储介质,非易失性可读存储介质601上存储有计算机程序,计算机程序被处理器执行时实现上述任意实施例的图神经网络压缩方法的步骤。
由于非易失性可读存储介质部分的实施例与图神经网络压缩方法部分的实施例相互对应,因此存储介质部分的实施例请参见图神经网络压缩方法部分的实施例的描述,这里不再赘述。
说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。
专业人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。
以上对本申请所提供的一种图神经网络压缩方法、装置、电子设备及存储介质进行了详细介绍。本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想。应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以对本申请进行若干改进和修饰,这些改进和修饰也落入本申请权利要求的保护范围内。

Claims (20)

  1. 一种图神经网络压缩方法,其特征在于,包括:
    获取已训练的图神经网络及其训练时所使用的图数据;
    确定所述图数据中所有图顶点对应的度数分布范围,并将所述度数分布范围划分为多个度数区间;
    在预设资源限制条件的约束下,利用强化学习及硬件加速器确定各所述度数区间对应的最优区间量化位宽以及所述图神经网络对应的最优网络量化位宽;
    利用所述最优区间量化位宽对所述图数据中对应度数的图顶点的顶点特征进行量化压缩,以及利用所述最优网络量化位宽对所述图神经网络进行量化压缩,得到最优量化图数据和最优量化图神经网络。
  2. 根据权利要求1的图神经网络压缩方法,其特征在于,所述度数分布范围依照图顶点在该范围内的分布情况进行划分。
  3. 根据权利要求1的图神经网络压缩方法,其特征在于,所述图神经网络对应的最优网络量化位宽具体指图神经网络的图卷积核矩阵、权重矩阵及激活矩阵对应的最佳网络量化位宽。
  4. 根据权利要求1的图神经网络压缩方法,其特征在于,所述预设资源限制条件用于限制处理量化图数据及量化图神经网络所要耗费的计算资源。
  5. 根据权利要求1的图神经网络压缩方法,其特征在于,所述预设资源限制条件包含:计算量阈值、内存占用量阈值及延迟量阈值。
  6. 根据权利要求1所述的图神经网络压缩方法,其特征在于,所述确定所述图数据中所有图顶点对应的度数分布范围,并将所述度数分布范围划分为多个度数区间,包括:
    将所述图数据中的所有图顶点按度数从小到大排列,得到图顶点序列;
    利用所述图顶点序列对所述度数分布范围进行划分,得到多个所述度数区间;各所述度数区间包含的图顶点数量相同或差值小于预设阈值。
  7. 根据权利要求1所述的图神经网络压缩方法,其特征在于,在得到最优量化图数据和最优量化图神经网络之后,还包括:
    利用所述最优量化图数据对所述最优量化图神经网络进行训练,得到微调量化图神经网络,以将所述微调量化图神经网络部署至对外服务设备中。
  8. 根据权利要求1所述的图神经网络压缩方法,其特征在于,所述硬件加速器的时序结构为可重构位串行矩阵乘法叠加,空间结构为BitFusion架构。
  9. 根据权利要求1至8任一项所述的图神经网络压缩方法,其特征在于,所述在预设资源限制条件的约束下,利用强化学习及硬件加速器确定各所述度数区间对应的最优区间量化位宽以及所述图神经网络对应的最优网络量化位宽,包括:
    获取所述图神经网络执行指定任务对应的基准准确度,并初始化所述强化学习所使用的智能体以及历史奖励值;所述智能体包括演员模块和评论家模块;
    将策略次数设置为1,并初始化动作序列以及历史状态向量;所述动作序列用于保存各所述度数区间对应的区间量化位宽以及所述图神经网络对应的网络量化位宽;状态向量用于记录量化图神经网络在处理量化图数据时对应的内存占用量、计算量以及在执行 所述指定任务时对应的准确度;
    将时间步设置为1,并在所述预设资源限制条件的约束下,利用所述演员模块确定连续动作,利用所述连续动作对所述动作序列进行数值更新,并在更新后确定所述动作序列对应的内存占用量及计算量;
    利用所述动作序列对所述图数据中的顶点特征和所述图神经网络进行量化压缩,并将得到的量化图数据和量化图神经网络发送至所述硬件加速器,以使所述硬件加速器利用所述量化图数据对所述量化图神经网络进行训练,并确定训练后的所述量化图神经网络执行所述指定任务对应的当前准确度;
    利用所述动作序列对应的内存占用量、计算量和所述准确度确定当前状态向量,以及利用所述基准准确度和所述当前准确度确定奖励值;
    在确定所述奖励值大于所述历史奖励值时,利用所述奖励值更新所述历史奖励值,并利用所述更新后的动作序列对所述最优区间量化位宽及所述最优网络量化位宽进行更新;
    利用所述历史状态向量、所述连续动作、所述奖励值和所述当前状态向量生成转换数据,并利用所述转换数据对所述演员模块和所述评论家模块进行训练,以使所述评论家模块对所述演员模块在进行所述数值更新时所使用的策略进行更新;
    当确定所述时间步未达到所述动作序列的长度时,对所述时间步加1,利用所述当前状态向量更新所述历史状态向量,并进入所述在所述预设资源限制条件的约束下,利用所述演员模块确定连续动作的步骤;
    当确定所述时间步达到所述动作序列的长度且所述策略次数未达到预设值时,对所述策略次数加1,并进入所述初始化动作序列以及历史状态向量的步骤;
    当确定所述策略次数达到所述预设值时,输出所述最优区间量化位宽及所述最优网络量化位宽。
  10. 根据权利要求9所述的图神经网络压缩方法,其特征在于,所述在所述预设资源限制条件的约束下,利用所述演员模块确定连续动作,利用所述连续动作对所述动作序列进行数值更新,并在更新后确定所述动作序列对应的内存占用量及计算量,包括:
    利用所述演员模块根据Behavior策略选择所述连续动作,并通过如下方式将所述连续动作进行离散化,得到离散动作值:
    其中,aτ(i)表示第τ个时间步的动作序列中的第i个量化位宽对应的连续动作,a′τ(i)表示与aτ(i)对应的离散动作值,Q包含多个预设量化位宽值,round(·)表示四舍五入函数,qmin和qmax表示预设的最小量化位宽和最大量化位宽,argmin(·)函数用于在Q中选择目标预设量化位宽值q,以使|q-round(qmin-0.5+aτ(i)×(qmax-qmin+1))|最小;
    利用所述动作值对所述动作序列进行数值更新,确定更新后的所述动作序列对应的 内存占用量、计算量及延迟量,并判断所述内存占用量、所述计算量及所述延迟量是否满足所述预设资源限制条件的限制;
    若内存占用量、计算量及延迟量满足预设资源限制条件的限制,则进入所述利用所述动作序列对所述图数据中的顶点特征和所述图神经网络进行量化压缩的步骤;
    若内存占用量、计算量及延迟量不满足预设资源限制条件的限制,则按照预设顺序依次对所述动作序列中的量化位宽进行减少,以再次更新所述动作序列,并在每次减少动作完成时进入所述确定更新后的所述动作序列对应的内存占用量、计算量及延迟量的步骤。
  11. 根据权利要求10所述的图神经网络压缩方法,其特征在于,所述利用所述演员模块根据Behavior策略选择所述连续动作,包括:
    利用所述演员模块根据Behavior策略以如下方式选择连续动作:
    aτ=μ(Oτμ)+Nτ
    其中,Nτ表示第τ个时间步对应的随机UO噪声,Oτ表示第τ个时间步对应的历史状态向量,μ表示所述演员模块中的在线演员网络,θμ表示在线演员网络参数。
  12. 根据权利要求10所述的图神经网络压缩方法,其特征在于,所述硬件加速器利用所述量化图数据对所述量化图神经网络进行训练,包括:
    所述硬件加速器基于小批量随机梯度下降法利用所述量化图数据对所述量化图神经网络进行训练。
  13. 根据权利要求12所述的图神经网络压缩方法,其特征在于,所述确定更新后的所述动作序列对应的内存占用量、计算量及延迟量,包括:
    利用如下公式计算所述内存占用量:
    其中,storeMB表示所述内存占用量,nb表示单个小批量内的图顶点个数,fl表示所述量化图神经网络第l个网络层对应的顶点维度值,L表示量化图神经网络所有网络层的数量,qmax表示单个小批量内的所有图顶点分配到的区间量化位宽中的最大值,S表示卷积核的总数,qW和qF分别表示所述量化图神经网络各网络层的权重矩阵和卷积核对应的网络量化位宽;
    利用如下公式计算所述计算量:
    其中,computeMB表示所述计算量,qσ表示所述量化图神经网络各网络 层的激活矩阵对应的网络量化位宽,MACl表示量化图神经网络第l层的乘累加操作的总数;
    利用如下公式计算所述延迟量:
    其中,latencyMB表示所述延迟量,Λl表示所述量化图神经网络第l个网络层处理小批量图数据的延迟。
  14. 根据权利要求9所述的图神经网络压缩方法,其特征在于,所述利用所述动作序列对所述图数据中的顶点特征进行量化压缩,包括:
    通过如下方式对所述图数据中各图顶点的顶点特征截断至[-c,c](c>0)范围内,并利用所述动作序列中与所述图顶点的度数对应的区间量化比特对截断后的顶点特征进行量化压缩:
    quantize(Xi,:(j),a′τ,c)=round(clip(Xi,:(j),c)/s)×s;
    其中,quantize(·)表示量化压缩函数,round(·)表示四舍五入函数,clip(x,y)表示截断函数,用于将x截断至[-y,y](y>0),Xi,:表示所述顶点特征,Xi,:(j)(j∈[1,f0])表示所述顶点特征中的第j个分量,S表示缩放因子,s=c/(2q-1),q表示所述动作序列中与Xi,:所属图顶点的度数对应的区间量化比特。
  15. 根据权利要求14所述的图神经网络压缩方法,其特征在于,在利用所述动作序列对所述图数据中的顶点特征进行量化压缩之前,还包括:
    通过如下方式确定c值:
    其中,argmin(·)函数用于在选择x值,以使DKL(Xi,:||quantize(Xi,:,a′τ,x))最小,DKL(Xi,:||quantize(Xi,:,a′τ,x))表示Xi,:的特征分布与quantize(Xi,:,a′τ,x)的特征分布间的KL散度;所述特征分布为最大值、最小值、均值、方差、尖度或峰度。
  16. 根据权利要求9所述的图神经网络压缩方法,其特征在于,所述演员模块包含在线演员网络和目标演员网络,所述评论家模块包括在线评论家网络和目标评论家网 络,所述初始化所述强化学习所使用的智能体,包括:
    对所述在线演员网络的在线演员网络参数进行初始化,并将所述目标演员网络的目标演员网络参数与所述在线演员网络参数设置为相同值;
    对所述在线评论家网络的在线评论家网络参数进行初始化,并将所述目标评论家网络的目标评论家网络参数与所述在线评论家网络参数设置为相同值。
  17. 根据权利要求16所述的图神经网络压缩方法,其特征在于,所述利用所述转换数据对所述演员模块和所述评论家模块进行训练,包括:
    将所述转换数据添加至经验回放池,并从所述经验回放池中随机采样预设数量的转换数据作为训练数据;
    利用所述训练数据、所述目标演员网络、所述目标评论家网络、所述在线评论家网络及如下损失函数,确定所述在线评论家网络参数的第一梯度;
    其中,所述lossQ表示所述损失函数,aτ表示所述连续动作,Oτ表示第τ个时间步对应的历史状态向量,Q表示在线评论家网络,θQ表示在线评论家网络参数,N表示所述预设数量;表示对目标评论家网络的估计,rτ表示第τ个时间步对应的奖励值,γ表示预设的折扣因子,Q′表示目标评论家网络,θQ′表示目标评论家网络参数,μ′表示目标演员网络,θμ′表示目标演员网络参数,Oτ+1表示第τ个时间步对应的当前状态向量;
    根据所述第一梯度对所述在线评论家网络参数进行更新;
    利用所述训练数据、更新后的在线评论家网络、所述在线演员网络及目标函数确定绩效目标,并确定所述绩效目标关于确定所述在线演员网络参数的第二梯度:
    其中,表示当环境状态O服从分布函数为ρβ的分布时Q(O,μ(O))的期望值,θμ表示在线演员网络参数,表示所述第二梯度;
    基于所述第二梯度对所述在线演员网络参数进行更新;
    利用更新后的在线评论家网络参数和在线演员网络参数以如下方式对所述目标评论家网络参数和所述目标演员网络参数进行更新:
    其中,α为预设值。
  18. 一种图神经网络压缩装置,其特征在于,包括:
    获取模块,用于获取已训练的图神经网络及其训练时所使用的图数据;
    区间确定模块,用于确定所述图数据中所有图顶点对应的度数分布范围,并将所述度数分布范围划分为多个度数区间;
    量化位宽确定模块,用于在预设资源限制条件的约束下,利用强化学习及硬件加速器确定各所述度数区间对应的最优区间量化位宽以及所述图神经网络对应的最优网络量化位宽;
    量化压缩模块,用于利用所述最优区间量化位宽对所述图数据中对应度数的图顶点的顶点特征进行量化压缩,以及利用所述最优网络量化位宽对所述图神经网络进行量化压缩,得到最优量化图数据和最优量化图神经网络。
  19. 一种电子设备,其特征在于,包括:
    存储器,用于存储计算机程序;
    处理器,用于执行所述计算机程序时实现如权利要求1至17任一项所述的图神经网络压缩方法。
  20. 一种非易失性可读存储介质,其特征在于,所述非易失性可读存储介质中存储有计算机可执行指令,所述计算机可执行指令被处理器加载并执行时,实现如权利要求1至17任一项所述的图神经网络压缩方法。
PCT/CN2023/085970 2022-10-24 2023-04-03 一种图神经网络压缩方法、装置、电子设备及存储介质 WO2024087512A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211299256.8A CN115357554B (zh) 2022-10-24 2022-10-24 一种图神经网络压缩方法、装置、电子设备及存储介质
CN202211299256.8 2022-10-24

Publications (1)

Publication Number Publication Date
WO2024087512A1 true WO2024087512A1 (zh) 2024-05-02

Family

ID=84007819

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/085970 WO2024087512A1 (zh) 2022-10-24 2023-04-03 一种图神经网络压缩方法、装置、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN115357554B (zh)
WO (1) WO2024087512A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115357554B (zh) * 2022-10-24 2023-02-24 浪潮电子信息产业股份有限公司 一种图神经网络压缩方法、装置、电子设备及存储介质
CN116011551B (zh) * 2022-12-01 2023-08-29 中国科学技术大学 优化数据加载的图采样训练方法、系统、设备及存储介质
CN115934661B (zh) * 2023-03-02 2023-07-14 浪潮电子信息产业股份有限公司 一种图神经网络压缩方法、装置、电子设备及存储介质
CN116341633B (zh) * 2023-05-29 2023-09-01 山东浪潮科学研究院有限公司 一种模型部署方法、装置、设备及存储介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190340492A1 (en) * 2018-05-04 2019-11-07 Microsoft Technology Licensing, Llc Design flow for quantized neural networks
CN110852439A (zh) * 2019-11-20 2020-02-28 字节跳动有限公司 神经网络模型的压缩与加速方法、数据处理方法及装置
CN111563589A (zh) * 2020-04-14 2020-08-21 中科物栖(北京)科技有限责任公司 一种神经网络模型的量化方法及装置
CN113570037A (zh) * 2021-07-13 2021-10-29 清华大学 神经网络压缩方法及装置
CN113762489A (zh) * 2021-08-12 2021-12-07 北京交通大学 一种对深度卷积神经网络进行多位宽量化的方法
CN113902108A (zh) * 2021-11-24 2022-01-07 贵州电网有限责任公司 一种量化位宽动态选择的神经网络加速硬件架构及方法
US20220092391A1 (en) * 2021-12-07 2022-03-24 Santiago Miret System and method of using neuroevolution-enhanced multi-objective optimization for mixed-precision quantization of deep neural networks
CN115357554A (zh) * 2022-10-24 2022-11-18 浪潮电子信息产业股份有限公司 一种图神经网络压缩方法、装置、电子设备及存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10747433B2 (en) * 2018-02-21 2020-08-18 Wisconsin Alumni Research Foundation Computer architecture for high-speed, graph-traversal
CN108962393B (zh) * 2018-05-12 2019-10-15 鲁东大学 基于压缩图神经网络的自动心律失常分析方法
CN112100286A (zh) * 2020-08-14 2020-12-18 华南理工大学 基于多维度数据的计算机辅助决策方法、装置、系统及服务器
CN114781615A (zh) * 2022-04-24 2022-07-22 上海大学 一种基于压缩神经网络的二阶段量化实现方法及装置

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190340492A1 (en) * 2018-05-04 2019-11-07 Microsoft Technology Licensing, Llc Design flow for quantized neural networks
CN110852439A (zh) * 2019-11-20 2020-02-28 字节跳动有限公司 神经网络模型的压缩与加速方法、数据处理方法及装置
CN111563589A (zh) * 2020-04-14 2020-08-21 中科物栖(北京)科技有限责任公司 一种神经网络模型的量化方法及装置
CN113570037A (zh) * 2021-07-13 2021-10-29 清华大学 神经网络压缩方法及装置
CN113762489A (zh) * 2021-08-12 2021-12-07 北京交通大学 一种对深度卷积神经网络进行多位宽量化的方法
CN113902108A (zh) * 2021-11-24 2022-01-07 贵州电网有限责任公司 一种量化位宽动态选择的神经网络加速硬件架构及方法
US20220092391A1 (en) * 2021-12-07 2022-03-24 Santiago Miret System and method of using neuroevolution-enhanced multi-objective optimization for mixed-precision quantization of deep neural networks
CN115357554A (zh) * 2022-10-24 2022-11-18 浪潮电子信息产业股份有限公司 一种图神经网络压缩方法、装置、电子设备及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YIN WEN-FENG, LIANG LING-YAN , PENG HUI-MIN , CAO QI-CHUN , ZHAO JIAN , DONG GANG , ZHAO YA-QIAN , ZHAO KUN: "Research Progress on Convolutional Neural Network Compression and Acceleration Technology", COMPUTER SYSTEMS AND APPLICATIONS, ZHONGGUO KEXUEYUAN RUANJIAN YANJIUSUO, CN, vol. 29, no. 9, 15 September 2020 (2020-09-15), CN , pages 16 - 25, XP093028237, ISSN: 1003-3254, DOI: 10.15888/j.cnki.csa.007632 *

Also Published As

Publication number Publication date
CN115357554B (zh) 2023-02-24
CN115357554A (zh) 2022-11-18

Similar Documents

Publication Publication Date Title
WO2024087512A1 (zh) 一种图神经网络压缩方法、装置、电子设备及存储介质
Foster et al. Beyond ucb: Optimal and efficient contextual bandits with regression oracles
CN104951425B (zh) 一种基于深度学习的云服务性能自适应动作类型选择方法
Li et al. Development and investigation of efficient artificial bee colony algorithm for numerical function optimization
US20170061279A1 (en) Updating an artificial neural network using flexible fixed point representation
US11650968B2 (en) Systems and methods for predictive early stopping in neural network training
CN109445935B (zh) 云计算环境下一种高性能大数据分析系统自适应配置方法
CN112395046B (zh) 虚拟机迁移规划调度方法及其系统与介质
Sun et al. Solving the multi-stage portfolio optimization problem with a novel particle swarm optimization
JP2023510566A (ja) ニューラル・ネットワークのための適応的探索方法および装置
US20210312295A1 (en) Information processing method, information processing device, and information processing program
CN111429142B (zh) 一种数据处理方法、装置及计算机可读存储介质
CN112764893B (zh) 数据处理方法和数据处理系统
Lavin et al. Analyzing and simplifying model uncertainty in fuzzy cognitive maps
Maleki et al. A hybrid approach of firefly and genetic algorithms in software cost estimation
CN114817571B (zh) 基于动态知识图谱的成果被引用量预测方法、介质及设备
CN114895773A (zh) 异构多核处理器的能耗优化方法、系统、装置及存储介质
Li et al. Resource usage prediction based on BiLSTM-GRU combination model
Chen et al. A fuzzy-neural approach for remaining cycle time estimation in a semiconductor manufacturing factory—a simulation study
Aswani et al. Optimization hierarchy for fair statistical decision problems
CN116994764A (zh) 一种健康行为模型构建方法、装置及电子设备及存储介质
CN111027709B (zh) 信息推荐方法、装置、服务器及存储介质
Shen et al. IIoT mobile business data placement strategy based on bayesian optimization algorithm
Zhou Cross-validation research based on RBF-SVR model for stock index prediction
CN112070200A (zh) 一种谐波群优化方法及其应用