CN114997400A

CN114997400A - Neural network acceleration reasoning method

Info

Publication number: CN114997400A
Application number: CN202210598354.5A
Authority: CN
Inventors: 郝占龙; 陈凯; 周异; 黄征; 陈昊; 方恒凯; 吴胜杰; 庄国金
Original assignee: Xiamen Shangji Network Technology Co ltd; Nanjing Shangji Enterprise Service Co ltd
Current assignee: Xiamen Shangji Network Technology Co ltd; Nanjing Shangji Enterprise Service Co ltd
Priority date: 2022-05-30
Filing date: 2022-05-30
Publication date: 2022-09-02

Abstract

The invention relates to a neural network acceleration reasoning method, which comprises the following steps: dividing the working model into a plurality of fragments; constructing an objective function with the minimum sum of the processing time of each fragment and the processing energy consumption of each fragment as a target; solving an optimal solution of the objective function; caching the plurality of fragments to a plurality of computing nodes respectively according to the optimal solution; the computing node comprises a terminal, an edge server and a cloud server; recording a caching scheme of the working model; and performing collaborative reasoning among the computing nodes according to the caching scheme. According to the invention, the working model is fragmented and cached to the plurality of computing nodes, and collaborative reasoning is carried out among the computing nodes, so that the computing capability and the position advantage of the edge server can be fully utilized, the computing pressure of the cloud server and the access delay and the transmission delay in the reasoning process are reduced, and the real-time performance is strong.

Description

Neural network acceleration reasoning method

Technical Field

The invention relates to a neural network acceleration reasoning method, and belongs to the field of edge calculation.

Background

A neural network is a computing system formed by a number of simple processing units interconnected to one another in some way, which processes information by means of a dynamic response of its state to externally input information. The neural network has a self-learning function and can deduce unknown relation between unknown data, so that the neural network can predict the unknown data and popularize the unknown data to multiple fields. However, the inference process of the neural network requires a large amount of computing resources, especially for a working model including a plurality of neural networks, such as a knowledge graph, an expert knowledge base, and the like.

Because the user terminal is difficult to provide the required computing resources for the neural network, the traditional method is to place a working model in a cloud server and utilize the strong computing power of the cloud server to carry out neural network reasoning operation so as to respond to the user requirements. However, the cloud server is far away from the terminal, the access is centralized, the access delay and the transmission delay are both large, and the user with high real-time requirement is difficult to respond in time.

Therefore, a neural network inference method with higher real-time performance is needed.

Disclosure of Invention

In order to overcome the problems in the prior art, the invention designs a neural network accelerated reasoning method, a caching method is solved by setting an objective function, a working model is cached to a plurality of computing nodes in a segmented mode, and collaborative reasoning is carried out among the computing nodes, so that the computing capacity and the position advantage of an edge server can be fully utilized, the computing pressure of a cloud server is reduced, meanwhile, the time required by the collaborative reasoning is far shorter than that of a traditional cloud computing scheme, the real-time performance is strong, and the user experience is improved.

In order to achieve the purpose, the invention adopts the following technical scheme:

a neural network accelerated reasoning method comprises the following steps:

dividing the working model into a plurality of fragments;

constructing an objective function with the minimum sum of the processing time of each fragment and the processing energy consumption of each fragment as a target;

solving an optimal solution of the objective function;

caching the plurality of fragments to a plurality of computing nodes respectively according to the optimal solution; the computing node comprises a terminal, an edge server and a cloud server;

recording a caching scheme of the working model; and performing collaborative reasoning among a plurality of computing nodes according to the caching scheme.

Further, the objective function is formulated as:

in the formula, n represents the number of slices; t is _i Representing the processing time of the ith slice; e _i Representing the processing energy consumption of the ith fragment; a and B represent weights; a is _i 、b _ij 、c _i Is a variable; t is t _ai 、t _bji 、t _ci Respectively representing the processing time of the ith fragment on the terminal, the jth edge server and the cloud server; e.g. of the type _ai 、e _bij 、e _ci Respectively representing the processing energy consumption of the ith fragment on the terminal, the jth edge server and the cloud server; m represents the number of edge servers.

The constraint of the objective function is formulated as:

a _i ∈{0、1}、b _ij ∈{0、1}、c _i ∈{0、1}

in the formula (I), the compound is shown in the specification,

respectively representing the waiting time delay of the ith fragment on the terminal, the jth edge server and the cloud server; t is _k Indicating the processing time of the kth slice; pred (i) denotes the associated set of pre-shards with the ith shard.

Further, the processing time includes the waiting time delay of the slice, the calculation time delay and the transmission time delay of the slice input data.

Further, the transmission delay is expressed by the formula:

in the formula (I), the compound is shown in the specification,

respectively representing the transmission time delay of the ith fragment on the terminal, the jth edge server and the cloud server; (ii) a S _i Represents the ith scoreThe size of the slice input data volume; r is a radical of hydrogen _a Representing the uplink data transmission rate of the terminal; r is _bj Representing the upstream data transmission rate of the jth edge server; r is _c And the data transmission rate of the cloud server is represented.

Further, still include: the terminal, each edge server and the cloud server are provided with request queues; and calculating the average waiting time delay of the request queue as the waiting time delay of the fragments in the terminal, each edge server and the cloud server.

Compared with the prior art, the invention has the following characteristics and beneficial effects:

according to the invention, by setting the target function to solve the cache method, caching the working model to a plurality of computing nodes in a partitioned manner, and performing collaborative reasoning among the computing nodes, the computing capability and the position advantage of the edge server (the edge server is closer to the terminal) can be fully utilized, the computing pressure of the cloud server is reduced, meanwhile, the time required by the collaborative reasoning is far shorter than that of the traditional cloud computing scheme, the real-time performance is strong, and the user experience is improved.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The present invention will be described in more detail with reference to examples.

Example one

As shown in fig. 1, a neural network accelerated reasoning method includes the following steps:

training a working model, wherein the working model comprises at least one neural network model, such as an intelligent financial processing system in the patent 'an intelligent financial processing system and method', and comprises a plurality of neural network models such as a text extraction model, an image classification model and a document classification model.

The working model is partitioned into n slices.

Construct the objective function as follows

In the formula, n represents the number of slices; t is a unit of _i Representing the processing time of the ith slice; e _i Representing the processing energy consumption of the ith fragment; a and B represent weights, a + B ═ 1; a is a _i 、b _ij 、c _i Is variable and has the value range of 0 and 1,

t _ai 、t _bji 、t _ci respectively representing the processing time of the ith fragment on the terminal, the jth edge server and the cloud server; e.g. of the type _ai 、e _bij 、e _ci Respectively representing the processing energy consumption of the ith fragment on the terminal, the jth edge server and the cloud server; m represents the number of edge servers.

In the formula (I), the compound is shown in the specification,

the calculation delay, the waiting delay and the transmission delay of the ith fragment on the terminal are shown (the transmission delay in the embodiment refers to the transmission time of the input data required by the ith fragment);

the calculation time delay, the waiting time delay and the transmission time delay of the ith fragment on the jth edge server are represented;

representing the computation delay, waiting delay and transmission of the ith fragment on the cloud serverAnd (4) time delay.

The terminal, each edge server and the cloud server are provided with request queues; calculating the average waiting time delay of the request queue as the waiting time delay of the fragments in the terminal, each edge server and the cloud server

In the formula (d) _i Representing the number of CPU clock cycles needed by the ith fragment; f. of _a Representing the working frequency of a terminal CPU; f. of _bj Indicating the operating frequency of the jth edge server CPU (in this embodiment, it is assumed that the operating frequencies of the edge servers are the same); f. of _c Representing the working frequency of a CPU of the cloud server;

in the formula, S _i Representing the size of the input data amount required by the ith slice; r is _a Representing the uplink data transmission rate of the terminal; r is _bj Representing the upstream data transmission rate of the jth edge server; r is _c And the data transmission rate of the cloud server is represented.

e _ai ＝d _i C _a

e _bij ＝d _i C _bj

e _ci ＝d _i C _c

In the formula (d) _i Representing the number of CPU clock cycles needed by the ith fragment; c _a Representing the energy consumed by the terminal in each clock cycle; f. of _bj Representing the energy consumed by the jth edge server in each clock cycle; f. of _c Representing the energy consumed by the cloud server per clock cycle.

s.t.

a _i ∈{0、1}、b _ij ∈{0、1}、c _i ∈{0、1}

Where pred (i) denotes the associated set of pre-slices. The constraint indicates that the waiting time of the ith slice is greater than or equal to all the associated slice completion times before the ith slice.

Solving the optimal solution of the objective function; caching the n fragments to a terminal, an edge server and a cloud server according to the optimal solution;

the terminal records the caching scheme; and when the terminal receives the inference request, performing collaborative inference among the computing nodes according to the caching scheme.

Example two

Since the objective function is an NP problem, the objective function is solved using a genetic algorithm:

s1, randomly generating M solutions as individuals to form an initial population;

s2, calculating the objective function value of each individual as the fitness according to the objective function formula;

s3, reserving N individuals with the highest fitness as candidate solutions;

s4, generating a new candidate solution according to the candidate solution;

s5, carrying out mutation operation on the newly generated candidate solution;

s6, iteratively circulating steps S2-S5 until the fitness of the optimal individual reaches a threshold value, wherein the optimal individual is the optimal solution of the objective function.

It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Claims

1. A neural network acceleration reasoning method is characterized by comprising the following steps:

dividing the working model into a plurality of fragments;

solving an optimal solution of the objective function;

2. The neural network accelerated reasoning method of claim 1, wherein the objective function is formulated as:

in the formula, n represents the number of slices; t is _i Representing the processing time of the ith slice; e _i Representing the processing energy consumption of the ith fragment; a and B both represent weights; a is _i 、b _ij 、c _i Is a variable; t is t _ai 、t _bji 、t _ci Respectively representing the processing time of the ith fragment on the terminal, the jth edge server and the cloud server; e.g. of the type _ai 、e _bij 、e _ci Respectively representing the processing energy consumption of the ith fragment on the terminal, the jth edge server and the cloud server; m represents the number of edge servers.

3. The neural network accelerated reasoning method of claim 2, wherein the constraint condition of the objective function is formulated as:

a _i ∈{0、1}、b _ij ∈{0、1}、c _i ∈{0、1}

in the formula (I), the compound is shown in the specification,

4. The neural network accelerated reasoning method of claim 2, wherein the processing time comprises a slice latency, a computation delay, and a transmission delay of slice input data.

5. The neural network accelerated inference method of claim 4, wherein the transmission delay is expressed by a formula:

in the formula (I), the compound is shown in the specification,

respectively representing the transmission time delay of the ith fragment on the terminal, the jth edge server and the cloud server; (ii) a S _i Representing the size of the ith sliced input data volume; r is _a Representing the uplink data transmission rate of the terminal; r is _bj Representing the uplink data transmission rate of the jth edge server; r is _c And the data transmission rate of the cloud server is represented.

6. The neural network accelerated reasoning method of claim 4, further comprising: the terminal, each edge server and the cloud server are provided with request queues; and calculating the average waiting time delay of the request queue as the waiting time delay of the fragments in the terminal, each edge server and the cloud server.