CN116523045A

CN116523045A - Deep learning reasoning simulator oriented to multi-core chip

Info

Publication number: CN116523045A
Application number: CN202310235465.4A
Authority: CN
Inventors: 汤昭荣; 杨佳宁; 毛旷; 潘秋红; 杨弢; 叶茂伟; 许慧卿; 王颖
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2023-03-13
Filing date: 2023-03-13
Publication date: 2023-08-01
Anticipated expiration: 2043-03-13
Also published as: CN116523045B

Abstract

The invention discloses a deep learning reasoning simulator oriented to a multi-core chip, which comprises the following components: the configuration input layer is used for acquiring a deep learning model, a multi-chip architecture and a mapping strategy required by simulation; the model analysis layer is used for analyzing the deep learning model according to the mapping strategy to obtain a model analysis table; the route generation layer is used for analyzing routes in operators and routes among operators according to the operation strategy of each operator in the model analysis table and generating a route file; the reasoning simulation layer is used for carrying out reasoning simulation of the deep learning model on the multi-core chip described by the multi-core chip architecture, layering the routing file and carrying out multi-process parallel simulation through the network-on-chip simulator to obtain the number of cycles required by each operator route; and the result calculation layer is used for carrying out arrangement calculation on the operator routing cycle numbers obtained by parallel simulation in the reasoning simulation layer to obtain the cycle numbers and average equipment utilization rate of the deep learning model reasoning simulation on the multi-core chip.

Description

Deep learning reasoning simulator oriented to multi-core chip

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to a multi-core chip-oriented deep learning reasoning simulator which is particularly suitable for deep learning reasoning on a multi-core chip.

Background

With the continuous popularization of research and application of deep learning, the deployment of a deep model on a multi-core chip is proposed. The multi-core chip has the advantages of low cost and high yield, and is indispensable to use a simulator for early exploration in order to use the multi-core chip for deep learning reasoning.

In the process of realizing the invention, the inventor discovers that the full-system simulator in the prior art has too slow simulation speed to effectively design iteration; the period accurate simulator lacks a method of directly deploying the deep learning model to the simulator.

Aiming at the problem of model deployment, a multi-chip reasoning deep learning model for a framework is needed.

Disclosure of Invention

Aiming at the defects of the prior art, the embodiment of the application aims to provide a deep learning reasoning simulator oriented to a multi-core chip.

According to a first aspect of embodiments of the present application, there is provided a deep learning reasoning simulator for a multicore chip, including:

the configuration input layer is used for acquiring a deep learning model, a multi-chip architecture and a mapping strategy required by simulation;

the model analysis layer is used for analyzing the deep learning model according to the mapping strategy to obtain a model analysis table, wherein the model analysis table describes the operation strategy of each operator in the deep learning model;

the route generation layer is used for analyzing the routes in the operators and the routes among the operators according to the operation strategy of each operator in the model analysis table and generating a route file;

the reasoning simulation layer is used for carrying out reasoning simulation of the deep learning model on the multi-core chip described by the multi-core chip architecture by utilizing the routing file to obtain the number of cycles required by each operator route;

and the result calculation layer is used for carrying out arrangement calculation on the operator routing cycle numbers obtained by parallel simulation in the reasoning simulation layer to obtain the cycle numbers and average equipment utilization rate of the deep learning model reasoning simulation on the multi-core chip.

Further, the deep learning model is a deep neural network composed of several operators.

Further, the multi-chip architecture is used to describe the architecture of a multi-chip, which is a large chip composed of multiple chips, each of which contains a set of neural network processing units.

Further, the mapping strategy is used to describe how operators are mapped onto multi-die chips and how devices are assigned for computation.

Further, the model analysis table comprises an operator type, an input and output shape, a data type and an operation strategy of each operator.

Further, the routing file is a set of routes of all data packets in the multi-core chip, and each route of the data packets in the multi-core chip includes a sending time, a source address, a destination address and a data packet size.

Further, in the reasoning simulation layer, the routing file is divided into a plurality of parts, and processes with corresponding numbers are simulated simultaneously by using a network-on-chip simulator so as to perform reasoning simulation of the deep learning model on the multi-core chip.

Further, in the result calculation layer, for single batch reasoning, the calculation process of the cycle number is as follows:

calculating the cycle number required by each stage of reasoning, wherein the cycle number required by one stage is the sum of the cycle numbers required by each operator reasoning in the current stage;

and adding the cycle numbers required by each stage of reasoning to obtain the cycle numbers.

Further, in the result calculation layer, for multi-batch reasoning, the calculation process of the cycle number is as follows:

the phase with the longest period number time is taken as the main body part of the pipeline, the period number of the phase is multiplied by the batch number and the period number required by each phase reasoning is added, and the total period number of the multi-batch reasoning is obtained.

Further, in the result calculation layer, the average device utilization rate is an average value of proportions of devices in the sub-network of the device used in each operator reasoning.

The technical scheme provided by the embodiment of the application can comprise the following beneficial effects:

according to the embodiment, the simulator automatically deploys the deep learning model on the multi-core chip through the mapping strategy, so that simulation reasoning of the deep learning model on specific hardware is realized; by simulating parallel reasoning of the deep learning model on the multi-core chip, a large amount of reference data can be provided for early system structure design, the chip development cost is saved, and the system development speed is accelerated.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

FIG. 1 is a flow diagram of a multi-chip oriented deep learning reasoning simulator, according to an exemplary embodiment.

Fig. 2 is a diagram of a multi-die chip apparatus, shown according to an example embodiment.

FIG. 3 is an operator execution policy diagram that is illustrated in accordance with an example embodiment.

FIG. 4 is a diagram of a pipelined multi-batch reasoning diagram, shown according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application.

The terminology used in the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first message may also be referred to as a second message, and similarly, a second message may also be referred to as a first message, without departing from the scope of the present application. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

FIG. 1 is a flow diagram of a multi-chip oriented deep learning reasoning simulator, as shown in FIG. 1, according to an exemplary embodiment, which may include:

The following description is made on the deep learning reasoning simulator for the multi-core chip provided by the application:

1. the configuration input layer inputs a deep learning model required by simulation, a multi-chip architecture and a mapping strategy to the model input layer, wherein:

(1) The deep learning model is a deep neural network composed of a plurality of operators.

(2) The multi-chip architecture is used to describe the architecture of a multi-chip, which is a large chip composed of a plurality of cores, each core contains a set of neural network processing units, as shown in fig. 2, and is a multi-chip with four cores, and 9 neural network processing units exist in each core.

(3) The mapping strategy is used to describe how operators are mapped onto multi-chip chips and how devices are allocated for computation. Wherein the mapping strategy is shown in table 1 below. And splitting the deep learning model with the stages as granularity, wherein each stage consists of one or more operators, and the model performs reasoning calculation with the stages as sequence. The chip architecture is abstracted into a device diagram, and the device diagram is segmented into a plurality of device subnets, each subnet can run a stage, each stage occupies one device subnet on the device diagram, the device subnets are determined by the device subnet starting point and the device subnet length and width, and as shown in fig. 2, stage 1 starts from (0, 0), and has a length of 4 and a width of 6. The operator operation policy represents the policy of each operator when the sub-network of the device is operated.

Table 1 mapping policy table

2. The model analysis layer analyzes the input deep learning model according to a mapping strategy table, and the model analysis table comprises operator types, input and output shapes, data types and operation strategies of each operator, wherein the operation strategies of the operators refer to strategies for reasonably distributing the operators to equipment subnets, so that the operator operation fully utilizes equipment.

Table 2 model resolution table

The operator operation strategy is shown in fig. 3, when two tensors are subjected to matrix multiplication, the tensors can be divided into a plurality of blocks to be respectively operated in a plurality of neural network processing units; when in transverse cutting, the two tensors obtained by cutting are respectively operated in two neural network processing units, and the two tensors obtained by calculation can be obtained by splicing operation; when cutting longitudinally, the two tensors obtained by calculation need to be added; tensors, whether spliced or added, need to be carried between the neural network processing units.

Therefore, different operation strategies reasonably distribute tensors to the neural network processing units on the equipment sub-network by adopting different cutting methods, and data handling among the neural network processing units is called intra-operator routing, and when one operator is completely operated and is switched to the next operator, data handling is also generated and is called inter-operator routing.

3. The route generation layer analyzes routes in operators and routes among operators according to an operator operation strategy in the model analysis table and generates a route file, wherein the route file is a set of routes of all data packets in the multi-core chip, and each data packet route consists of a sending time, a source address, a destination address and a data packet size.

As shown in table 3 below, the transmission time represents the time at which the packet is injected into the network on chip, the source address and destination address are represented by four-dimensional coordinates, the first two dimensions representing the coordinates of the core and the second two bits representing the coordinates of the processing unit within the core, the packet size being in flits.

Table 3 routing file

4. The reasoning simulation layer performs reasoning simulation of the deep learning model on the multi-core chip, and the specific content of the reasoning simulation is the routing process of the data packet on the network on chip, namely the number of cycles required by the reasoning is obtained by the routing file list. In order to accelerate the speed of reasoning simulation, the routing file is divided into a plurality of processes to be simultaneously simulated by using a network-on-chip simulator, and in specific implementation, the network-on-chip simulator with an open source can be used, for example, a booksim, a popnet, a gem5 and the like.

It should be noted that:

(1) Although the stages are executed in series, the inference simulation does not need to perform specific numerical operation, so that the routes of each stage can be simulated in parallel;

(2) Similarly, operators within the same phase can be simulated in parallel.

5. The result calculation layer sorts the operation data obtained by parallel simulation in the reasoning simulation layer to obtain the cycle number and average equipment utilization rate of the deep learning model reasoning simulation on the multi-core chip.

5.1 the following is the calculation step of reasoning the simulation cycle number:

(1) Calculating the number of cycles required for each phase of reasoning, the number of cycles required for one phase being the sum of the cycles required for each operator of reasoning in the current phase, i.e

In the method, in the process of the invention,the number of cycles needed for reasoning in this stage, +.>The number of cycles needed for an operator reasoning, < +.>For the number of operators in this stage.

(2) The cycle numbers required by each stage of reasoning are added, namely the cycle numbers required by one time of reasoning are:

where t is the total number of cycles required for one inference and m is the number of stages.

And (3) for single batch reasoning, obtaining the cycle number through the steps (1) - (2).

(3) For multi-batch reasoning, the pipeline mode can be used for calculating the cycle number, the stage with the longest cycle number is taken as the main body part of the pipeline, the cycle number of the stage is multiplied by the batch number and the cycle number required by each stage of reasoning is added, and the total cycle number of the multi-batch reasoning is obtained.

In one embodiment, as shown in fig. 4, there are five total samples a, B, C, D, E that need to be inferred, the number of cycles required for each phase inference is different, and the total inference time is the sum of the number of cycles required for each phase inference plus the number of cycles required for phase 3 inference times 4 (four samples B, C, D, E). In summary, when the lot is B, the total cycle number is as follows:

where T is the total number of cycles required for multi-batch reasoning, the first term is the number of cycles of the first batch through the pipeline, and the second term is the number of cycles of the remaining batch.

5.2 mapping strategies do not necessarily fully exploit the computational power of a multi-core chip, and therefore require an average device utilization, which is the average of the proportions of devices in the sub-network of the device used in each operator reasoning, as shown in the following equation,

where n is the total number of operators,the ratio of the devices used in the operator reasoning to the sub-network of the devices.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains.

It is to be understood that the present application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof.

Claims

1. A multi-core chip-oriented deep learning reasoning simulator is characterized by comprising:

2. The multi-kernel chip oriented deep learning reasoning simulator of claim 1 wherein the deep learning model is a deep neural network comprised of several operators.

3. The multi-chip oriented deep learning reasoning simulator of claim 1, wherein the multi-chip architecture is used to describe a multi-chip architecture that is a large chip composed of multiple chips, each chip containing a set of neural network processing units.

4. The multi-die oriented deep learning reasoning simulator of claim 1, wherein the mapping strategy is used to describe how operators are mapped onto multi-die dies and how devices are assigned for computation.

5. The multi-chip oriented deep learning reasoning simulator of claim 1, wherein the model parsing table includes an operator type, an input-output shape, a data type, and an operation strategy for each operator.

6. The multi-die chip oriented deep learning reasoning simulator of claim 1 wherein the routing file is a collection of all data packets routed in the multi-die chip, the routing of each data packet in the multi-die chip including a time of transmission, a source address, a destination address, and a data packet size.

7. The multi-chip-oriented deep learning reasoning simulator of claim 1, wherein in the reasoning simulation layer, the routing file is divided into a plurality of parts, and a corresponding number of processes are simulated simultaneously using a network-on-chip simulator to perform reasoning simulation of a deep learning model on a multi-chip.

8. The multi-chip oriented deep learning reasoning simulator of claim 1 wherein in the result calculation layer, for single batch reasoning, the number of cycles is calculated as:

9. The multi-chip oriented deep learning reasoning simulator of claim 1 wherein in the result calculation layer, for multi-batch reasoning, the number of cycles is calculated as:

10. The multi-core chip oriented deep learning reasoning simulator of claim 1, wherein in the result calculation layer, the average device utilization is an average value of proportions of devices in the sub-network of the devices used in each operator reasoning.