CN113778655A

CN113778655A - Network precision quantification method and system

Info

Publication number: CN113778655A
Application number: CN202010519846.1A
Authority: CN
Inventors: 孟凡辉; 胡川; 李涵; 张爱飞; 吴欣洋
Original assignee: Beijing Lynxi Technology Co Ltd
Current assignee: Beijing Lynxi Technology Co Ltd
Priority date: 2020-06-09
Filing date: 2020-06-09
Publication date: 2021-12-10
Also published as: US20230040375A1; US11783168B2; WO2021249440A1

Abstract

The invention discloses a network precision quantification method, which is applied to a many-core chip and comprises the following steps: determining reference precision according to the total amount of the core resources of the many-core chip and each network to be quantized, wherein the total amount of the core resources required by each network to be quantized according to the reference precision is less than or equal to the total amount of the core resources of the many-core chip; and determining the target precision corresponding to each network to be quantized according to the reference precision and the total amount of the core resources of the many-core chip. The invention also discloses a system for quantizing the network precision. The invention has the beneficial effects that: the chip resource utilization rate under the coexistence of a plurality of networks is improved, and meanwhile, the precision loss caused by excessive quantization is reduced.

Description

Network precision quantification method and system

Technical Field

The invention relates to the technical field of neural networks, in particular to a method and a system for quantizing network precision.

Background

In the related technology, when the neural network carries out precision quantization, the precision selection is single, and only the reduction of resource utilization by single network quantization is considered. The problem of how to select the precision of a plurality of networks and realize reasonable distribution of on-chip resources under the condition that a plurality of networks coexist in a many-core chip is not solved, so that the chip resources are not fully utilized or the precision is lost due to excessive quantification.

Disclosure of Invention

In order to solve the above problems, the present invention provides a method and a system for quantizing network accuracy, which can improve the utilization rate of chip resources in the presence of multiple networks and reduce the accuracy loss caused by network accuracy quantization.

The invention provides a network precision quantification method, which is applied to a many-core chip and comprises the following steps:

determining reference precision according to the total amount of the core resources of the many-core chip and each network to be quantized, wherein the total amount of the core resources required by each network to be quantized according to the reference precision is less than or equal to the total amount of the core resources of the many-core chip;

and determining the target precision corresponding to each network to be quantized according to the reference precision and the total amount of the core resources of the many-core chip.

As a further improvement of the present invention, determining the reference precision according to the total amount of core resources of the many-core chip and each network to be quantized includes:

determining the total amount S of core resources required by quantization of each network to be quantized according to the 1 st precision₁；

Judging the total amount S of the nuclear resources₁Whether the total amount of the core resources Z of the many-core chip is less than or equal to;

if the total amount of the core resources S₁If the total quantity of the core resources Z is larger than that of the many-core chip, determining the total quantity S of the core resources required by the quantization of each network to be quantized according to the 2 nd precision₂And judging the total amount S of the nuclear resources₂Whether the total amount of the core resources Z of the many-core chip is less than or equal to 1, wherein the 2 nd precision is lower than the 1 st precision;

gradually decreasing according to the quantization precision, and repeating the steps until the total quantity S of the core resources required by quantization of each network to be quantized according to the jth precision is determined_jLess than or equal to the total core resources of the many-core chipAnd Z, determining the jth precision as the reference precision, wherein j is an integer greater than or equal to 2.

As a further improvement of the present invention, determining the target precision corresponding to each network to be quantized according to the reference precision and the total amount of core resources of the many-core chip includes:

quantizing the total amount S of the required core resources according to the reference precision j of each network to be quantized_jAnd determining the total amount of the core resources Z of the many-core chip to determine the amount of the remaining core resources Y_jWherein Y is_j＝Z-S_jJ is an integer greater than or equal to 2;

determining at least one core resource quantity difference W [ i ] } of each to-be-quantized network quantized according to each precision and the jth precision, wherein i is used for representing the number of the to-be-quantized network, and i is an integer greater than or equal to 1;

according to the residual nuclear resource quantity Y_jAnd determining the target precision corresponding to each network to be quantized according to the quantity difference of the core resources quantized step by step of each network to be quantized.

As a further improvement of the invention, according to the amount of remaining core resources Y_jAnd determining the target precision corresponding to each network to be quantized according to the number difference of each core resource quantized step by step of each network to be quantized, comprising:

for each network to be quantized, calculating a difference W [ i ] from the number of the at least one core resource]＝{M[i][1]-M[i][j],M[i][2]-M[i][j],., determining a core resource quantity difference so that the sum of the core resource quantity differences of the networks to be quantized is less than or equal to the residual core resource quantity Y_jThen, the sum of the resource differences of the cores of the networks to be quantized is the maximum;

and when the sum of the core resource differences of the networks to be quantized is the maximum, determining the target precision corresponding to the networks to be quantized.

As a further improvement of the invention, each network to be quantized comprises a first network to be quantized and a second network to be quantized, the target precision corresponding to the second network to be quantized is the kth precision,

determining target precision corresponding to each network to be quantized according to the reference precision and the total amount of core resources of the many-core chip, wherein the target precision comprises the following steps:

determining a first class of network to be quantized according to a reference precision j^/Quantifying the total amount of core resources S required_j ^/And quantizing the total amount S of the core resources required by the second type of network to be quantized according to the specified precision k_kWherein j is^/Is an integer greater than or equal to 1, k is an integer greater than or equal to 1;

according to the total amount S of the core resources_j ^/The total amount of nuclear resources S_kAnd determining the total amount of the core resources Z of the many-core chip to determine the amount of the remaining core resources Y_j ^/Wherein Y is_j ^/＝Z-S_j ^/-S_k；

Determining the quantization of the first type of network to be quantized according to each precision and j^/At least one core resource number difference W [ i 'with quantized precision']＝{M[i′][1]-M[i′][j^/],M[i′][2]-M[i′][j^/],., wherein i^/Number, i, for representing a first type of network to be quantized^/Is an integer greater than or equal to 1;

according to the residual nuclear resource quantity Y_j ^/And determining the target precision corresponding to the first type of network to be quantized according to the quantity difference of the core resources quantized step by step of the first type of network to be quantized. As a further improvement of the invention, according to the amount of remaining core resources Y_j ^/And the number difference of each core resource quantized step by step of the first type of network to be quantized determines the target precision corresponding to the first type of network to be quantized, and the method comprises the following steps:

for each network in the first type of network to be quantized, calculating the difference W [ i']＝{M[i′][1]-M[i′][j^/],M[i′][2]-M[i′][j^/],., determining a core resource quantity difference so that the sum of the core resource quantity differences of the first type of network to be quantized is less than or equal to the residual core resource quantity Y_j ^/The sum of the resource differences of each core of the first type of network to be quantifiedMaximum;

and when the sum of the core resource differences of the first type of network to be quantized is maximum, determining the target precision corresponding to the first type of network to be quantized.

As a further improvement of the present invention, the target accuracies corresponding to the networks to be quantized are not completely the same.

As a further improvement of the invention, the total quantity S of the core resources required by quantizing each network to be quantized according to the 1 st precision is determined₁The method comprises the following steps:

calculating the number M [ i ] [1] of core resources required when the ith network quantization is 1 st precision;

Wherein,

the number of the networks is N, i is used for representing the number of the networks to be quantized, i is an integer greater than or equal to 1, and N is an integer greater than or equal to 1.

if the total amount of the core resources S₁Less than or equal to the total amount of core resources Z of the many-core chip, and determining the 1 st precision as the reference precision;

determining target precision corresponding to each network to be quantized according to the reference precision and the total amount of core resources of the many-core chip, wherein the determining of the target precision corresponding to each network to be quantized comprises the following steps:

and determining the 1 st precision as the target precision corresponding to each network to be quantized.

As a further improvement of the invention, the quantization precision comprises one or more of fp32, fp16, int8 and int 4.

The invention also provides a network precision quantification system, which is applied to a many-core chip and comprises the following components:

the reference precision determining module is used for determining reference precision according to the total amount of the core resources of the many-core chip and the networks to be quantized, wherein the total amount of the core resources required by the networks to be quantized according to the reference precision is less than or equal to the total amount of the core resources of the many-core chip;

and the target precision determining module is used for determining the target precision corresponding to each network to be quantized according to the reference precision and the total amount of the core resources of the many-core chip.

As a further improvement of the present invention, the reference accuracy determination module is configured to:

gradually decreasing according to the quantization precision, and repeating the steps until the total quantity S of the core resources required by quantization of each network to be quantized according to the jth precision is determined_jAnd determining the jth precision as the reference precision, wherein j is an integer greater than or equal to 2.

As a further improvement of the present invention, the target accuracy determination module is configured to:

As a further improvement of the present invention, each network to be quantized includes a first network to be quantized and a second network to be quantized, a target precision corresponding to the second network to be quantized is a kth precision, and the target precision determining module is configured to:

according to the total amount of the core resourcesS_j ^/The total amount of nuclear resources S_kAnd determining the total amount of the core resources Z of the many-core chip to determine the amount of the remaining core resources Y_j ^/Wherein Y is_j ^/＝Z-S_j ^/-S_k；

according to the residual nuclear resource quantity Y_j ^/And determining the target precision corresponding to the first type of network to be quantized according to the quantity difference of the core resources quantized step by step of the first type of network to be quantized.

As a further improvement of the invention, according to the amount of remaining core resources Y_j ^/And the number difference of each core resource quantized step by step of the first type of network to be quantized determines the target precision corresponding to the first type of network to be quantized, and the method comprises the following steps:

for each network in the first type of network to be quantized, calculating the difference W [ i']＝{M[i′][1]-M[i′][j^/],M[i′][2]-M[i′][j^/],., determining a core resource quantity difference so that the sum of the core resource quantity differences of the first type of network to be quantized is less than or equal to the residual core resource quantity Y_j ^/Then, the sum of the resource differences of the cores of the first type of network to be quantized is maximum;

Wherein,

the number of the networks is N, i represents the number of the networks to be quantized, i is an integer greater than or equal to 1, and N is an integer greater than or equal to 1.

The invention also provides an electronic device comprising a memory and a processor, the memory storing one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method.

The invention also provides a computer-readable storage medium having stored thereon a computer program for execution by a processor to perform the method.

The invention has the beneficial effects that: the problem of resource distribution in a neural network multi-network model chip is solved in a quantification mode, the problem of optimal selection of quantification precision is solved through a multi-level dynamic programming algorithm, distribution and use of an internal memory in a chip are optimized, the internal memory resource of a limited chip is fully utilized in the internal memory distribution process of the on-chip multi-network model, and meanwhile, precision loss caused by over-quantification is reduced.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without undue inventive faculty.

Fig. 1 is a schematic flowchart illustrating a network accuracy quantifying method according to an exemplary embodiment of the disclosure;

FIG. 2 is a flow diagram illustrating a plurality of network progressive quantization according to an exemplary embodiment of the present disclosure;

FIG. 3 is a flow chart illustrating progressive quantization of multiple networks according to yet another exemplary embodiment of the present disclosure;

fig. 4 is a schematic diagram of a many-core chip multiple network core resource allocation according to an exemplary embodiment of the disclosure;

in the figure, the position of the upper end of the main shaft,

1. 1, a network; 2. a 2 nd network; 3. a 3 rd network; 4. a 4 th network; 5. a 5 th network; 6. A 6 th network; 7. and 7 th network.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

It should be noted that, if directional indications (such as up, down, left, right, front, and back … …) are involved in the disclosed embodiment, the directional indications are only used to explain the relative position relationship between the components, the motion situation, and the like in a specific posture (as shown in the drawing), and if the specific posture is changed, the directional indications are changed accordingly.

In addition, in the description of the present disclosure, the terms used are for illustrative purposes only and are not intended to limit the scope of the present disclosure. The terms "comprises" and/or "comprising" are used to specify the presence of stated elements, steps, operations, and/or components, but do not preclude the presence or addition of one or more other elements, steps, operations, and/or components. The terms "first," "second," and the like may be used to describe various elements, not necessarily order, and not necessarily limit the elements. In addition, in the description of the present disclosure, "a plurality" means two or more unless otherwise specified. These terms are only used to distinguish one element from another. These and/or other aspects will become apparent to those of ordinary skill in the art in view of the following drawings, and the description of the embodiments of the disclosure will be more readily understood by those of ordinary skill in the art. The drawings are only for purposes of illustrating the described embodiments of the disclosure. One skilled in the art will readily recognize from the following description that alternative embodiments of the structures and methods illustrated in the present disclosure may be employed without departing from the principles described in the present disclosure.

The method for quantizing network precision according to the embodiment of the present disclosure is applied to a many-core chip, and as shown in fig. 1, the method includes:

in step S1, determining a reference precision according to the total amount of core resources of the many-core chip and each network to be quantized, where the total amount of core resources required for quantization of each network to be quantized according to the reference precision is less than or equal to the total amount of core resources of the many-core chip;

in step S2, a target precision corresponding to each network to be quantized is determined according to the reference precision and the total amount of core resources of the many-core chip.

The network to be quantized can be various neural network models needing quantization. The total amount of core resources of a many-core chip may refer to the number of cores of the many-core chip or the core storage resources of the many-core chip. The reference precision and the target precision can be determined from various quantization precisions. For example, the quantization precision may include one or more of fp32 (32-bit data type), fp16 (16-bit data type), int8 (8-bit data type), int4 (4-bit data type). For example, the quantization precision includes fp32, fp16, int8, and int4, and the reference precision and the target precision corresponding to each network to be quantized can be selected from fp32, fp16, int8, and int 4. The target precision corresponding to each network to be quantized may be the same, for example, the target precision corresponding to each network to be quantized is a reference precision, and the target precision corresponding to each network to be quantized may not be completely the same.

In an artificial intelligence chip, chip storage resources are limited, and with the increase of consumption of the chip storage resources by a neural network, precision quantification of a neural network model becomes more important. The many-core chip comprises a plurality of cores, which has great advantages in neural network application, and the many-core chip has more cores, and the quantization precision is not only a simple precision selection. When a plurality of neural networks coexist, resources of the plurality of networks need to be reasonably distributed, so that the utilization rate of core resources is high, and the precision loss caused by over-measurement can be reduced. The method disclosed by the invention aims at the condition that a plurality of networks coexist in a many-core chip, based on the network quantity and core resource perception, according to the resources in the many-core chip and the specific resource requirements of the plurality of networks, multi-level quantification of a plurality of network precisions is carried out in a multi-level dynamic planning mode, the resources of each network are dynamically planned and distributed, reasonable resource distribution and full utilization of the resources in the chip are realized, and the precision loss caused by over-quantification is reduced.

In an optional implementation manner, determining the reference precision according to the total amount of core resources of the many-core chip and each network to be quantized includes:

Judging the total amount S of the nuclear resources₁Whether is less than or equal toThe total amount of core resources Z of the many-core chip;

For example, the 1 st precision is the highest precision, each network to be quantized may be quantized according to the highest precision, and when the total amount of core resources required for quantizing each network to be quantized according to the highest precision is less than or equal to the total amount of core resources of the many-core chip, each network to be quantized may be directly allocated with core resources according to the highest quantization precision, so that the core resources are utilized most sufficiently, and each network to be quantized is guaranteed to have high precision.

In an optional implementation manner, the total amount S of core resources required for quantizing each network to be quantized according to the 1 st precision is determined₁The method comprises the following steps:

Wherein,

the number of the networks to be quantized is N, i represents the number of the networks to be quantized, i is an integer greater than or equal to 1, and N is an integer greater than or equal to 1.

Judging the total amount S of the nuclear resources₁Whether or not less than or equal to the total amount of core resources of the many-core chipZ；

When determining the total amount of core resources required by quantization of each network to be quantized according to quantization precision, the method can be used for determining the total amount of the core resources required by quantization of each network to be quantized according to quantization precision

And calculating, wherein the number of the networks to be quantized is N, i represents the number of the networks to be quantized, i is an integer greater than or equal to 1, and N is an integer greater than or equal to 1.

For example, the quantization precision includes fp32, fp16, int8, int 4. The 1 st precision may represent fp32 precision, the 2 nd precision represents fp16, the 3 rd precision represents int8, and the 4 th precision represents int 4. When each network to be quantized quantizes the required total amount S of the core resources according to the 1 st precision₁When the sum of the core resources Z of the many-core chip is greater than the sum of the core resources Z of the many-core chip, the quantization precision is gradually decreased, for example, the sum of the core resources S required by each network to be quantized to quantize according to the 2 nd precision (j ═ 2) can be determined₂And the 2 nd precision is lower than the 1 st precision. If S₂Less than or equal to the total amount of core resources Z of the many-core chip, and determining the 2 nd precision (j-2) as the reference precision.

If S₂If the total amount of the core resources Z of the many-core chip is larger than the total amount of the core resources Z of the many-core chip, the quantization precision is continuously decreased step by step, and the total amount of the core resources S required by quantization of each network to be quantized according to the 3 rd precision (j is 3) is determined₃If S is₃Less than or equal to the total amount of core resources Z of the many-core chip, and determining the 3 rd precision (j-3) as the reference precision. If S₃The total amount of the core resources Z larger than the many-core chip is gradually decreased according to the quantization precision, and the like, for example, the total amount of the core resources S required by quantization of each network to be quantized according to the 4 th precision is determined₄Less than or equal to the total amount of core resources Z of the many-core chip, and determining the 4 th precision (j is 4) as the reference precision.

It should be understood that the reference precision may be determined in various manners, as long as the total amount of core resources required for quantizing each network to be quantized according to the reference precision is less than or equal to the total amount of core resources of the many-core chip, and the manner of determining the reference precision is not limited by the present disclosure.

In an optional implementation manner, determining a target precision corresponding to each network to be quantized according to the reference precision and the total amount of core resources of the many-core chip includes:

In another alternative embodiment, the amount of remaining core resources Y is based on_jAnd determining the target precision corresponding to each network to be quantized according to the number difference of each core resource quantized step by step of each network to be quantized, comprising:

for each network to be quantized, counting differences from the at least one core resource

W[i]＝{M[i][1]-M[i][j],M[i][2]-M[i][j],.. determining a core resource quantity difference so as to enable each core resource of each network to be quantizedThe sum of the number differences is less than or equal to the residual core resource amount Y_jThen, the sum of the resource differences of the cores of the networks to be quantized is the maximum;

For example, as shown in fig. 2, the multiple networks are subjected to progressive quantization from fp32, int8 to int4, and the quantization precision may be selected from three types, that is, fp32, int8, and int4, where the 1 st precision is fp32, the 2 nd precision is int8, and the 3 rd precision is int 4.

Step1, determining the total core resource S required by quantizing each network to be quantized according to fp32 precision₁And judging the total amount S of the nuclear resources₁Whether the total amount of the core resources Z of the many-core chip is less than or equal to;

if the total amount of the core resources S₁The total amount Z of the core resources of the many-core chip is less than or equal to that of the many-core chip, the fp32 precision is determined as the target precision corresponding to each network to be quantized, and the core resources are quantitatively distributed to each network to be quantized according to the fp32 precision;

if the total amount of the core resources S₁Performing Step2 when the total amount of the core resources Z is larger than the total amount of the core resources Z of the many-core chip;

step2, determining the total amount S of the core resources required by the quantization of each network to be quantized according to int8 precision₂And judging the total amount S of the nuclear resources₂Whether the total amount of the core resources Z of the many-core chip is less than or equal to;

if the total amount of the core resources S₂The precision of each network to be quantized can be fp32 precision or int8 precision, and the total quantity of the core resources needed by each network to be quantized after being quantized according to the respective target precision is ensured to be less than or equal to the total quantity Z of the core resources of the many-core chip;

if the total amount of the core resources S₂Performing Step3 when the total amount of the core resources Z is larger than the total amount of the core resources Z of the many-core chip;

step3, determining the precision of each network to be quantized according to int4Total amount of core resources S required for the formation₃And judging the total amount S of the nuclear resources₃Whether the total amount of the core resources Z of the many-core chip is less than or equal to;

if the total amount of the core resources S₃The precision of each network to be quantized can be selected from fp32 precision, int8 precision or int4 precision, and the total quantity of the core resources needed by each network to be quantized after being quantized according to the respective target precision is ensured to be less than or equal to the total quantity Z of the core resources of the many-core chip;

if the total amount of the core resources S₃And if the total quantity of the core resources Z is larger than the total quantity of the core resources Z of the many-core chip, determining that all the networks cannot be stored simultaneously.

In some optional embodiments, the total amount of core resources S is determined at Step1₁Greater than the total amount of core resources Z of the many-core chip, determining the total amount of core resources S in Step2₂Less than or equal to the total amount of the core resources Z of the many-core chip, determining the precision of int8 as the reference precision, and determining the amount of the remaining core resources Y₂＝Z-S₂. When the number difference of the core resources of each network to be quantized from fp32 precision to int8 precision is obtained, the number difference of the core resources of each network to be quantized is W [ i [ i ] ]]＝M[i][1]-M[i][2]According to the amount of remaining nuclear resources Y₂And determining the target precision corresponding to each network to be quantized according to the quantity difference of each core resource quantized step by step of each network to be quantized.

In some optional embodiments, the total amount of core resources S is determined at Step1₁Greater than the total amount of core resources Z of the many-core chip, determining the total amount of core resources S in Step2₂Greater than the total amount of core resources Z of the many-core chip, determining the total amount of core resources S in Step3₃Less than or equal to the total amount of the core resources Z of the many-core chip, determining the precision of int4 as the reference precision, and determining the amount of the residual core resources Y₃＝Z-S₃. When the number difference of the core resources of each network to be quantized from fp32 precision to int4 precision is obtained, the number difference of the core resources of each network to be quantized can be W [ i [ ]]＝M[i][1]-M[i][3]May be W [ i ]]＝M[i][2]-M[i][3]According to the amount of remaining nuclear resources Y₃＝Z-S₃And determining the target precision corresponding to each network to be quantized according to the quantity difference of each core resource quantized step by step of each network to be quantized.

When the target accuracy of each network to be quantified is determined, each network to be quantified can be quantified according to the reference accuracy, the quantified residual nuclear resource amount is used as the capacity of a new backpack, the nuclear resource amount difference of each network to be quantified represents the value of a new backpack article, the value sum of each network to be quantified is maximum through a 0-1 backpack dynamic planning algorithm, and at the moment, the target accuracy corresponding to each network to be quantified when the value sum is maximum can be determined. According to the method, for a plurality of networks on a chip, the sum of the differences of the core resources of each network is used as the optimal solution of the target precision of each network to the maximum, so that the core resources of the chip are utilized fully, the reasonable distribution of the core resources is realized, and the precision loss caused by the single quantization precision selected by the plurality of networks is reduced.

In an optional implementation manner, the target accuracies corresponding to the networks to be quantized are not completely the same.

For example, as shown in fig. 4, when the 1 st network quantization is fp32 precision, the 2 nd network, the 4 th network and the 5 th network quantization are int8 precision, and the 3 rd network, the 6 th network and the 7 th network quantization are int4 precision, the sum of the values of all the networks is the largest, that is, it is determined that the target precision of the 1 st network is fp32 precision, the target precision of the 2 nd network, the 4 th network and the 5 th network is int8 precision, and the target precision of the 3 rd network, the 6 th network and the 7 th network is int4 precision.

In an optional implementation manner, each network to be quantized includes a first network to be quantized and a second network to be quantized, a target precision corresponding to the second network to be quantized is a kth precision,

determining a first class of network to be quantized according to a reference precision j^/Required for quantizationTotal amount of core resources S_j ^/And quantizing the total amount S of the core resources required by the second type of network to be quantized according to the specified precision k_kWherein j is^/Is an integer greater than or equal to 1, k is an integer greater than or equal to 1;

Determining the quantization of the first type of network to be quantized according to each precision and j^/At least one core resource number difference W [ i 'with quantized precision']＝{M[i′][1]-M[i′][j^/],M[i′][2]-M[i′][j^/],., wherein i^/Number representing the first type of network to be quantized, i^/Is an integer greater than or equal to 1;

In another alternative embodiment, the amount of remaining core resources Y is based on_j ^/And the number difference of each core resource quantized step by step of the first type of network to be quantized determines the target precision corresponding to the first type of network to be quantized, and the method comprises the following steps:

When the precision quantization is carried out on each network to be quantized, the precision quantization can be carried out according to the requirement, for example, a certain network or a plurality of networks to be quantized are quantized according to the specified precision, and then the corresponding target precision is determined for other networks to be quantized.

For example, as shown in fig. 3, the multiple networks are quantized from fp32, fp16, int8 to int4 in a stepwise manner, and the quantization precision can be selected from four types, j 32, fp16, int8 and int4 ^/1 denotes fp32 precision, j ^/2 denotes fp16 precision, j^/Int8 precision, j, is expressed as 3^/4 denotes int4 precision:

step2, determining the total core resource S required by quantizing each network to be quantized according to fp16 precision₂And judging the total amount S of the nuclear resources₂Whether the total amount of the core resources Z of the many-core chip is less than or equal to;

if the total amount of the core resources S₂The precision of each network to be quantized can be fp32 precision or fp16 precision, and the total quantity of the core resources needed by each network to be quantized after quantization according to the respective target precision is ensured to be less than or equal to the total quantity Z of the core resources of the many-core chip;

step3, determining each waiting timeQuantifying network quantifies total amount of core resources S required according to int8 precision₃And judging the total amount S of the nuclear resources₃Whether the total amount of the core resources Z of the many-core chip is less than or equal to;

if the total amount of the core resources S₃The precision of each network to be quantized can be selected from fp32 precision, fp16 precision or int8 precision, and the total quantity of the core resources needed by each network to be quantized after being quantized according to the respective target precision is ensured to be less than or equal to the total quantity Z of the core resources of the many-core chip;

if the total amount of the core resources S₃Performing Step4 when the total amount of the core resources Z is larger than the total amount of the core resources Z of the many-core chip;

step4, determining the total amount S of the core resources required by the quantization of each network to be quantized according to int4 precision₄And judging the total amount S of the nuclear resources₄Whether the total amount of the core resources Z of the many-core chip is less than or equal to;

if the total amount of the core resources S₄The precision of each network to be quantized can be selected from fp32 precision, fp16 precision, int8 precision or int4 precision, and the total quantity of the core resources needed by each network to be quantized after being quantized according to the respective target precision is ensured to be less than or equal to the total quantity Z of the core resources of the many-core chip;

if the total amount of the core resources S₄And if the total quantity of the core resources Z is larger than the total quantity of the core resources Z of the many-core chip, determining that all the networks cannot be stored simultaneously.

For example, the total amount of core resources S is determined in Step1₁Greater than the total amount of core resources Z of the many-core chip, determining the total amount of core resources S in Step2₂Greater than the total amount of core resources Z of the many-core chip, determining the total amount of core resources S in Step3₃Greater than the total amount of core resources Z of the many-core chip, and in Step4, determining the total amount of core resources S₄Cores smaller than or equal to the many-core chipAnd the total resource amount Z, namely determining the int4 precision as the reference precision of the first type of network to be quantized, wherein the total core resource amount required by the first type of network to be quantized according to the int4 precision is S₄ ^/The total amount of core resources required by the second type of network to be quantized for quantization according to specified precision fp32 is S_k. At this time, the remaining core resource amount Y is determined₄ ^/＝Z-S₄ ^/-S_k. When the core resource quantity difference of the first type of network to be quantized from fp32 precision to int4 precision is solved, the core resource quantity difference of each network in the first type of network to be quantized can be W [ i']＝M[i′][1]-M[i′][4]May be W [ i']＝M[i′][2]-M[i′][4]May be W [ i']＝M[i′][3]-M[i′][4]According to the amount of remaining nuclear resources Y₄ ^/＝Z-S₄ ^/-S_kAnd determining the target precision corresponding to each network in the first type of network to be quantized according to the quantity difference of the core resources quantized step by step of the first type of network to be quantized.

When the target accuracy of each network to be quantized is determined, the first network to be quantized can be quantized according to reference accuracy, the second network to be quantized is quantized according to specified accuracy, the residual nuclear resource amount after the two networks to be quantized are quantized is used as the capacity of a new backpack, the value of a new backpack article is represented by the nuclear resource amount difference of each network in the first network to be quantized, the value sum of the first network to be quantized is obtained through a 0-1 backpack dynamic programming algorithm, and at the moment, the target accuracy corresponding to the first network to be quantized when the value sum is maximum can be determined.

The method disclosed by the invention solves the problem of resource allocation in the neural network multi-network model chip in a quantization mode, solves the problem of optimal selection of quantization precision by a multi-level dynamic programming algorithm, optimizes the allocation and use of the memory in the chip, realizes the full utilization of limited chip memory resources in the memory allocation process of the multi-network model chip, and simultaneously reduces the precision loss caused by over-quantization.

The quantization system of network precision stated in this disclosure embodiment, the stated system is applied to many core chips, the stated system includes:

In an optional embodiment, the reference accuracy determination module is further configured to:

The above may be understood that, when each network to be quantized is quantized according to the highest precision, and the total amount of the required core resources is less than or equal to the total amount of the core resources of the many-core chip, each network to be quantized may directly allocate the core resources according to the highest quantization precision, so that the core resources are most fully utilized.

Wherein,

In an alternative embodiment, the quantization precision includes one or more of fp32 (32-bit data type), fp16 (16-bit data type), int8 (8-bit data type), int4 (4-bit data type). For example, the quantization precision of each network to be quantized is selected from fp32, fp16, int8 and int4, where j-1 represents fp32 precision, j-2 represents fp16 precision, j-3 represents int8 precision, and j-4 represents int4 precision. For example, the quantization precision of each network to be quantized is selected from fp32, int8, int4, where j-1 represents fp32 precision, j-2 represents int8 precision, and j-3 represents int4 precision.

In an optional embodiment, the target accuracy determination module is further configured to:

In an alternative embodiment, the amount of remaining core resources Y is based on_jAnd determining the target precision corresponding to each network to be quantized according to the number difference of each core resource quantized step by step of each network to be quantized, comprising:

When the target accuracy of each network to be quantified is determined, the system can quantify each network to be quantified according to the reference accuracy, the quantified residual nuclear resource amount is used as the capacity of a new backpack, the nuclear resource amount difference of each network to be quantified represents the value of a new backpack article, the value sum of each network to be quantified is maximum through a 0-1 backpack dynamic planning algorithm, and at the moment, the target accuracy corresponding to each network to be quantified when the value sum is maximum can be determined. According to the method, for a plurality of networks on a chip, the sum of the differences of the core resources of each network is used as the optimal solution of the target precision of each network to the maximum, so that the core resources of the chip are utilized fully, the reasonable distribution of the core resources is realized, and the precision loss caused by the single quantization precision selected by the plurality of networks is reduced.

In an optional implementation manner, each network to be quantized includes a first network to be quantized and a second network to be quantized, a target precision corresponding to the second network to be quantized is a kth precision, and the target precision determining module is configured to:

Determining the quantization of the first type of network to be quantized according to each precision and j^/At least one core resource number difference W [ i 'with quantized precision']＝{M[i′][1]-M[i′][j^/],M[i′][2]-M[i′][j^/],..}, in which i^/Number representing the first type of network to be quantized, i^/Is an integer greater than or equal to 1;

In an alternative embodiment, the amount of remaining core resources Y is based on_j ^/And the number difference of each core resource quantized step by step of the first type of network to be quantized determines the target precision corresponding to the first type of network to be quantized, and the method comprises the following steps:

When the precision quantization is carried out on each network to be quantized, the precision quantization can be carried out according to the requirement, for example, a certain network or a plurality of networks to be quantized are quantized according to the specified precision, and then the corresponding target precision is determined for other networks to be quantized. When the target accuracy of each network to be quantized is determined, a first type of network to be quantized can be quantized according to reference accuracy, a second type of network to be quantized is quantized according to specified accuracy, the residual nuclear resource amount after the two types of networks to be quantized are quantized is used as the capacity of a new backpack, the value of a new backpack article is represented by the nuclear resource amount difference of each network in the first type of network to be quantized, the value sum of the first type of network to be quantized is obtained through a 0-1 backpack dynamic programming algorithm, and at the moment, the target accuracy corresponding to the first type of network to be quantized when the value sum is maximum can be determined.

The system disclosed by the invention solves the problem of resource allocation in the neural network multi-network model chip in a quantization mode, solves the problem of optimal selection of quantization precision by a multi-level dynamic programming algorithm, optimizes the allocation and use of the memory in the chip, realizes the full utilization of limited chip memory resources in the memory allocation process of the multi-network model chip, and simultaneously reduces the precision loss caused by over-quantization.

The disclosure also relates to an electronic device comprising a server, a terminal and the like. The electronic device includes: at least one processor; a memory communicatively coupled to the at least one processor; and a communication component communicatively coupled to the storage medium, the communication component receiving and transmitting data under control of the processor; wherein the memory stores instructions executable by the at least one processor to implement the method for quantifying network accuracy in the above embodiments.

In an alternative embodiment, the memory is used as a non-volatile computer-readable storage medium for storing non-volatile software programs, non-volatile computer-executable programs, and modules. The processor executes various functional applications of the device and data processing, i.e., implements the method, by executing nonvolatile software programs, instructions, and modules stored in the memory.

The memory may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store a list of options, etc. Further, the memory may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and such remote memory may be connected to the external device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

One or more modules are stored in the memory and, when executed by the one or more processors, perform the method of quantifying network accuracy in any of the method embodiments described above.

The above-mentioned product can execute the method provided by the embodiment of the present application, and has corresponding functional modules and beneficial effects of the execution method, and the technical details not described in detail in the embodiment of the present application can be referred to the method for quantifying network accuracy provided by the embodiment of the present application.

The present disclosure also relates to a computer-readable storage medium for storing a computer-readable program for causing a computer to perform some or all of the above-described embodiments of a method for quantifying network accuracy.

That is, as can be understood by those skilled in the art, all or part of the steps in the method for implementing the embodiments described above may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the disclosure may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Furthermore, those of ordinary skill in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the disclosure and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

It will be understood by those skilled in the art that while the present disclosure has been described with reference to exemplary embodiments, various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the disclosure. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the disclosure without departing from the essential scope thereof. Therefore, it is intended that the disclosure not be limited to the particular embodiment disclosed, but that the disclosure will include all embodiments falling within the scope of the appended claims.

Claims

1. A quantification method of network precision is applied to a many-core chip, and comprises the following steps:

2. The method of claim 1, wherein determining the reference precision according to the total amount of core resources of the many-core chip and each network to be quantized comprises:

gradually decreasing according to the quantization precision, and repeating the steps until the total quantity S of the core resources required by each network to be quantized according to the jth precision is determined_jIs less than or equal toAnd determining the jth precision as the reference precision, wherein j is an integer greater than or equal to 2.

3. The method of claim 1, wherein determining the target precision corresponding to each network to be quantized according to the reference precision and the total amount of core resources of the many-core chip comprises:

4. The method of claim 3, wherein the amount of remaining core resources Y is based on_jAnd determining the target precision corresponding to each network to be quantized according to the number difference of each core resource quantized step by step of each network to be quantized, comprising:

5. The method according to claim 1, wherein each network to be quantized comprises a first network to be quantized and a second network to be quantized, the target precision of the second network to be quantized is kth precision,

6. The method of claim 5, wherein the amount of remaining core resources Y is based on_j ^/And the number difference of each core resource quantized step by step of the first type of network to be quantized determines the target precision corresponding to the first type of network to be quantized, and the method comprises the following steps:

7. The method of claim 1, wherein the target accuracies corresponding to the networks to be quantized are not identical.

8. A system for quantifying network accuracy, the system being applied to a many-core die, the system comprising:

9. An electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method of any one of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored, the computer program being executable by a processor for implementing the method according to any one of claims 1-7.