CN115730646A

CN115730646A - Hybrid expert network optimization method based on partial quantization

Info

Publication number: CN115730646A
Application number: CN202211713009.8A
Authority: CN
Inventors: 赵继胜
Original assignee: Shanghai Fudian Intelligent Technology Co ltd
Current assignee: Shanghai Fudian Intelligent Technology Co ltd
Priority date: 2022-12-30
Filing date: 2022-12-30
Publication date: 2023-03-03

Abstract

The invention discloses a mixed expert network optimization method based on partial quantization, which relates to the technical field of information and comprises the following steps: s1, selecting a data sample set, and performing mixed expert network sampling; s2, establishing a corresponding relation between the subnets and the data sets, and selecting the high-frequency subnets and the corresponding data sets; and S3, carrying out iterative quantization processing on the selected high-frequency sub-network by using the corresponding data set. According to the method, the data flow is sampled in the inference process of the hybrid expert network, the corresponding relation between different data sets and different subnets in the hybrid expert network is obtained, and then the different subnets are subjected to quantitative optimization on the data sets corresponding to the different subnets, so that the calculation burden required by the overall optimization of the hybrid expert network is reduced, and the service throughput rate of the whole network is improved. The data set corresponding to the high-frequency used subnet is utilized to carry out quantization processing on the subnet, namely, the quantization processing on the whole network is avoided, the method is simple and efficient, and the performance of the whole network is improved.

Description

Hybrid expert network optimization method based on partial quantization

Technical Field

The invention relates to the technical field of information, in particular to a hybrid expert network optimization method based on partial quantization.

Background

The hybrid-of-expert-moe (hybrid-of-expert-moe) is a technology for organizing a neural network in a sparse manner, can integrate more network parameters while maintaining a limited increase in the demand for computing power, can be regarded as being sparsely connected together (through an expert selection switch) by a large number of relatively small-scale neural network systems (such as a full-connection network, a transform), can provide effective support for tasks such as complex object discrimination, and can serve as a basic model service for urban-level artificial intelligence application, and continuous optimization of the hybrid-expert-network can provide assistance for high-performance intelligent application in terms of computational efficiency and throughput rate.

Quantization is a way of compressing the network model, and is to approximate the network weight or activation value represented by a high bit width (e.g. 32-bit floating point number) with a lower bit width (e.g. 16-bit floating point number or 8-bit integer, or even 2-bit), and the representation on the value is to discretize a continuous value.

In the prior art, modern smart city systems increasingly rely on complex artificial intelligence models to perform discriminant analysis on space objects, the operational pressure brought by a large-scale neural network becomes a technical bottleneck for improving intelligent application, the existing hybrid expert network optimization method generally performs quantization processing on the whole network, the computational pressure brought after the large-scale neural network is deployed is large, the computational power consumed in the optimization process is large, and the problems of large computation amount, high cost, unbalanced load, low network throughput and insufficient performance support exist.

Disclosure of Invention

In order to overcome the technical problems of large computation amount, high cost and low network throughput rate in the prior art, the invention provides a hybrid expert network optimization method based on partial quantization.

In order to realize the purpose, the invention is realized by the following technical scheme:

a hybrid expert network optimization method based on partial quantization comprises the following steps:

s1, selecting a data sample set, and performing mixed expert network sampling;

s2, establishing a corresponding relation between the subnets and the data sets, and selecting a high-frequency subnet and a corresponding data set;

and S3, carrying out iterative quantization processing on the selected high-frequency sub-network by using the corresponding data set.

Preferably, in S1, information is sampled for a control gateway of each layer in the hybrid expert network to obtain an execution path for reasoning on the data sample.

Preferably, in S1, the hybrid expert network is sampled to obtain correspondence between the high-frequency-use subnetworks and the execution path information, and the corresponding data sample set information.

Preferably, the step S1 includes the steps of:

s11, for a given mixed expert network N, a data sample set D = { D0, D1 \8230; dN };

s12, implanting a sampling code into the N;

s13, repeating the following steps for each sample di in the D:

s131, writing the ID number i of the di into a log file;

s132, calling N to carry out reasoning calculation on di, and writing the accessed subnet set EN into a log file through a sampling code.

Preferably, in S2, in the sampled log file, the data expresses a correspondence between the bit sample ID and the execution path EN, and EN is composed of the subnet ID, so that the data pair can be disassembled, and the correspondence between the subnet ID and the sample subset Dk can be obtained by summarizing the data pair.

Preferably, the step S2 includes the steps of:

s21, for a given hybrid expert network N, obtaining sampling data PD from a data sample set D;

s22, inducing PN to obtain n relation pairs of ENk and Dk;

s23, screening r subnetworks { EN0, EN1 \8230; ENr } with Dk larger than a threshold value t in n relation pairs as candidate subnetworks for quantization processing.

Preferably, in S3, the plurality of subnets, for which there is no context-dependent correlation, are optimized simultaneously in parallel.

Preferably, the context correlation processing further divides the correspondence between the original subnet and the sample set into subnet IDs by using the execution path for data sample inference, executes the correspondence between the path EN and the sample set, where the execution path EN is used as context information, and performs iterative quantization processing on the subnet by using the sample set on the basis of the correspondence, so as to obtain the context correlation effect.

Preferably, in S3, through iterative quantization, an optimal quantization bit width configuration is found for a high-frequency usage expert subnet in the hybrid expert network.

Preferably, the step S3 includes the steps of:

s31, initializing, namely setting a data sample set Dr, selecting a high-frequency subnet ENr, setting a quantization threshold qt, setting an optimization network ENr' = ENr, and setting a current quantization configuration QC to QC1;

s32, judging whether qc is not equal to null, and if qc is not equal to null, returning to the current optimized network ENr'; otherwise, entering the next step;

s33, applying Dr to quantize the ENr according to qc bit width configuration to obtain ENr1;

s34, evaluating whether the ENr1 quality is in compliance by applying a quantization threshold qt, and if so, entering the next step; otherwise, returning to the current optimized network ENr';

and S35, enabling ENr' = ENr1, reducing the quantization bit width, selecting lower bit width quantization configuration, setting qc, and returning to S32.

Compared with the prior art, the invention has the advantages that:

the invention samples the data flow in the inference process of the hybrid expert network to obtain the corresponding relation between different data sets and different subnets in the hybrid expert network, and then carries out quantitative optimization on the different subnets on the data sets corresponding to the subnets, thereby reducing the calculation burden required by the overall optimization of the hybrid expert network. The optimized hybrid expert network can be processed in a calculation optimization mode in a high-frequency scene of a user, so that the service throughput rate of the whole network is improved.

According to the invention, the sub-network is quantized by utilizing the corresponding relation between the sub-network and the data obtained after sampling and utilizing the data set corresponding to the sub-network used at high frequency, namely, the whole network is prevented from being quantized, so that the high calculation overhead caused by the whole quantization is avoided, the method is simple and efficient, and the performance of the whole network is improved.

According to the invention, through network sampling, the relation between a subnet and a data sample is positioned, and data preparation is carried out for subsequent quantization processing; carrying out quantization processing on the subnets, and carrying out quantization optimization on the subnets used at high frequency; the quantization may be performed in a context-free, context-dependent or partially context-dependent manner, and the quantization may be performed in a parallel manner for a plurality of subnets that do not have a context-dependent relationship. By integrating the above methods, quantitative optimization of a given hybrid expert network in an application environment is achieved. Advantages of quantization include reduced model size, weight values represented by low bit width data, reduced memory space; the calculation pressure is reduced, the calculation of a high bit width floating point is reduced to a low bit width floating point or even integer calculation, and the calculation cost is greatly reduced; reducing the computational overhead can reduce power consumption and prompt throughput rate at the same time.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.

FIG. 1 is a diagram of a standard hybrid expert network architecture according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of the data flow path of hybrid expert network inference in accordance with an embodiment of the present invention;

FIG. 3 is a diagram illustrating context-free quantization according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating context dependent quantization according to an embodiment of the present invention;

fig. 5 is a flowchart illustrating an operation of an iterative quantization process according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

In the description of the present invention, it is to be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", etc. indicate orientations or positional relationships based on those shown in the drawings, and are used merely for convenience of description and for simplicity of description, and do not indicate or imply that the device or element so referred to must have a particular orientation, be constructed in a particular orientation, and be operated, and thus should not be considered as limiting the present invention.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to implicitly indicate the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of the feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally formed; may be mechanically coupled, may be electrically coupled or may be in communication with each other; they may be directly connected or indirectly connected through intervening media, or they may be connected internally or in any other suitable relationship, unless expressly stated otherwise. The specific meanings of the above terms in the present invention can be understood according to specific situations by those of ordinary skill in the art.

Referring to fig. 1-5, an embodiment of a hybrid expert network optimization method based on partial quantization according to the present invention includes the following steps:

s1, selecting a data sample set, and performing mixed expert network sampling;

In this embodiment, in S1, information sampling is performed on a control gateway of each layer in the hybrid expert network to obtain an execution path for reasoning on a data sample.

In this embodiment, in S1, the hybrid expert network is sampled to obtain a corresponding relationship between the high-frequency usage subnet and the execution path information, and the corresponding data sample set information.

In this embodiment, the step S1 includes the following steps:

s11, for a given mixed expert network N, a data sample set D = { D0, D1 \8230dns };

s12, implanting a sampling code into the N;

s13, repeating the following steps for each sample di in the D:

s131, writing the ID number i of the di into a log file;

In this embodiment, in S2, in the sampled log file, the data represents a corresponding relationship between the sample ID and the execution path EN, and EN is composed of the subnet ID, so that the data pair may be disassembled, and the data pair may be summarized to obtain a corresponding relationship between the subnet ID and the sample subset Dk.

In this embodiment, the step S2 includes the following steps:

s22, inducing PN to obtain n relation pairs of ENk and Dk;

In this embodiment, in S3, a parallel method is adopted to optimize the subnets without generating a correlation due to context.

In this embodiment, the context correlation processing further divides the correspondence between the original subnet and the sample set into the subnet IDs by using the execution path for data sample inference, executes the correspondence between the path EN and the sample set, where the execution path EN is used as context information, and performs iterative quantization processing on the subnet by using the sample set on the basis, so as to obtain the context correlation effect.

In this embodiment, in S3, through iterative quantization, an optimal quantization bit width configuration is found for a high-frequency usage expert subnet in a hybrid expert network.

In this embodiment, the step S3 includes the following steps:

s31, initializing, namely setting a data sample set Dr, selecting a high-frequency subnet ENr, quantizing a threshold qt, setting an optimized network ENr' = ENr, and setting a current quantization configuration QC to QC1;

s33, applying Dr, and quantizing ENr according to qc bit width configuration to obtain ENr1;

s34, evaluating whether the quality of the ENr1 is in compliance by applying a quantization threshold qt, and if so, entering the next step; otherwise, returning to the current optimized network ENr';

In this embodiment, as shown in fig. 1, the network may be divided into L layers, each layer has N expert networks (i.e., subnetworks), and the N expert networks are scheduled by one gateway gate, that is, the gate controls data flow to one or more of the expert networks. When inference calculation is carried out, only part of the network participates in calculation, so that the network learning capability can be considered to be expanded, meanwhile, the calculation capability requirement is kept not to be obviously changed, and a typical mixed expert network inference execution path is shown in fig. 2.

The hybrid expert network sampling is used for positioning an execution path of a given data sample in the hybrid expert network, namely a subnet set called by the hybrid expert network in the process of reasoning the data sample. As shown in FIG. 2, a hybrid expert network is composed of a set of L layers of expert networks, each layer has an expert subnet participating in reasoning, and the execution path through which a given data sample reasoning is executed in the graph includes the subnets EN = { EL1E2, EL2E6, \8230; ELL-2E5, ELL-1E3, ELE1} (where ELi represents the ith layer and Ej represents the jth expert subnet of the layer).

The sampling of the hybrid expert network is obtained by outputting and sampling each data sample through a control gateway (gate) of each layer, and the structural characteristics of the hybrid expert network show that the expert subnet through which each sample flows is determined by the gate, so that the sampling of the expert subnet is not needed, and the determination of the gate is recorded. Sampling is obtained by implanting a profiler code segment into an operator code where a gate unit is located, the sampling code writes a subnet id selected by the gate into a log file, and pseudo codes of the subnet id are expressed as follows:

moe _ gate _ i (data) {// i-th layer hybrid expert network gateway, data being input data

V/calculating the required deployment of the expert subnet

j=select_expert(data);

write _ log (j)// here is a sampling code, writing the id of the expert network into a log file

// Call jth expert subnet

data’=net_inference(experts[j]，data);

}

The process of the mixed expert network sampling comprises the following steps:

s12, implanting a sampling code into the N;

s13, repeating the following steps for each sample di in the D:

s131, writing the ID number i of the di into a log file;

The high-frequency subnet and the corresponding data set are selected, and in the sampled log file, the data expression bit sample ID corresponds to the execution path EN, wherein EN is composed of the subnet ID, so that the corresponding relation (data pair) of the sample ID corresponding to the subnet ID can be disassembled. By summarizing the data pairs, a subset Dk of samples corresponding to the subnet ID can be obtained, where Dk = { Dk, dk +1, \8230; dm }. For those subnets corresponding to the high-capacity sample subset, which obviously belong to the high-frequency subnet, they can be screened out for quantization. The steps of screening the high-frequency sub-networks for quantization processing are as follows:

s22, inducing PN to obtain n relation pairs of ENk and Dk;

And (4) performing iterative quantization processing, and performing quantization optimization in an iterative manner after selecting a subnet for quantization processing. The invention aims at the typical precision bit width adopted by the deep neural network: the 32-bit floating point number (FP 32) is used as the starting bit width, and is sequentially represented by a radius (BF 16), an 8-bit integer (INT 8), a 4-bit integer (INT 4), a 3-bit value and a 2-bit value, and the quantization options are 5 in total from high to low (here, the notation is that the quantization configurations are QC1, QC2, QC3, QC4 and QC 5). The means for evaluating the quantization quality is usually by setting a quantization threshold qt, i.e. the quantized network accuracy is not lower than the quantization threshold qt.

For a given hybrid expert network N and subnet ENr, and corresponding data set Dr, a quantization threshold qt is given, from high to low, which in turn is quantized with a quantization configuration and validated by the quantization threshold. And finally, taking the quantization subnet optimized by the lowest quantization configuration (bit width configuration) not lower than the quantization threshold as a final optimization result. The workflow steps of the iterative quantization processing are as follows: s31, initializing, namely setting a data sample set Dr, selecting a high-frequency subnet ENr, quantizing a threshold qt, setting an optimized network ENr' = ENr, and setting a current quantization configuration QC to QC1;

s32, judging whether qc is not equal to null or not, and if qc is not equal to null, returning to the current optimized network ENr'; otherwise, entering the next step;

In the context correlation processing in the hybrid expert network, when a high-frequency sub-network exists in a plurality of execution paths at the same time, for example, as shown in fig. 3, the sub-network ELL-1E3 is shared by 2 execution paths, so that the quantization of the sub-network affects the inference effect of the 2 execution paths at the same time, and also means the sub-network is affected by data samples corresponding to two paths, so that the result of quantization optimization is a compromise of the data samples corresponding to the 2 paths. And (3) a context-dependent sampling method, which respectively carries out quantitative optimization on the sub-network ELL-1E3 by using samples of 2 execution paths, and respectively generates different optimized sub-networks ELL-1E31 and ELL-1E2. The method can avoid the compromise caused by multiple execution path data samples, and search a more optimal quantization scheme for a specific path.

The context correlation processing may further divide the subnet ID by using the execution path of data sample inference to make the correspondence relationship between the original subnet corresponding sample sets, and execute the path EN corresponding sample set, where the execution path EN is used as context information (EN is information in the sampling log, so that it is not necessary to modify the sampling method). On the basis, the subnet is subjected to iterative quantization processing by using the sample set, and the context-dependent effect can be obtained.

In parallel quantization of multiple subnets in a hybrid expert network, the context-dependent characteristics described in the above section may be used to determine two sets of data samples that are not coherent with each other, corresponding to two subnets for optimization (due to context-dependent properties, the same subnet may be optimized for different subnets in different contexts). Therefore, the invention can simultaneously optimize a plurality of subnets without generating relevance due to context in a parallel mode, and further improves the optimization efficiency.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is specific and detailed, but not to be understood as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A hybrid expert network optimization method based on partial quantization is characterized by comprising the following steps:

s1, selecting a data sample set, and performing mixed expert network sampling;

s2, establishing a corresponding relation between the subnets and the data sets, and selecting the high-frequency subnets and the corresponding data sets;

2. The method of claim 1, wherein in the step S1, information is sampled for a control gateway of each layer in the hybrid expert network to obtain an execution path for reasoning on data samples.

3. The method according to claim 2, wherein in S1, the hybrid expert network is sampled to obtain correspondence between high-frequency usage subnetworks and execution path information and corresponding data sample set information.

4. The method for optimizing a hybrid expert network based on partial quantization according to claim 3, wherein in S1, the following steps are included:

s12, implanting a sampling code into the N;

s13, repeating the following steps for each sample di in the D:

s131, writing the ID number i of the di into a log file;

5. The method of claim 4, wherein in the S2, in the sampled log file, the data represents a correspondence between the sample ID and the execution path EN, wherein EN is composed of the subnet ID, so that the data pairs can be disassembled and summarized to obtain a correspondence between the subnet ID and the sample subset Dk.

6. The method of claim 5, wherein the step of S2 comprises the steps of:

s22, inducing PN to obtain n relation pairs of ENk and Dk;

s23, screening r subnets { EN0, EN1 \823030Onr } of which Dk is larger than a threshold value t in n relation pairs as candidate subnets for quantitative processing.

7. The method of claim 6, wherein in the step S3, the optimization is performed simultaneously in a parallel manner for a plurality of subnets without context-dependent correlation.

8. The method of claim 7, wherein the context correlation processing further divides the subnet ID from the original subnet to the sample set by using the execution path of data sample inference, executes the correspondence between the path EN and the sample set, and uses the execution path EN as context information, and then performs iterative quantization processing on the subnet by using the sample set, thereby obtaining the context correlation effect.

9. The method according to claim 8, wherein in S3, through iterative quantization, an optimal quantization bit width configuration is found for a high-frequency usage expert subnet in the hybrid expert network.

10. The all-weather autonomous partial quantization-based hybrid expert network optimization method according to claim 9, characterized in that in S3, it comprises the following steps: