CN116232581A - Method, device and hardware architecture for deploying hybrid precision - Google Patents

Method, device and hardware architecture for deploying hybrid precision Download PDF

Info

Publication number
CN116232581A
CN116232581A CN202310086701.0A CN202310086701A CN116232581A CN 116232581 A CN116232581 A CN 116232581A CN 202310086701 A CN202310086701 A CN 202310086701A CN 116232581 A CN116232581 A CN 116232581A
Authority
CN
China
Prior art keywords
module
precision
hardware architecture
calculation
resource consumption
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310086701.0A
Other languages
Chinese (zh)
Inventor
赵健
魏琼
宋佩
周国鹏
蔡宗霖
马妍
吴运凯
严晓
赵恩海
冯洲武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai MS Energy Storage Technology Co Ltd
Original Assignee
Shanghai MS Energy Storage Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai MS Energy Storage Technology Co Ltd filed Critical Shanghai MS Energy Storage Technology Co Ltd
Priority to CN202310086701.0A priority Critical patent/CN116232581A/en
Publication of CN116232581A publication Critical patent/CN116232581A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/08Key distribution or management, e.g. generation, sharing or updating, of cryptographic keys or passwords
    • H04L9/0861Generation of secret information including derivation or calculation of cryptographic keys or passwords
    • H04L9/0877Generation of secret information including derivation or calculation of cryptographic keys or passwords using additional device, e.g. trusted platform module [TPM], smartcard, USB or hardware security module [HSM]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/08Key distribution or management, e.g. generation, sharing or updating, of cryptographic keys or passwords
    • H04L9/0861Generation of secret information including derivation or calculation of cryptographic keys or passwords
    • H04L9/0869Generation of secret information including derivation or calculation of cryptographic keys or passwords involving random numbers or seeds
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/08Key distribution or management, e.g. generation, sharing or updating, of cryptographic keys or passwords
    • H04L9/0894Escrow, recovery or storing of secret information, e.g. secret key escrow or cryptographic key storage
    • H04L9/0897Escrow, recovery or storing of secret information, e.g. secret key escrow or cryptographic key storage involving additional devices, e.g. trusted platform module [TPM], smartcard or USB
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L2209/00Additional information or applications relating to cryptographic mechanisms or cryptographic arrangements for secret or secure communication H04L9/00
    • H04L2209/12Details relating to cryptographic hardware or logic circuitry
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Supply And Distribution Of Alternating Current (AREA)

Abstract

The invention provides a method, a device and a hardware architecture for deploying hybrid precision, wherein the method comprises the following steps: determining a pending computation module in a hardware architecture deployed with an electrochemical model; performing buried point processing on the undetermined computing module, and determining resource consumption data of the undetermined computing module when the electrochemical model is operated by the hardware architecture; taking at least one undetermined computing module as a target computing module needing to reduce consumed resources according to the resource consumption data, and determining an optimization scheme of the target computing module; and deploying a corresponding optimization scheme for the target computing module, and ending optimization under the condition that the computing performance of the optimized hardware architecture meets the requirement. By the method, the device and the hardware architecture for deploying the hybrid precision, which are provided by the embodiment of the invention, the resource consumption can be reduced, the calculation speed of the target calculation module can be greatly improved, the running time is compressed, the hardware deployment of the electrochemical model can be compressed, and the occupied space of the electrochemical model is effectively reduced.

Description

Method, device and hardware architecture for deploying hybrid precision
Technical Field
The invention relates to the technical field of electrochemical models, in particular to a method, a device and a hardware architecture for deploying hybrid precision.
Background
Compared with an equivalent circuit model, a P2D (pseudo-two-dimensional) electrochemical algorithm model supplements the change condition inside the battery, and the physical and chemical process inside the battery is modeled in time and space through a three-layer simplified model, so that the method has the advantages of accurate model and high calculation precision, and is widely applied to the field of lithium battery detection at present. However, in contrast, a large number of Floating Point (FP) numerical calculations are required in the P2D electrochemical algorithm, and in the FPGA (Field Programmable Gate Array ), a large amount of resources are consumed for the floating point calculation, so that the cost for deploying the P2D electrochemical algorithm in the FPGA is very high, and the calculation time is also relatively high.
At present, the P2D electrochemical solver has not much research in the aspect of hardware deployment, and the algorithm hardware optimization is mainly performed from the aspects of model simplification, hardware pipeline and parallel computation, and only has more research in the aspect of precision optimization on a neural network model, but the current precision optimization method cannot be directly used for the P2D algorithm hardware deployment.
Disclosure of Invention
In order to solve the existing technical problems, the embodiment of the invention provides a method, a device and a hardware architecture for deploying hybrid precision.
In a first aspect, an embodiment of the present invention provides a method for deploying hybrid precision, including:
determining a pending computation module in a hardware architecture deployed with an electrochemical model, wherein the pending computation module is a computation module with consumed resources larger than a preset threshold value;
performing buried point processing on the undetermined computing module, and determining resource consumption data of the undetermined computing module when the hardware architecture runs the electrochemical model;
taking at least one undetermined computing module as a target computing module needing to reduce consumed resources according to the resource consumption data, and determining an optimization scheme of the target computing module; the optimization scheme is a scheme for reducing floating point number precision of at least part of the region;
deploying a corresponding optimization scheme for the target computing module, and determining the computing performance of the optimized hardware architecture when the optimized hardware architecture runs the electrochemical model;
and ending the optimization under the condition that the calculation performance of the optimized hardware architecture meets the requirement.
In one possible implementation manner, the optimization scheme includes:
the first optimization scheme: reducing floating point numbers of all areas of the target computing module to BF16 semi-precision floating point numbers;
Alternatively, the second optimization scheme: reducing the floating point number of a partial region of the target computing module to BF16 semi-precision floating point number, and keeping the rest partial region unchanged;
alternatively, the third optimization scheme: and reducing the floating point number of a partial region of the target calculation module to the FP16 semi-precision floating point number, and keeping the rest partial region unchanged.
In one possible implementation, the method further includes:
under the condition that the calculation performance of the optimized hardware architecture does not meet the requirement, determining an optimization scheme with higher calculation precision again;
deploying a redetermined optimization scheme on the target computing module, and determining computing performance of the re-optimized hardware architecture when the re-optimized hardware architecture runs the electrochemical model;
the calculation precision of the first optimization scheme is lower than that of the second optimization scheme, and the calculation precision of the second optimization scheme is lower than that of the third optimization scheme.
In one possible implementation, the method further includes:
an overflow calculation module that determines that there is a data overflow;
temporarily improving the floating point number precision of at least part of the area of the overflow calculation module, and operating the electrochemical model through a hardware architecture with temporarily improved precision to determine the overflow magnitude of the overflow calculation module;
And under the condition that the overflow magnitude of the overflow calculation module meets the requirement, the floating point number precision of the overflow calculation module is maintained.
In one possible implementation manner, the temporarily improving the floating point number precision of at least a part of the area of the overflow calculation module includes:
and temporarily improving the floating point number of the overflow calculation module to be the FP64 double-precision floating point number.
In one possible implementation manner, the taking at least one pending computing module as a target computing module needing to reduce consumed resources according to the resource consumption data includes:
ordering according to the space resource consumption data of the undetermined computing module, and leading n with more space resource consumption data 1 The undetermined computing modules are used as target computing modules;
or ordering according to the time resource consumption data of the undetermined computing module, and leading n with more time resource consumption data 2 The undetermined computing modules are used as target computing modules;
alternatively, the weighted sum of the space resource consumption data and the time resource consumption data of the undetermined computing module is ranked, and the top n with the weighted sum larger is used for the processing 3 The undetermined computing modules are used as target computing modules;
Wherein the resource consumption data comprises the spatial resource consumption data and/or the temporal resource consumption data.
In one possible implementation, the method further includes:
determining an error compensation model, wherein the error compensation model is used for performing error compensation on the result of the low-precision electrochemical model and outputting the result of the high-precision electrochemical model;
and inputting a result of the optimized hardware architecture when the electrochemical model is operated into the error compensation model, and taking an output result of the error compensation model as a final result.
In one possible implementation, the determining the resource consumption data of the pending computation module while the hardware architecture runs the electrochemical model includes:
determining resource consumption data of the pending computation module and original computation accuracy of the pending computation module when the hardware architecture runs the electrochemical model;
the method further comprises the steps of:
and under the condition that the difference between the original calculation precision of the target calculation module and the calculation precision of the optimized target calculation module is smaller than a preset difference, determining that the calculation performance of the optimized hardware architecture meets the requirement.
In a second aspect, an embodiment of the present invention further provides an apparatus for deploying hybrid precision, including:
the first determining module is used for determining a pending calculating module in a hardware architecture deployed with an electrochemical model, wherein the pending calculating module is a calculating module with consumed resources larger than a preset threshold value;
the point burying module is used for carrying out point burying processing on the undetermined computing module and determining resource consumption data of the undetermined computing module when the hardware architecture runs the electrochemical model;
the second determining module is used for determining an optimization scheme of at least one pending computing module as a target computing module needing to reduce consumed resources according to the resource consumption data; the optimization scheme is a scheme for reducing floating point number precision of at least part of the region;
the optimization module is used for deploying a corresponding optimization scheme for the target calculation module and determining the calculation performance of the optimized hardware architecture when the optimized hardware architecture runs the electrochemical model;
and the ending module is used for ending the optimization under the condition that the calculation performance of the optimized hardware architecture meets the requirement.
In a third aspect, an embodiment of the present invention provides a hardware architecture, where the hardware architecture is an optimized hardware architecture obtained based on the method for deploying hybrid precision according to the first aspect.
In a fourth aspect, an embodiment of the present invention provides an apparatus for deploying hybrid precision, including a processor and a memory, where the memory stores a computer program, where the processor executes the computer program stored in the memory, and the computer program is executed by the processor to implement the method for deploying hybrid precision according to the first aspect.
In a fifth aspect, an embodiment of the present invention further provides a computer readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the method for deploying blending precision according to the first aspect.
According to the method, the device and the hardware architecture for deploying the hybrid precision, disclosed by the embodiment of the invention, the undetermined computing module with larger resource consumption is buried to select the target computing module to be optimized, and the floating point number precision of the target computing module is reduced, so that the overall optimization of the hardware architecture is realized; in the optimized hardware architecture, at least part of the area of the target computing module adopts low-precision floating point numbers, while other computing modules still adopt high-precision floating point numbers, namely a mixed-precision deployment strategy. The floating point number precision of the target calculation module is reduced, so that the resource consumption can be reduced, the calculation speed of the target calculation module is greatly improved, and the running time is shortened; in addition, the low-precision target calculation module only needs less hardware, so that the hardware deployment of the electrochemical model can be compressed, and the occupied space of the electrochemical model is effectively reduced; the optimized hardware architecture has shorter overall operation time and less resource consumption, can reduce cost, improves the feasibility of deploying the electrochemical model in the hardware in practical engineering application, and has higher cost performance.
Drawings
In order to more clearly describe the embodiments of the present invention or the technical solutions in the background art, the following description will describe the drawings that are required to be used in the embodiments of the present invention or the background art.
FIG. 1 illustrates a flow chart of a method of deploying hybrid accuracy provided by an embodiment of the invention;
FIG. 2 is a schematic diagram showing a comparison between before and after optimization in the method for deploying hybrid accuracy according to the embodiment of the present invention;
FIG. 3 is a schematic structural diagram of an apparatus for deploying hybrid accuracy according to an embodiment of the present invention;
fig. 4 shows a schematic structural diagram of an apparatus for deploying hybrid precision according to an embodiment of the present invention.
Detailed Description
Floating point numbers are typically made up of three parts, namely sign bits, exponent bits, and mantissa bits. Wherein, the sign bit is generally 1 bit (bit), and the exponent bit and the tail bit can be a plurality of bits (bits). For convenience of description, various types of floating point numbers will be described first.
FP64: a 64-bit floating point number, which is a double-precision floating point number; the exponent bits of PF64 are 11 bits and the mantissa bits are 52 bits.
FP32: a 32-bit floating point number, which is a single-precision floating point number, the exponent of PF32 is 8 bits and the mantissa is 23 bits.
FP16: a 16-bit floating point number, which is a half-precision floating point number; the exponent bit of FP16 is 5 bits and the mantissa bit is 10 bits.
BF16: a 16-bit floating point number, which is a half-precision floating point number; the exponent bit of BF16 is 8 bits and the mantissa bit is 7 bits, which is truncated data for FP32 single precision floating point numbers.
Embodiments of the present invention will be described below with reference to the accompanying drawings in the embodiments of the present invention.
Fig. 1 shows a flowchart of a method for deploying hybrid accuracy according to an embodiment of the invention. As shown in fig. 1, the method includes:
step 101: and determining a pending calculation module in the hardware architecture deployed with the electrochemical model, wherein the pending calculation module is a calculation module with consumed resources larger than a preset threshold value.
In the embodiment of the invention, the electrochemical model is deployed in the hardware architecture, and the electrochemical model is operated by operating the hardware architecture. For example, the hardware structure may include an FPGA with a P2D electrochemical model of a lithium battery deployed in the FPGA. The hardware architecture comprises a plurality of calculation modules so as to realize the electrochemical model; for example, the hardware architecture includes a liquid phase computing module, a solid phase computing module, a decoupling computing module, etc., or the modules in the hardware architecture may be divided according to the implemented functions, for example, the hardware architecture includes a matrix computing module, a numerical iteration module, etc.
When each computing module realizes the corresponding function, certain resources are required to be consumed, and the more complex the computation is, the larger the consumed resources are; for example, the resources consumed by the multiplication module are typically greater than the resources consumed by the addition module. In the embodiment of the invention, a plurality of calculation modules with larger consumed resources, namely calculation modules with the consumed resources larger than a preset threshold value, are selected in advance, and are called as undetermined calculation modules. The method comprises the steps that resources consumed by each computing module can be truly measured, and then the pending computing module with the consumed resources larger than a preset threshold value is determined; alternatively, the pending calculation module may be selected based on human experience, which corresponds to setting a smaller preset threshold value; alternatively, the pending calculation module may be selected by estimating the resources consumed by each calculation module based on the resources consumed by each module provided by the authority and based on the materials provided by the authority.
All calculation modules with consumed resources larger than a preset threshold value can be used as pending calculation modules, or only part of calculation modules can be used as pending calculation modules, for example, the calculation modules are ordered according to the consumed resources, and the calculation modules with the first 10% are used as pending calculation modules. For example, matrix operation links, numerical iteration links, parameter repeated updating links and the like may cause excessive resource consumption, and calculation modules corresponding to the links can be used as pending calculation modules.
Step 102: and carrying out buried point processing on the undetermined calculation module, and determining resource consumption data of the undetermined calculation module when the electrochemical model is operated by the hardware architecture.
In the embodiment of the invention, after the undetermined computing module which needs to consume more resources is determined, the undetermined computing module is subjected to buried point processing so as to be capable of determining the resources consumed by the undetermined computing module when the electrochemical model is operated, namely the resource consumption data. The undetermined computing module with the buried point can be packaged into an independent module, so that the resource analysis and energy consumption analysis functions of design software can be fully utilized later, and analysis work can be rapidly completed, namely, the resource consumption data of the undetermined computing module can be rapidly determined.
When the hardware architecture runs the deployed electrochemical model, each computing module executes corresponding computation and consumes corresponding resources. In the embodiment of the invention, the resources consumed by the pending computation module are called resource consumption data. Alternatively, the resource consumption data may comprise spatial resource consumption data; for example, the space resource consumption data may include space utilization of memory cells such as Bram, LUTRam, etc. And/or the resource consumption data may include time resource consumption data, e.g., the time resource consumption data may include a run length of the pending computing module, etc.
Step 103: taking at least one undetermined computing module as a target computing module needing to reduce consumed resources according to the resource consumption data, and determining an optimization scheme of the target computing module; the optimization scheme is a scheme for reducing the floating point number precision of at least part of the region.
In the embodiment of the invention, at least one calculation module to be optimized, namely a target calculation module, is selected from the to-be-determined calculation modules; in the embodiment of the invention, if the resources consumed by the undetermined computing module are too large, the undetermined computing module can be used as the target computing module needing to be optimized, so that the consumed resources can be reduced. Specifically, according to the resource consumption data of the pending computation modules, which pending computation modules are selected as target computation modules; the larger the resource consumption data of the undetermined computing module is, the greater the possibility that the undetermined computing module is used as a target computing module is; for example, a pending calculation module with the largest resource consumption data may be regarded as the target calculation module.
In addition, the embodiment of the invention realizes the optimization of the target calculation module by reducing the floating point number precision of at least part of the area of the target calculation module; if a plurality of target computing modules exist, the optimization scheme of each target computing module can be determined respectively. For example, the floating point number precision of the whole area of the target computing module can be reduced, and the floating point number precision of a partial area (for example, an area needing to consume more resources in the computing process) of the target computing module can be reduced, which can be specific based on actual conditions.
Step 104: and deploying a corresponding optimization scheme for the target computing module, and determining the computing performance of the optimized hardware architecture when the optimized hardware architecture runs the electrochemical model.
Step 105: and under the condition that the calculation performance of the optimized hardware architecture meets the requirement, ending the optimization.
In the embodiment of the invention, after the optimization scheme of the target computing module is determined, the corresponding optimization scheme can be deployed for the target computing module, namely, floating point numbers in the target computing module are optimized, and finally, the whole hardware architecture is optimized. After optimization, the electrochemical model is re-run in the hardware architecture (i.e., the optimized hardware architecture),
at this time, the electrochemical model in the hardware architecture is also an optimized electrochemical model, and the optimized electrochemical model is a low-precision electrochemical model because the floating point number precision is reduced by the optimization scheme; correspondingly, the electrochemical model before optimization (such as the electrochemical model in the hardware architecture corresponding to step 102) is a high-precision electrochemical model. After the optimization, the calculation accuracy of the hardware architecture is reduced, but the calculation speed can be improved. In the embodiment of the invention, the electrochemical model is operated in the optimized hardware architecture, so that the calculation performance of the optimized hardware architecture can be determined, for example, the calculation performance can comprise calculation precision, calculation speed and the like.
If the calculation performance of the optimized hardware architecture meets the required requirement, it is indicated that the optimization scheme is used to optimize the original hardware architecture (such as the hardware architecture corresponding to the steps 101 and 102), so that a relatively better hardware architecture can be obtained. In general, when the floating point number precision is reduced, the resources consumed by the corresponding calculation modules are reduced; in addition, if the computing performance includes computing precision, and the computing precision difference of the hardware architecture before and after the optimization is small (that is, the computing performance meets the requirement), it is indicated that the optimized hardware architecture can obtain a result with small precision difference by using smaller resources, and the optimized hardware architecture is available, and at this time, the optimization can be ended, and the optimized hardware architecture can be directly deployed. Conversely, if the difference between the calculation accuracy of the hardware architecture before and after the optimization is large, it is undesirable to increase the calculation speed and to lose more calculation accuracy, and the calculation performance of the optimized hardware architecture does not meet the requirement, and the optimized hardware architecture needs to be discarded at this time. For example, other optimization schemes may be used to redetermine the optimized hardware architecture.
According to the method for deploying the hybrid precision, disclosed by the embodiment of the invention, the to-be-determined computing module with larger resource consumption is buried to select the target computing module to be optimized, and the floating point number precision of the target computing module is reduced, so that the overall optimization of the hardware architecture is realized; in the optimized hardware architecture, at least part of the area of the target computing module adopts low-precision floating point numbers, while other computing modules still adopt high-precision floating point numbers, namely a mixed-precision deployment strategy. The floating point number precision of the target calculation module is reduced, so that the resource consumption can be reduced, the calculation speed of the target calculation module is greatly improved, and the running time is shortened; in addition, the low-precision target calculation module only needs less hardware, so that the hardware deployment of the electrochemical model can be compressed, and the occupied space of the electrochemical model is effectively reduced; the optimized hardware architecture has shorter overall operation time and less resource consumption, can reduce cost, improves the feasibility of deploying the electrochemical model in the hardware in practical engineering application, and has higher cost performance.
Optionally, the optimization scheme provided by the embodiment of the invention may include: the first optimization scheme, the second optimization scheme, or the third optimization scheme.
Specifically, the first optimization scheme: and reducing the floating point number of all the areas of the target computing module to BF16 semi-precision floating point number. The second optimization scheme: and reducing the floating point number of a partial region of the target computing module to BF16 semi-precision floating point number, and keeping the rest partial region unchanged. Third optimization scheme: and reducing the floating point number of a partial region of the target calculation module to the FP16 semi-precision floating point number, and keeping the rest partial region unchanged.
In the embodiment of the invention, a computing module in a hardware architecture generally adopts 32-bit floating point numbers, such as FP32 single-precision floating point numbers. When the target computing module needs to be optimized, a first optimization scheme can be adopted, namely, the floating point number of all areas in the target computing module is reduced to BF16 half-precision floating point number, for example, the single-precision floating point number of FP32 is reduced to BF16 half-precision floating point number.
Alternatively, when the target computing module needs to be optimized, a second optimization scheme may be tried, that is, the floating point number of a part of the areas in the target computing module is reduced to BF16 half-precision floating point number, for example, from FP32 single-precision floating point number to BF16 half-precision floating point number, while the other part of the areas remains unchanged, for example, FP32 single-precision floating point number is still adopted. In general, the resources consumed by the partial area needing to reduce the floating point number precision in the target computing module are larger than those consumed by the partial area needing not to reduce the floating point number precision; for example, the target calculation module is a matrix calculation module, and the calculation process involves multiplication operation and addition operation; since multiplication operations generally require more resources than addition operations, the hardware region corresponding to the multiplication operations may be reduced to employing BF16 half-precision floating-point numbers, while the hardware region corresponding to the addition operations still employs FP32 single-precision floating-point numbers.
Alternatively, when the target computing module needs to be optimized, a third optimization scheme may be tried, that is, the floating point number of a part of the areas in the target computing module is reduced to FP16 half-precision floating point number, for example, from FP32 single-precision floating point number to FP16 half-precision floating point number, while the rest of the areas remain unchanged, for example, FP32 single-precision floating point number is still adopted. Compared with BF16, PF16 has a smaller numerical range, but higher accuracy; therefore, the third optimization scheme has higher accuracy than the above-described second optimization scheme.
Taking the example of the calculation module providing the multiplication module for the Xilinx official, the Look-Up Table (LUT) resources and Flip-Flop (FF) resources consumed by the calculation module with different floating point number precision can be seen in Table 1 below.
TABLE 1
Calculation module Look-up table (LUT) resources Trigger (FF) resources
32-bit floating point number multiplication module 697 37
BF16 floating point number multiplication module 137 21
FP16 floating point number multiplication module 195 21
Alternatively, as shown in step 105 above, if the computing performance of the optimized hardware architecture meets the requirement, the optimization may be ended. If the calculation performance of the optimized hardware architecture does not meet the requirement, the embodiment of the invention updates the optimization scheme. Specifically, the method further comprises the following steps A1-A2:
Step A1: and under the condition that the calculation performance of the optimized hardware architecture does not meet the requirement, determining an optimization scheme with higher calculation precision again. The calculation precision of the first optimization scheme is lower than that of the second optimization scheme, and the calculation precision of the second optimization scheme is lower than that of the third optimization scheme.
Step A1: and deploying the redetermined optimization scheme for the target computing module, and determining the computing performance of the re-optimized hardware architecture when the re-optimized hardware architecture runs the electrochemical model.
In the embodiment of the invention, the calculation performance can comprise calculation precision, and the change of the calculation precision of the hardware architecture before and after optimization is generally used for determining whether the calculation performance of the optimized hardware architecture can meet the requirement. For example, the determining of the resource consumption data of the pending computing module when the hardware architecture runs the electrochemical model at step 102 "described above may include: determining resource consumption data of the undetermined computing module and original computing precision of the undetermined computing module when the electrochemical model is operated by the hardware architecture; that is, the calculation accuracy of the pending calculation module, that is, the original calculation accuracy is also determined at the same time as the resource consumption data of the pending calculation module is determined. Furthermore, the method comprises the following steps: and under the condition that the difference between the original calculation precision of the target calculation module and the calculation precision of the optimized target calculation module is smaller than a preset difference, determining that the calculation performance of the optimized hardware architecture meets the requirement. Wherein, since the target computing module is a pending computing module, the original computing accuracy of the target computing module can be determined based on the step 102; and, the calculation accuracy of the optimized target calculation module can also be determined. If the difference of the two calculation accuracy is smaller, if the difference between the two calculation accuracy is smaller than the preset difference, the target calculation module can be optimized by the current optimization scheme, the calculation accuracy after optimization can be effectively ensured, and the overall calculation performance of the optimized hardware architecture can meet the requirement; conversely, if the difference between the two values is greater than the preset difference value, the above step A1 may be executed, i.e. the optimization scheme with higher calculation accuracy is redetermined.
Based on the content of the first optimization scheme, the second optimization scheme and the third optimization scheme, the first optimization scheme, the second optimization scheme and the third optimization scheme are sequenced according to the sequence from low calculation accuracy to high calculation accuracy, and the sequence is as follows: the system comprises a first optimization scheme, a second optimization scheme and a third optimization scheme. Therefore, when the target computing module needs to be optimized, a first optimization scheme can be adopted for optimization; if the first optimization scheme is not feasible, optimizing by adopting the second optimization scheme; if the second optimization scheme is not feasible, the third optimization scheme is adopted for optimization.
For example, when an FP32 single-precision floating point number adopted by a certain target computing module needs to be optimized, a first optimization scheme is used for optimizing the FP32 single-precision floating point number, that is, all hardware areas of the target computing module are reduced to BF16 half-precision floating point numbers, and then whether the computing precision of a target computing unit optimized by the first optimization scheme meets the requirement is judged; if the two types of the floating point numbers do not accord with each other, the second optimization scheme is adopted to perform optimization, namely, only part of the hardware area of the target calculation module is reduced to BF16 half-precision floating point numbers, and the rest of the hardware area is still FP32 single-precision floating point numbers. Similarly, if the calculation accuracy of the optimized target calculation unit still does not meet the requirement at this time, the hardware area requiring the accuracy reduction does not use BF16 half-accuracy floating point number, but uses FP16 half-accuracy floating point number, so as to improve the calculation accuracy of the optimized target calculation module.
Optionally, there may be a data overflow situation for some computing modules, for which, after determining the resource consumption data of the pending computing module when the hardware architecture runs the electrochemical model in step 102, the embodiment of the present invention performs the following operations shown in steps B1-B3:
step B1: an overflow computation module that determines that there is a data overflow.
Step B2: and temporarily improving the floating point number precision of at least part of the area of the overflow calculation module, and determining the overflow magnitude of the overflow calculation module by operating the electrochemical model through a hardware architecture with temporarily improved precision.
Step B3: and under the condition that the overflow magnitude of the overflow calculation module meets the requirement, the floating point number precision of the overflow calculation module is maintained.
In the embodiment of the invention, an officially provided module can be used, wherein the module can provide misinterpretation abnormal bits, and if the result is wrong, the interface can output a signal to indicate the error; for example, the computing module may have problems with data invalidation, data overflow, data underflow, and so on. For computing modules that have data overflows (data overflows or data underflows), embodiments of the present invention are referred to as overflow computing modules. When the hardware architecture is optimized, the floating point number precision of part or all of the hardware area of the overflow calculation module is improved, for example, the floating point number of the overflow calculation module is temporarily improved to be the FP64 double-precision floating point number. After the precision of the floating point number is improved, the overflow calculation module can basically perform normal operation, so that the overflow magnitude of the overflow calculation module before the precision is improved can be determined, the overflow magnitude represents the condition of data overflow, the higher the overflow magnitude is, the more serious the data overflow is, the subsequent calculation precision is affected, and the overflow calculation module needs to be deployed by the FP64 double-precision floating point number. Conversely, if the overflow level is lower than the preset level, the overflow level of the overflow calculation module can be considered to meet the requirement, that is, the overflow calculation module does not overflow too much when adopting the low-precision floating point number (such as FP32 single-precision floating point number), and the floating point number precision of the overflow calculation module can be kept unchanged at this time, that is, the initial floating point number precision, such as FP32 single-precision floating point number, is still adopted.
Alternatively, the resource consumption data may include space resource consumption data and/or time resource consumption data, and the step 103 "using at least one pending calculation module as the target calculation module for reducing the consumed resource according to the resource consumption data" includes the following step C1, step C2, or step C3.
Step C1: ordering according to the space resource consumption data of the undetermined computing module, and leading n with more space resource consumption data 1 The pending computation modules serve as target computation modules.
Step C2: ordering according to the time resource consumption data of the undetermined computing module, and leading n with more time resource consumption data 2 The undetermined computing modules are used as target computing modules;
step C3: ordering according to the weighted sum of the space resource consumption data and the time resource consumption data of the undetermined computing module, and leading n with larger weighted sum 3 The pending computation modules serve as target computation modules.
In the embodiment of the invention, the to-be-determined computing modules can be respectively ordered according to the space resource consumption data (such as space occupancy rate) and the time resource consumption data (such as operation time length), and the to-be-determined computing module which is ranked at the front and has more resource consumption is taken as the target computing module; alternatively, the space resource consumption data and the time resource consumption data may be weighted, for example, the space resource consumption data and the time resource consumption data may be normalized and then weighted according to 50% of the weights, so that a weighted sum of the space resource consumption data and the time resource consumption data may be obtained, and a pending calculation module with a larger weighted sum may be used as the target calculation module. Wherein n is 1 、n 2 、n 3 The present embodiment is not limited to this, and may be the same or different.
Optionally, to be able to improve the accuracy, the method further comprises an error compensation process comprising the following steps D1-D2, as optimizing the hardware architecture would result in a lower accuracy.
Step D1: and determining an error compensation model, wherein the error compensation model is used for carrying out error compensation on the result of the low-precision electrochemical model and outputting the result of the high-precision electrochemical model.
Step D2: and inputting the result of the optimized hardware architecture when the electrochemical model is operated into an error compensation model, and taking the output result of the error compensation model as a final result.
In the embodiment of the invention, an error compensation model is trained in advance, and the error compensation model can be a neural network model, such as a BP neural network. In the training process, the result of the low-precision electrochemical model is taken as input, the result of the high-precision electrochemical model is taken as output, and the error compensation model is trained, so that a model capable of performing error compensation on the result of the low-precision electrochemical model, namely an error compensation model, is obtained. Wherein "low accuracy" and "high accuracy" in step D1 are relative, i.e. the high accuracy electrochemical model is more accurate than the low accuracy electrochemical model, and do not mean that the accuracy only reaches a certain value.
Specifically, the high-precision electrochemical model is an electrochemical model before optimization, such as an electrochemical model deployed in a hardware architecture shown in steps 101 and 102, and the result of the high-precision electrochemical model can be obtained by running the electrochemical model in the hardware architecture. The low-precision electrochemical model is an optimized electrochemical model, such as the electrochemical model deployed in the optimized hardware architecture shown in step 104, and the result of the low-precision electrochemical model can be obtained by running the electrochemical model in the optimized hardware architecture.
For example, as shown in fig. 2, taking the deployment of a P2D electrochemical model (P2D model for short) in a hardware architecture as an example, before optimization, a high-precision P2D model is deployed in the hardware architecture, optimization is performed based on the method provided by the embodiment of the present invention, and an optimized electrochemical model, which is a low-precision P2D model, can be obtained based on a hybrid precision strategy. For the same input data, inputting the same input data into a high-precision P2D model, so that a high-precision result can be obtained; the result is input into a low-precision P2D model, and error compensation is carried out on the low-precision P2D model through a BP neural network, so that an approximate high-precision result which is similar to the result of the high-precision P2D model can be obtained.
The embodiment of the invention also provides a hardware architecture, which is an optimized hardware architecture obtained based on the method for deploying hybrid precision provided by any embodiment; for example, in step 105, if the calculation performance of the optimized hardware architecture meets the requirement, the optimized hardware architecture is deployed to obtain the hardware architecture provided by the embodiment of the invention, and the hardware architecture is a low-precision hardware architecture, but has the advantages of high calculation speed, less resource consumption, capability of deploying an electrochemical model by using less hardware, and low cost.
The method for deploying the hybrid precision provided by the embodiment of the invention is described in detail above, the method can also be realized by a corresponding device, and the device for deploying the hybrid precision provided by the embodiment of the invention is described in detail below.
Fig. 3 shows a schematic structural diagram of an apparatus for deploying hybrid precision according to an embodiment of the present invention. As shown in fig. 3, the apparatus for deploying hybrid precision includes:
a first determining module 31, configured to determine a pending calculating module in a hardware architecture deployed with an electrochemical model, where the pending calculating module is a calculating module that consumes resources greater than a preset threshold;
The embedded point module 32 is configured to perform embedded point processing on the pending computation module, and determine resource consumption data of the pending computation module when the hardware architecture runs the electrochemical model;
a second determining module 33, configured to determine, according to the resource consumption data, an optimization scheme of at least one pending computing module as a target computing module that needs to reduce the consumed resource; the optimization scheme is a scheme for reducing floating point number precision of at least part of the region;
an optimization module 34, configured to deploy the corresponding optimization scheme to the target computing module, and determine a computing performance of the optimized hardware architecture when the optimized hardware architecture runs the electrochemical model;
and the ending module 35 is configured to end the optimization when the calculation performance of the optimized hardware architecture meets the requirement.
In one possible implementation manner, the optimization scheme includes:
the first optimization scheme: reducing floating point numbers of all areas of the target computing module to BF16 semi-precision floating point numbers;
alternatively, the second optimization scheme: reducing the floating point number of a partial region of the target computing module to BF16 semi-precision floating point number, and keeping the rest partial region unchanged;
Alternatively, the third optimization scheme: and reducing the floating point number of a partial region of the target calculation module to the FP16 semi-precision floating point number, and keeping the rest partial region unchanged.
In one possible implementation, the apparatus further includes: a re-optimization module for:
under the condition that the calculation performance of the optimized hardware architecture does not meet the requirement, determining an optimization scheme with higher calculation precision again;
deploying a redetermined optimization scheme on the target computing module, and determining computing performance of the re-optimized hardware architecture when the re-optimized hardware architecture runs the electrochemical model;
the calculation precision of the first optimization scheme is lower than that of the second optimization scheme, and the calculation precision of the second optimization scheme is lower than that of the third optimization scheme.
In one possible implementation, the apparatus further includes: an overflow handling module for:
an overflow calculation module that determines that there is a data overflow;
temporarily improving the floating point number precision of at least part of the area of the overflow calculation module, and operating the electrochemical model through a hardware architecture with temporarily improved precision to determine the overflow magnitude of the overflow calculation module;
And under the condition that the overflow magnitude of the overflow calculation module meets the requirement, the floating point number precision of the overflow calculation module is maintained.
In one possible implementation manner, the overflow processing module temporarily improves the floating point number precision of at least a part of the area of the overflow calculation module, including:
and temporarily improving the floating point number of the overflow calculation module to be the FP64 double-precision floating point number.
In one possible implementation manner, the second determining module 33 regards at least one pending computing module as a target computing module that needs to reduce the consumed resources according to the resource consumption data, including:
ordering according to the space resource consumption data of the undetermined computing module, and leading n with more space resource consumption data 1 The undetermined computing modules are used as target computing modules;
or ordering according to the time resource consumption data of the undetermined computing module, and leading n with more time resource consumption data 2 The undetermined computing modules are used as target computing modules;
alternatively, the weighted sum of the space resource consumption data and the time resource consumption data of the undetermined computing module is ranked, and the top n with the weighted sum larger is used for the processing 3 The undetermined computing modules are used as target computing modules;
Wherein the resource consumption data comprises the spatial resource consumption data and/or the temporal resource consumption data.
In one possible implementation, the apparatus further includes: an error compensation module for:
determining an error compensation model, wherein the error compensation model is used for performing error compensation on the result of the low-precision electrochemical model and outputting the result of the high-precision electrochemical model;
and inputting a result of the optimized hardware architecture when the electrochemical model is operated into the error compensation model, and taking an output result of the error compensation model as a final result.
In one possible implementation, the embedded point module 32 determines resource consumption data of the pending computation module when the hardware architecture runs the electrochemical model, including:
determining resource consumption data of the pending computation module and original computation accuracy of the pending computation module when the hardware architecture runs the electrochemical model;
the apparatus further comprises: the judging module is used for: and under the condition that the difference between the original calculation precision of the target calculation module and the calculation precision of the optimized target calculation module is smaller than a preset difference, determining that the calculation performance of the optimized hardware architecture meets the requirement.
It should be noted that, when the device for deploying hybrid precision provided in the above embodiment implements the corresponding function, only the division of the above functional modules is used for illustration, and in practical application, the above functional allocation may be implemented by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to implement all or part of the functions described above. In addition, the device for deploying the hybrid precision and the method embodiment for deploying the hybrid precision provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the device for deploying the hybrid precision are detailed in the method embodiment, which is not described herein again.
According to one aspect of the present application, the present embodiment also provides a computer program product comprising a computer program comprising program code for performing the method shown in the flowchart. In such an embodiment, the computer program may be downloaded and installed from a network through a communication section. When the computer program is executed by a processor, the method for deploying hybrid precision provided by the embodiment of the application is executed.
In addition, the embodiment of the invention also provides a device for deploying the hybrid precision, which comprises a processor and a memory, wherein the memory stores a computer program, the processor can execute the computer program stored in the memory, and when the computer program is executed by the processor, the method for deploying the hybrid precision provided by any embodiment can be realized.
For example, FIG. 4 illustrates a hybrid precision deployment device provided by an embodiment of the present invention, the device comprising a bus 1110, a processor 1120, a transceiver 1130, a bus interface 1140, a memory 1150, and a user interface 1160.
In an embodiment of the present invention, the apparatus further includes: computer programs stored on the memory 1150 and executable on the processor 1120, which when executed by the processor 1120, implement the various processes of the method embodiments for deploying hybrid accuracy described above.
A transceiver 1130 for receiving and transmitting data under the control of the processor 1120.
In an embodiment of the invention, represented by bus 1110, bus 1110 may include any number of interconnected buses and bridges, with bus 1110 connecting various circuits, including one or more processors, represented by processor 1120, and memory, represented by memory 1150.
Bus 1110 represents one or more of any of several types of bus structures, including a memory bus and a memory controller, a peripheral bus, an accelerated graphics port (Accelerate Graphical Port, AGP), a processor, or a local bus using any of a variety of bus architectures. By way of example, and not limitation, such an architecture includes: industry standard architecture (Industry Standard Architecture, ISA) bus, micro channel architecture (Micro Channel Architecture, MCA) bus, enhanced ISA (EISA) bus, video electronics standards association (Video Electronics Standards Association, VESA) bus, peripheral component interconnect (Peripheral Component Interconnect, PCI) bus.
Processor 1120 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method embodiments may be implemented by instructions in the form of integrated logic circuits in hardware or software in a processor. The processor includes: general purpose processors, central processing units (Central Processing Unit, CPU), network processors (Network Processor, NP), digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field Programmable Gate Array, FPGA), complex programmable logic devices (Complex Programmable Logic Device, CPLD), programmable logic arrays (Programmable Logic Array, PLA), micro control units (Microcontroller Unit, MCU) or other programmable logic devices, discrete gates, transistor logic devices, discrete hardware components. The methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. For example, the processor may be a single-core processor or a multi-core processor, and the processor may be integrated on a single chip or located on multiple different chips.
The processor 1120 may be a microprocessor or any conventional processor. The steps of the method disclosed in connection with the embodiments of the present invention may be performed directly by a hardware decoding processor, or by a combination of hardware and software modules in the decoding processor. The software modules may be located in a random access Memory (Random Access Memory, RAM), flash Memory (Flash Memory), read-Only Memory (ROM), programmable ROM (PROM), erasable Programmable ROM (EPROM), registers, and so forth, as are known in the art. The readable storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.
Bus 1110 may also connect together various other circuits such as peripheral devices, voltage regulators, or power management circuits, bus interface 1140 providing an interface between bus 1110 and transceiver 1130, all of which are well known in the art. Accordingly, the embodiments of the present invention will not be further described.
The transceiver 1130 may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. For example: the transceiver 1130 receives external data from other devices, and the transceiver 1130 is configured to transmit the data processed by the processor 1120 to the other devices. Depending on the nature of the computer system, a user interface 1160 may also be provided, for example: touch screen, physical keyboard, display, mouse, speaker, microphone, trackball, joystick, stylus.
It should be appreciated that in embodiments of the present invention, the memory 1150 may further comprise memory located remotely from the processor 1120, such remotely located memory being connectable to a server through a network. One or more portions of the above-described networks may be an ad hoc network (ad hoc network), an intranet, an extranet (extranet), a Virtual Private Network (VPN), a Local Area Network (LAN), a Wireless Local Area Network (WLAN), a Wide Area Network (WAN), a Wireless Wide Area Network (WWAN), a Metropolitan Area Network (MAN), the Internet (Internet), a Public Switched Telephone Network (PSTN), a plain old telephone service network (POTS), a cellular telephone network, a wireless fidelity (Wi-Fi) network, and a combination of two or more of the above-described networks. For example, the cellular telephone network and wireless network may be a global system for mobile communications (GSM) system, a Code Division Multiple Access (CDMA) system, a Worldwide Interoperability for Microwave Access (WiMAX) system, a General Packet Radio Service (GPRS) system, a Wideband Code Division Multiple Access (WCDMA) system, a Long Term Evolution (LTE) system, an LTE Frequency Division Duplex (FDD) system, an LTE Time Division Duplex (TDD) system, a long term evolution-advanced (LTE-a) system, a Universal Mobile Telecommunications (UMTS) system, an enhanced mobile broadband (Enhance Mobile Broadband, embbb) system, a mass machine type communication (massive Machine Type of Communication, mctc) system, an ultra reliable low latency communication (Ultra Reliable Low Latency Communications, uirllc) system, and the like.
It should be appreciated that the memory 1150 in embodiments of the present invention may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. Wherein the nonvolatile memory includes: read-Only Memory (ROM), programmable ROM (PROM), erasable Programmable EPROM (EPROM), electrically Erasable EPROM (EEPROM), or Flash Memory (Flash Memory).
The volatile memory includes: random access memory (Random Access Memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as: static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (ddr SDRAM), enhanced SDRAM (Enhanced SDRAM), synchronous DRAM (SLDRAM), and Direct RAM (DRAM). Memory 1150 described in embodiments of the present invention includes, but is not limited to, the above and any other suitable types of memory.
In an embodiment of the invention, memory 1150 stores the following elements of operating system 1151 and application programs 1152: an executable module, a data structure, or a subset thereof, or an extended set thereof.
Specifically, the operating system 1151 includes various system programs, such as: a framework layer, a core library layer, a driving layer and the like, which are used for realizing various basic services and processing tasks based on hardware. The applications 1152 include various applications such as: a Media Player (Media Player), a Browser (Browser) for implementing various application services. A program for implementing the method of the embodiment of the present invention may be included in the application 1152. The application 1152 includes: applets, objects, components, logic, data structures, and other computer system executable instructions that perform particular tasks or implement particular abstract data types.
In addition, the embodiment of the present invention further provides a computer readable storage medium, on which a computer program is stored, where the computer program when executed by a processor implements each process of the above-mentioned method embodiment for deploying hybrid precision, and the same technical effects can be achieved, so that repetition is avoided, and no further description is given here.
The computer-readable storage medium includes: persistent and non-persistent, removable and non-removable media are tangible devices that may retain and store instructions for use by an instruction execution device. The computer-readable storage medium includes: electronic storage, magnetic storage, optical storage, electromagnetic storage, semiconductor storage, and any suitable combination of the foregoing. The computer-readable storage medium includes: phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), non-volatile random access memory (NVRAM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disk read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassette storage, magnetic tape disk storage or other magnetic storage devices, memory sticks, mechanical coding (e.g., punch cards or bump structures in grooves with instructions recorded thereon), or any other non-transmission medium that may be used to store information that may be accessed by a computing device. In accordance with the definition in the present embodiments, the computer-readable storage medium does not include a transitory signal itself, such as a radio wave or other freely propagating electromagnetic wave, an electromagnetic wave propagating through a waveguide or other transmission medium (e.g., a pulse of light passing through a fiber optic cable), or an electrical signal transmitted through a wire.
In the several embodiments provided herein, it should be understood that the disclosed apparatus, devices, and methods may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, e.g., the division of the modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices, or elements, or may be an electrical, mechanical, or other form of connection.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one position, or may be distributed over a plurality of network units. Some or all of the units can be selected according to actual needs to solve the problem to be solved by the scheme of the embodiment of the invention.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the embodiments of the present invention is essentially or partly contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (including: a personal computer, a server, a data center or other network device) to perform all or part of the steps of the method according to the embodiments of the present invention. And the storage medium includes various media as exemplified above that can store program codes.
In the description of the embodiments of the present invention, those skilled in the art should appreciate that the embodiments of the present invention may be implemented as a method, an apparatus, and a hardware architecture. Thus, embodiments of the present invention may be embodied in the following forms: complete hardware, complete software (including firmware, resident software, micro-code, etc.), a combination of hardware and software. Furthermore, in some embodiments, embodiments of the invention may also be implemented in the form of a computer program product in one or more computer-readable storage media having computer program code embodied therein.
Any combination of one or more computer-readable storage media may be employed by the computer-readable storage media described above. The computer-readable storage medium includes: an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of the computer readable storage medium include the following: portable computer diskette, hard disk, random Access Memory (RAM), read-only Memory (ROM), erasable programmable read-only Memory (EPROM), flash Memory (Flash Memory), optical fiber, compact disc read-only Memory (CD-ROM), optical storage device, magnetic storage device, or any combination thereof. In embodiments of the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, device.
The computer program code embodied in the computer readable storage medium may be transmitted using any appropriate medium, including: wireless, wire, fiber optic cable, radio Frequency (RF), or any suitable combination thereof.
Computer program code for carrying out operations of embodiments of the present invention may be written in assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, integrated circuit configuration data, or in one or more programming languages, including an object oriented programming language such as: java, smalltalk, C ++, also include conventional procedural programming languages, such as: c language or similar programming language. The computer program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of remote computers, the remote computers may be connected via any sort of network, including: a Local Area Network (LAN) or a Wide Area Network (WAN), which may be connected to the user's computer or to an external computer.
The embodiments of the present invention describe the provided methods, apparatuses, devices through flowcharts and/or block diagrams.
It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions. These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer readable program instructions may also be stored in a computer readable storage medium that can cause a computer or other programmable data processing apparatus to function in a particular manner. Thus, instructions stored in a computer-readable storage medium produce an instruction means which implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The foregoing is merely a specific implementation of the embodiment of the present invention, but the protection scope of the embodiment of the present invention is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the embodiment of the present invention, and the changes or substitutions are covered by the protection scope of the embodiment of the present invention. Therefore, the protection scope of the embodiments of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A method of deploying hybrid precision, comprising:
determining a pending computation module in a hardware architecture deployed with an electrochemical model, wherein the pending computation module is a computation module with consumed resources larger than a preset threshold value;
performing buried point processing on the undetermined computing module, and determining resource consumption data of the undetermined computing module when the hardware architecture runs the electrochemical model;
taking at least one undetermined computing module as a target computing module needing to reduce consumed resources according to the resource consumption data, and determining an optimization scheme of the target computing module; the optimization scheme is a scheme for reducing floating point number precision of at least part of the region;
deploying a corresponding optimization scheme for the target computing module, and determining the computing performance of the optimized hardware architecture when the optimized hardware architecture runs the electrochemical model;
And ending the optimization under the condition that the calculation performance of the optimized hardware architecture meets the requirement.
2. The method of claim 1, wherein the optimization scheme comprises:
the first optimization scheme: reducing floating point numbers of all areas of the target computing module to BF16 semi-precision floating point numbers;
alternatively, the second optimization scheme: reducing the floating point number of a partial region of the target computing module to BF16 semi-precision floating point number, and keeping the rest partial region unchanged;
alternatively, the third optimization scheme: and reducing the floating point number of a partial region of the target calculation module to the FP16 semi-precision floating point number, and keeping the rest partial region unchanged.
3. The method as recited in claim 2, further comprising:
under the condition that the calculation performance of the optimized hardware architecture does not meet the requirement, determining an optimization scheme with higher calculation precision again;
deploying a redetermined optimization scheme on the target computing module, and determining computing performance of the re-optimized hardware architecture when the re-optimized hardware architecture runs the electrochemical model;
the calculation precision of the first optimization scheme is lower than that of the second optimization scheme, and the calculation precision of the second optimization scheme is lower than that of the third optimization scheme.
4. The method as recited in claim 1, further comprising:
an overflow calculation module that determines that there is a data overflow;
temporarily improving the floating point number precision of at least part of the area of the overflow calculation module, and operating the electrochemical model through a hardware architecture with temporarily improved precision to determine the overflow magnitude of the overflow calculation module;
and under the condition that the overflow magnitude of the overflow calculation module meets the requirement, the floating point number precision of the overflow calculation module is maintained.
5. The method of claim 4, wherein temporarily increasing the floating point number accuracy of at least a portion of the area of the overflow computation module comprises:
and temporarily improving the floating point number of the overflow calculation module to be the FP64 double-precision floating point number.
6. The method of claim 1, wherein said treating at least one pending computing module as a target computing module in need of reduced consumed resources based on said resource consumption data comprises:
ordering according to the space resource consumption data of the undetermined computing module, and leading n with more space resource consumption data 1 The undetermined computing modules are used as target computing modules;
or ordering according to the time resource consumption data of the undetermined computing module, and leading n with more time resource consumption data 2 The undetermined computing modules are used as target computing modules;
alternatively, the weighted sum of the space resource consumption data and the time resource consumption data of the undetermined computing module is ranked, and the top n with the weighted sum larger is used for the processing 3 The undetermined computing modules are used as target computing modules;
wherein the resource consumption data comprises the spatial resource consumption data and/or the temporal resource consumption data.
7. The method as recited in claim 1, further comprising:
determining an error compensation model, wherein the error compensation model is used for performing error compensation on the result of the low-precision electrochemical model and outputting the result of the high-precision electrochemical model;
and inputting a result of the optimized hardware architecture when the electrochemical model is operated into the error compensation model, and taking an output result of the error compensation model as a final result.
8. The method of claim 1, wherein the determining the resource consumption data of the pending computation module while the hardware architecture is running the electrochemical model comprises:
determining resource consumption data of the pending computation module and original computation accuracy of the pending computation module when the hardware architecture runs the electrochemical model;
The method further comprises the steps of:
and under the condition that the difference between the original calculation precision of the target calculation module and the calculation precision of the optimized target calculation module is smaller than a preset difference, determining that the calculation performance of the optimized hardware architecture meets the requirement.
9. An apparatus for deploying hybrid precision, comprising:
the first determining module is used for determining a pending calculating module in a hardware architecture deployed with an electrochemical model, wherein the pending calculating module is a calculating module with consumed resources larger than a preset threshold value;
the point burying module is used for carrying out point burying processing on the undetermined computing module and determining resource consumption data of the undetermined computing module when the hardware architecture runs the electrochemical model;
the second determining module is used for determining an optimization scheme of at least one pending computing module as a target computing module needing to reduce consumed resources according to the resource consumption data; the optimization scheme is a scheme for reducing floating point number precision of at least part of the region;
the optimization module is used for deploying a corresponding optimization scheme for the target calculation module and determining the calculation performance of the optimized hardware architecture when the optimized hardware architecture runs the electrochemical model;
And the ending module is used for ending the optimization under the condition that the calculation performance of the optimized hardware architecture meets the requirement.
10. A hardware architecture, characterized in that it is an optimized hardware architecture obtained based on the method for deploying hybrid precision according to any one of claims 1 to 8.
CN202310086701.0A 2023-01-18 2023-01-18 Method, device and hardware architecture for deploying hybrid precision Pending CN116232581A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310086701.0A CN116232581A (en) 2023-01-18 2023-01-18 Method, device and hardware architecture for deploying hybrid precision

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310086701.0A CN116232581A (en) 2023-01-18 2023-01-18 Method, device and hardware architecture for deploying hybrid precision

Publications (1)

Publication Number Publication Date
CN116232581A true CN116232581A (en) 2023-06-06

Family

ID=86590457

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310086701.0A Pending CN116232581A (en) 2023-01-18 2023-01-18 Method, device and hardware architecture for deploying hybrid precision

Country Status (1)

Country Link
CN (1) CN116232581A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116822253A (en) * 2023-08-29 2023-09-29 山东省计算中心(国家超级计算济南中心) Hybrid precision implementation method and system suitable for MANUM sea wave mode

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116822253A (en) * 2023-08-29 2023-09-29 山东省计算中心(国家超级计算济南中心) Hybrid precision implementation method and system suitable for MANUM sea wave mode
CN116822253B (en) * 2023-08-29 2023-12-08 山东省计算中心(国家超级计算济南中心) Hybrid precision implementation method and system suitable for MANUM sea wave mode

Similar Documents

Publication Publication Date Title
CN108701250B (en) Data fixed-point method and device
KR20080089313A (en) Method and apparatus for performing multiplicative functions
CN116232581A (en) Method, device and hardware architecture for deploying hybrid precision
US20190339938A1 (en) Very low precision floating point representation for deep learning acceleration
US11620105B2 (en) Hybrid floating point representation for deep learning acceleration
EP2834731B1 (en) System and method for a floating-point format for digital signal processors
CN111967608A (en) Data processing method, device, equipment and storage medium
Kwon et al. Sparse convolutional neural network acceleration with lossless input feature map compression for resource‐constrained systems
CN113988438A (en) Self-checking method and system based on IC carrier plate production process
CN111552652B (en) Data processing method and device based on artificial intelligence chip and storage medium
US20230214638A1 (en) Apparatus for enabling the conversion and utilization of various formats of neural network models and method thereof
CN112183744A (en) Neural network pruning method and device
US20210357753A1 (en) Method and apparatus for multi-level stepwise quantization for neural network
US20220113943A1 (en) Method for multiply-add operations for neural network
US9171117B2 (en) Method for ranking paths for power optimization of an integrated circuit design and corresponding computer program product
US8713086B2 (en) Three-term predictive adder and/or subtracter
CN114253956A (en) Edge caching method and device and electronic equipment
CN104823153A (en) Leading change anticipator logic
CN116360729A (en) Shift weighting operation method and device and FPGA
CN111860898A (en) Method and device for updating decision of equipment and electronic equipment
CN116047310B (en) Battery model parameter identification method and device and electronic equipment
US20190236354A1 (en) Information processing method and information processing system
US8842784B2 (en) L-value generation in a decoder
WO2024087185A1 (en) Memory access adaptive self-attention mechanism for transformer model
CN116520147A (en) Battery consistency calculating method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination