CN116541075B - Domain-specific architecture processor and acceleration computing method, medium and device thereof - Google Patents

Domain-specific architecture processor and acceleration computing method, medium and device thereof Download PDF

Info

Publication number
CN116541075B
CN116541075B CN202310815903.4A CN202310815903A CN116541075B CN 116541075 B CN116541075 B CN 116541075B CN 202310815903 A CN202310815903 A CN 202310815903A CN 116541075 B CN116541075 B CN 116541075B
Authority
CN
China
Prior art keywords
acceleration
instruction
programmable logic
code
logic array
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310815903.4A
Other languages
Chinese (zh)
Other versions
CN116541075A (en
Inventor
孔令军
庞兆春
林宁亚
宋琪
邹晓峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202310815903.4A priority Critical patent/CN116541075B/en
Publication of CN116541075A publication Critical patent/CN116541075A/en
Application granted granted Critical
Publication of CN116541075B publication Critical patent/CN116541075B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30076Arrangements for executing specific machine instructions to perform miscellaneous control operations, e.g. NOP
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a special architecture processor in the field and an acceleration computing method, medium and equipment thereof, relating to the technical field of processors, wherein the acceleration computing method comprises the following steps: analyzing the program codes to find an accelerating code segment corresponding to the domain-specific algorithm, generating a corresponding netlist based on the accelerating code segment, sending the netlist to a programmable logic array, and generating a dynamic accelerating instruction matched with the accelerating code segment according to a preset instruction table; generating acceleration logic based on the received netlist through the programmable logic array, and performing acceleration logic configuration; in response to the programmable logic array completing configuration, replacing the speedable code segment in the program code with a dynamic acceleration instruction; acceleration logic is executed by the programmable logic array based on dynamic acceleration instructions in the program code to perform computation of the domain-specific algorithm. The method improves the acceleration calculation efficiency of the domain-specific algorithm.

Description

Domain-specific architecture processor and acceleration computing method, medium and device thereof
Technical Field
The present invention relates to the field of processor technologies, and in particular, to a domain-specific architecture processor and an acceleration computing method, medium, and apparatus thereof.
Background
With the development of technology and application, a great deal of computing demands are presented in more and more fields, and requirements on performance, speed and the like of computer hardware are higher and higher, and development of a DSA (domain specific architecture, field-specific architecture) processor is faster. For example: in order to decompose the image processing capability of the CPU (Central Processing Unit ), a GPU (graphics processing unit, graphics processor) is specifically designed to be mounted on the system bus; in order to process acceleration tasks such as cryptography, a special cryptography acceleration chip is designed; similarly, more chips such as multimedia processing chips, audio processing chips, communication chips, blockchain accelerator chips, etc. are designed for use in more specialized and finely divided areas.
Thus, there are many subdivision areas or algorithms that need to be accelerated, however, not every area may support a dedicated chip, and more areas may use conventional general-purpose processors to perform a series of computations.
To solve the problem that the general-purpose processor adapts to more algorithms, a technology of embedding a general-purpose processor core in an FPGA (Field Programmable Gate Array, programmable logic array) is currently proposed. Fig. 1 shows a schematic diagram of the architecture of an FPGA in combination with a general-purpose processor for domain-specific computing provided according to the prior art. As shown in fig. 1, the general purpose processor is connected with the FPGA through an on-chip bus, and this location of the FPGA may be originally a chip specifically designed for algorithms in a subdivision domain, such as a GPU, an artificial intelligence processor, etc., and when used in a specific domain, the FPGA may be configured by software design into specific hardware suitable for the domain.
But before the application and implementation of the special-purpose chip, FPGA engineers and hardware engineers are always required to design new hardware logic for the on-chip FPGA. And once the FPGA has completed configuration, no other type of computation can be performed.
Disclosure of Invention
In view of the above, the present invention aims to provide a processor with a domain-specific architecture, and an acceleration computing method, medium, and device thereof, so as to solve the problems of low acceleration efficiency and poor compatibility caused by the need of manually programming to design a specific acceleration chip or hardware by mounting an FPGA on each processor through a bus when implementing the acceleration computation of a domain-specific algorithm in the prior art.
Based on the above object, the present invention provides an acceleration computing method for a domain-specific architecture processor, comprising the following steps:
analyzing the program codes to find an accelerating code segment corresponding to the domain-specific algorithm, generating a corresponding netlist based on the accelerating code segment, sending the netlist to a programmable logic array, and generating a dynamic accelerating instruction matched with the accelerating code segment according to a preset instruction table;
generating acceleration logic based on the received netlist through the programmable logic array, and performing acceleration logic configuration;
In response to the programmable logic array completing configuration, replacing the speedable code segment in the program code with a dynamic acceleration instruction;
acceleration logic is executed by the programmable logic array based on dynamic acceleration instructions in the program code to perform computation of the domain-specific algorithm.
In some embodiments, the method further comprises:
and identifying and extracting the dynamic acceleration instruction from the program code, analyzing the extracted dynamic acceleration instruction to obtain a micro instruction, and sending the micro instruction to the programmable logic array.
In some embodiments, executing acceleration logic by the programmable logic array based on dynamic acceleration instructions in the program code includes:
acceleration logic is executed by the programmable logic array based on the received microinstructions.
In some embodiments, the method further comprises: further comprises:
generating feature codes for the accelerating code segments, and judging whether the feature codes exist in a feature code library;
responsive to the feature code already being present in the feature code library, sending an indication to the programmable logic array;
in response to the programmable logic array receiving an indication that the feature code exists in the feature code library, confirming whether corresponding acceleration logic is reserved in the programmable logic array;
In response to the corresponding acceleration logic remaining, the programmable logic array is directly configured using the corresponding acceleration logic.
In some embodiments, the method further comprises: further comprises:
and generating a corresponding netlist based on the accelerated code segments and generating dynamic acceleration instructions matched with the accelerated code segments according to a preset instruction table in response to that no corresponding acceleration logic is reserved or the feature codes are not in the feature code library.
In some embodiments, the method further comprises: further comprises:
configuring a counter for the acceleration logic, the counter decrementing a count value over time;
in response to the acceleration logic being used an nth time, increasing a current count value of the counter by a weight value corresponding to the acceleration logic, wherein N is an integer greater than 1;
in response to the count value of the counter decrementing to zero, the acceleration logic is placed in a deletable state.
In some embodiments, the method further comprises: further comprises:
in response to the count value of the counter not decrementing to zero, it is determined that the acceleration logic is currently resident in the programmable logic array.
In some embodiments, the method further comprises: further comprises:
in response to the insufficient remaining space of the programmable logic array, the acceleration logic that was earliest in a deletable state is deleted.
In some embodiments, the method further comprises: further comprises:
and in response to the feature codes existing in the feature code library, directly increasing the life values of the feature codes in the feature code library by preset values.
In some embodiments, the method further comprises: further comprises:
the counter based on the feature code has a start value configured for the acceleration logic, the level of the start value being consistent with the level of the vital value.
In some embodiments, the method further comprises: further comprises:
and storing the feature codes into the feature code library in response to the feature codes not existing in the feature code library, and assigning the life values of the feature codes as initial values.
In some embodiments, the method further comprises: further comprises:
and storing the netlist and the dynamic acceleration instructions into a cache queue.
In some embodiments, the method further comprises: further comprises:
judging whether the residual space of the programmable logic array reaches a preset threshold value or not;
and in response to the residual space reaching a preset threshold, obtaining the netlist from the cache queue and transmitting the netlist to the programmable logic array.
In some embodiments, generating dynamic acceleration instructions matching the acceleration-capable code segments according to a preset instruction table includes:
and selecting an instruction format matched with the accelerating code segment in a preset instruction table, and generating a dynamic accelerating instruction based on the instruction format.
In some embodiments, the method further comprises: further comprises:
in response to finding the acceleratable code segments, judging whether the number of algorithm instructions in the acceleratable code segments exceeds a preset number;
in response to not exceeding the preset number, it is determined that the acceleration-capable code segment does not need to be accelerated.
In some embodiments, the method further comprises: further comprises:
and generating dynamic acceleration instructions matched with the accelerated code segments according to a preset instruction table in response to the preset quantity.
In some embodiments, the method further comprises: further comprises:
the preset number is determined based on a space size of the programmable logic array.
In another aspect of the present invention, there is also provided a domain-specific architecture processor, including:
an instruction buffer configured to store program code;
the ASIC chip is configured to search the accelerating code segment corresponding to the domain-specific algorithm from the program code, generate a corresponding netlist based on the accelerating code segment, and generate a dynamic accelerating instruction matched with the accelerating code segment according to a preset instruction table; and
a programmable logic array configured to receive the netlist and generate and configure acceleration logic based on the netlist,
The ASIC chip is further configured to replace the speedable code segment in the instruction buffer with a dynamic acceleration instruction in response to the programmable logic array completing the configuration, and the programmable logic array is further configured to execute acceleration logic based on the dynamic acceleration instruction in the instruction buffer to perform computation of the domain-specific algorithm.
In some embodiments, the processor further comprises:
the instruction fetcher is configured to identify and extract dynamic acceleration instructions in the instruction buffer;
the decoder is configured to receive the dynamic acceleration instruction output by the finger receiving device and analyze the dynamic acceleration instruction to obtain a micro instruction; and
the FIFO memory is configured to receive the micro instruction and send the micro instruction to the programmable logic array;
the programmable logic array is further configured to execute acceleration logic based on the received microinstructions to perform computations of the domain-specific algorithm.
In yet another aspect of the present invention, there is also provided a computer readable storage medium storing computer program instructions which, when executed by a processor, implement the above-described method.
In yet another aspect of the present invention, there is also provided a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, performs the above method.
The invention has at least the following beneficial technical effects:
the acceleration calculation method of the domain-specific architecture processor can complete the acceleration task of the acceleration code fragments by driving the programmable logic array accelerator through the dynamic acceleration instruction in the execution process of the program codes, and can be suitable for the acceleration calculation requirements of different domain-specific algorithms; the programmable logic array can change the acceleration logic in real time according to the acceleration code fragments, so that the task of realizing acceleration of various algorithm codes in the processor is completed, manual programming is not needed, a special acceleration chip or hardware is not needed, and the efficiency of accelerating calculation of the field special algorithm is greatly improved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are necessary for the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention and that other embodiments may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a combined FPGA and general purpose processor for domain-specific computing according to the prior art;
FIG. 2 is a schematic diagram of an acceleration computing method of a domain-specific architecture processor according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an overall architecture of a domain-specific architecture processor according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a preset instruction table according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a domain-specific architecture processor provided in accordance with an embodiment of the present invention;
FIG. 6 is a schematic diagram of a computer readable storage medium implementing a method of accelerating computation of a domain-specific architecture processor, provided in accordance with an embodiment of the present invention;
fig. 7 is a schematic hardware structure of a computer device for executing an acceleration computing method of a domain-specific architecture processor according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention will be described in further detail with reference to the accompanying drawings.
It should be noted that, in the embodiments of the present invention, all the expressions "first" and "second" are used to distinguish two non-identical entities with the same name or non-identical parameters, and it is noted that the "first" and "second" are only used for convenience of expression, and should not be construed as limiting the embodiments of the present invention. Furthermore, the terms "comprise" and "have," and any variations thereof, are intended to cover a non-exclusive inclusion, such as a process, method, system, article, or other step or unit that comprises a list of steps or units.
Based on the above objects, a first aspect of the embodiments of the present invention proposes an embodiment of an acceleration computing method for a domain-specific architecture processor. Fig. 2 is a schematic diagram of an embodiment of an acceleration computing method of a domain-specific architecture processor provided by the present invention. As shown in fig. 2, the method for accelerating computation of the domain-specific architecture processor according to the embodiment of the present invention includes the following steps:
step S10, analyzing the program codes to find the accelerating code segments corresponding to the domain-specific algorithm, generating corresponding netlists based on the accelerating code segments, sending the netlists to the programmable logic array, and generating dynamic accelerating instructions matched with the accelerating code segments according to a preset instruction table;
step S20, generating acceleration logic based on the received netlist through a programmable logic array, and carrying out acceleration logic configuration;
step S30, responding to the completion of configuration of the programmable logic array, and replacing the accelerating code segments in the program code with dynamic accelerating instructions;
step S40, executing acceleration logic based on dynamic acceleration instructions in the program codes through the programmable logic array to calculate a domain-specific algorithm.
Fig. 3 is a schematic diagram of an overall architecture of a domain-specific architecture processor according to an embodiment of the present invention. As shown in fig. 3, the program code is stored in the instruction buffer, the chip searches the program code for the accelerated code segment corresponding to the domain-specific algorithm through ASIC (Application Specific Integrated Circuit, integrated circuit chip technology for special application, which is regarded as an integrated circuit designed for special purpose in the integrated circuit world), generates the corresponding netlist based on the accelerated code segment, generates the dynamic acceleration instruction matched with the accelerated code segment according to the preset instruction table, and replaces the accelerated code segment in the instruction buffer with the dynamic acceleration instruction after the programmable logic array completes configuration, and the programmable logic array executes the acceleration logic based on the dynamic acceleration instruction in the instruction buffer to calculate the domain-specific algorithm.
In this embodiment, selecting the acceleration code segment from the program code includes three cases:
1. cyclic loading class computations can be accelerated as parallel computations:
and (3) condition judgment: the cycle number fixed length and the loading address belong to the memory or hard disk range, but not the IO (Input/Output) device range, and no additional operation is performed on the address data before and after the code segment and in the middle.
Examples: for (int i=0; i++; i < 20) { s [ i ] =a [ i ] +b [ i ]; }
In this example, 20 cycles of processing are required to complete before being accelerated; after being accelerated, only a 20 element vector adder is required.
2. The continuous sequence class computation may be accelerated to a short period single direct computation:
and (3) condition judgment: there is no load operation to scalar registers inside the code fragment and no reference to the data in the fragment behind the code fragment.
Examples: ld 0 a1, ld 2 a2, ld 3 a3, t4=t1+t2, t5=t3xt1, t6=t4+t5, sd t6 a0;
in this example, three data are loaded into scalar registers and a series of computations, namely, add-multiply-add, are completed. The add-multiply-add can be integrated into a circuit so that the operation can be accomplished using a predetermined instruction.
3. The introduction state analyzer performs learning judgment:
the code stream is loaded into a FIFO (First Input First Output, first in first out) memory, which is input into a convolutional layer in units of every 8 bits, reduced to 4bit layer through a fully connected layer and a pooling layer, then expanded to 8bit layer, and then passed through several compression layers to form a decision. The parameters of each layer may be adjusted by training so that the model can make a determination as to whether the code stream can be accelerated.
The acceleration calculation method of the domain-specific architecture processor can complete the acceleration task of the acceleration code fragments by driving the programmable logic array accelerator through the dynamic acceleration instruction in the execution process of the program codes, and can be suitable for the acceleration calculation requirements of different domain-specific algorithms; the programmable logic array can change the acceleration logic in real time according to the acceleration code fragments, so that the task of realizing acceleration of various algorithm codes in the processor is completed, manual programming is not needed, a special acceleration chip or hardware is not needed, and the efficiency of accelerating calculation of the field special algorithm is greatly improved.
Thus, the programmable logic array (Field Programmable Gate Array, FPGA) in embodiments of the invention can be driven directly by instructions issued by the processor, rather than by bus commands.
In some embodiments, the method further comprises: and identifying and extracting the dynamic acceleration instruction from the program code, analyzing the extracted dynamic acceleration instruction to obtain a micro instruction, and sending the micro instruction to the programmable logic array.
In some embodiments, executing acceleration logic by the programmable logic array based on dynamic acceleration instructions in the program code includes: acceleration logic is executed by the programmable logic array based on the received microinstructions.
As shown in fig. 3, the dynamic acceleration instruction is identified from the instruction buffer by the instruction fetcher, and then extracted and sent to the decoder; the decoder parses the dynamic acceleration instruction into a plurality of micro instructions as specific execution instructions and sends the micro instructions to the programmable logic array to execute the acceleration logic to perform the computation of the domain-specific algorithm.
In some embodiments, the method further comprises: generating feature codes for the accelerating code segments, and judging whether the feature codes exist in a feature code library; responsive to the feature code already being present in the feature code library, sending an indication to the programmable logic array; in response to the programmable logic array receiving an indication that the feature code exists in the feature code library, confirming whether corresponding acceleration logic is reserved in the programmable logic array; in response to the corresponding acceleration logic remaining, the programmable logic array is directly configured using the corresponding acceleration logic.
In some embodiments, the method further comprises: and generating a corresponding netlist based on the accelerated code segments and generating dynamic acceleration instructions matched with the accelerated code segments according to a preset instruction table in response to that no corresponding acceleration logic is reserved or the feature codes are not in the feature code library.
In the above embodiment, if the acceleration logic is left in the programmable logic array, the acceleration logic can be directly used, otherwise, the acceleration logic needs to be generated.
In some embodiments, the method further comprises: configuring a counter for the acceleration logic, the counter decrementing a count value over time; in response to the acceleration logic being used an nth time, increasing a current count value of the counter by a weight value corresponding to the acceleration logic, wherein N is an integer greater than 1; in response to the count value of the counter decrementing to zero, the acceleration logic is placed in a deletable state.
In some embodiments, the method further comprises: in response to the count value of the counter not decrementing to zero, it is determined that the acceleration logic is currently resident in the programmable logic array.
In some embodiments, the method further comprises: in response to the insufficient remaining space of the programmable logic array, the acceleration logic that was earliest in a deletable state is deleted.
The above embodiments explain how the acceleration logic is left in the programmable logic array.
In some embodiments, the method further comprises: and in response to the feature codes existing in the feature code library, directly increasing the life values of the feature codes in the feature code library by preset values.
In this embodiment, a preset value is added, for example: increase by 1.
In some embodiments, the method further comprises: the counter based on the feature code has a start value configured for the acceleration logic, the level of the start value being consistent with the level of the vital value.
In this embodiment, a starting value may be configured for the counter, and if the life value of the feature code is large, the starting value is also large; if the vital value of the feature code is small, the start value is also small.
In some embodiments, the method further comprises: and storing the feature codes into the feature code library in response to the feature codes not existing in the feature code library, and assigning the life values of the feature codes as initial values.
In this embodiment, the initial value is, for example, 1.
In some embodiments, the method further comprises: and storing the netlist and the dynamic acceleration instructions into a cache queue.
In some embodiments, the method further comprises: judging whether the residual space of the programmable logic array reaches a preset threshold value or not; and in response to the residual space reaching a preset threshold, obtaining the netlist from the cache queue and transmitting the netlist to the programmable logic array.
In some embodiments, generating dynamic acceleration instructions matching the acceleration-capable code segments according to a preset instruction table includes: and selecting an instruction format matched with the accelerating code segment in a preset instruction table, and generating a dynamic accelerating instruction based on the instruction format.
Fig. 4 is a schematic diagram of a preset instruction table according to an embodiment of the invention. As shown in fig. 4, the preset instruction table is specifically designed as an extended instruction, and supports four formats for 32. Wherein fi.31-fi.00 represent instruction names; 0-31 means that each instruction consists of a binary number of 32 bits; rs3, rs2, rs1 respectively represent source register addresses, each source register is composed of 5-bit binary numbers, the source registers are used for storing operated numbers, and for each instruction, the number of the source register addresses can be determined according to actual conditions; imm represents the immediate count, also belonging to the operands, but not the count taken from the source register, but a fixed number; rd represents the destination register address, also consisting of 5-bit binary numbers, for storing the result of the operand as calculated; 000-111 in the fur 3 column represent the function code of the instruction, such as a compute function; the opcodes 1-4 of the opcode (operation code) column are respectively matched with the function codes, namely the operation code and the function code of each instruction jointly represent the function to be realized by the instruction; m represents a mask; n represents a reserved bit, and the custom content can be stored according to actual conditions. The encoding of the 32 preset instructions is fixed, but the corresponding functions are changed dynamically according to the acceleration code segments in the instruction buffer.
Specific functions of the preset instruction:
in the instruction buffer, it is assumed that the code segments that can be accelerated are divided into A, B, C.
After analysis, the code segment a generates a corresponding acceleration logic OA for the analysis unit, configures the instruction if.03 for the logic, and replaces the code segment a with the instruction if.03.
After analysis, the code segment B generates a corresponding acceleration logic OB for the analysis unit, configures the instruction if.04 for the logic, and replaces the code segment a with the instruction if.04.
After analysis, the code segment C generates a corresponding acceleration logic OC for the analysis unit, configures the instruction if.05 for the logic, and replaces the code segment a with the instruction if.05.
Therefore, only IF.03-05 acts in the preset instruction, namely, the ASIC chip can schedule corresponding units to calculate according to the preset instruction after receiving the preset instruction. But the dynamic acceleration instruction function at this time is limited.
Assume that code segment A is to add, subtract, multiply and divide scalar register a0 and store the result back to a0; code segment B performs the same add-subtract multiply-divide operation on scalar register a1 and stores the result back to a1. If only the opcode portion of the preset instructions is used, two preset instructions and two identical acceleration logic need to be generated at this time, which is wasteful.
The ASIC chip can generate a feature code for a code segment by code analysis, and if the feature codes of two code segments are identical (representing the same algorithm), the code segment is a reusable segment. In addition, whether the code segment is a function can be judged through jump and return instructions, and the code segment can be judged to be reusable for the function without additional IO input operation. Instructions of a corresponding format may be allocated to the function at this time depending on how many input operands are. If one function is the int function (a, b, c), it can be allocated a certain instruction in fi.31 to fi.16, and three data inputs and one data return function are realized. After possessing the function input and output capability, the repeated configuration condition of the code segments can be greatly reduced, and the algorithms with the same functions and different input and output data can be distributed into one instruction.
Four formats of instructions are preset to accommodate the calculations and functions of different situations.
For the ASIC chip to select the acceleration code segment from the instruction buffer, the code segment feature code matching mode can also be used, which is specifically as follows:
during the process of analyzing the code stream, the analyzer extracts an opcode (operation code) part and a rd rs1 rs2 part as input variables, generates a feature code for the whole code segment through a convolution operation and a compression operation, and can judge whether the feature code is an acceleration code segment after being matched by a feature code library.
In some embodiments, the method further comprises: in response to finding the acceleratable code segments, judging whether the number of algorithm instructions in the acceleratable code segments exceeds a preset number; in response to not exceeding the preset number, it is determined that the acceleration-capable code segment does not need to be accelerated.
In some embodiments, the method further comprises: and generating dynamic acceleration instructions matched with the accelerated code segments according to a preset instruction table in response to the preset quantity.
In the above embodiment, if the algorithm instructions in the code fragment capable of being accelerated are few, acceleration is not needed, and the code fragment can be directly calculated by an original method; otherwise, acceleration is required, i.e., netlist generation, dynamic acceleration instructions, etc. are required.
In some embodiments, the method further comprises: the preset number is determined based on a space size of the programmable logic array.
In another embodiment, to prevent the ASIC chip from analyzing too slowly, it may be considered to increase the instruction buffer capacity.
In another embodiment, to prevent code jump from causing conflict, a trigger event may be set, when the fetcher extracts the optimized code segment, the ASIC chip will determine whether the segment has completed code segment analysis, configuration and instruction replacement, and if the configuration has not been completed, cancel instruction replacement so that the function of the code segment can be normally executed by the general purpose processor.
How the ASIC chip dynamically analyzes the code fragments and configures the programmable logic array and ultimately achieves acceleration according to the functionality of the code fragments will be described in connection with the examples below as follows:
1) The instruction buffer will read and store a series of program codes in advance, and assume that the code segment content corresponding to the domain-specific algorithm is as follows:
LOOP:
Load a6 a.addr
Add a9 a6 a7
Store a9a.addr
Adda.Addr,1
Add a10,1
BNE 10 a10,LOOP
2) The ASIC chip will analyze the code segment in the instruction buffer, which usually involves the cycle and before the PC jump (because the PC jump will cause the introduction of external variables to increase the analysis difficulty), and for example, the code can be understood as the unified addition of the value in the address addr to addr+10 in the data buffer to the value of the a7 register and the return to addr+10.
3) The ASIC chip generates a netlist according to the analysis results, configures the programmable logic array and designates one or more instructions among the 32 instructions for implementation. As in the present case, the ASIC chip will configure a1 x 10 vector add unit for the programmable logic array and specify an instruction, which is done by fi.15.
4) The ASIC chip replaces the code segment to be optimized in the instruction buffer with a dynamic acceleration instruction, e.g. fi.15 a.addr, a10, 10 in this example.
5) When the instruction is extracted by the instruction extractor, the instruction is sent to the decoder, and the decoder sends the micro instruction to the FIFO memory after the analysis is completed, and the transmission is completed and the micro instruction is executed in the programmable logic array. The programmable logic array is now configured as application specific acceleration logic in accordance with the netlist file that has been generated from the ASIC chip. Taking this example as an example, the programmable logic array will extract 10 data from the address of a.addr, complete the calculation of +a10, and then store the position of a.addr in the data cache.
In a second aspect of the embodiment of the present invention, an acceleration computing method for a domain-specific architecture processor is also provided. Fig. 5 is a schematic diagram of an embodiment of a domain-specific architecture processor provided by the present invention. As shown in fig. 5, a domain-specific architecture processor includes:
an instruction buffer 10 configured to store program code;
the ASIC chip 20 is configured to search the accelerating code segments corresponding to the domain-specific algorithm from the program codes, generate corresponding netlists based on the accelerating code segments, and generate dynamic accelerating instructions matched with the accelerating code segments according to a preset instruction table; and
a programmable logic array 30 configured to receive the netlist and generate and configure acceleration logic based on the netlist,
Wherein the ASIC chip 20 is further configured to replace the speedable code segment in the instruction buffer 10 with a dynamic acceleration instruction in response to the programmable logic array 30 completing the configuration, the programmable logic array 30 is further configured to execute acceleration logic based on the dynamic acceleration instruction in the instruction buffer 10 for the calculation of the domain-specific algorithm.
In some embodiments, the processor further comprises:
the instruction fetcher is configured to identify and extract dynamic acceleration instructions in the instruction buffer 10;
the decoder is configured to receive the dynamic acceleration instruction output by the finger receiving device and analyze the dynamic acceleration instruction to obtain a micro instruction; and
a FIFO memory configured to receive the micro instructions and to send the micro instructions to the programmable logic array 30;
the programmable logic array 30 is further configured to execute acceleration logic based on the received microinstructions to perform computations of domain-specific algorithms.
In some embodiments, ASIC chip 20 is further configured to: generating feature codes for the accelerating code segments, and judging whether the feature codes exist in a feature code library; and, in response to the feature code already being present in the feature code library, sending an indication to the programmable logic array 30; the programmable logic array 30 is further configured to: in response to receiving an indication that the feature code exists in the feature code library, confirming whether corresponding acceleration logic is left in the programmable logic array 30; and in response to the corresponding acceleration logic remaining, directly configuring the programmable logic array 30 using the corresponding acceleration logic; an indication is sent to ASIC chip 20 in response to the corresponding acceleration logic not being left.
In some embodiments, ASIC chip 20 is further configured to: in response to receiving an indication that no corresponding acceleration logic is retained in the programmable logic array 30 or that feature codes are not present in the feature code library, generating a corresponding netlist based on the pieces of acceleration code and generating a dynamic acceleration instruction matching the pieces of acceleration code according to a preset instruction table.
In some embodiments, programmable logic array 30 is further configured to: configuring a counter for the acceleration logic, the counter decrementing a count value over time; in response to the acceleration logic being used an nth time, increasing a current count value of the counter by a weight value corresponding to the acceleration logic, wherein N is an integer greater than 1; in response to the count value of the counter decrementing to zero, the acceleration logic is placed in a deletable state.
In some embodiments, programmable logic array 30 is further configured to: in response to the count value of the counter not decrementing to zero, it is determined that acceleration logic is currently resident in the programmable logic array 30.
In some embodiments, programmable logic array 30 is further configured to: in response to insufficient remaining space of programmable logic array 30, the acceleration logic that was earliest in a deletable state is deleted.
In some embodiments, ASIC chip 20 is further configured to: and in response to the feature codes existing in the feature code library, directly increasing the life values of the feature codes in the feature code library by preset values.
In some embodiments, programmable logic array 30 is further configured to: the counter based on the feature code has a start value configured for the acceleration logic, the level of the start value being consistent with the level of the vital value.
In some embodiments, ASIC chip 20 is further configured to: and storing the feature codes into the feature code library in response to the feature codes not existing in the feature code library, and assigning the life values of the feature codes as initial values.
In some embodiments, ASIC chip 20 is further configured to: the netlist and the dynamic acceleration instructions are deposited into a cache queue of the ASIC chip 20.
In some embodiments, programmable logic array 30 is further configured to: judging whether the residual space of the programmable logic array 30 reaches a preset threshold value; and responding to the residual space reaching a preset threshold value, and obtaining the netlist from the cache queue.
In some embodiments, ASIC chip 20 is further configured to: and selecting an instruction format matched with the accelerating code segment in a preset instruction table, and generating a dynamic accelerating instruction based on the instruction format.
In some embodiments, ASIC chip 20 is further configured to: in response to finding the acceleratable code segments, judging whether the number of algorithm instructions in the acceleratable code segments exceeds a preset number; in response to not exceeding the preset number, it is determined that the acceleration-capable code segment does not need to be accelerated.
In some embodiments, ASIC chip 20 is further configured to: and generating dynamic acceleration instructions matched with the accelerated code segments according to a preset instruction table in response to the preset quantity.
In some embodiments, ASIC chip 20 is further configured to: the preset number is determined based on the space size of the programmable logic array 30.
The domain-specific architecture processor of the embodiment of the invention can complete the acceleration task of the acceleration code fragments by driving the programmable logic array accelerator through the dynamic acceleration instruction in the execution process of the program codes, and can be suitable for the acceleration calculation requirements of different domain-specific algorithms; the programmable logic array can change the acceleration logic in real time according to the acceleration code fragments, so that the task of realizing acceleration of various algorithm codes in the processor is completed, manual programming is not needed, a special acceleration chip or hardware is not needed, and the efficiency of accelerating calculation of the field special algorithm is greatly improved.
In a third aspect of the embodiment of the present invention, there is further provided a computer readable storage medium, and fig. 6 is a schematic diagram of the computer readable storage medium for implementing the method for accelerating computing of the domain-specific architecture processor according to the embodiment of the present invention. As shown in fig. 6, the computer-readable storage medium 3 stores computer program instructions 31. The computer program instructions 31 when executed by a processor implement the method of the above-described embodiments.
It should be appreciated that all of the embodiments, features and advantages set forth above with respect to a domain-specific architecture processor according to the present invention equally apply, without conflict, to the accelerated computing method and storage medium of a domain-specific architecture processor according to the present invention.
In a fourth aspect of the embodiment of the present invention, there is also provided a computer device, including a memory 402 and a processor 401 as shown in fig. 7, where the memory 402 stores a computer program, and the computer program is executed by the processor 401 to implement the method of any one of the embodiments above.
Referring to fig. 7, a schematic hardware structure of an embodiment of a computer device for executing an acceleration computing method of a domain-specific architecture processor according to the present invention is shown. Taking the example of a computer device as shown in fig. 7, a processor 401 and a memory 402 are included in the computer device, and may further include: an input device 403 and an output device 404. The processor 401, memory 402, input device 403, and output device 404 may be connected by a bus or otherwise, for example in fig. 7. The input device 403 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the domain-specific architecture processor. The output 404 may include a display device such as a display screen.
The memory 402 is used as a non-volatile computer readable storage medium for storing non-volatile software programs, non-volatile computer executable programs, and modules, such as program instructions/modules corresponding to the accelerated computing method of a domain-specific architecture processor in an embodiment of the present application. Memory 402 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created by the use of an accelerated computing method of a domain-specific architecture processor, and the like. In addition, memory 402 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, memory 402 may optionally include memory located remotely from processor 401, which may be connected to the local module via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The processor 401 executes various functional applications of the server and data processing, i.e., an accelerated computing method for implementing the domain-specific architecture processor of the above-described method embodiment, by running non-volatile software programs, instructions and modules stored in the memory 402.
Finally, it should be noted that the computer-readable storage media (e.g., memory) herein can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.
It should be understood that as used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items. The foregoing embodiment of the present invention has been disclosed with reference to the number of embodiments for the purpose of description only, and does not represent the advantages or disadvantages of the embodiments.
Those of ordinary skill in the art will appreciate that: the above discussion of any embodiment is merely exemplary and is not intended to imply that the scope of the disclosure of embodiments of the invention, including the claims, is limited to such examples; combinations of features of the above embodiments or in different embodiments are also possible within the idea of an embodiment of the invention, and many other variations of the different aspects of the embodiments of the invention as described above exist, which are not provided in detail for the sake of brevity. Therefore, any omission, modification, equivalent replacement, improvement, etc. of the embodiments should be included in the protection scope of the embodiments of the present invention.

Claims (21)

1. An acceleration computing method of a domain-specific architecture processor is characterized by comprising the following steps:
Analyzing the program codes to find an accelerated code segment corresponding to a domain-specific algorithm, generating a corresponding netlist based on the accelerated code segment, sending the netlist to a programmable logic array, and generating a dynamic acceleration instruction matched with the accelerated code segment according to a preset instruction table;
generating acceleration logic based on the received netlist through the programmable logic array, and performing acceleration logic configuration;
in response to the programmable logic array completing configuration, replacing the speedable code segment in the program code with the dynamic acceleration instruction;
executing, by the programmable logic array, the acceleration logic based on the dynamic acceleration instructions in the program code to perform computation of the domain-specific algorithm.
2. The method as recited in claim 1, further comprising:
and identifying and extracting the dynamic acceleration instruction from the program code, analyzing the extracted dynamic acceleration instruction to obtain a micro instruction, and sending the micro instruction to the programmable logic array.
3. The method of claim 2, wherein executing, by the programmable logic array, the acceleration logic based on the dynamic acceleration instructions in the program code comprises:
Executing, by the programmable logic array, the acceleration logic based on the received microinstructions.
4. The method as recited in claim 1, further comprising:
generating feature codes for the acceleration code segments, and judging whether the feature codes exist in a feature code library;
transmitting an indication to the programmable logic array in response to the feature code already existing in a feature code library;
responsive to the programmable logic array receiving an indication that the feature code exists in the feature code library, determining whether corresponding acceleration logic remains in the programmable logic array;
and in response to the corresponding acceleration logic being reserved, configuring the programmable logic array directly by using the corresponding acceleration logic.
5. The method as recited in claim 4, further comprising:
and generating a corresponding netlist based on the accelerated code segments and generating dynamic acceleration instructions matched with the accelerated code segments according to a preset instruction table in response to the fact that the corresponding acceleration logic is not reserved or the feature codes are not stored in the feature code library.
6. The method as recited in claim 1, further comprising:
Configuring a counter for said acceleration logic, said counter decrementing a count value over time;
increasing a weight value corresponding to the acceleration logic for a current count value of the counter in response to the acceleration logic being used an nth time, wherein N is an integer greater than 1;
in response to the count value of the counter decrementing to zero, the acceleration logic is placed in a deletable state.
7. The method as recited in claim 6, further comprising:
in response to the count value of the counter not decrementing to zero, it is determined that the acceleration logic is currently resident in the programmable logic array.
8. The method as recited in claim 6, further comprising:
in response to the insufficient remaining space of the programmable logic array, the acceleration logic that was earliest in a deletable state is deleted.
9. The method as recited in claim 4, further comprising:
and in response to the feature code existing in the feature code library, directly increasing the life value of the feature code in the feature code library by a preset value.
10. The method as recited in claim 9, further comprising:
and configuring a starting value for a counter of the acceleration logic based on the vital value size of the feature code, wherein the level of the starting value size is consistent with the level of the vital value size.
11. The method as recited in claim 9, further comprising:
and storing the feature codes into the feature code library in response to the feature codes not existing in the feature code library, and assigning the life values of the feature codes as initial values.
12. The method as recited in claim 1, further comprising:
and storing the netlist and the dynamic acceleration instruction into a cache queue.
13. The method as recited in claim 12, further comprising:
judging whether the residual space of the programmable logic array reaches a preset threshold value or not;
and responding to the residual space reaching the preset threshold value, acquiring the netlist from the cache queue and transmitting the netlist to the programmable logic array.
14. The method of claim 1, wherein generating dynamic acceleration instructions matching the acceleration code segments according to a preset instruction table comprises:
and selecting an instruction format matched with the accelerating code segment from the preset instruction table, and generating the dynamic accelerating instruction based on the instruction format.
15. The method as recited in claim 1, further comprising:
In response to finding the acceleratable code segments, judging whether the number of algorithm instructions in the acceleratable code segments exceeds a preset number;
and determining that the acceleration-capable code segment does not need to be accelerated in response to the preset number not being exceeded.
16. The method as recited in claim 15, further comprising:
and generating dynamic acceleration instructions matched with the accelerated code segments according to the preset instruction table in response to the preset quantity.
17. The method as recited in claim 15, further comprising:
the preset number is determined based on a space size of the programmable logic array.
18. A domain-specific architecture processor, comprising:
an instruction buffer configured to store program code;
the ASIC chip is configured to search the accelerating code segment corresponding to the domain-specific algorithm from the program code, generate a corresponding netlist based on the accelerating code segment, and generate a dynamic accelerating instruction matched with the accelerating code segment according to a preset instruction table; and
a programmable logic array configured to receive the netlist and generate and configure acceleration logic based on the netlist,
Wherein the ASIC chip is further configured to replace the speedable code segment in the instruction buffer with the dynamic acceleration instruction in response to the programmable logic array completing configuration, the programmable logic array further configured to execute the acceleration logic based on the dynamic acceleration instruction in the instruction buffer to perform computation of the domain-specific algorithm.
19. The processor of claim 18, further comprising:
the instruction fetcher is configured to identify the dynamic acceleration instructions in the instruction buffer and extract the dynamic acceleration instructions;
the decoder is configured to receive the dynamic acceleration instruction output by the instruction fetcher and analyze the dynamic acceleration instruction to obtain a micro instruction; and
the FIFO memory is configured to receive the micro instruction and send the micro instruction to the programmable logic array;
the programmable logic array is further configured to execute the acceleration logic based on the received microinstructions to perform computations of the domain-specific algorithm.
20. A computer readable storage medium, characterized in that computer program instructions are stored, which, when executed by a processor, implement the method of any one of claims 1-17.
21. A computer device comprising a memory and a processor, wherein the memory has stored therein a computer program which, when executed by the processor, performs the method of any of claims 1-17.
CN202310815903.4A 2023-07-05 2023-07-05 Domain-specific architecture processor and acceleration computing method, medium and device thereof Active CN116541075B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310815903.4A CN116541075B (en) 2023-07-05 2023-07-05 Domain-specific architecture processor and acceleration computing method, medium and device thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310815903.4A CN116541075B (en) 2023-07-05 2023-07-05 Domain-specific architecture processor and acceleration computing method, medium and device thereof

Publications (2)

Publication Number Publication Date
CN116541075A CN116541075A (en) 2023-08-04
CN116541075B true CN116541075B (en) 2023-09-01

Family

ID=87451019

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310815903.4A Active CN116541075B (en) 2023-07-05 2023-07-05 Domain-specific architecture processor and acceleration computing method, medium and device thereof

Country Status (1)

Country Link
CN (1) CN116541075B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN2582056Y (en) * 2002-10-15 2003-10-22 无锡市富华科技有限责任公司 Programmable CPU module with contactless card R/W function
CN109932953A (en) * 2017-12-19 2019-06-25 陈新 Intelligent supercomputer programmable controller
CN110795754A (en) * 2019-11-12 2020-02-14 中核控制系统工程有限公司 Information security maintenance method based on FPGA
CN112734011A (en) * 2021-01-04 2021-04-30 北京大学 Deep neural network accelerator collaborative design method based on incremental synthesis

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7278122B2 (en) * 2004-06-24 2007-10-02 Ftl Systems, Inc. Hardware/software design tool and language specification mechanism enabling efficient technology retargeting and optimization

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN2582056Y (en) * 2002-10-15 2003-10-22 无锡市富华科技有限责任公司 Programmable CPU module with contactless card R/W function
CN109932953A (en) * 2017-12-19 2019-06-25 陈新 Intelligent supercomputer programmable controller
CN110795754A (en) * 2019-11-12 2020-02-14 中核控制系统工程有限公司 Information security maintenance method based on FPGA
CN112734011A (en) * 2021-01-04 2021-04-30 北京大学 Deep neural network accelerator collaborative design method based on incremental synthesis

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Cyclone V FPSoCs平台下嵌入式硬核处理器和FPGA协同设计方法研究;李宝平;;数码世界(第09期);全文 *

Also Published As

Publication number Publication date
CN116541075A (en) 2023-08-04

Similar Documents

Publication Publication Date Title
US9383999B2 (en) Conditional compare instruction
JP2005531848A (en) Reconfigurable streaming vector processor
US20030023830A1 (en) Method and system for encoding instructions for a VLIW that reduces instruction memory requirements
CN108121688B (en) Calculation method and related product
WO2017116926A1 (en) Loop code processor optimizations
CN111158756B (en) Method and apparatus for processing information
US6934938B2 (en) Method of programming linear graphs for streaming vector computation
JP2020027616A (en) Command execution method and device
CN111124495B (en) Data processing method, decoding circuit and processor
CN111694643B (en) Task scheduling execution system and method for graph neural network application
US10990073B2 (en) Program editing device, program editing method, and computer readable medium
CN116541075B (en) Domain-specific architecture processor and acceleration computing method, medium and device thereof
KR20130045276A (en) System and method to evaluate a data value as an instruction
US10740099B2 (en) Instruction to perform a logical operation on conditions and to quantize the boolean result of that operation
US7647368B2 (en) Data processing apparatus and method for performing data processing operations on floating point data elements
CN113296788B (en) Instruction scheduling method, device, equipment and storage medium
CN114237878A (en) Instruction control method, circuit, device and related equipment
US9934035B2 (en) Device and method for tracing updated predicate values
CN113779311A (en) Data processing method, device and storage medium
CN114840256A (en) Program data level parallel analysis method and device and related equipment
US11620132B2 (en) Reusing an operand received from a first-in-first-out (FIFO) buffer according to an operand specifier value specified in a predefined field of an instruction
CN115951936B (en) Chip adaptation method, device, equipment and medium of vectorization compiler
CN116610362B (en) Method, system, equipment and storage medium for decoding instruction set of processor
US9582619B1 (en) Simulation of a circuit design block using pattern matching
CN117591184A (en) RISC-V vector compression out-of-order execution realization method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant