CN115617740B

CN115617740B - Processor architecture for single-shot multithreading dynamic loop parallel technology implementation

Info

Publication number: CN115617740B
Application number: CN202211288569.3A
Authority: CN
Inventors: 王杜
Original assignee: Changsha Fangwei Technology Co ltd
Current assignee: Changsha Fangwei Technology Co ltd
Priority date: 2022-10-20
Filing date: 2022-10-20
Publication date: 2023-10-27
Anticipated expiration: 2042-10-20
Also published as: CN115617740A

Abstract

The invention relates to the field of integrated circuits, in particular to a processor core application architecture design and implementation method in the field of computer architecture; the whole technical system comprises: the method comprises a differential pipeline normalization arbitration gyration blocking stack transfer mechanism, a system clock stage time division multiplexing technology, a multithread starting and executing mechanism, a simplified feasibility design, expansibility and a customized design. The difference assembly line normalization arbitration gyration blocking stack transfer mechanism divides the assembly line into functional stacks (such as instruction fetching, decoding and the like) according to the functional mutex principle, normalizes stack transfer information, finally forms a closed loop difference assembly line and has an arbitration gyration blocking function; the SCTDM technology is matched with the multithread starting and executing mechanism to reasonably schedule each functional stack, and each functional stack serves different threads (programs) at the same time, so that the execution correlation of the functional stacks is reduced, and the parallelism is improved. The resource utilization rate is improved to the maximum extent.

Description

Processor architecture for single-shot multithreading dynamic loop parallel technology implementation

Technical Field

The invention relates to the field of integrated circuits, in particular to a processor core application architecture implementation method in the field of computer architecture.

Background

The integrated circuit industry is heavily supported. The architecture of the processor is the basis for building the processor, and most common architectures are foreign architectures, such as x86 architecture, ARM architecture, etc. There are many limitations to the application of these architectures, so new architectures are needed. The novel architecture technology is the core of an autonomous controllable, safe and reliable domestic processor and is also a difficult point. Currently, the architecture technology has less innovation results, urgent requirements, great strategic status and wide market space.

The existing processors have the Hyper-Threading technology, which was originally developed by intel and published in 2002. The hyper-Threading technology was originally only applied to Xeon processors and was then called "Super-Threading". With this technique, intel implements two logical threads provided in one physical CPU. While the use of hyper-threading techniques enables two threads to be executed simultaneously, when two threads require a resource at the same time, one of the threads must yield resources to a temporary suspension until the resources are free. Thus, the performance of the hyper-threading is not equal to the performance of both CPUs. That is, the existing hyper-threading technology simulates only two processors, can process multi-threading application, and the performance of the hyper-threading technology is not improved.

Disclosure of Invention

The purpose of the invention is that: a new architecture technology is provided, a method for processing single-core parallel multithreading (multiprogram) of a processor is created, and the processing performance of the multithreading is improved.

The technical scheme of the invention comprises the following steps: the processor architecture realized by the single-shot multithreading dynamic circular parallel technology comprises more than one core and a pipeline corresponding to the cores, wherein the cores only use one instruction emitter, and the pipeline comprises one sub-pipeline or a plurality of parallel sub-pipelines; each sub-pipeline comprises a plurality of functional stacks, the functional stacks in the sub-pipeline realize different functions, the functional stacks in the sub-pipeline sequentially execute to complete execution of one instruction, the pipeline comprises the functional stacks forming an instruction emitter, and the parallel sub-pipelines share the instruction emitter; the pipeline is used for executing threads, the functional stacks are used for executing the threads of the pipeline, different functional stacks in the same pipeline serve different threads at the same time, the instruction emitter is used for emitting thread instructions to the next functional stack, the functional stack runs to the next functional stack of the sub-pipeline after the specific stack function of the sub-pipeline is completed, the functional stack completes the stack function of the next thread after the stack function of the current thread is completed, the functional stack interacts with the next functional stack through a stack transfer interface, and the functional stack transmits stack transfer information to the next functional stack through the stack transfer interface after the functional stack self functions are completed.

Further, the content of the stack transfer interface comprises a thread number and arbitration information, wherein the arbitration information comprises busy information, state information and priority information, when the busy information of the stack transfer interface is busy, the function stack is blocked, and when the busy information of the stack transfer interface is not busy and the state information is not equal to the thread number, the function stack is blocked; blocking when the priority is low when the state information is equal to the thread number, and completing the stack function when the priority is high.

Further, the arbitration information specifically includes:

the state information changes during blocking, the blocking is fixed in length, the blocking time is any integer multiple of the machine period, the sequence execution is carried out, the instruction execution period is the integer multiple of the machine period, and the machine period is the period of one completion of all the function stacks;

the state information is unchanged during blocking, the blocking is of an indefinite length, the blocking time is any integer multiple of the system clock, the out-of-order execution is carried out, and the instruction execution period is not an integer multiple of the machine period.

Further, the content of the stack transfer interface comprises an instruction cache address, an instruction code, an interrupt mark, a specific register program state or a special function debugging interrupt; the instruction cache address and the instruction code are the cache address and the instruction code of a thread instruction executed by a function stack, the interrupt mark is used for indicating whether to interrupt, the specific register program state is used for indicating the state of a specific register, and the special function debugging interrupt is used for indicating whether to interrupt the special function debugging.

Further, the number of the functional stacks is the same as the maximum number of parallel threads.

Further, the processor is a single-core processor, a homogeneous multi-core processor, a heterogeneous multi-core MCU, DSP, CPU, DPU, GPU, or an NPU.

Further, the main program thread and the auxiliary thread are parallel, and the auxiliary thread comprises: a simulated hardware communication protocol thread and an online debugging thread, wherein the simulated hardware communication protocol comprises: IIC protocol, SPI protocol and UART protocol.

Further, the stack execution time of the functional stack is at least one system clock period, and the single-core multithreading parallel granularity is at a system clock level.

Further, the processor is a RISC instruction set processor, and the functional stack includes a fetch functional stack, a decode functional stack, an execute functional stack, a memory access functional stack, and a write-back functional stack.

Further, the processor is a CISC instruction set processor, and the function stack comprises a value function stack, a decoding read-memory function stack and a write-memory execution write-back function stack.

Further, threads of the same operation share a pipeline.

Further, the starting time interval of the pipeline executing different threads is smaller than one instruction completion period.

In summary, the beneficial effects of the implementation of the invention are as follows:

1. a method for single-core parallel multithreading (multiprogram) of a processor is created; by dividing a core into different functional stacks, programs of different threads can be simultaneously operated among a plurality of different functional stacks, so that the threads can be simultaneously operated, and the processing performance of the threads is improved;

2. the invention provides a brand new hyper-threading category technology, which improves the number of single-core parallel threads (programs) from 1 to 2 (the single core of the traditional hyper-threading technology is 2 hardware threads) to a plurality of threads under the condition of limited hardware resources;

3. increasing the common processor single-core multithreading (program) parallel granularity from instruction level (even function level concurrency) to system clock level; can be widely used for single core, isomorphic multi-core, heterogeneous multi-core MCU, DSP, CPU, DPU, GPU, NPU and the like.

Drawings

FIG. 1 is a block diagram of a single-shot multithreading (multiprogram) dynamic circular parallelism architecture.

FIG. 2 is a timing analysis diagram of a single pipeline parallel five threads (five programs) according to the present invention.

FIG. 3 is a diagram illustrating a three-thread (three-program) timing analysis for superscalar parallel, in accordance with the present technique.

FIG. 4 is a diagram illustrating a general layout of a differential pipeline for RISC and CISC instruction sets according to the present technique.

Fig. 5 is a diagram showing normalized stack information content in the present technology.

FIG. 6 is a schematic diagram of a rotary blocking mechanism with busy+state+priority arbitration in accordance with the present invention.

FIG. 7 is a timing analysis diagram of sequential congestion and out-of-order congestion in accordance with the present technique.

FIG. 8 is a diagram illustrating a detailed analysis of the timing of a multi-threaded (multi-program) boot process in accordance with the present invention.

FIG. 9 is an exemplary analysis of extensibility and customization design in accordance with the teachings of the present invention.

FIG. 10 is a diagram illustrating the initiation and execution of a multithreading (multiprogramming) process in accordance with the present invention.

Detailed Description

The invention is described in further detail below with reference to the accompanying drawings.

The invention relates to the field of integrated circuits, in particular to a processor kernel application architecture implementation method in the field of computer architecture, which creates a processor single-core parallel multithreading (multithreading) method, and provides a brand-new hyper-threading category technology, wherein the number of single-core parallel threads (programs) is increased from 1-2 (the single core of the traditional hyper-threading technology is 2 hardware threads) to a plurality of threads under the condition of limited hardware resources, the performance is greatly improved, the common processor single-core multithreading (programs) parallel granularity is increased from an instruction level (even the concurrency of function levels) to a system clock level, and the technology is low in cost, high in performance, safe and reliable, and can be continuously and widely popularized. The invention relates to a brand new super-thread category technology, which improves the number of single-core parallel threads (programs) from 1-2 (single-core is 2 hardware threads in the traditional super-thread technology) to a plurality of single-core parallel threads (the single-core is more than 300 percent in the traditional super-thread technology) under the condition of limited hardware resources (the increase of the single-core parallel threads is less than 10 percent compared with the traditional super-thread technology and is more than 40 percent compared with the traditional super-thread technology), greatly improves the performance (the single-core multi-thread parallel granularity of a common processor is improved to a system clock level from an instruction level (even the concurrency of a function level).

As shown in fig. 1: the single-shot multithreading (multiprogram) dynamic loop parallelism architecture constitutes a block diagram. The method comprises five technical points, namely: the method comprises a differential pipeline normalization arbitration gyration blocking stack transmission mechanism, a system clock stage time division multiplexing technology (SCTDM technology), a multithread (multi-program) starting and executing mechanism, a simplified feasibility design, expansibility and a customized design.

As shown in fig. 2: the present technique is directed to a single pipeline (i.e., a pipeline contains only one sub-pipeline) parallel five-thread (five-program) timing analysis, which includes five functional stacks within the sub-pipeline. A single five stack differential pipeline operation is depicted and the timing relationship is listed. A pipeline may run one thread or multiple threads of the same operation, the pipeline divided into: five functional stacks are fetched, decoded, executed, accessed, stored and written back. Corresponding to five working states: s0, S1, S2, S3, S4; the stack execution time is one system clock cycle (clk). Other numbers may be divided in other embodiments, such as combining decoding and execution into one functional stack. White stacks represent the active state and gray stacks represent the idle state. Five threads (five programs) are illustrated in parallel, a relatively ideal start state is analyzed, no blocking occurs in time sequence, and the threads start up with a difference of one system clock (clk). After receiving the instruction of thread 1 of the instruction transmitter, the first functional stack value function stack of the pipeline finishes the value of thread 1 at the time t 0. After completion, the functional stack runs to the next functional stack of the sub-pipeline after completing the specific stack function of the sub-pipeline, and then the t0+clk value functional stack transmits the information after the value running of the current thread is completed to the next functional stack decoding functional stack for running. At this time, the value function stack can complete the value work of the next thread, namely thread 2. This is performed in turn. Eventually at t0+4clk. The value function stack takes the value of the thread 5, the decoding function stack decodes the thread 4, the execution function stack can execute the thread 3, the memory access function stack accesses the memory of the thread 2, and the write-back function stack writes back the thread 1. I.e. one pipeline can handle 5 threads simultaneously. The functional core is as follows: only one set of hardware resources (functional stack resources, namely functional stacks), namely one hardware core, is provided, and each functional stack serves different threads (programs) at the same time, so that the execution correlation of the functional stacks is reduced; staggering the execution sequence of each thread by one system clock (clk) to increase the parallel granularity of the single-core multi-thread (program) from the instruction level (even the concurrency of the function level) to the system clock level (clk); therefore, the parallelism is improved, the multithreading parallelism is supported, and the multithreading performance is improved.

As shown in fig. 3: the present technique is directed to superscalar (i.e., a pipeline comprising multiple sub-pipelines) parallel three-thread (three-program) timing analysis. Four different pipelines (superscalar) are described in the figure, the time sequence relation is listed, one pipeline comprises four sub-pipelines, the first sub-pipeline comprises an integer calculation function stack, a memory access function stack and a write-back function stack, the second and third sub-pipelines comprise a floating point opposite-order function stack, a mantissa summation function stack, a normalization function stack, a memory access function stack and a write-back function stack, and the fourth pipeline comprises a condition judgment function stack and a branch jump function stack. The sub-pipelines share a value function stack and a decoding function stack, namely, single emission is realized. The longest seven stacks of pipeline correspond to seven working states: s0, S1, S2, S3, S4, S5, S6; the stack execution time is one system clock cycle (clk). Three-thread (three-program) parallel conditions are illustrated, a relatively ideal starting state is analyzed, no blocking occurs in time sequence, the starting is sequentially started, and the starting time is different by one system clock (clk). The functional core is as follows: only one set of hardware resources (functional stack resources) is provided, and at the same time, each functional stack serves different threads (programs), so that the execution correlation of the functional stacks is reduced; staggering the execution sequence of each thread by one system clock (clk) to increase the parallel granularity of the single-core multi-thread (program) from the instruction level (even the concurrency of the function level) to the system clock level (clk); therefore, parallelism is improved, and multithreading parallelism is supported. FIG. 3 illustrates a superscalar out-of-order execution scenario, where the present invention may also support superscalar sequential execution.

As shown in fig. 4: the technology is suitable for common planning of a differential pipeline of RISC and CISC series instruction sets. Common three-stack and five-stack difference pipelines are illustrated. CISC instruction sets such as: CISC-51, commonly used in 8-bit processors, RISC instruction sets such as: RISC-V, commonly used in 32/64 bit processors. The instruction sets are all open-source, and therefore, can also be used for developing domestic processors. The three-stack type difference pipeline is mainly suitable for 8-bit MCU and five-stack type difference pipeline, and is suitable for high-end processors. For RISC instruction set processor, the function stack includes instruction fetch function stack, decoding function stack, executing function stack, memory access function stack and write-back function stack, and for CISC instruction set processor, the function stack includes value function stack, decoding read-memory function stack and write-back executing function stack.

In addition to this, the interaction between ALU (arithmetic unit) and memory (program+data) and pipeline is also embodied.

All function stacks unify input and output interfaces and are internally provided with a rotation blocking function

As shown in fig. 5: the invention normalizes the stack information content. The specific table is as follows:

once the thread (program) starts, the information transferred between the upstream stack (upper stage) and the downstream stack (lower stage) is consistent, and a closed loop normalized pipeline is formed until the end.

As shown in fig. 6: the invention provides a slewing blocking mechanism with busy+state+priority arbitration function.

Three logical relationships of arbitration decisions are illustrated:

(1) busy=1, then the slewing is blocked, otherwise the transfer;

(2) state=thread number, then transfer, otherwise rollback blocking;

(3) priority is high, then transfer, otherwise slewing is blocked.

As shown in fig. 7: sequential blocking and out-of-order blocking timing analysis in the present technology. In the figure, there are two modes when a slewing occlusion occurs:

(1) state changes, the blocking is fixed in length, the blocking is any integer multiple of the machine period, and the blocking is executed according to the sequence of the blocking; the state is recorded when the blocking occurs, then enters a blocking area of the black matrix until the state returns to the recorded recording position again, the arbitration result is updated, and if the state can be released, the state returns to normal operation; the machine cycles (S0-S4) are shown with 5 system clock cycles (clk), and when blocked, there is a slew blocking time of 1 machine cycle (S0-S4). The instruction execution period is always an integer multiple of the machine period (S0-S4).

(2) The state is unchanged, the blocking is of an indefinite length, and the state is any integer multiple of the system clock period and is executed out of order. The state is not recorded when the blocking occurs, then enters a blocking area of the black matrix, and each system clock period (clk) updates the arbitration result, and if the state can be released, the state returns to normal; the machine cycles (S0-S4) are shown with 5 system clock cycles (clk) and, when blocked, with 3-4 system clock cycles (clk) of slew blocking time. The instruction execution period may not be an integer multiple of the machine period (S0-S4).

As shown in fig. 8: multithreading (multiprogramming) initiates timing detail analysis in the present technology. In the figure, five threads (programs) of thread number=tn0, TN1, TN2, TN3, and TN4 are started in parallel. A starting process, which follows a state arbitration principle, starts sequentially, and differs by 1 system clock period (clk) in time; TN1 has started two simultaneously, follows priority arbitration principle, and lower TN1 thread (procedure) priority is high, and upper TN1x thread (procedure) is blocked forever.

The idle resources in the black bottom area in the lower left corner can be seen, and the idle resources gradually decrease along with the starting of the threads one by one until all 5 threads (programs) are started, and all the functional stack resources are busy state, so that new threads cannot be added.

When the thread (program) of the thread number (TN 0) is started, the thread (program) of the thread number (TN 0) is not ended, and then the thread (program) started later is blocked and turned around according to the busy arbitration principle.

As shown in fig. 9: the invention provides an expansibility and a customized design example analysis. In the figure, thread number=tn0, TN1, TN2, TN3, TN4, wherein:

TN0 and TN1 are main functions main1 and main2;

TN1x threads (programs) are permanently blocked as main2x;

TN2 and TN3 are respectively: the software simulation hardware function customization 1 and the software simulation hardware function customization 2 are very flexible in specific functions, and are common as follows: the analog communication protocols … … such as IIC and SPI are parallel to main program and do not influence the execution of main program.

TN4 is dedicated to debug. The online debugging function is realized in a software interaction mode, and the thread (degree) of the independent online debugging function is parallel to the main program, so that the execution of the main program is not influenced.

As shown in fig. 10: the present invention provides a multithreaded (multi-program) thread initiation and execution mechanism. Three threads (programs) are illustrated, the instructions being integer computation (thread 1), branch jump (thread 2), integer computation (thread 3), respectively, as follows:

s01: threads (programs) 1, 2 and 3 are started simultaneously, the thread (program) 1 follows busy+state+priority arbitration conclusion, the turning blocking is preferably ended, and the next process is entered; other threads (programs) remain blocked by slewing;

s02: thread (program) 1 accesses Icache to finish instruction fetching operation; thread (program) 2 follows busy+state+priority arbitration to end slewing blocking and enter the next flow; other threads (programs) remain blocked by slewing;

s03: thread (program) 1 follows busy+state+priority arbitration conclusion, ends slewing blocking, and enters the next flow; the thread (program) 2 accesses the Icache to finish instruction fetching operation; thread (program) 3 follows busy+state+priority arbitration conclusion to end slewing blocking and enter the next flow;

s04: the thread (program) 1 enters a decoding stage and transmits the result to an integer calculation link; thread (program) 2 follows busy+state+priority arbitration conclusion, ends slewing blocking, and enters the next flow; the thread (program) 3 accesses the Icache to finish instruction fetching operation;

s05: thread (program) 1 follows busy+state+priority arbitration conclusion, ends slewing blocking, and enters the next flow; the thread (program) 2 enters a decoding stage and transmits the result to an integer calculation link; thread (program) 3 follows busy+state+priority arbitration conclusion, ends slewing blocking, and enters the next flow;

s06: the thread (program) 1 enters an integer computing stage, and an ALU is started to obtain an operation result for standby; thread (program) 2 follows busy+state+priority arbitration conclusion, ends slewing blocking, and enters the next flow; the thread (program) 3 enters a decoding stage and transmits the result to an integer calculation link;

s07: thread (program) 1 follows busy+state+priority arbitration conclusion, ends slewing blocking, and enters the next flow; the thread (program) 2 enters a condition judgment stage to obtain a jump destination address; thread (program) 3 follows busy+state+priority arbitration conclusion, ends slewing blocking, and enters the next flow;

s08: the thread (program) 1 enters a data memory access stage, calculates the Dcache address and completes the Dcache access operation; thread (program) 2 follows busy+state+priority arbitration conclusion, ends slewing blocking, and enters the next flow; the thread (program) 3 enters an integer computing stage and starts an ALU to obtain an operation result for standby;

s09: thread (program) 1 follows busy+state+priority arbitration conclusion, ends slewing blocking, and enters the next flow; the thread (program) 2 enters a jump execution stage, and the whole jump instruction execution is completed; thread (program) 3 follows busy+state+priority arbitration conclusion, ends slewing blocking, and enters the next flow;

s10: thread (program) 1 enters a write-back operation stage, the whole calculation instruction execution is completed, and the program memory address (calculating icache address) of the next access is calculated; the thread (program) 2 firstly enters the next instruction operation interval because of the shortest instruction period, follows busy+state+priority arbitration conclusion, ends the gyration blocking and enters the next process; the thread (program) 3 enters into the stage of accessing the data memory, calculates the Dcache address and completes the Dcache access operation;

s11: the thread (program) 1 enters the next instruction operation interval, follows busy+state+priority arbitration conclusion, ends the slewing blocking, and enters the next process; thread (program) 2 accesses Icache again, take out the next instruction code; thread (program) 3 follows busy+state+priority arbitration conclusion, ends slewing blocking, and enters the next flow;

s12: thread (program) 1 accesses the Icache again, and takes out the next instruction code; thread (program) 2 follows busy+state+priority arbitration conclusion, ends slewing blocking, and enters the next flow; the thread (program) 3 enters the write-back operation phase, and the whole calculation instruction execution is completed, while the program memory address (calculation icache address) of the next access is calculated.

It should be noted that, although the foregoing embodiments have been described herein, the scope of the present invention is not limited thereby. Therefore, based on the innovative concepts of the present invention, alterations and modifications to the embodiments described herein, or equivalent structures or equivalent flow transformations made by the present description and drawings, apply the above technical solutions directly or indirectly to other relevant technical fields, all of which are included in the scope of protection of the present patent.

Claims

1. The processor architecture realized by the single-shot multithreading dynamic circular parallel technology is characterized by comprising more than one core and a pipeline corresponding to the cores, wherein the cores only use one instruction emitter, and the pipeline comprises a sub-pipeline or a plurality of parallel sub-pipelines; each sub-pipeline comprises a plurality of functional stacks, the functional stacks in the sub-pipeline realize different functions, the functional stacks in the sub-pipeline sequentially execute to complete execution of one instruction, the pipeline comprises the functional stacks forming an instruction emitter, and the parallel sub-pipelines share the instruction emitter; the pipeline is used for executing threads, the functional stacks are used for executing the threads of the pipeline, different functional stacks in the same pipeline serve different threads at the same time, the instruction emitter is used for emitting thread instructions to the next functional stack, the functional stack runs to the next functional stack of the sub-pipeline after completing the specific stack function of the sub-pipeline, the functional stack completes the stack function of the next thread after completing the stack function of the current thread, the functional stack interacts with the next functional stack through a stack transfer interface, and the functional stack transmits stack transfer information to the next functional stack through the stack transfer interface after completing the stack function;

the content of the stack transmission interface comprises a thread number and arbitration information, wherein the arbitration information comprises busy information, state information and priority information, when the busy information of the stack transmission interface is busy, the function stack is blocked, and when the busy information of the stack transmission interface is not busy and the state information in the arbitration information of the stack transmission interface is not equal to the thread number of the stack transmission interface, the function stack is blocked; and when the state information in the arbitration information of the stack transmission interface is equal to the thread number of the stack transmission interface and the priority information in the arbitration information of the stack transmission interface is low, blocking, and when the priority information is high, completing the stack function.

2. The processor architecture implemented by the single-issue multithreading dynamic loop parallelism technique of claim 1, wherein the arbitration information is specifically:

3. The processor architecture implemented by single-shot multithreading dynamic loop parallel technology of claim 1, wherein the contents of the stack transfer interface include instruction cache addresses and instruction codes, interrupt flags, special register program states, or special function debug interrupts; the instruction cache address and the instruction code are the cache address and the instruction code of a thread instruction executed by a function stack, the interrupt mark is used for indicating whether to interrupt, the specific register program state is used for indicating the state of a specific register, and the special function debugging interrupt is used for indicating whether to interrupt the special function debugging.

4. The processor architecture implemented by the single-issue multithreading dynamic loop parallelism technique of claim 1, wherein the number of functional stacks is the same as the maximum number of parallel threads.

5. The processor architecture implemented by the single-issue multithreading dynamic loop parallelism technique of claim 1, wherein the processor is a single-core processor, a homogeneous multi-core processor, a heterogeneous multi-core MCU, DSP, CPU, DPU, GPU, or an NPU.

6. The processor architecture implemented by the single-shot multithreading dynamic loop parallelism technique of claim 1, wherein the main program thread and the auxiliary thread are parallel, the auxiliary thread comprising: a simulated hardware communication protocol thread and an online debugging thread, wherein the simulated hardware communication protocol comprises: IIC protocol, SPI protocol and UART protocol.

7. The processor architecture of claim 1, wherein the stack execution time of the functional stack is at least one system clock cycle, and the single-core multithreading parallel granularity is at least one system clock stage.

8. The processor architecture of claim 1, wherein the processor is a RISC instruction set processor, the functional stacks including a fetch functional stack, a decode functional stack, an execute functional stack, a memory functional stack, and a write back functional stack.

9. The processor architecture implemented by the single-shot multithreading dynamic loop parallel technique of claim 1, wherein the processor is a CISC instruction set processor, and the functional stacks include a value function stack, a decode read-store function stack, and a write-store execution write-back function stack.

10. The processor architecture implemented by the single-issue multithreading dynamic loop parallel technique of claim 1, wherein threads of the same operation share a pipeline.

11. The processor architecture implemented by the single-issue multithreading dynamic loop parallelism technique of claim 1, wherein the pipeline execution of different threads has a startup interval of less than one instruction completion cycle.