CN115617740A

CN115617740A - Processor architecture realized by single-emission multi-thread dynamic circulation parallel technology

Info

Publication number: CN115617740A
Application number: CN202211288569.3A
Authority: CN
Inventors: 王杜
Original assignee: Changsha Fangwei Technology Co ltd
Current assignee: Changsha Fangwei Technology Co ltd
Priority date: 2022-10-20
Filing date: 2022-10-20
Publication date: 2023-01-17
Anticipated expiration: 2042-10-20
Also published as: CN115617740B

Abstract

The invention relates to the field of integrated circuits, in particular to a design and implementation method of a processor kernel application architecture in the field of computer architecture; the whole technical system comprises: the method comprises a difference pipeline normalization arbitration rotation blocking stack transmission mechanism, a system clock level time division multiplexing technology, a multi-thread starting and executing mechanism, a simplified feasible design, an expansibility and a customized design. The normalization arbitration rotation blocking stack transmission mechanism of the difference assembly line divides the assembly line into function stacks (such as instruction fetching, decoding and the like) according to a function mutual exclusion principle, normalizes and standardizes the stack transmission information, finally forms a closed-loop difference assembly line and has an arbitration rotation blocking function; the SCTD technology is matched with the multithreading starting and executing mechanism to reasonably schedule each function stack, and each function stack serves different threads (programs) at the same time, so that the execution correlation of the function stacks is reduced, and the parallelism is improved. The resource utilization rate is improved to the maximum extent.

Description

Processor architecture realized by single-emission multi-thread dynamic circulation parallel technology

Technical Field

The invention relates to the field of integrated circuits, in particular to a method for realizing a processor kernel application architecture in the field of computer architectures.

Background

The integrated circuit industry is heavily supported. The architecture of the processor is a foundation for constructing the processor, and most of the common architectures are foreign architectures, such as an x86 architecture, an ARM architecture, and the like. There are many limitations to the application of these architectures, so new architectures are needed. The novel architecture technology is the core of an autonomous controllable, safe and reliable domestic processor and is also a difficult point. At present, the system structure has less technological innovation achievements, urgent requirements, great strategic position and wide market space.

The existing processor has Hyper-Threading technology, and Hyper-Threading (HT) was originally developed by intel and released in 2002. hyper-Threading was originally applied only to Xeon processors, and was then called "Super-Threading". By this technique, intel implements that in one physical CPU, two logical threads are provided. Although two threads can be executed simultaneously using hyper-threading, when two threads simultaneously require a resource, one of the threads must leave the resource temporarily pending until the resource is free. Thus, the performance of the hyper-thread is not equal to the performance of both CPUs. That is, the existing hyper-threading technology only simulates two processors, can process multi-thread application, and the performance of the hyper-threading technology is not improved.

Disclosure of Invention

The purpose of the invention is: a new architecture technology is provided, a method for parallel multithreading (multi-program) of a single core of a processor is created, and the processing performance of multithreading is improved.

The technical scheme of the invention comprises the following steps: the processor architecture realized by the single-emission multi-thread dynamic circulation parallel technology comprises more than one core and a pipeline corresponding to the core, wherein the core only uses one instruction emitter, and the pipeline comprises one sub-pipeline or a plurality of parallel sub-pipelines; each sub-pipeline comprises a plurality of function stacks, the function stacks in the sub-pipelines realize different functions, the function stacks in the sub-pipelines are sequentially executed to complete the execution of one instruction, the pipeline comprises the function stacks forming an instruction transmitter, and a plurality of parallel sub-pipelines share one instruction transmitter; the assembly line is used for executing threads, the function stacks are used for executing the threads of the assembly line, different function stacks in the same assembly line serve different threads at the same time, the instruction emitter is used for sending a thread instruction to a next function stack, the function stacks are operated to the next function stack of the sub-assembly line after the specific stack function of the sub-assembly line is completed, the function stacks complete the stack function of the next thread after the stack function of the current thread is completed, the function stacks interact with the next function stack through a stack transmission interface, and the function stacks transmit stack transmission information to the next function stack through the stack transmission interface after the stack functions are completed.

Furthermore, the content of the stack transmission interface comprises a thread number and arbitration information, the arbitration information comprises busy information, state information and priority information, when the busy information of the stack transmission interface is busy, the functional stack is blocked, and when the busy information of the stack transmission interface is not busy and the state information is not equal to the thread number, the functional stack is blocked; when the state information is equal to the thread number and the priority is low, the blocking is carried out, and when the priority is high, the stack function of the self-stack is completed.

Further, the arbitration information specifically includes:

when the state information changes during blocking, the blocking is fixed in length, the blocking time is any integral multiple of the machine period, the sequential execution is carried out, the instruction execution period is integral multiple of the machine period, and the machine period is the period in which all function stacks are executed once;

when the system is blocked, the state information is unchanged, the blocking is not fixed, the blocking time is any integral multiple of a system clock, the system is executed out of order, and the instruction execution period is not integral multiple of the machine period.

Further, the contents of the stack transferring interface include an instruction cache address and an instruction code, an interrupt flag, a specific register program state or a special function debugging interrupt; the instruction cache address and the instruction code are cache addresses and instruction codes of thread instructions executed by a function stack, the interrupt flag is used for indicating whether to interrupt, the program state of the specific register is used for indicating the state of the specific register, and the special function debugging interrupt is used for indicating whether to interrupt the special function debugging.

Further, the number of the function stacks is the same as the maximum number of parallel threads.

Further, the processor is a single-core processor, a homogeneous multi-core processor, a heterogeneous multi-core MCU, a DSP, a CPU, a DPU, a GPU or an NPU.

Further, the main program thread and the assistant thread are parallel, and the assistant thread comprises: simulating a hardware communication protocol thread and an online debugging thread, wherein the hardware communication protocol comprises the following steps: IIC protocol, and UART protocol after SPI protocol.

Furthermore, the stack execution time of the function stack is at least one system clock cycle, and the single-core multithreading parallel granularity is a system clock level.

Further, the processor is a RISC instruction set processor, and the function stack includes an instruction fetching function stack, a decoding function stack, an execution function stack, an access function stack, and a write-back function stack.

Further, the processor is a CISC instruction set processor, and the function stack includes a value stack, a decode and read-store stack, and a write-store execution and write-back stack.

Further, threads of the same operation share a pipeline.

Further, the interval of starting opportunity of the pipelines to execute different threads is less than one instruction completion cycle.

In summary, the beneficial effects of the invention are as follows:

1. a method for parallel multithreading (multi-program) of a single core of a processor is created; by dividing one core into different functional stacks, programs of different threads can be simultaneously operated among a plurality of different functional stacks, so that multiple threads can be simultaneously operated, and the processing performance of the multiple threads is improved;

2. the invention discloses a brand-new hyper-threading category technology, wherein the number of single-core parallel threads (programs) is increased from 1-2 (the number of single cores is 2 hardware threads) to a plurality under the limited hardware resources;

3. the method improves the common processor single-core multithreading (program) parallelism granularity from an instruction level (even function level concurrency) to a system clock level; the method can be widely applied to MCU, DSP, CPU, DPU, GPU, NPU and the like of single-core, isomorphic multi-core and heterogeneous multi-core.

Drawings

FIG. 1 is a block diagram of a single-issue multi-thread (multi-program) dynamic loop parallel technique architecture.

FIG. 2 is a diagram illustrating parallel five-thread (five-program) timing analysis for a single pipeline according to the present invention.

FIG. 3 is a timing analysis diagram of superscalar parallel three-thread (three-program) according to the present invention.

FIG. 4 is a block diagram showing a common layout of the differential pipeline for RISC and CISC instruction sets.

FIG. 5 is a diagram illustrating normalized stack information content according to the present invention.

FIG. 6 is a diagram of a slew blocking mechanism with busy + state + priority arbitration in the present invention.

FIG. 7 is a timing diagram illustrating sequential and out-of-order blocking analysis in accordance with the present technology.

FIG. 8 is a diagram illustrating a detailed analysis of multi-threaded (multi-program) start-up timing in the present technology.

FIG. 9 is a diagram illustrating an exemplary analysis of an extensible and customizable design in the present technology.

FIG. 10 is a diagram illustrating the multi-threaded (multi-program) program start-up and execution mechanism of the present technology.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

The invention relates to the field of integrated circuits, in particular to a method for realizing a processor core application architecture in the field of computer architecture, which creates a method for realizing single-core parallel multithreading (multi-program) of a processor, invents a brand-new hyper-threading category technology, improves the number of the single-core parallel threads (programs) from 1 to 2 (the single core is 2 hardware threads in the traditional hyper-threading technology) to a plurality of threads under the limited hardware resources, greatly improves the performance, improves the common processor single-core multithreading (program) parallel granularity from an instruction level (even function level concurrence) to a system clock level, and is a technology with low cost, high performance, safety, reliability and sustainable and wide popularization. The invention provides a brand-new hyper-threading category technology, which improves the number of single-core parallel threads (programs) from 1-2 (the number of single cores is 2 hardware threads) to a plurality under the condition of limited hardware resources (the number of single threads is increased by less than 10% compared with the traditional single thread and is reduced by more than 40% compared with the traditional hyper-threading technology), greatly improves the performance (the number of single cores is increased by more than 300% compared with the traditional single thread and is increased by more than 150% compared with the traditional hyper-threading technology), and improves the common single-core multi-thread (program) parallel granularity of a processor from an instruction level (even a function level is concurrent) to a system clock level.

As shown in fig. 1: the single-emission multi-thread (multi-program) dynamic loop parallel technology system forms a block diagram. The method comprises five technical points which are respectively as follows: the method comprises a difference pipeline normalization arbitration rotation blocking stack transmission mechanism, a system clock level time division multiplexing technology (SCTDDM technology for short), a multithread (multi-program) starting and executing mechanism, a simplified feasible design, an expandable design and a customized design.

As shown in fig. 2: the technology of the invention aims at the parallel five-thread (five-program) time sequence analysis of a single pipeline (namely, one pipeline only comprises one sub-pipeline), and the sub-pipeline comprises five functional stacks. The figure describes the working condition of a single five-stack type difference pipeline and lists the time sequence relation. A pipeline can run one thread or a plurality of threads of the same operation, and the pipeline is divided into: five function stacks of instruction fetching, decoding, execution, access and write-back are provided. Corresponding to five working states: s0, S1, S2, S3 and S4; the stack execution time is one system clock cycle (clk). In other embodiments, the decoding and execution may be divided into other numbers, for example, the decoding and execution are combined into one functional stack. The white stack represents the working state and the grey stack represents the idle state. The situation of parallel five threads (five programs) is illustrated, a relatively ideal starting state is analyzed, no blockage occurs in the timing view, the threads are started sequentially, and the thread starting time is different by one system clock (clk). And after receiving the instruction of the thread 1 of the instruction emitter, the first function stack value taking function stack of the assembly line finishes the value taking of the thread 1 at the time of t 0. After the completion, the function stack runs to the next function stack of the sub-pipeline after the completion of the specific stack function of the sub-pipeline, and the t0+ clk dereferencing function stack at the next moment transmits the information of the current thread after the dereferencing operation is completed to the next function stack decoding function stack for operation. At this time, the value taking function stack can complete the value taking work of the next thread, namely thread 2. This is done in turn. Eventually at time t0+4 clk. The value taking function stack takes values of the thread 5, the decoding function stack decodes the thread 4, the execution function stack can execute the thread 3, the memory access function stack accesses the memory of the thread 2, and the write-back function stack writes back the thread 1. I.e. one pipeline can process 5 threads simultaneously. The functional core is as follows: only one set of hardware resources (function stack resources, namely function stacks) is provided, namely one hardware core, and at the same time, each function stack serves different threads (programs), so that the execution correlation of the function stacks is reduced; staggering each thread execution sequence by a system clock (clk), so that the parallel granularity of the single-core multithreading (program) is increased from an instruction level (even the concurrence of a function level) to the system clock level (clk); therefore, the parallelism is improved, the multithreading parallelism is supported, and the multithreading performance is improved.

As shown in fig. 3: the present technology is directed to superscalar (i.e., one pipeline contains multiple sub-pipelines) parallel three-thread (three-program) timing analysis. The figure describes four working conditions of a differential pipeline (superscalar), lists a time sequence relation, one pipeline comprises four sub pipelines, the first sub pipeline comprises an integer calculation function stack, an access and storage function stack and a write-back function stack, the second sub pipeline and the third sub pipeline comprise a floating point step-pair function stack, a mantissa summation function stack, a normalization function stack, an access and storage function stack and a write-back function stack, and the fourth pipeline comprises a condition judgment function stack and a branch jump function stack. The sub-pipelines share a value taking function stack and a decoding function stack, namely single transmission is realized. The longest pipeline is seven stacks, and the longest pipeline corresponds to seven working states: s0, S1, S2, S3, S4, S5 and S6; the stack execution time is one system clock cycle (clk). The case of the parallel of three threads (three programs) is illustrated, the ideal starting state is analyzed, no blockage occurs in the time sequence, the starting states are all sequentially started, and the starting times are different by one system clock (clk). The functional core is as follows: only one set of hardware resources (function stack resources) is provided, and each function stack serves different threads (programs) at the same time, so that the execution correlation of the function stacks is reduced; staggering each thread execution sequence by a system clock (clk), so that the parallel granularity of the single-core multithreading (program) is increased from an instruction level (even the concurrence of a function level) to the system clock level (clk); therefore, the parallelism is improved, and the multithreading parallelism is supported. FIG. 3 illustrates a superscalar out-of-order execution scenario, and the present invention may also support superscalar in-order execution.

As shown in fig. 4: the technique of the invention is suitable for common planning of the difference assembly line of RISC and CISC series instruction sets. Common three-stack and five-stack difference pipelines are illustrated. CISC instruction sets such as: CISC-51, commonly used in 8-bit processors, RISC instruction sets such as: RISC-V, commonly used for 32 \/64 bit processors. The above instruction sets are open source and therefore can also be used to develop domestic processors. The three-stack type difference assembly line is mainly suitable for an 8-bit MCU, and the five-stack type difference assembly line is suitable for a high-end processor. For a RISC instruction set processor, the function stack comprises an instruction fetching function stack, a decoding function stack, an execution function stack, an access function stack and a write-back function stack, and for a CISC instruction set processor, the function stack comprises a value taking function stack, a decoding read-memory function stack and a write-memory execution write-back function stack.

In addition to this, the interaction between ALU (arithmetic unit) and memory (program + data) and pipeline is also embodied.

All function stacks are unified input and output interfaces and are internally provided with a rotation blocking function

As shown in fig. 5: the invention normalizes the content of the stack information. In particular, the following table is given:

once the thread (program) starts, the information passing between the upstream stack (upper stage) and the downstream stack (lower stage) is kept consistent, and a closed-loop normalized pipeline is formed until the end.

As shown in fig. 6: the invention provides a rotation blocking mechanism with busy + state + priority arbitration function.

The logical relationship of three arbitration decisions is illustrated:

(1) busy =1, then the slewing is blocked, otherwise the transmission is carried out;

(2) state = thread number, then pass, otherwise rotate back blocking;

(3) priority is high, pass, otherwise rollback is blocked.

As shown in fig. 7: sequential blocking and out-of-order blocking timing analysis in the present technique. In the figure, there are two modes when slewing blockage occurs:

(1) The state is changed, the blocking is fixed in length, the blocking is any integral multiple of the machine period, and the blocking is executed according to the sequence of the blocking; the state is recorded when the blockage occurs, then enters a blocking area with black bottom, updates the arbitration result until the recording position is returned again, and returns to normal operation if the arbitration result can be released; the illustrated machine cycles (S0-S4) have 5 system clock cycles (clk), and when blocked, there is a slew-blocked time of 1 machine cycle (S0-S4). The instruction execution cycle is always an integral multiple of the machine cycle (S0-S4).

(2) The state is unchanged, the blocking is not long, the blocking is any integral multiple of the system clock period, and the out-of-order execution is carried out. The state is not recorded when the blocking happens, then the state enters a blocking area of a black background, the arbitration result is updated every system clock cycle (clk), and if the state can be released, the state returns to normal; the illustrated machine cycles (S0-S4) have 5 system clock cycles (clk), and when blocked, have a slew-blocking time of 3-4 system clock cycles (clk). The instruction execution cycle may not be an integer multiple of the machine cycle (S0-S4).

As shown in fig. 8: the invention relates to a multithread (multi-program) starting timing detail analysis method. In the figure, five threads (programs) of thread number = TN0, TN1, TN2, TN3, and TN4 are started in parallel. The starting process, which follows the state arbitration principle, is started in sequence, and the time difference is 1 system clock cycle (clk); TN1 starts two at the same time, following the priority arbitration principle, the lower TN1 thread (program) has high priority, and the upper TN1x thread (program) is always blocked.

The idle resources in the black bottom area at the lower left corner can be seen, and gradually decrease as the threads are started one by one until 5 threads (programs) are all started, all the function stack resources are in busy states, and no new thread can be added.

When the thread (program) of the thread number (TN 0) is started and is not finished, and the thread (program) of the thread number (TN 0) needs to be started again, the busy arbitration principle is followed, and the thread (program) started later is blocked and rotated.

As shown in fig. 9: the invention relates to an analysis of an extensible and customized design example. In the figure, thread number = TN0, TN1, TN2, TN3, TN4, where:

TN0 and TN1 are main functions main1 and main2;

the TN1x thread (program) is always blocked as main2x;

TN2 and TN3 are respectively: the software simulation hardware function customization 1 and the software simulation hardware function customization 2 have very flexible specific functions, which are common: the IIC, SPI and other analog communication protocols … … are parallel to the main program, and the main program execution is not influenced.

TN4 is dedicated to debug. The online debugging function is realized in a software interaction mode, and the independent online debugging function thread (degree) is parallel to the outside of the main program, so that the execution of the main program is not influenced.

As shown in fig. 10: the invention relates to a multithreading (multi-program) program starting and executing mechanism. Three threads (programs) are illustrated, with instructions being integer computation (thread 1), branch jump (thread 2), and integer computation (thread 3), respectively, as follows:

s01: the threads (programs) 1, 2 and 3 are started simultaneously, the thread (program) 1 follows the busy + state + priority arbitration conclusion, the rotation blockage is ended preferentially, and the next flow is entered; other threads (programs) remain blocked for looping;

s02: thread (program) 1 accesses Icache to complete instruction fetching operation; thread (program) 2 follows busy + state + priority arbitration conclusion to finish the rotation block, and enters the next flow; other threads (programs) remain blocked for looping;

s03: thread (program) 1 follows busy + state + priority arbitration conclusion, finishes the rotation blocking and enters the next flow; thread (program) 2 accesses Icache to complete instruction fetching operation; thread (program) 3 follows busy + state + priority arbitration conclusion to end the rotation block, and enters the next flow;

s04: thread (program) 1 enters a decoding stage and transmits a result to an integer calculation link; thread (program) 2 follows busy + state + priority arbitration conclusion, finishes the rotation blocking and enters the next flow; thread (program) 3 accesses Icache to complete instruction fetching operation;

s05: thread (program) 1 follows busy + state + priority arbitration conclusion, finishes the rotation blocking and enters the next flow; thread (program) 2 enters a decoding stage and transmits the result to an integer calculation link; thread (program) 3 follows busy + state + priority arbitration conclusion, finishes rotation blocking and enters the next flow;

s06: when the thread (program) 1 enters an integer calculation stage, starting an ALU to obtain an operation result for later use; thread (program) 2 follows busy + state + priority arbitration conclusion, finishes the rotation blocking and enters the next flow; the thread (program) 3 enters a decoding stage and transmits a result to an integer calculation link;

s07: thread (program) 1 follows busy + state + priority arbitration conclusion, finishes the rotation blocking and enters the next flow; the thread (program) 2 enters a condition judgment stage to obtain a jump destination address; thread (program) 3 follows busy + state + priority arbitration conclusion, finishes rotation blocking and enters the next flow;

s08: the thread (program) 1 enters a stage of accessing the data memory, calculates the Dcache address and completes the Dcache access operation; thread (program) 2 follows busy + state + priority arbitration conclusion, finishes the rotation blocking and enters the next flow; the thread (program) 3 enters an integer calculation stage, starting an ALU, and obtaining an operation result for standby;

s09: thread (program) 1 follows busy + state + priority arbitration conclusion, finishes the rotation blocking and enters the next flow; the thread (program) 2 enters a jump execution stage, and the whole jump instruction is executed; thread (program) 3 follows busy + state + priority arbitration conclusion, finishes rotation blocking and enters the next flow;

s10: the thread (program) 1 enters a write-back operation stage, the execution of the whole calculation instruction is completed, and meanwhile, the address of a program memory accessed next time (the Icahe address is calculated) is calculated; the thread (program) 2 enters the next instruction operation interval at first because the instruction cycle is shortest, follows busy + state + priority arbitration conclusion, ends the rotation block, and enters the next flow; the thread (program) 3 enters a stage of accessing the data memory, calculates the Dcache address and completes the Dcache access operation;

s11: thread (program) 1 enters the next instruction operation interval, follows busy + state + priority arbitration conclusion, ends the rotation block, and enters the next flow; the thread (program) 2 accesses the Icache again and takes out the next instruction code; thread (program) 3 follows busy + state + priority arbitration conclusion, finishes the rotation blocking, and enters the next flow;

s12: thread (program) 1 accesses the Icache again and takes out the next instruction code; thread (program) 2 follows busy + state + priority arbitration conclusion, finishes rotation blocking and enters the next flow; thread (program) 3 enters a writeback operation phase, and the entire compute instruction execution completes while computing the next program memory address to access (compute icache address).

It should be noted that, although the above embodiments have been described herein, the scope of the present invention is not limited thereby. Therefore, based on the innovative concepts of the present invention, the technical solutions of the present invention can be directly or indirectly applied to other related technical fields by changing and modifying the embodiments described herein or by using the equivalent structures or equivalent processes of the content of the present specification and the attached drawings, and are included in the scope of the present patent.

Claims

1. The processor architecture is characterized by comprising more than one core and a pipeline corresponding to the core, wherein the core only uses one instruction transmitter, and the pipeline comprises one sub-pipeline or a plurality of parallel sub-pipelines; each sub-pipeline comprises a plurality of function stacks, the function stacks in the sub-pipelines realize different functions, the function stacks in the sub-pipelines are sequentially executed to complete the execution of one instruction, the pipeline comprises the function stacks forming an instruction transmitter, and a plurality of parallel sub-pipelines share one instruction transmitter; the assembly line is used for executing threads, the function stacks are used for executing the threads of the assembly line, different function stacks in the same assembly line serve different threads at the same time, the instruction emitter is used for sending a thread instruction to a next function stack, the function stacks are operated to the next function stack of the sub-assembly line after the specific stack function of the sub-assembly line is completed, the function stacks complete the stack function of the next thread after the stack function of the current thread is completed, the function stacks interact with the next function stack through a stack transmission interface, and the function stacks transmit stack transmission information to the next function stack through the stack transmission interface after the stack functions are completed.

2. The processor architecture realized by the single-issue multithreading dynamic loop parallel technology as claimed in claim 1, wherein the contents of the stack passing interface include a thread number and arbitration information, the arbitration information includes busy information, status information and priority information, when the busy information of the stack passing interface is busy, the functional stack is blocked, when the busy information of the stack passing interface is not busy, the status information is not equal to the thread number, the functional stack is blocked; when the state information is equal to the thread number and the priority is low, the blocking is carried out, and when the priority is high, the stack function of the self-stack is completed.

3. The processor architecture implemented in accordance with claim 2, wherein the arbitration information is specifically:

when the state information changes during blocking, the blocking is fixed in length, the blocking time is any integral multiple of the machine cycle, the sequential execution is carried out, the instruction execution cycle is integral multiple of the machine cycle, and the machine cycle is the cycle of completing the execution of all the function stacks once;

4. The processor architecture of claim 1, wherein the contents of the stack pass interface comprise instruction cache addresses and instruction codes, interrupt flags, special register program state, or special function debug interrupts; the instruction cache address and the instruction code are cache addresses and instruction codes of thread instructions executed by a function stack, the interrupt flag is used for indicating whether to interrupt, the program state of the specific register is used for indicating the state of the specific register, and the special function debugging interrupt is used for indicating whether to interrupt the special function debugging.

5. The processor architecture implemented in accordance with claim 1, wherein the number of functional stacks is the same as a maximum number of parallel threads.

6. The processor architecture implemented by the single-issue, multi-threaded dynamic loop parallel technique of claim 1, wherein the processor is a single-core processor, a homogeneous multi-core processor, a heterogeneous multi-core MCU, a DSP, a CPU, a DPU, a GPU, or an NPU.

7. The processor architecture of claim 1, wherein a main program thread and an assist thread are concurrent, the assist thread comprising: simulating a hardware communication protocol thread and an online debugging thread, wherein the simulated hardware communication protocol comprises the following steps: IIC protocol, and UART protocol after SPI protocol.

8. The processor architecture of claim 1, wherein the stack execution time of the functional stack is at least one system clock cycle, and the single-core multithreading parallelism granularity is a system clock level.

9. The processor architecture implemented in accordance with claim 1, wherein said processor is a RISC instruction set processor, and wherein said function stacks include an instruction fetch stack, a decode stack, an execute stack, a memory access stack, and a write back stack.

10. The architecture of claim 1, wherein the processor is a CISC instruction set processor and the functional stacks include a value stack, a decode read store stack, and a write store execution write back stack.

11. The processor architecture implemented in accordance with claim 1, wherein threads of identical operations share a pipeline.

12. The processor architecture implemented in accordance with claim 1, wherein the pipeline execution different threads are initiated with less than one instruction completion cycle apart.