CN102156637A

CN102156637A - Vector crossing multithread processing method and vector crossing multithread microprocessor

Info

Publication number: CN102156637A
Application number: CN2011101138829A
Authority: CN
Inventors: 杨学军; 徐炜遐; 窦强; 王永文; 高军; 邓让钰; 衣晓飞; 郭御风; 唐遇星; 黎铁军; 吴俊杰; 曾坤; 晏小波
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2011-05-04
Filing date: 2011-05-04
Publication date: 2011-08-17

Abstract

The invention discloses a vector crossing multithread processing method and a vector crossing multithread microprocessor. The processing method comprises the following steps: using a multithread instruction-acquiring part to choose a vector thread from N vector threads for reading an instruction and storing the read instruction to a corresponding instruction buffer array of the vector thread; using a thread scheduling part to choose an instruction buffer array from N instruction buffer arrays and taking out an instruction from the instruction buffer array for the purpose of decoding; and sending a decoded instruction to a vector executing streamline or scalar executing streamline so as to execute. The method can be realized by using hardware structure by the vector crossing multithread microprocessor. The method and the microprocessor provided by the invention have the advantages that the vector processing technique and multithread technique are combined, the hardware structure is simple, the operation capability is strong, the compatibility and expansibility are excellent, and the like.

Description

Vector intersection multithread processing method and vector intersection multithreaded microprocessor

Technical field

The present invention relates to the computer microprocessor field, refer in particular to a kind of multithreaded microprocessor.

Background technology

The high speed development of computer realm is more and more higher to the requirement of the processing power of microprocessor, and the arithmetic capability that improves microprocessor mainly contains two approach: the one, improve the arithmetic capability of single processor core; The 2nd, increase is integrated in a plurality of processor cores in the microprocessor chip, promptly common alleged multi-core technology.

1, improves the arithmetic capability of processor core, traditional method mainly relies on the frequency that improves processor core and adopts the bigger superscale technology of transmitting instructions width, owing to be subjected to the restriction of factors such as technology, power consumption and reliability, the lifting of processor frequencies has run into bottleneck, and the transmitting instructions width also is difficult to continue to enlarge.Therefore, people turn to other new type microprocessor architecture technology with the emphasis of paying close attention to gradually, go up hardware resource to make full use of ever-increasing, thereby the performance of microprocessor core is got a promotion.

As shown in Figure 1, be the typical structure of traditional scalar micro-processor nuclear, it mainly comprises programmable counter, instruction fetching component, instruction cache, decoding unit and scalar execution pipeline.The scalar execution pipeline mainly comprises: register file cell, data cache, scalar execution unit (Load/Store parts, scalar floating-point calculation component and scalar arithmetic logical unti) and data write back parts.The typical implementation of scalar application program on conventional microprocessor is as follows: as shown in Figure 2, instruction fetching component sends access instruction according to programmable counter to instruction cache and obtains instruction, instruction fetching component mails to decoding unit with ready instruction and deciphers then, decode results according to decoding unit, instruction enters the scalar execution pipeline and begins to carry out, during execution: the scalar execution pipeline is according to the decode results access register file unit of decoding unit and obtain the source operand of this instruction, and mail to the function that suitable functional part is finished this instruction, at last, data write back parts and are responsible for the net result that this instruction produces is write back register file.

In the evolution of micro-processor architecture, intersection multithreading and vectorial technology had once appearred.The multithreading that intersects is meant that microprocessor can be safeguarded the status register and the relevant information of a plurality of scalar threads simultaneously, and the execution that execution pipeline replaces is from the scalar instruction of different threads.The feature of vector technology is mainly reflected in the processing that an instruction can be finished a plurality of scalar datas, and vector processor has high peak performance usually.The multithreading that intersects can make microprocessor hide the long delay operation effectively, guarantee that execution pipeline is full of, yet because the shared cover arithmetic unit of a plurality of threads, this makes the performance of single thread be affected, and this technology can't improve the peak performance of microprocessor; The vector technology can greatly improve the peak performance of microprocessor, but the memory access latency that it brings cache miss is comparatively responsive, is difficult to peak performance is converted into Practical Performance.If the two can be organically combined, might design the high-performance microprocessor that possesses peak value performance and Practical Performance simultaneously so.

2, multi-core technology is that the processor core that a plurality of complexities are lower is integrated on the chip.The hardware resource of multi-core technology on can the better utilization sheet improves the performance of microprocessor.In recent years, along with the development of multi-core technology, integrated processor check figure is more and more on one chip, and the arithmetic capability of corresponding single processor core does not significantly promote, in some design even obviously descend.And in order to bring into play the performance of polycaryon processor more fully, people more and more pay close attention to the exploitation of process level and Thread-Level Parallelism, and have ignored instruction level parallelism and the parallel research of data level.The how integrated processor core number of balance chip and the arithmetic capability of core, it is parallel to merge process level, thread-level, instruction-level and data level, thus the performance that further improves microprocessor is the major issue of micro-processor architecture design.

Summary of the invention

Technical matters to be solved by this invention is: at the technical matters of prior art existence, the invention provides a kind of Vector Processing technology was combined, can improve the processor calculating performance with multithreading vector and intersect multithread processing method, and a kind of hardware configuration is simple, arithmetic capability strong, compatibility and favorable expandability vector intersect multithreaded microprocessor.

For solving the problems of the technologies described above, the present invention by the following technical solutions:

A kind of vector intersection multithread processing method is characterized in that may further comprise the steps:

1) reading command: the multithreading instruction fetching component is selected a vectorial thread to instruct from N vectorial thread with round robin to read, and will read the instruction storage that obtains in the instruction queue buffer of described vectorial thread correspondence;

2) thread is selected: the thread scheduling parts are selected an instruction queue buffer from N instruction queue buffer, and take out an instruction and decipher from described instruction queue buffer;

3) execution command: vectorial execution pipeline is sent in the instruction after will deciphering or the scalar execution pipeline is carried out.

Further improve as method of the present invention:

Described N=2 ⁿ, n=1,2,3 wherein ...

Described step 3) specifically may further comprise the steps:

3.1) the operand selection: according to the content of the instruction after the described decoding, visit vector registor file unit or scalar register file unit obtain source operand, and the source operand that obtains is delivered to corresponding vectorial execution unit or scalar execution unit;

3.2) the instruction execution: described vectorial execution unit or scalar execution unit are carried out computing according to described source operand, and the result who carries out computing writes back vector registor file unit or scalar register file unit respectively.

Described instruction is scalar instruction or vector instruction, and described vector instruction comprises following classification:

ⅰ. vectorial access instruction comprises:

A. vectorial load instructions:

Vload vA rB as the address, reads in vector registor vA with data with the numerical value among the scalar register rB;

Vload vA rB imm adds that with the value among the scalar register rB several immediately imm as the address, read in vector registor vA with data;

Vload vA imm to count imm immediately as the address, reads in vector registor vA with data;

B. vector is held instruction:

Vstore vA rB imm adds that with the numerical value among the scalar register rB several immediately imm as the address, write main memory with data from vector registor vA;

Vstore vA imm to count imm immediately as the address, writes main memory with data from vector registor vA;

Vstore vA rB as the address, writes main memory with data from vector registor vA with the numerical value among the scalar register rB;

ⅱ. the vector/scalar register data transfer instructions comprises:

C. vtos vA rB idx sends idx unit among the vector registor vA into scalar register rB;

D. stov vA rB duplicates four parts with the value among the scalar register rB, sends into vector registor vA;

ⅲ. the vector arithmetic logic instruction comprises:

E. vvvop vD vA vB is certain arithmetic logical operation op with unit corresponding among vector registor vA and the vB, and the result writes vector registor vD;

F. vvsop rD vA vB is certain arithmetic logical operation op with unit all among vector registor vA and the vB, and the result writes scalar register rD;

G. vsvop vD vA rB is certain arithmetic logical operation op with all unit among the vector registor vA and scalar register rB, and the result writes vector registor vD;

H. vssop rD vA rB is certain arithmetic logical operation op with all unit among the vector registor vA and scalar register rB, and the result writes scalar register rD;

ⅳ. vectorial floating-point operation instruction comprises two classes:

E. vvvfop vD vA vB is certain floating-point operation fop with unit corresponding among vector registor vA and the vB, and the result writes vector registor vD;

G. vsvfop vD vA rB is certain floating-point operation fop with all unit among the vector registor vA and scalar register rB, and the result writes vector registor vD;

V. vector shuffles instruction, and vshuffle vC vB rA is according to the value of scalar register rA, with each unit weighs new sort of vector registor vB and write vector registor vC.

And a kind of vector intersection multithreaded microprocessor, it is characterized in that, comprise more than one vector intersection multithreaded microprocessor nuclear, described vector intersection multithreaded microprocessor nuclear comprises: N programmable counter, the multithreading instruction fetching component, instruction cache, N instruction queue buffer, the thread scheduling parts, scalar the vector decoding unit, scalar execution pipeline and vectorial execution pipeline and data cache, described multithreading instruction fetching component is according to the current instruction address of storing in the described N programmable counter, wheel changes the instruction of reading a described N thread from described instruction cache, and the instruction that will read send into described one to one N the instruction queue buffer of a described N thread in, described thread scheduling parts are selected one from a described N instruction queue buffer, therefrom an instruction being taken out and mail to described Biao Liang vector decoding unit deciphers, instruction after the decoding is sent into corresponding scalar execution pipeline or vectorial execution pipeline is carried out, in the implementation, described scalar execution pipeline or vectorial execution pipeline memory access data from described data cache.

Further improvement as vector intersection multithreaded microprocessor of the present invention:

Described data cache comprises on-chip cache and the second level cache that is interconnected, described scalar execution pipeline link to each other with described on-chip cache and from described on-chip cache the memory access data, described vectorial execution pipeline links to each other with described second level cache and direct memory access data from second level cache by vectorial access interface.

Described scalar execution pipeline comprises: be used for storing the source operand of scalar instruction the scalar register file unit, be used for selecting from described scalar register file unit source operand the scalar operands selected cell, be used for described source operand carried out the scalar performance element of computing and be used to carry out computing the scalar data that after finishing the result is write back the scalar register file unit and write back parts;

Described vectorial execution pipeline comprises: the vector registor file unit that is used to store the source operand of vector instruction, be used for from described scalar register file unit and or described vector registor file unit select source operand, and finish the vector operand selected cell that the operand between described vectorial execution unit and scalar execution unit transmits, be used for described source operand is carried out the vectorial performance element of computing, and the vector data that after being used to carry out computing and finishing the result is write back the vector registor file unit writes back parts, described scalar performance element links to each other with described on-chip cache, and described vectorial performance element links to each other with described second level cache.

Described vectorial performance element comprises: be used to finish vectorial Load/Store instruction or scalar Load/Store instruction vectorial Load/Store parts, be used to finish vectorial floating-point operation instruction vectorial floating-point calculation component, be used to finish the vector arithmetic logical block of vector arithmetic logic instruction, and be used to finish the vector that vector shuffles instruction and shuffle parts.

The size of the data block in the described data cache is identical with the width of described vector registor file unit.

Described vector intersection multithreaded microprocessor comprises the vector intersection multithreaded microprocessor nuclear more than two, described vector intersection multithreaded microprocessor nuclear more than two is furnished with on-chip cache and the second level cache that is interconnected respectively, described second level cache more than two interconnects by cross bar switch, and network interface links to each other with Peripheral Interface between the three grade high-speed caches outer with being located at nuclear, sheet, and described each three grades of high-speed cache link to each other with a memory controller that is used for access external memory.

Compared with prior art, the invention has the advantages that:

1, the present invention's vector intersection multithread processing method, to intersect multithreading and vectorial technology combines, be used to handle the vector instruction or the scalar instruction of multithreading, adopt the multithreading wheel to change the mode of getting finger, the instruction of a plurality of vectorial threads is carried out with interleaved mode, when certain thread runs into the long delay operation, the instruction of other thread still can continue to carry out, can well hide the long delay operation that runs in the thread implementation, reach when improving the microprocessor peak performance, can guarantee being full of of vectorial execution pipeline, thereby bring into play the arithmetic capability of microprocessor more fully.

2, the present invention's vector intersection multithreaded microprocessor, on the basis of traditional scalar micro-processor core, add the multithreading instruction fetching component and and increased vectorial execution pipeline, make it will intersect multithreading and vectorial technology combination, finishing the crossing parallel of a plurality of vectorial threads handles, realize the data level parallel processing in the vectorial technological development program, arithmetic capability is strong.

3, the present invention vector intersects multithreaded microprocessor, is still keeping original scalar processor structure when increasing vectorial execution pipeline, can compatible traditional fully scalar application program, and therefore compatibility is good.

4, the present invention's vector intersection multithreaded microprocessor, use intersection multithreading has substituted the superscale technology of current popular, has reduced the complexity of hardware design when guaranteeing performance, and hardware configuration is simple, and cost is little, favorable expandability.

Description of drawings

Fig. 1 is the structural representation of existing typical conventional microprocessor;

Fig. 2 is that existing typical scalar micro-processor is carried out schematic flow sheet;

Fig. 3 is the composition structural representation of vector intersection multithreaded microprocessor of the present invention;

Fig. 4 is the execution schematic flow sheet of vector intersection multithread processing method of the present invention;

Fig. 5 is that the multithreading intersection is carried out the principle schematic of hiding the long delay operation in the vector intersection multithread processing method of the present invention;

Fig. 6 is the realization principle schematic that vector of the present invention shuffles instruction vshuffle vA vB rA;

Fig. 7 is the syndeton synoptic diagram of scalar performance element of the present invention and vectorial performance element and data cache;

Fig. 8 is the composition structural representation of the polycaryon processor of vector intersection multithreaded microprocessor of the present invention.

Embodiment

Below with reference to Figure of description and specific embodiment the present invention is described in further detail.

As shown in Figure 4, vector intersection multithread processing method of the present invention may further comprise the steps:

1) reading command: the multithreading instruction fetching component is selected a vectorial thread to instruct from 8 vectorial threads with round robin to read, and will read the instruction storage that obtains in the instruction queue buffer of vectorial thread correspondence.Instruction fetching component can be that instruction queue buffer is read in the instruction of a cacheline with size at every turn, the length of every instruction generally is 1 machine work, if the size of each on-chip cache piece is 4 machine works, then each clock period of instruction fetching component can be read in the instruction cache formation with 4 instructions.During practical application, the quantity N of vectorial thread can be the natural number greater than 1, and its size only is subjected to the restriction of hardware resource; Generally, general N is got is 2 power, i.e. N=2 ⁿ, n=1,2,3 ..., help simplifying hardware design like this.

2) thread is selected: the thread scheduling parts are selected an instruction queue buffer from 8 instruction queue buffers, and take out an instruction and decipher from instruction queue buffer;

3) execution command: vectorial execution pipeline is sent in the instruction after will deciphering or the scalar execution pipeline is carried out, and concrete execution in step is as follows:

3.1) the operand selection: according to the content of the instruction after the decoding, visit vector registor file unit or scalar register file unit obtain source operand, and the source operand that obtains is delivered to corresponding vectorial execution unit or scalar execution unit;

3.2) the instruction execution: vectorial execution unit or scalar execution unit are carried out computing according to source operand, and the result who carries out computing writes back vector registor file unit or scalar register file unit respectively.

In the said method, instruct to be scalar instruction or vector instruction, scalar instruction is existing general scalar instruction, and vector instruction can comprise following five kinds:

ⅰ. vectorial access instruction comprises:

A. vectorial load instructions, a vector can be read in a vector registor from main memory:

B. vector is held instruction, and the content of a vector registor can be write back main memory:

The vector load instructions is held instruction with vector can support multiple different addressing mode, comprises register addressing, immediate addressing and base addressing, can realize whole addressing modes when carrying out specific implementation, also can realize wherein one or more addressing modes;

ⅱ. the vector/scalar register data transfer instructions comprises:

C. vtos vA rB idx sends idx unit among the vector registor vA into scalar register rB, for by the instruction of scalar register to the vector registor Data transmission;

D. stov vA rB duplicates four parts with the value among the scalar register rB, sends into vector registor vA; For by the instruction of vector registor to the scalar register Data transmission;

Finish in certain unit that the value of scalar register is imported in the vector registor to the instruction of vector registor Data transmission by scalar register, or according to mask the value of scalar register is duplicated some parts and compose a plurality of unit of giving vector registor simultaneously; The value of certain unit in the vector registor is composed to certain scalar register to the instruction of scalar register Data transmission by vector registor;

ⅲ. the vector arithmetic logic instruction comprises four classes:

E. vvvop vD vA vB is certain arithmetic logical operation op with unit corresponding among vector registor vA and the vB, and the result writes vector registor vD; This instruction is the vector-vector-vector instruction, and promptly two source operands and a destination operand all are vector registors, and the arithmetic logical operation between vectorial corresponding unit is mainly finished in this class instruction;

F. vvsop rD vA vB is certain arithmetic logical operation op with unit all among vector registor vA and the vB, and the result writes scalar register rD; This instruction is vector-vector-scalar class instruction, and promptly two operands all are vector registors and destination operand is a scalar register, this class instruct mainly finish with the summation of all unit in the vector registor, ask and, ask or etc. operation.

Vsvop vD vA rB and vssop rD vA rB are the vector-scalar-vector instruction, be that source operand is a vector registor and a scalar register, destination operand is a vector registor, and the arithmetic logical operation of each unit of vector and scalar is finished in this class instruction;

The operation that vector arithmetic logic class instruction is supported comprises the fixed point arithmetic logic instruction, as add, subtract, arithmetic instruction such as multiplication and division, with or, non-, XOR, negate, relatively wait logic instruction; And bit arithmetic instruction, comprise step-by-step and, step-by-step or, non-, the step-by-step XOR of step-by-step, step-by-step negate etc.

ⅳ. vectorial floating-point operation instruction comprises two classes:

E. vvvfop vD vA vB is certain floating-point operation fop with unit corresponding among vector registor vA and the vB, and the result writes vector registor vD; This instruction is the vector-vector-vector instruction, and promptly two source operands and a destination operand all are vector registors, and the floating-point operation between vectorial corresponding unit is mainly finished in this class instruction;

The floating-point operation that the instruction of vector floating-point operation is supported mainly comprises floating add, floating-point subtraction, floating-point multiplication, floating-point division and floating-point comparison, floating-point/operations such as fixed-point data conversion.

The realization principle that this vector shuffles instruction as shown in Figure 6, the value of four unit of source vector register vB is sent into purpose vector registor vA by two-layer MUX, four MUX of ground floor all are four to select a MUX, their control signal is from the least-significant byte of scalar source-register rA (s0 among the figure ~ s3), two of each MUX; Four MUX of the second layer are the alternative MUX, be used to determine whether the value of source vector register is sent into the purpose vector registor, their control signal is from scalar source-register rA the 8th to the 11st (mask among the figure), if corresponding control signal is 1, the value of corresponding units is just sent into the corresponding unit of purpose vector registor in the source vector register, otherwise the value of corresponding units is constant in the purpose vector registor.

The intersection of a plurality of threads is carried out the long delay operation (as accessing operation) that can be good at hiding in the execution pipeline, thereby guarantees that as possible execution pipeline is in the state of operating at full capacity all the time.As shown in Figure 5, hide the principle of long delay operation for the multithreading intersection execution of vector intersection multithreaded microprocessor of the present invention, the intersection with 4 threads among the figure is implemented as example, and the process of the hiding memory access delay of vectorial thread of a plurality of intersections execution has been described.C representation vector operational order among the figure, M representation vector access instruction, L_hit is that cache hit postpones, L_miss is that cache miss postpones; Four threads are respectively with different sequence execute vector operational orders and vectorial access instruction.When carrying out, the vectorial access instruction of thread 0 will introduce two delays in cycle all the time, if do not intersect other threads of carrying out, this will introduce the obstruction in two cycles in execution pipeline, after introducing the intersection multi-thread mechanism, two idling cycles that the memory access delay brings have been filled up in the vector operation instruction of thread 1 and thread 2, thereby have guaranteed being full of of streamline.The execution of the vectorial access instruction of thread 2 has run into the inefficacy of high-speed cache, this may cause very long streamline to block, and the free time of execution pipeline has been filled up in the vector operation of thread 0, thread 1 and thread 2 instruction in the process of handling cache miss, although do not avoid the obstruction of execution pipeline fully, but also significantly reduced the blocking time of execution pipeline, improved operational performance.As the above analysis, generally speaking, the Thread Count that microprocessor is supported is many more, and the ability of its hiding long delay operation is also just strong more, and its performance just can better be brought into play.

Can realize above-mentioned vector intersection multithread processing method by vector intersection multithreaded microprocessor of the present invention, this vector intersection multithreaded microprocessor comprises more than one vector intersection multithreaded microprocessor nuclear, as shown in Figure 3, vector intersection multithreaded microprocessor nuclear comprises: 8 programmable counters (PC), the multithreading instruction fetching component, instruction cache, 8 instruction queue buffers, the thread scheduling parts, scalar the vector decoding unit, scalar execution pipeline and vectorial execution pipeline and data cache, the multithreading instruction fetching component is according to the current instruction address of storing in 8 programmable counters, wheel changes the instruction of reading N thread from instruction cache, and the instruction that will read is sent into and 8 threads one to one in 8 instruction queue buffers, the thread scheduling parts are selected one from 8 instruction queue buffers, therefrom an instruction being taken out and mail to Biao Liang vector decoding unit deciphers, corresponding scalar execution pipeline or vectorial execution pipeline are sent in instruction after the decoding, and (scalar instruction will be sent into the scalar execution pipeline and carry out, vector instruction then is sent to vectorial execution pipeline and carries out) carry out, in the implementation, scalar execution pipeline or vectorial execution pipeline memory access data from data cache.

In the present embodiment, the scalar execution pipeline has kept the structure of existing typical scalar execution pipeline, and it comprises: be used for storing the source operand of scalar instruction the scalar register file unit, be used for selecting from the scalar register file unit source operand the scalar operands selected cell, be used for source operand carried out the scalar performance element of computing and be used to carry out computing the scalar data that after finishing the result is write back the scalar register file unit and write back parts.And the present invention has increased vectorial execution pipeline, it comprises: be used for storing the source operand of vector instruction the vector registor file unit, be used for from the scalar register file unit and or the vector registor file unit select source operand, and finish the operand transmission between vectorial execution unit and scalar execution unit the vector operand selected cell, be used for source operand is carried out the vectorial performance element of computing, and the vector data that after being used to carry out computing and finishing the result is write back the vector registor file unit writes back parts.Why increasing operand selection and transmittance process is because some instruction (as vector-scalar-vector) needs two kinds of operands of vector sum scalar simultaneously.

Because vectorial execution pipeline can be finished the computing of whole vector simultaneously in each clock period, therefore it is far longer than the scalar execution pipeline to the data demand, and the less on-chip cache of capacity is difficult to satisfy simultaneously scalar operation parts and the vector operation parts demand to data usually.Therefore, as shown in Figure 7, in the present embodiment, data cache comprises on-chip cache and the second level cache that is interconnected, scalar performance element in the scalar execution pipeline links to each other with on-chip cache and memory access data from on-chip cache, the vectorial performance element of vector in the execution pipeline is by the vectorial access interface directly memory access data from second level cache that link to each other with second level cache also, that is: make vectorial Load/Store parts can walk around on-chip cache, through second level cache.

In the present embodiment, the vector performance element comprises: be used to finish vectorial Load/Store instruction or scalar Load/Store instruction vectorial Load/Store parts, be used to finish vectorial floating-point operation instruction vectorial floating-point calculation component, be used to finish the vector arithmetic logical block of vector arithmetic logic instruction, and be used to finish the vector that vector shuffles instruction and shuffle parts.The scalar performance element then adopts existing typical structure, during execution pipeline work, various instructions all are admitted to corresponding execution unit, source operand as scalar/vectorial access instruction will be admitted to scalar or vectorial Load/Store parts, and the vector arithmetic operational order will be admitted to vector arithmetic logical block etc.

In the present embodiment, the size of the data block in the data cache is identical with the width of vector registor file unit.As: the vector registor file unit comprises 8 vector registor groups, the corresponding thread of each vector registor group.Each vector registor group comprises 32 vector registor: v0～v31, the length of each vector registor is 4 machine works, can be corresponding 32 machine works of 128(according to the bit wide of the different vector registors of machine work length) or corresponding 64 machine works of 256().

It is two when above that vector in vector of the present invention intersects multithreaded microprocessor intersects multithreaded microprocessor nuclear, as shown in Figure 8, vector intersection multithreaded microprocessor nuclear more than two is furnished with on-chip cache and the second level cache that is interconnected respectively, second level cache more than two interconnects by cross bar switch, and network interface links to each other with Peripheral Interface between the three grade high-speed caches outer with being located at nuclear, sheet, and each three grades of high-speed cache links to each other with a memory controller that is used for access external memory.The interconnected HSSI High-Speed Serial Interface that provides of a plurality of vector intersection microprocessor chips is provided network interface between sheet; Peripheral Interface provides multiple peripheral bus support, comprises PCI-E bus, gigabit Ethernet etc.Vector intersects the multithreading multi-core microprocessor provides high Practical Performance with simple hardware construction comparatively, and process level, thread-level, instruction-level and the data level that can give full play in the application program are parallel; This processor can compatible existing scalar application program simultaneously, has compatibility and extensibility preferably.

The above only is a preferred implementation of the present invention, and protection scope of the present invention also not only is confined to the foregoing description, and all technical schemes that belongs under the thinking of the present invention all belong to protection scope of the present invention.Should be pointed out that for those skilled in the art the some improvements and modifications not breaking away under the principle of the invention prerequisite should be considered as protection scope of the present invention.

Claims

1. a vector intersects multithread processing method, it is characterized in that may further comprise the steps:

2. vector intersection multithread processing method according to claim 1 is characterized in that described N=2 ⁿ, n=1,2,3 wherein ...

3. vector intersection multithread processing method according to claim 1 is characterized in that described step 3) specifically may further comprise the steps:

4. according to claim 1 or 2 or 3 described vector intersection multithread processing methods, it is characterized in that described instruction is scalar instruction or vector instruction, described vector instruction comprises following classification:

ⅰ. vectorial access instruction comprises:

A. vectorial load instructions:

B. vector is held instruction:

ⅱ. the vector/scalar data transfer instructions comprises:

ⅲ. the vector arithmetic logic instruction comprises:

ⅳ. vectorial floating-point operation instruction comprises two classes:

5. a vector intersects multithreaded microprocessor, it is characterized in that, comprise more than one vector intersection multithreaded microprocessor nuclear, described vector intersection multithreaded microprocessor nuclear comprises: N programmable counter, the multithreading instruction fetching component, instruction cache, N instruction queue buffer, the thread scheduling parts, scalar the vector decoding unit, scalar execution pipeline and vectorial execution pipeline and data cache, described multithreading instruction fetching component is according to the current instruction address of storing in the described N programmable counter, wheel changes the instruction of reading a described N thread from described instruction cache, and the instruction that will read send into described one to one N the instruction queue buffer of a described N thread in, described thread scheduling parts are selected one from a described N instruction queue buffer, therefrom an instruction being taken out and mail to described Biao Liang vector decoding unit deciphers, instruction after the decoding is sent into corresponding scalar execution pipeline or vectorial execution pipeline is carried out, in the implementation, described scalar execution pipeline or vectorial execution pipeline memory access data from described data cache.

6. vector intersection multithreaded microprocessor according to claim 5, it is characterized in that, described data cache comprises on-chip cache and the second level cache that is interconnected, described scalar execution pipeline link to each other with described on-chip cache and from described on-chip cache the memory access data, described vectorial execution pipeline links to each other with described second level cache and direct memory access data from second level cache by vectorial access interface.

7. vector intersection multithreaded microprocessor according to claim 6, it is characterized in that described scalar execution pipeline comprises: be used for storing the source operand of scalar instruction the scalar register file unit, be used for selecting from described scalar register file unit source operand the scalar operands selected cell, be used for described source operand carried out the scalar performance element of computing and be used to carry out computing the scalar data that after finishing the result is write back the scalar register file unit and write back parts;

8. vector intersection multithreaded microprocessor according to claim 7, it is characterized in that, described vectorial performance element comprises: be used to finish vectorial Load/Store instruction or scalar Load/Store instruction vectorial Load/Store parts, be used to finish vectorial floating-point operation instruction vectorial floating-point calculation component, be used to finish the vector arithmetic logical block of vector arithmetic logic instruction, and be used to finish the vector that vector shuffles instruction and shuffle parts.

9. according to claim 5 or 6 or 7 or 8 described vector intersection multithreaded microprocessors, it is characterized in that the size of the data block in the described data cache is identical with the width of described vector registor file unit.

10. according to claim 5 or 6 or 7 or 8 described vector intersection multithreaded microprocessors, it is characterized in that, described vector intersection multithreaded microprocessor comprises the vector intersection multithreaded microprocessor nuclear more than two, described vector intersection multithreaded microprocessor nuclear more than two is furnished with on-chip cache and the second level cache that is interconnected respectively, described second level cache more than two interconnects by cross bar switch, and with being located at nuclear outer three grades of high-speed caches, network interface links to each other with Peripheral Interface between sheet, and described each three grades of high-speed cache link to each other with a memory controller that is used for access external memory.