CN102156637A - Vector crossing multithread processing method and vector crossing multithread microprocessor - Google Patents

Vector crossing multithread processing method and vector crossing multithread microprocessor Download PDF

Info

Publication number
CN102156637A
CN102156637A CN2011101138829A CN201110113882A CN102156637A CN 102156637 A CN102156637 A CN 102156637A CN 2011101138829 A CN2011101138829 A CN 2011101138829A CN 201110113882 A CN201110113882 A CN 201110113882A CN 102156637 A CN102156637 A CN 102156637A
Authority
CN
China
Prior art keywords
vector
instruction
scalar
vectorial
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011101138829A
Other languages
Chinese (zh)
Inventor
杨学军
徐炜遐
窦强
王永文
高军
邓让钰
衣晓飞
郭御风
唐遇星
黎铁军
吴俊杰
曾坤
晏小波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN2011101138829A priority Critical patent/CN102156637A/en
Publication of CN102156637A publication Critical patent/CN102156637A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Advance Control (AREA)

Abstract

The invention discloses a vector crossing multithread processing method and a vector crossing multithread microprocessor. The processing method comprises the following steps: using a multithread instruction-acquiring part to choose a vector thread from N vector threads for reading an instruction and storing the read instruction to a corresponding instruction buffer array of the vector thread; using a thread scheduling part to choose an instruction buffer array from N instruction buffer arrays and taking out an instruction from the instruction buffer array for the purpose of decoding; and sending a decoded instruction to a vector executing streamline or scalar executing streamline so as to execute. The method can be realized by using hardware structure by the vector crossing multithread microprocessor. The method and the microprocessor provided by the invention have the advantages that the vector processing technique and multithread technique are combined, the hardware structure is simple, the operation capability is strong, the compatibility and expansibility are excellent, and the like.

Description

Vector intersection multithread processing method and vector intersection multithreaded microprocessor
Technical field
The present invention relates to the computer microprocessor field, refer in particular to a kind of multithreaded microprocessor.
Background technology
The high speed development of computer realm is more and more higher to the requirement of the processing power of microprocessor, and the arithmetic capability that improves microprocessor mainly contains two approach: the one, improve the arithmetic capability of single processor core; The 2nd, increase is integrated in a plurality of processor cores in the microprocessor chip, promptly common alleged multi-core technology.
1, improves the arithmetic capability of processor core, traditional method mainly relies on the frequency that improves processor core and adopts the bigger superscale technology of transmitting instructions width, owing to be subjected to the restriction of factors such as technology, power consumption and reliability, the lifting of processor frequencies has run into bottleneck, and the transmitting instructions width also is difficult to continue to enlarge.Therefore, people turn to other new type microprocessor architecture technology with the emphasis of paying close attention to gradually, go up hardware resource to make full use of ever-increasing, thereby the performance of microprocessor core is got a promotion.
As shown in Figure 1, be the typical structure of traditional scalar micro-processor nuclear, it mainly comprises programmable counter, instruction fetching component, instruction cache, decoding unit and scalar execution pipeline.The scalar execution pipeline mainly comprises: register file cell, data cache, scalar execution unit (Load/Store parts, scalar floating-point calculation component and scalar arithmetic logical unti) and data write back parts.The typical implementation of scalar application program on conventional microprocessor is as follows: as shown in Figure 2, instruction fetching component sends access instruction according to programmable counter to instruction cache and obtains instruction, instruction fetching component mails to decoding unit with ready instruction and deciphers then, decode results according to decoding unit, instruction enters the scalar execution pipeline and begins to carry out, during execution: the scalar execution pipeline is according to the decode results access register file unit of decoding unit and obtain the source operand of this instruction, and mail to the function that suitable functional part is finished this instruction, at last, data write back parts and are responsible for the net result that this instruction produces is write back register file.
In the evolution of micro-processor architecture, intersection multithreading and vectorial technology had once appearred.The multithreading that intersects is meant that microprocessor can be safeguarded the status register and the relevant information of a plurality of scalar threads simultaneously, and the execution that execution pipeline replaces is from the scalar instruction of different threads.The feature of vector technology is mainly reflected in the processing that an instruction can be finished a plurality of scalar datas, and vector processor has high peak performance usually.The multithreading that intersects can make microprocessor hide the long delay operation effectively, guarantee that execution pipeline is full of, yet because the shared cover arithmetic unit of a plurality of threads, this makes the performance of single thread be affected, and this technology can't improve the peak performance of microprocessor; The vector technology can greatly improve the peak performance of microprocessor, but the memory access latency that it brings cache miss is comparatively responsive, is difficult to peak performance is converted into Practical Performance.If the two can be organically combined, might design the high-performance microprocessor that possesses peak value performance and Practical Performance simultaneously so.
2, multi-core technology is that the processor core that a plurality of complexities are lower is integrated on the chip.The hardware resource of multi-core technology on can the better utilization sheet improves the performance of microprocessor.In recent years, along with the development of multi-core technology, integrated processor check figure is more and more on one chip, and the arithmetic capability of corresponding single processor core does not significantly promote, in some design even obviously descend.And in order to bring into play the performance of polycaryon processor more fully, people more and more pay close attention to the exploitation of process level and Thread-Level Parallelism, and have ignored instruction level parallelism and the parallel research of data level.The how integrated processor core number of balance chip and the arithmetic capability of core, it is parallel to merge process level, thread-level, instruction-level and data level, thus the performance that further improves microprocessor is the major issue of micro-processor architecture design.
Summary of the invention
Technical matters to be solved by this invention is: at the technical matters of prior art existence, the invention provides a kind of Vector Processing technology was combined, can improve the processor calculating performance with multithreading vector and intersect multithread processing method, and a kind of hardware configuration is simple, arithmetic capability strong, compatibility and favorable expandability vector intersect multithreaded microprocessor.
For solving the problems of the technologies described above, the present invention by the following technical solutions:
A kind of vector intersection multithread processing method is characterized in that may further comprise the steps:
1) reading command: the multithreading instruction fetching component is selected a vectorial thread to instruct from N vectorial thread with round robin to read, and will read the instruction storage that obtains in the instruction queue buffer of described vectorial thread correspondence;
2) thread is selected: the thread scheduling parts are selected an instruction queue buffer from N instruction queue buffer, and take out an instruction and decipher from described instruction queue buffer;
3) execution command: vectorial execution pipeline is sent in the instruction after will deciphering or the scalar execution pipeline is carried out.
Further improve as method of the present invention:
Described N=2 n, n=1,2,3 wherein ...
Described step 3) specifically may further comprise the steps:
3.1) the operand selection: according to the content of the instruction after the described decoding, visit vector registor file unit or scalar register file unit obtain source operand, and the source operand that obtains is delivered to corresponding vectorial execution unit or scalar execution unit;
3.2) the instruction execution: described vectorial execution unit or scalar execution unit are carried out computing according to described source operand, and the result who carries out computing writes back vector registor file unit or scalar register file unit respectively.
Described instruction is scalar instruction or vector instruction, and described vector instruction comprises following classification:
ⅰ. vectorial access instruction comprises:
A. vectorial load instructions:
Vload vA rB as the address, reads in vector registor vA with data with the numerical value among the scalar register rB;
Vload vA rB imm adds that with the value among the scalar register rB several immediately imm as the address, read in vector registor vA with data;
Vload vA imm to count imm immediately as the address, reads in vector registor vA with data;
B. vector is held instruction:
Vstore vA rB imm adds that with the numerical value among the scalar register rB several immediately imm as the address, write main memory with data from vector registor vA;
Vstore vA imm to count imm immediately as the address, writes main memory with data from vector registor vA;
Vstore vA rB as the address, writes main memory with data from vector registor vA with the numerical value among the scalar register rB;
ⅱ. the vector/scalar register data transfer instructions comprises:
C. vtos vA rB idx sends idx unit among the vector registor vA into scalar register rB;
D. stov vA rB duplicates four parts with the value among the scalar register rB, sends into vector registor vA;
ⅲ. the vector arithmetic logic instruction comprises:
E. vvvop vD vA vB is certain arithmetic logical operation op with unit corresponding among vector registor vA and the vB, and the result writes vector registor vD;
F. vvsop rD vA vB is certain arithmetic logical operation op with unit all among vector registor vA and the vB, and the result writes scalar register rD;
G. vsvop vD vA rB is certain arithmetic logical operation op with all unit among the vector registor vA and scalar register rB, and the result writes vector registor vD;
H. vssop rD vA rB is certain arithmetic logical operation op with all unit among the vector registor vA and scalar register rB, and the result writes scalar register rD;
ⅳ. vectorial floating-point operation instruction comprises two classes:
E. vvvfop vD vA vB is certain floating-point operation fop with unit corresponding among vector registor vA and the vB, and the result writes vector registor vD;
G. vsvfop vD vA rB is certain floating-point operation fop with all unit among the vector registor vA and scalar register rB, and the result writes vector registor vD;
V. vector shuffles instruction, and vshuffle vC vB rA is according to the value of scalar register rA, with each unit weighs new sort of vector registor vB and write vector registor vC.
And a kind of vector intersection multithreaded microprocessor, it is characterized in that, comprise more than one vector intersection multithreaded microprocessor nuclear, described vector intersection multithreaded microprocessor nuclear comprises: N programmable counter, the multithreading instruction fetching component, instruction cache, N instruction queue buffer, the thread scheduling parts, scalar the vector decoding unit, scalar execution pipeline and vectorial execution pipeline and data cache, described multithreading instruction fetching component is according to the current instruction address of storing in the described N programmable counter, wheel changes the instruction of reading a described N thread from described instruction cache, and the instruction that will read send into described one to one N the instruction queue buffer of a described N thread in, described thread scheduling parts are selected one from a described N instruction queue buffer, therefrom an instruction being taken out and mail to described Biao Liang vector decoding unit deciphers, instruction after the decoding is sent into corresponding scalar execution pipeline or vectorial execution pipeline is carried out, in the implementation, described scalar execution pipeline or vectorial execution pipeline memory access data from described data cache.
Further improvement as vector intersection multithreaded microprocessor of the present invention:
Described data cache comprises on-chip cache and the second level cache that is interconnected, described scalar execution pipeline link to each other with described on-chip cache and from described on-chip cache the memory access data, described vectorial execution pipeline links to each other with described second level cache and direct memory access data from second level cache by vectorial access interface.
Described scalar execution pipeline comprises: be used for storing the source operand of scalar instruction the scalar register file unit, be used for selecting from described scalar register file unit source operand the scalar operands selected cell, be used for described source operand carried out the scalar performance element of computing and be used to carry out computing the scalar data that after finishing the result is write back the scalar register file unit and write back parts;
Described vectorial execution pipeline comprises: the vector registor file unit that is used to store the source operand of vector instruction, be used for from described scalar register file unit and or described vector registor file unit select source operand, and finish the vector operand selected cell that the operand between described vectorial execution unit and scalar execution unit transmits, be used for described source operand is carried out the vectorial performance element of computing, and the vector data that after being used to carry out computing and finishing the result is write back the vector registor file unit writes back parts, described scalar performance element links to each other with described on-chip cache, and described vectorial performance element links to each other with described second level cache.
Described vectorial performance element comprises: be used to finish vectorial Load/Store instruction or scalar Load/Store instruction vectorial Load/Store parts, be used to finish vectorial floating-point operation instruction vectorial floating-point calculation component, be used to finish the vector arithmetic logical block of vector arithmetic logic instruction, and be used to finish the vector that vector shuffles instruction and shuffle parts.
The size of the data block in the described data cache is identical with the width of described vector registor file unit.
Described vector intersection multithreaded microprocessor comprises the vector intersection multithreaded microprocessor nuclear more than two, described vector intersection multithreaded microprocessor nuclear more than two is furnished with on-chip cache and the second level cache that is interconnected respectively, described second level cache more than two interconnects by cross bar switch, and network interface links to each other with Peripheral Interface between the three grade high-speed caches outer with being located at nuclear, sheet, and described each three grades of high-speed cache link to each other with a memory controller that is used for access external memory.
Compared with prior art, the invention has the advantages that:
1, the present invention's vector intersection multithread processing method, to intersect multithreading and vectorial technology combines, be used to handle the vector instruction or the scalar instruction of multithreading, adopt the multithreading wheel to change the mode of getting finger, the instruction of a plurality of vectorial threads is carried out with interleaved mode, when certain thread runs into the long delay operation, the instruction of other thread still can continue to carry out, can well hide the long delay operation that runs in the thread implementation, reach when improving the microprocessor peak performance, can guarantee being full of of vectorial execution pipeline, thereby bring into play the arithmetic capability of microprocessor more fully.
2, the present invention's vector intersection multithreaded microprocessor, on the basis of traditional scalar micro-processor core, add the multithreading instruction fetching component and and increased vectorial execution pipeline, make it will intersect multithreading and vectorial technology combination, finishing the crossing parallel of a plurality of vectorial threads handles, realize the data level parallel processing in the vectorial technological development program, arithmetic capability is strong.
3, the present invention vector intersects multithreaded microprocessor, is still keeping original scalar processor structure when increasing vectorial execution pipeline, can compatible traditional fully scalar application program, and therefore compatibility is good.
4, the present invention's vector intersection multithreaded microprocessor, use intersection multithreading has substituted the superscale technology of current popular, has reduced the complexity of hardware design when guaranteeing performance, and hardware configuration is simple, and cost is little, favorable expandability.
Description of drawings
Fig. 1 is the structural representation of existing typical conventional microprocessor;
Fig. 2 is that existing typical scalar micro-processor is carried out schematic flow sheet;
Fig. 3 is the composition structural representation of vector intersection multithreaded microprocessor of the present invention;
Fig. 4 is the execution schematic flow sheet of vector intersection multithread processing method of the present invention;
Fig. 5 is that the multithreading intersection is carried out the principle schematic of hiding the long delay operation in the vector intersection multithread processing method of the present invention;
Fig. 6 is the realization principle schematic that vector of the present invention shuffles instruction vshuffle vA vB rA;
Fig. 7 is the syndeton synoptic diagram of scalar performance element of the present invention and vectorial performance element and data cache;
Fig. 8 is the composition structural representation of the polycaryon processor of vector intersection multithreaded microprocessor of the present invention.
Embodiment
Below with reference to Figure of description and specific embodiment the present invention is described in further detail.
As shown in Figure 4, vector intersection multithread processing method of the present invention may further comprise the steps:
1) reading command: the multithreading instruction fetching component is selected a vectorial thread to instruct from 8 vectorial threads with round robin to read, and will read the instruction storage that obtains in the instruction queue buffer of vectorial thread correspondence.Instruction fetching component can be that instruction queue buffer is read in the instruction of a cacheline with size at every turn, the length of every instruction generally is 1 machine work, if the size of each on-chip cache piece is 4 machine works, then each clock period of instruction fetching component can be read in the instruction cache formation with 4 instructions.During practical application, the quantity N of vectorial thread can be the natural number greater than 1, and its size only is subjected to the restriction of hardware resource; Generally, general N is got is 2 power, i.e. N=2 n, n=1,2,3 ..., help simplifying hardware design like this.
2) thread is selected: the thread scheduling parts are selected an instruction queue buffer from 8 instruction queue buffers, and take out an instruction and decipher from instruction queue buffer;
3) execution command: vectorial execution pipeline is sent in the instruction after will deciphering or the scalar execution pipeline is carried out, and concrete execution in step is as follows:
3.1) the operand selection: according to the content of the instruction after the decoding, visit vector registor file unit or scalar register file unit obtain source operand, and the source operand that obtains is delivered to corresponding vectorial execution unit or scalar execution unit;
3.2) the instruction execution: vectorial execution unit or scalar execution unit are carried out computing according to source operand, and the result who carries out computing writes back vector registor file unit or scalar register file unit respectively.
In the said method, instruct to be scalar instruction or vector instruction, scalar instruction is existing general scalar instruction, and vector instruction can comprise following five kinds:
ⅰ. vectorial access instruction comprises:
A. vectorial load instructions, a vector can be read in a vector registor from main memory:
Vload vA rB as the address, reads in vector registor vA with data with the numerical value among the scalar register rB;
Vload vA rB imm adds that with the value among the scalar register rB several immediately imm as the address, read in vector registor vA with data;
Vload vA imm to count imm immediately as the address, reads in vector registor vA with data;
B. vector is held instruction, and the content of a vector registor can be write back main memory:
Vstore vA rB imm adds that with the numerical value among the scalar register rB several immediately imm as the address, write main memory with data from vector registor vA;
Vstore vA imm to count imm immediately as the address, writes main memory with data from vector registor vA;
Vstore vA rB as the address, writes main memory with data from vector registor vA with the numerical value among the scalar register rB;
The vector load instructions is held instruction with vector can support multiple different addressing mode, comprises register addressing, immediate addressing and base addressing, can realize whole addressing modes when carrying out specific implementation, also can realize wherein one or more addressing modes;
ⅱ. the vector/scalar register data transfer instructions comprises:
C. vtos vA rB idx sends idx unit among the vector registor vA into scalar register rB, for by the instruction of scalar register to the vector registor Data transmission;
D. stov vA rB duplicates four parts with the value among the scalar register rB, sends into vector registor vA; For by the instruction of vector registor to the scalar register Data transmission;
Finish in certain unit that the value of scalar register is imported in the vector registor to the instruction of vector registor Data transmission by scalar register, or according to mask the value of scalar register is duplicated some parts and compose a plurality of unit of giving vector registor simultaneously; The value of certain unit in the vector registor is composed to certain scalar register to the instruction of scalar register Data transmission by vector registor;
ⅲ. the vector arithmetic logic instruction comprises four classes:
E. vvvop vD vA vB is certain arithmetic logical operation op with unit corresponding among vector registor vA and the vB, and the result writes vector registor vD; This instruction is the vector-vector-vector instruction, and promptly two source operands and a destination operand all are vector registors, and the arithmetic logical operation between vectorial corresponding unit is mainly finished in this class instruction;
F. vvsop rD vA vB is certain arithmetic logical operation op with unit all among vector registor vA and the vB, and the result writes scalar register rD; This instruction is vector-vector-scalar class instruction, and promptly two operands all are vector registors and destination operand is a scalar register, this class instruct mainly finish with the summation of all unit in the vector registor, ask and, ask or etc. operation.
G. vsvop vD vA rB is certain arithmetic logical operation op with all unit among the vector registor vA and scalar register rB, and the result writes vector registor vD;
H. vssop rD vA rB is certain arithmetic logical operation op with all unit among the vector registor vA and scalar register rB, and the result writes scalar register rD;
Vsvop vD vA rB and vssop rD vA rB are the vector-scalar-vector instruction, be that source operand is a vector registor and a scalar register, destination operand is a vector registor, and the arithmetic logical operation of each unit of vector and scalar is finished in this class instruction;
The operation that vector arithmetic logic class instruction is supported comprises the fixed point arithmetic logic instruction, as add, subtract, arithmetic instruction such as multiplication and division, with or, non-, XOR, negate, relatively wait logic instruction; And bit arithmetic instruction, comprise step-by-step and, step-by-step or, non-, the step-by-step XOR of step-by-step, step-by-step negate etc.
ⅳ. vectorial floating-point operation instruction comprises two classes:
E. vvvfop vD vA vB is certain floating-point operation fop with unit corresponding among vector registor vA and the vB, and the result writes vector registor vD; This instruction is the vector-vector-vector instruction, and promptly two source operands and a destination operand all are vector registors, and the floating-point operation between vectorial corresponding unit is mainly finished in this class instruction;
G. vsvfop vD vA rB is certain floating-point operation fop with all unit among the vector registor vA and scalar register rB, and the result writes vector registor vD;
The floating-point operation that the instruction of vector floating-point operation is supported mainly comprises floating add, floating-point subtraction, floating-point multiplication, floating-point division and floating-point comparison, floating-point/operations such as fixed-point data conversion.
V. vector shuffles instruction, and vshuffle vC vB rA is according to the value of scalar register rA, with each unit weighs new sort of vector registor vB and write vector registor vC.
The realization principle that this vector shuffles instruction as shown in Figure 6, the value of four unit of source vector register vB is sent into purpose vector registor vA by two-layer MUX, four MUX of ground floor all are four to select a MUX, their control signal is from the least-significant byte of scalar source-register rA (s0 among the figure ~ s3), two of each MUX; Four MUX of the second layer are the alternative MUX, be used to determine whether the value of source vector register is sent into the purpose vector registor, their control signal is from scalar source-register rA the 8th to the 11st (mask among the figure), if corresponding control signal is 1, the value of corresponding units is just sent into the corresponding unit of purpose vector registor in the source vector register, otherwise the value of corresponding units is constant in the purpose vector registor.
The intersection of a plurality of threads is carried out the long delay operation (as accessing operation) that can be good at hiding in the execution pipeline, thereby guarantees that as possible execution pipeline is in the state of operating at full capacity all the time.As shown in Figure 5, hide the principle of long delay operation for the multithreading intersection execution of vector intersection multithreaded microprocessor of the present invention, the intersection with 4 threads among the figure is implemented as example, and the process of the hiding memory access delay of vectorial thread of a plurality of intersections execution has been described.C representation vector operational order among the figure, M representation vector access instruction, L_hit is that cache hit postpones, L_miss is that cache miss postpones; Four threads are respectively with different sequence execute vector operational orders and vectorial access instruction.When carrying out, the vectorial access instruction of thread 0 will introduce two delays in cycle all the time, if do not intersect other threads of carrying out, this will introduce the obstruction in two cycles in execution pipeline, after introducing the intersection multi-thread mechanism, two idling cycles that the memory access delay brings have been filled up in the vector operation instruction of thread 1 and thread 2, thereby have guaranteed being full of of streamline.The execution of the vectorial access instruction of thread 2 has run into the inefficacy of high-speed cache, this may cause very long streamline to block, and the free time of execution pipeline has been filled up in the vector operation of thread 0, thread 1 and thread 2 instruction in the process of handling cache miss, although do not avoid the obstruction of execution pipeline fully, but also significantly reduced the blocking time of execution pipeline, improved operational performance.As the above analysis, generally speaking, the Thread Count that microprocessor is supported is many more, and the ability of its hiding long delay operation is also just strong more, and its performance just can better be brought into play.
Can realize above-mentioned vector intersection multithread processing method by vector intersection multithreaded microprocessor of the present invention, this vector intersection multithreaded microprocessor comprises more than one vector intersection multithreaded microprocessor nuclear, as shown in Figure 3, vector intersection multithreaded microprocessor nuclear comprises: 8 programmable counters (PC), the multithreading instruction fetching component, instruction cache, 8 instruction queue buffers, the thread scheduling parts, scalar the vector decoding unit, scalar execution pipeline and vectorial execution pipeline and data cache, the multithreading instruction fetching component is according to the current instruction address of storing in 8 programmable counters, wheel changes the instruction of reading N thread from instruction cache, and the instruction that will read is sent into and 8 threads one to one in 8 instruction queue buffers, the thread scheduling parts are selected one from 8 instruction queue buffers, therefrom an instruction being taken out and mail to Biao Liang vector decoding unit deciphers, corresponding scalar execution pipeline or vectorial execution pipeline are sent in instruction after the decoding, and (scalar instruction will be sent into the scalar execution pipeline and carry out, vector instruction then is sent to vectorial execution pipeline and carries out) carry out, in the implementation, scalar execution pipeline or vectorial execution pipeline memory access data from data cache.
In the present embodiment, the scalar execution pipeline has kept the structure of existing typical scalar execution pipeline, and it comprises: be used for storing the source operand of scalar instruction the scalar register file unit, be used for selecting from the scalar register file unit source operand the scalar operands selected cell, be used for source operand carried out the scalar performance element of computing and be used to carry out computing the scalar data that after finishing the result is write back the scalar register file unit and write back parts.And the present invention has increased vectorial execution pipeline, it comprises: be used for storing the source operand of vector instruction the vector registor file unit, be used for from the scalar register file unit and or the vector registor file unit select source operand, and finish the operand transmission between vectorial execution unit and scalar execution unit the vector operand selected cell, be used for source operand is carried out the vectorial performance element of computing, and the vector data that after being used to carry out computing and finishing the result is write back the vector registor file unit writes back parts.Why increasing operand selection and transmittance process is because some instruction (as vector-scalar-vector) needs two kinds of operands of vector sum scalar simultaneously.
Because vectorial execution pipeline can be finished the computing of whole vector simultaneously in each clock period, therefore it is far longer than the scalar execution pipeline to the data demand, and the less on-chip cache of capacity is difficult to satisfy simultaneously scalar operation parts and the vector operation parts demand to data usually.Therefore, as shown in Figure 7, in the present embodiment, data cache comprises on-chip cache and the second level cache that is interconnected, scalar performance element in the scalar execution pipeline links to each other with on-chip cache and memory access data from on-chip cache, the vectorial performance element of vector in the execution pipeline is by the vectorial access interface directly memory access data from second level cache that link to each other with second level cache also, that is: make vectorial Load/Store parts can walk around on-chip cache, through second level cache.
In the present embodiment, the vector performance element comprises: be used to finish vectorial Load/Store instruction or scalar Load/Store instruction vectorial Load/Store parts, be used to finish vectorial floating-point operation instruction vectorial floating-point calculation component, be used to finish the vector arithmetic logical block of vector arithmetic logic instruction, and be used to finish the vector that vector shuffles instruction and shuffle parts.The scalar performance element then adopts existing typical structure, during execution pipeline work, various instructions all are admitted to corresponding execution unit, source operand as scalar/vectorial access instruction will be admitted to scalar or vectorial Load/Store parts, and the vector arithmetic operational order will be admitted to vector arithmetic logical block etc.
In the present embodiment, the size of the data block in the data cache is identical with the width of vector registor file unit.As: the vector registor file unit comprises 8 vector registor groups, the corresponding thread of each vector registor group.Each vector registor group comprises 32 vector registor: v0~v31, the length of each vector registor is 4 machine works, can be corresponding 32 machine works of 128(according to the bit wide of the different vector registors of machine work length) or corresponding 64 machine works of 256().
It is two when above that vector in vector of the present invention intersects multithreaded microprocessor intersects multithreaded microprocessor nuclear, as shown in Figure 8, vector intersection multithreaded microprocessor nuclear more than two is furnished with on-chip cache and the second level cache that is interconnected respectively, second level cache more than two interconnects by cross bar switch, and network interface links to each other with Peripheral Interface between the three grade high-speed caches outer with being located at nuclear, sheet, and each three grades of high-speed cache links to each other with a memory controller that is used for access external memory.The interconnected HSSI High-Speed Serial Interface that provides of a plurality of vector intersection microprocessor chips is provided network interface between sheet; Peripheral Interface provides multiple peripheral bus support, comprises PCI-E bus, gigabit Ethernet etc.Vector intersects the multithreading multi-core microprocessor provides high Practical Performance with simple hardware construction comparatively, and process level, thread-level, instruction-level and the data level that can give full play in the application program are parallel; This processor can compatible existing scalar application program simultaneously, has compatibility and extensibility preferably.
The above only is a preferred implementation of the present invention, and protection scope of the present invention also not only is confined to the foregoing description, and all technical schemes that belongs under the thinking of the present invention all belong to protection scope of the present invention.Should be pointed out that for those skilled in the art the some improvements and modifications not breaking away under the principle of the invention prerequisite should be considered as protection scope of the present invention.

Claims (10)

1. a vector intersects multithread processing method, it is characterized in that may further comprise the steps:
1) reading command: the multithreading instruction fetching component is selected a vectorial thread to instruct from N vectorial thread with round robin to read, and will read the instruction storage that obtains in the instruction queue buffer of described vectorial thread correspondence;
2) thread is selected: the thread scheduling parts are selected an instruction queue buffer from N instruction queue buffer, and take out an instruction and decipher from described instruction queue buffer;
3) execution command: vectorial execution pipeline is sent in the instruction after will deciphering or the scalar execution pipeline is carried out.
2. vector intersection multithread processing method according to claim 1 is characterized in that described N=2 n, n=1,2,3 wherein ...
3. vector intersection multithread processing method according to claim 1 is characterized in that described step 3) specifically may further comprise the steps:
3.1) the operand selection: according to the content of the instruction after the described decoding, visit vector registor file unit or scalar register file unit obtain source operand, and the source operand that obtains is delivered to corresponding vectorial execution unit or scalar execution unit;
3.2) the instruction execution: described vectorial execution unit or scalar execution unit are carried out computing according to described source operand, and the result who carries out computing writes back vector registor file unit or scalar register file unit respectively.
4. according to claim 1 or 2 or 3 described vector intersection multithread processing methods, it is characterized in that described instruction is scalar instruction or vector instruction, described vector instruction comprises following classification:
ⅰ. vectorial access instruction comprises:
A. vectorial load instructions:
Vload vA rB as the address, reads in vector registor vA with data with the numerical value among the scalar register rB;
Vload vA rB imm adds that with the value among the scalar register rB several immediately imm as the address, read in vector registor vA with data;
Vload vA imm to count imm immediately as the address, reads in vector registor vA with data;
B. vector is held instruction:
Vstore vA rB imm adds that with the numerical value among the scalar register rB several immediately imm as the address, write main memory with data from vector registor vA;
Vstore vA imm to count imm immediately as the address, writes main memory with data from vector registor vA;
Vstore vA rB as the address, writes main memory with data from vector registor vA with the numerical value among the scalar register rB;
ⅱ. the vector/scalar data transfer instructions comprises:
C. vtos vA rB idx sends idx unit among the vector registor vA into scalar register rB;
D. stov vA rB duplicates four parts with the value among the scalar register rB, sends into vector registor vA;
ⅲ. the vector arithmetic logic instruction comprises:
E. vvvop vD vA vB is certain arithmetic logical operation op with unit corresponding among vector registor vA and the vB, and the result writes vector registor vD;
F. vvsop rD vA vB is certain arithmetic logical operation op with unit all among vector registor vA and the vB, and the result writes scalar register rD;
G. vsvop vD vA rB is certain arithmetic logical operation op with all unit among the vector registor vA and scalar register rB, and the result writes vector registor vD;
H. vssop rD vA rB is certain arithmetic logical operation op with all unit among the vector registor vA and scalar register rB, and the result writes scalar register rD;
ⅳ. vectorial floating-point operation instruction comprises two classes:
E. vvvfop vD vA vB is certain floating-point operation fop with unit corresponding among vector registor vA and the vB, and the result writes vector registor vD;
G. vsvfop vD vA rB is certain floating-point operation fop with all unit among the vector registor vA and scalar register rB, and the result writes vector registor vD;
V. vector shuffles instruction, and vshuffle vC vB rA is according to the value of scalar register rA, with each unit weighs new sort of vector registor vB and write vector registor vC.
5. a vector intersects multithreaded microprocessor, it is characterized in that, comprise more than one vector intersection multithreaded microprocessor nuclear, described vector intersection multithreaded microprocessor nuclear comprises: N programmable counter, the multithreading instruction fetching component, instruction cache, N instruction queue buffer, the thread scheduling parts, scalar the vector decoding unit, scalar execution pipeline and vectorial execution pipeline and data cache, described multithreading instruction fetching component is according to the current instruction address of storing in the described N programmable counter, wheel changes the instruction of reading a described N thread from described instruction cache, and the instruction that will read send into described one to one N the instruction queue buffer of a described N thread in, described thread scheduling parts are selected one from a described N instruction queue buffer, therefrom an instruction being taken out and mail to described Biao Liang vector decoding unit deciphers, instruction after the decoding is sent into corresponding scalar execution pipeline or vectorial execution pipeline is carried out, in the implementation, described scalar execution pipeline or vectorial execution pipeline memory access data from described data cache.
6. vector intersection multithreaded microprocessor according to claim 5, it is characterized in that, described data cache comprises on-chip cache and the second level cache that is interconnected, described scalar execution pipeline link to each other with described on-chip cache and from described on-chip cache the memory access data, described vectorial execution pipeline links to each other with described second level cache and direct memory access data from second level cache by vectorial access interface.
7. vector intersection multithreaded microprocessor according to claim 6, it is characterized in that described scalar execution pipeline comprises: be used for storing the source operand of scalar instruction the scalar register file unit, be used for selecting from described scalar register file unit source operand the scalar operands selected cell, be used for described source operand carried out the scalar performance element of computing and be used to carry out computing the scalar data that after finishing the result is write back the scalar register file unit and write back parts;
Described vectorial execution pipeline comprises: the vector registor file unit that is used to store the source operand of vector instruction, be used for from described scalar register file unit and or described vector registor file unit select source operand, and finish the vector operand selected cell that the operand between described vectorial execution unit and scalar execution unit transmits, be used for described source operand is carried out the vectorial performance element of computing, and the vector data that after being used to carry out computing and finishing the result is write back the vector registor file unit writes back parts, described scalar performance element links to each other with described on-chip cache, and described vectorial performance element links to each other with described second level cache.
8. vector intersection multithreaded microprocessor according to claim 7, it is characterized in that, described vectorial performance element comprises: be used to finish vectorial Load/Store instruction or scalar Load/Store instruction vectorial Load/Store parts, be used to finish vectorial floating-point operation instruction vectorial floating-point calculation component, be used to finish the vector arithmetic logical block of vector arithmetic logic instruction, and be used to finish the vector that vector shuffles instruction and shuffle parts.
9. according to claim 5 or 6 or 7 or 8 described vector intersection multithreaded microprocessors, it is characterized in that the size of the data block in the described data cache is identical with the width of described vector registor file unit.
10. according to claim 5 or 6 or 7 or 8 described vector intersection multithreaded microprocessors, it is characterized in that, described vector intersection multithreaded microprocessor comprises the vector intersection multithreaded microprocessor nuclear more than two, described vector intersection multithreaded microprocessor nuclear more than two is furnished with on-chip cache and the second level cache that is interconnected respectively, described second level cache more than two interconnects by cross bar switch, and with being located at nuclear outer three grades of high-speed caches, network interface links to each other with Peripheral Interface between sheet, and described each three grades of high-speed cache link to each other with a memory controller that is used for access external memory.
CN2011101138829A 2011-05-04 2011-05-04 Vector crossing multithread processing method and vector crossing multithread microprocessor Pending CN102156637A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011101138829A CN102156637A (en) 2011-05-04 2011-05-04 Vector crossing multithread processing method and vector crossing multithread microprocessor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011101138829A CN102156637A (en) 2011-05-04 2011-05-04 Vector crossing multithread processing method and vector crossing multithread microprocessor

Publications (1)

Publication Number Publication Date
CN102156637A true CN102156637A (en) 2011-08-17

Family

ID=44438145

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011101138829A Pending CN102156637A (en) 2011-05-04 2011-05-04 Vector crossing multithread processing method and vector crossing multithread microprocessor

Country Status (1)

Country Link
CN (1) CN102156637A (en)

Cited By (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102508776A (en) * 2011-11-03 2012-06-20 中国人民解放军国防科学技术大学 Automatic construction method for evaluation stimulus of multi-thread cross double-precision short-vector structure
CN103699360A (en) * 2012-09-27 2014-04-02 北京中科晶上科技有限公司 Vector processor and vector data access and interaction method thereof
CN103890719A (en) * 2011-10-18 2014-06-25 联发科技瑞典有限公司 Digital signal processor and baseband communication device
CN103930883A (en) * 2011-09-28 2014-07-16 Arm有限公司 Interleaving data accesses issued in response to vector access instructions
CN104040489A (en) * 2011-12-23 2014-09-10 英特尔公司 Multi-register gather instruction
WO2017016486A1 (en) * 2015-07-30 2017-02-02 Huawei Technologies Co., Ltd. System and method for variable lane architecture
WO2017185405A1 (en) * 2016-04-26 2017-11-02 北京中科寒武纪科技有限公司 Apparatus and method for performing vector outer product arithmetic
WO2017185411A1 (en) * 2016-04-29 2017-11-02 北京中科寒武纪科技有限公司 Apparatus and method for executing adagrad gradient descent training algorithm
WO2017185404A1 (en) * 2016-04-26 2017-11-02 北京中科寒武纪科技有限公司 Apparatus and method for performing vector logical operation
CN107315569A (en) * 2016-04-27 2017-11-03 北京中科寒武纪科技有限公司 A kind of device and method for being used to perform RMSprop gradient descent algorithms
CN107315570A (en) * 2016-04-27 2017-11-03 北京中科寒武纪科技有限公司 It is a kind of to be used to perform the device and method that Adam gradients decline training algorithm
CN107315717A (en) * 2016-04-26 2017-11-03 北京中科寒武纪科技有限公司 A kind of apparatus and method for performing vectorial arithmetic
CN107315565A (en) * 2016-04-26 2017-11-03 北京中科寒武纪科技有限公司 It is a kind of to be used to generate the random vector apparatus and method obeyed and be necessarily distributed
CN107315575A (en) * 2016-04-26 2017-11-03 北京中科寒武纪科技有限公司 A kind of apparatus and method for performing vectorial union operation
CN107341540A (en) * 2016-04-29 2017-11-10 北京中科寒武纪科技有限公司 A kind of apparatus and method for performing Hessian-Free training algorithms
CN107408063A (en) * 2015-02-02 2017-11-28 优创半导体科技有限公司 It is configured with the vector processor that asymmetric multithreading is operated to variable-length vector
US9870340B2 (en) 2015-03-30 2018-01-16 International Business Machines Corporation Multithreading in vector processors
WO2018082229A1 (en) * 2016-11-03 2018-05-11 北京中科寒武纪科技有限公司 Slam operation apparatus and method
CN108255519A (en) * 2016-12-29 2018-07-06 展讯通信(上海)有限公司 The floating point instruction processing method and processing device of synchronous multiline procedure processor
CN109032666A (en) * 2018-07-03 2018-12-18 中国人民解放军国防科技大学 Method and device for determining number of assertion active elements for vector processing
CN109062604A (en) * 2018-06-26 2018-12-21 天津飞腾信息技术有限公司 A kind of launching technique and device towards the mixing execution of scalar sum vector instruction
CN111291880A (en) * 2017-10-30 2020-06-16 上海寒武纪信息科技有限公司 Computing device and computing method
CN111464316A (en) * 2012-03-30 2020-07-28 英特尔公司 Method and apparatus for processing SHA-2 secure hash algorithms
CN111580864A (en) * 2016-01-20 2020-08-25 中科寒武纪科技股份有限公司 Vector operation device and operation method
CN111651204A (en) * 2016-04-26 2020-09-11 中科寒武纪科技股份有限公司 Device and method for executing vector maximum and minimum operation
WO2022127441A1 (en) * 2020-12-16 2022-06-23 广东赛昉科技有限公司 Method for extracting instructions in parallel and readable storage medium
CN115129480A (en) * 2022-08-26 2022-09-30 上海登临科技有限公司 Scalar processing unit and access control method thereof
WO2023123453A1 (en) * 2021-12-31 2023-07-06 华为技术有限公司 Operation acceleration processing method, operation accelerator use method, and operation accelerator
CN116450216A (en) * 2023-06-12 2023-07-18 上海灵动微电子股份有限公司 Local caching method for shared hardware operation unit
CN116483441A (en) * 2023-06-21 2023-07-25 睿思芯科(深圳)技术有限公司 Output time sequence optimizing system, method and related equipment based on shift buffering
CN117348933A (en) * 2023-12-05 2024-01-05 睿思芯科(深圳)技术有限公司 Processor and computer system
CN117931729A (en) * 2024-03-22 2024-04-26 芯来智融半导体科技(上海)有限公司 Vector processor memory access instruction processing method and system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4633389A (en) * 1982-02-03 1986-12-30 Hitachi, Ltd. Vector processor system comprised of plural vector processors
CN1349159A (en) * 2001-11-28 2002-05-15 中国人民解放军国防科学技术大学 Vector processing method of microprocessor
CN1478228A (en) * 2000-11-02 2004-02-25 ض� Breaking replay dependency loops in processor using rescheduled replay queue
CN1781088A (en) * 2001-12-20 2006-05-31 杉桥技术公司 Multithreaded processor with efficient processing for convergence device applications
CN1834956A (en) * 2005-03-18 2006-09-20 联想(北京)有限公司 Processing of multiroute processing element data
CN101978350A (en) * 2008-03-28 2011-02-16 英特尔公司 Vector instructions to enable efficient synchronization and parallel reduction operations
CN101986264A (en) * 2010-11-25 2011-03-16 中国人民解放军国防科学技术大学 Multifunctional floating-point multiply and add calculation device for single instruction multiple data (SIMD) vector microprocessor

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4633389A (en) * 1982-02-03 1986-12-30 Hitachi, Ltd. Vector processor system comprised of plural vector processors
CN1478228A (en) * 2000-11-02 2004-02-25 ض� Breaking replay dependency loops in processor using rescheduled replay queue
CN1349159A (en) * 2001-11-28 2002-05-15 中国人民解放军国防科学技术大学 Vector processing method of microprocessor
CN1781088A (en) * 2001-12-20 2006-05-31 杉桥技术公司 Multithreaded processor with efficient processing for convergence device applications
CN1834956A (en) * 2005-03-18 2006-09-20 联想(北京)有限公司 Processing of multiroute processing element data
CN101978350A (en) * 2008-03-28 2011-02-16 英特尔公司 Vector instructions to enable efficient synchronization and parallel reduction operations
CN101986264A (en) * 2010-11-25 2011-03-16 中国人民解放军国防科学技术大学 Multifunctional floating-point multiply and add calculation device for single instruction multiple data (SIMD) vector microprocessor

Cited By (72)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103930883A (en) * 2011-09-28 2014-07-16 Arm有限公司 Interleaving data accesses issued in response to vector access instructions
CN103930883B (en) * 2011-09-28 2017-02-15 Arm 有限公司 Interleaving data accesses method and device in response to vector access instructions
CN103890719B (en) * 2011-10-18 2016-11-16 联发科技瑞典有限公司 Digital signal processor and baseband communication equipment
CN103890719A (en) * 2011-10-18 2014-06-25 联发科技瑞典有限公司 Digital signal processor and baseband communication device
CN102508776A (en) * 2011-11-03 2012-06-20 中国人民解放军国防科学技术大学 Automatic construction method for evaluation stimulus of multi-thread cross double-precision short-vector structure
CN104040489A (en) * 2011-12-23 2014-09-10 英特尔公司 Multi-register gather instruction
US9766887B2 (en) 2011-12-23 2017-09-19 Intel Corporation Multi-register gather instruction
US10180838B2 (en) 2011-12-23 2019-01-15 Intel Corporation Multi-register gather instruction
CN111464316A (en) * 2012-03-30 2020-07-28 英特尔公司 Method and apparatus for processing SHA-2 secure hash algorithms
CN111464316B (en) * 2012-03-30 2023-10-27 英特尔公司 Method and apparatus for processing SHA-2 secure hash algorithm
CN103699360B (en) * 2012-09-27 2016-09-21 北京中科晶上科技有限公司 A kind of vector processor and carry out vector data access, mutual method
CN103699360A (en) * 2012-09-27 2014-04-02 北京中科晶上科技有限公司 Vector processor and vector data access and interaction method thereof
CN107408063A (en) * 2015-02-02 2017-11-28 优创半导体科技有限公司 It is configured with the vector processor that asymmetric multithreading is operated to variable-length vector
US9870340B2 (en) 2015-03-30 2018-01-16 International Business Machines Corporation Multithreading in vector processors
WO2017016486A1 (en) * 2015-07-30 2017-02-02 Huawei Technologies Co., Ltd. System and method for variable lane architecture
US10884756B2 (en) 2015-07-30 2021-01-05 Futurewei Technologies, Inc. System and method for variable lane architecture
US10691463B2 (en) 2015-07-30 2020-06-23 Futurewei Technologies, Inc. System and method for variable lane architecture
CN111580864B (en) * 2016-01-20 2024-05-07 中科寒武纪科技股份有限公司 Vector operation device and operation method
CN111580864A (en) * 2016-01-20 2020-08-25 中科寒武纪科技股份有限公司 Vector operation device and operation method
CN111580866A (en) * 2016-01-20 2020-08-25 中科寒武纪科技股份有限公司 Vector operation device and operation method
CN111580866B (en) * 2016-01-20 2024-05-07 中科寒武纪科技股份有限公司 Vector operation device and operation method
CN111651203A (en) * 2016-04-26 2020-09-11 中科寒武纪科技股份有限公司 Device and method for executing vector four-rule operation
US11100192B2 (en) 2016-04-26 2021-08-24 Cambricon Technologies Corporation Limited Apparatus and methods for vector operations
CN111651206B (en) * 2016-04-26 2024-05-07 中科寒武纪科技股份有限公司 Apparatus and method for performing vector outer product operation
CN111651203B (en) * 2016-04-26 2024-05-07 中科寒武纪科技股份有限公司 Device and method for executing vector four-rule operation
WO2017185405A1 (en) * 2016-04-26 2017-11-02 北京中科寒武纪科技有限公司 Apparatus and method for performing vector outer product arithmetic
CN111651204B (en) * 2016-04-26 2024-04-05 中科寒武纪科技股份有限公司 Apparatus and method for performing vector maximum-minimum operation
WO2017185404A1 (en) * 2016-04-26 2017-11-02 北京中科寒武纪科技有限公司 Apparatus and method for performing vector logical operation
CN107315575A (en) * 2016-04-26 2017-11-03 北京中科寒武纪科技有限公司 A kind of apparatus and method for performing vectorial union operation
US11507640B2 (en) 2016-04-26 2022-11-22 Cambricon Technologies Corporation Limited Apparatus and methods for vector operations
CN107315565A (en) * 2016-04-26 2017-11-03 北京中科寒武纪科技有限公司 It is a kind of to be used to generate the random vector apparatus and method obeyed and be necessarily distributed
CN107315717A (en) * 2016-04-26 2017-11-03 北京中科寒武纪科技有限公司 A kind of apparatus and method for performing vectorial arithmetic
CN107315568B (en) * 2016-04-26 2020-08-07 中科寒武纪科技股份有限公司 Device for executing vector logic operation
US11501158B2 (en) 2016-04-26 2022-11-15 Cambricon (Xi'an) Semiconductor Co., Ltd. Apparatus and methods for generating random vectors
CN107315716A (en) * 2016-04-26 2017-11-03 北京中科寒武纪科技有限公司 A kind of apparatus and method for performing Outer Product of Vectors computing
CN107315568A (en) * 2016-04-26 2017-11-03 北京中科寒武纪科技有限公司 A kind of device for being used to perform vector logic computing
US11436301B2 (en) 2016-04-26 2022-09-06 Cambricon Technologies Corporation Limited Apparatus and methods for vector operations
CN111651206A (en) * 2016-04-26 2020-09-11 中科寒武纪科技股份有限公司 Device and method for executing vector outer product operation
CN111651204A (en) * 2016-04-26 2020-09-11 中科寒武纪科技股份有限公司 Device and method for executing vector maximum and minimum operation
US10831861B2 (en) 2016-04-26 2020-11-10 Cambricon Technologies Corporation Limited Apparatus and methods for vector operations
US11341211B2 (en) 2016-04-26 2022-05-24 Cambricon Technologies Corporation Limited Apparatus and methods for vector operations
US11157593B2 (en) 2016-04-26 2021-10-26 Cambricon Technologies Corporation Limited Apparatus and methods for combining vectors
US10997276B2 (en) 2016-04-26 2021-05-04 Cambricon Technologies Corporation Limited Apparatus and methods for vector operations
US11126429B2 (en) 2016-04-26 2021-09-21 Cambricon Technologies Corporation Limited Apparatus and methods for bitwise vector operations
CN107315570B (en) * 2016-04-27 2021-06-18 中科寒武纪科技股份有限公司 Device and method for executing Adam gradient descent training algorithm
CN107315569B (en) * 2016-04-27 2021-06-18 中科寒武纪科技股份有限公司 Device and method for executing RMSprop gradient descent algorithm
CN107315569A (en) * 2016-04-27 2017-11-03 北京中科寒武纪科技有限公司 A kind of device and method for being used to perform RMSprop gradient descent algorithms
CN107315570A (en) * 2016-04-27 2017-11-03 北京中科寒武纪科技有限公司 It is a kind of to be used to perform the device and method that Adam gradients decline training algorithm
CN107341540A (en) * 2016-04-29 2017-11-10 北京中科寒武纪科技有限公司 A kind of apparatus and method for performing Hessian-Free training algorithms
CN107341540B (en) * 2016-04-29 2021-07-20 中科寒武纪科技股份有限公司 Device and method for executing Hessian-Free training algorithm
CN107341132B (en) * 2016-04-29 2021-06-11 中科寒武纪科技股份有限公司 Device and method for executing AdaGrad gradient descent training algorithm
WO2017185411A1 (en) * 2016-04-29 2017-11-02 北京中科寒武纪科技有限公司 Apparatus and method for executing adagrad gradient descent training algorithm
CN107341132A (en) * 2016-04-29 2017-11-10 北京中科寒武纪科技有限公司 It is a kind of to be used to perform the apparatus and method that AdaGrad gradients decline training algorithm
WO2018082229A1 (en) * 2016-11-03 2018-05-11 北京中科寒武纪科技有限公司 Slam operation apparatus and method
CN108255519A (en) * 2016-12-29 2018-07-06 展讯通信(上海)有限公司 The floating point instruction processing method and processing device of synchronous multiline procedure processor
CN108255519B (en) * 2016-12-29 2020-08-14 展讯通信(上海)有限公司 Floating point instruction processing method and device of synchronous multi-thread processor
CN111291880B (en) * 2017-10-30 2024-05-14 上海寒武纪信息科技有限公司 Computing device and computing method
CN111291880A (en) * 2017-10-30 2020-06-16 上海寒武纪信息科技有限公司 Computing device and computing method
CN109062604A (en) * 2018-06-26 2018-12-21 天津飞腾信息技术有限公司 A kind of launching technique and device towards the mixing execution of scalar sum vector instruction
CN109032666B (en) * 2018-07-03 2021-03-23 中国人民解放军国防科技大学 Method and device for determining number of assertion active elements for vector processing
CN109032666A (en) * 2018-07-03 2018-12-18 中国人民解放军国防科技大学 Method and device for determining number of assertion active elements for vector processing
WO2022127441A1 (en) * 2020-12-16 2022-06-23 广东赛昉科技有限公司 Method for extracting instructions in parallel and readable storage medium
WO2023123453A1 (en) * 2021-12-31 2023-07-06 华为技术有限公司 Operation acceleration processing method, operation accelerator use method, and operation accelerator
CN115129480B (en) * 2022-08-26 2022-11-08 上海登临科技有限公司 Scalar processing unit and access control method thereof
CN115129480A (en) * 2022-08-26 2022-09-30 上海登临科技有限公司 Scalar processing unit and access control method thereof
CN116450216B (en) * 2023-06-12 2023-08-29 上海灵动微电子股份有限公司 Local caching method for shared hardware operation unit
CN116450216A (en) * 2023-06-12 2023-07-18 上海灵动微电子股份有限公司 Local caching method for shared hardware operation unit
CN116483441B (en) * 2023-06-21 2023-09-12 睿思芯科(深圳)技术有限公司 Output time sequence optimizing system, method and related equipment based on shift buffering
CN116483441A (en) * 2023-06-21 2023-07-25 睿思芯科(深圳)技术有限公司 Output time sequence optimizing system, method and related equipment based on shift buffering
CN117348933A (en) * 2023-12-05 2024-01-05 睿思芯科(深圳)技术有限公司 Processor and computer system
CN117348933B (en) * 2023-12-05 2024-02-06 睿思芯科(深圳)技术有限公司 Processor and computer system
CN117931729A (en) * 2024-03-22 2024-04-26 芯来智融半导体科技(上海)有限公司 Vector processor memory access instruction processing method and system

Similar Documents

Publication Publication Date Title
CN102156637A (en) Vector crossing multithread processing method and vector crossing multithread microprocessor
Ipek et al. Core fusion: accommodating software diversity in chip multiprocessors
CN106648554B (en) For improving system, the method and apparatus of the handling capacity in continuous transactional memory area
CN102004719B (en) Very long instruction word processor structure supporting simultaneous multithreading
CN102750133B (en) 32-Bit triple-emission digital signal processor supporting SIMD
CN104050023B (en) System and method for realizing transaction memory
EP3716056B1 (en) Apparatus and method for program order queue (poq) to manage data dependencies in processor having multiple instruction queues
US11275637B2 (en) Aggregated page fault signaling and handling
US10095623B2 (en) Hardware apparatuses and methods to control access to a multiple bank data cache
US9904553B2 (en) Method and apparatus for implementing dynamic portbinding within a reservation station
CN105453030B (en) Processor, the method and system loaded dependent on the partial width of mode is carried out to wider register
CN103365627A (en) System and method of data forwarding within an execution unit
US10275242B2 (en) System and method for real time instruction tracing
KR20220151134A (en) Apparatus and method for adaptively scheduling work on heterogeneous processing resources
CN104216681B (en) A kind of cpu instruction processing method and processor
CN100451951C (en) 5+3 levels pipeline structure and method in RISC CPU
US20140129805A1 (en) Execution pipeline power reduction
CN101266559A (en) Configurable microprocessor and method for dividing single microprocessor core as multiple cores
CN108351780A (en) Contiguous data element-pairwise switching processor, method, system and instruction
CN105183697B (en) Embedded RSIC DSP Processors system and construction method
EP3757772A1 (en) System, apparatus and method for a hybrid reservation station for a processor
Omondi The microarchitecture of pipelined and superscalar computers
CN103218207A (en) Microprocessor instruction processing method and system based on single/dual transmitting instruction set
CN202720631U (en) Single/double transmission instruction set-based microprocessor instruction processing system
CN105843589B (en) A kind of storage arrangement applied to VLIW type processors

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20110817