CN102156637A - Vector crossing multithread processing method and vector crossing multithread microprocessor - Google Patents
Vector crossing multithread processing method and vector crossing multithread microprocessor Download PDFInfo
- Publication number
- CN102156637A CN102156637A CN2011101138829A CN201110113882A CN102156637A CN 102156637 A CN102156637 A CN 102156637A CN 2011101138829 A CN2011101138829 A CN 2011101138829A CN 201110113882 A CN201110113882 A CN 201110113882A CN 102156637 A CN102156637 A CN 102156637A
- Authority
- CN
- China
- Prior art keywords
- vector
- instruction
- scalar
- vectorial
- unit
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Advance Control (AREA)
Abstract
The invention discloses a vector crossing multithread processing method and a vector crossing multithread microprocessor. The processing method comprises the following steps: using a multithread instruction-acquiring part to choose a vector thread from N vector threads for reading an instruction and storing the read instruction to a corresponding instruction buffer array of the vector thread; using a thread scheduling part to choose an instruction buffer array from N instruction buffer arrays and taking out an instruction from the instruction buffer array for the purpose of decoding; and sending a decoded instruction to a vector executing streamline or scalar executing streamline so as to execute. The method can be realized by using hardware structure by the vector crossing multithread microprocessor. The method and the microprocessor provided by the invention have the advantages that the vector processing technique and multithread technique are combined, the hardware structure is simple, the operation capability is strong, the compatibility and expansibility are excellent, and the like.
Description
Technical field
The present invention relates to the computer microprocessor field, refer in particular to a kind of multithreaded microprocessor.
Background technology
The high speed development of computer realm is more and more higher to the requirement of the processing power of microprocessor, and the arithmetic capability that improves microprocessor mainly contains two approach: the one, improve the arithmetic capability of single processor core; The 2nd, increase is integrated in a plurality of processor cores in the microprocessor chip, promptly common alleged multi-core technology.
1, improves the arithmetic capability of processor core, traditional method mainly relies on the frequency that improves processor core and adopts the bigger superscale technology of transmitting instructions width, owing to be subjected to the restriction of factors such as technology, power consumption and reliability, the lifting of processor frequencies has run into bottleneck, and the transmitting instructions width also is difficult to continue to enlarge.Therefore, people turn to other new type microprocessor architecture technology with the emphasis of paying close attention to gradually, go up hardware resource to make full use of ever-increasing, thereby the performance of microprocessor core is got a promotion.
As shown in Figure 1, be the typical structure of traditional scalar micro-processor nuclear, it mainly comprises programmable counter, instruction fetching component, instruction cache, decoding unit and scalar execution pipeline.The scalar execution pipeline mainly comprises: register file cell, data cache, scalar execution unit (Load/Store parts, scalar floating-point calculation component and scalar arithmetic logical unti) and data write back parts.The typical implementation of scalar application program on conventional microprocessor is as follows: as shown in Figure 2, instruction fetching component sends access instruction according to programmable counter to instruction cache and obtains instruction, instruction fetching component mails to decoding unit with ready instruction and deciphers then, decode results according to decoding unit, instruction enters the scalar execution pipeline and begins to carry out, during execution: the scalar execution pipeline is according to the decode results access register file unit of decoding unit and obtain the source operand of this instruction, and mail to the function that suitable functional part is finished this instruction, at last, data write back parts and are responsible for the net result that this instruction produces is write back register file.
In the evolution of micro-processor architecture, intersection multithreading and vectorial technology had once appearred.The multithreading that intersects is meant that microprocessor can be safeguarded the status register and the relevant information of a plurality of scalar threads simultaneously, and the execution that execution pipeline replaces is from the scalar instruction of different threads.The feature of vector technology is mainly reflected in the processing that an instruction can be finished a plurality of scalar datas, and vector processor has high peak performance usually.The multithreading that intersects can make microprocessor hide the long delay operation effectively, guarantee that execution pipeline is full of, yet because the shared cover arithmetic unit of a plurality of threads, this makes the performance of single thread be affected, and this technology can't improve the peak performance of microprocessor; The vector technology can greatly improve the peak performance of microprocessor, but the memory access latency that it brings cache miss is comparatively responsive, is difficult to peak performance is converted into Practical Performance.If the two can be organically combined, might design the high-performance microprocessor that possesses peak value performance and Practical Performance simultaneously so.
2, multi-core technology is that the processor core that a plurality of complexities are lower is integrated on the chip.The hardware resource of multi-core technology on can the better utilization sheet improves the performance of microprocessor.In recent years, along with the development of multi-core technology, integrated processor check figure is more and more on one chip, and the arithmetic capability of corresponding single processor core does not significantly promote, in some design even obviously descend.And in order to bring into play the performance of polycaryon processor more fully, people more and more pay close attention to the exploitation of process level and Thread-Level Parallelism, and have ignored instruction level parallelism and the parallel research of data level.The how integrated processor core number of balance chip and the arithmetic capability of core, it is parallel to merge process level, thread-level, instruction-level and data level, thus the performance that further improves microprocessor is the major issue of micro-processor architecture design.
Summary of the invention
Technical matters to be solved by this invention is: at the technical matters of prior art existence, the invention provides a kind of Vector Processing technology was combined, can improve the processor calculating performance with multithreading vector and intersect multithread processing method, and a kind of hardware configuration is simple, arithmetic capability strong, compatibility and favorable expandability vector intersect multithreaded microprocessor.
For solving the problems of the technologies described above, the present invention by the following technical solutions:
A kind of vector intersection multithread processing method is characterized in that may further comprise the steps:
1) reading command: the multithreading instruction fetching component is selected a vectorial thread to instruct from N vectorial thread with round robin to read, and will read the instruction storage that obtains in the instruction queue buffer of described vectorial thread correspondence;
2) thread is selected: the thread scheduling parts are selected an instruction queue buffer from N instruction queue buffer, and take out an instruction and decipher from described instruction queue buffer;
3) execution command: vectorial execution pipeline is sent in the instruction after will deciphering or the scalar execution pipeline is carried out.
Further improve as method of the present invention:
Described N=2
n, n=1,2,3 wherein ...
Described step 3) specifically may further comprise the steps:
3.1) the operand selection: according to the content of the instruction after the described decoding, visit vector registor file unit or scalar register file unit obtain source operand, and the source operand that obtains is delivered to corresponding vectorial execution unit or scalar execution unit;
3.2) the instruction execution: described vectorial execution unit or scalar execution unit are carried out computing according to described source operand, and the result who carries out computing writes back vector registor file unit or scalar register file unit respectively.
Described instruction is scalar instruction or vector instruction, and described vector instruction comprises following classification:
ⅰ. vectorial access instruction comprises:
A. vectorial load instructions:
Vload vA rB as the address, reads in vector registor vA with data with the numerical value among the scalar register rB;
Vload vA rB imm adds that with the value among the scalar register rB several immediately imm as the address, read in vector registor vA with data;
Vload vA imm to count imm immediately as the address, reads in vector registor vA with data;
B. vector is held instruction:
Vstore vA rB imm adds that with the numerical value among the scalar register rB several immediately imm as the address, write main memory with data from vector registor vA;
Vstore vA imm to count imm immediately as the address, writes main memory with data from vector registor vA;
Vstore vA rB as the address, writes main memory with data from vector registor vA with the numerical value among the scalar register rB;
ⅱ. the vector/scalar register data transfer instructions comprises:
C. vtos vA rB idx sends idx unit among the vector registor vA into scalar register rB;
D. stov vA rB duplicates four parts with the value among the scalar register rB, sends into vector registor vA;
ⅲ. the vector arithmetic logic instruction comprises:
E. vvvop vD vA vB is certain arithmetic logical operation op with unit corresponding among vector registor vA and the vB, and the result writes vector registor vD;
F. vvsop rD vA vB is certain arithmetic logical operation op with unit all among vector registor vA and the vB, and the result writes scalar register rD;
G. vsvop vD vA rB is certain arithmetic logical operation op with all unit among the vector registor vA and scalar register rB, and the result writes vector registor vD;
H. vssop rD vA rB is certain arithmetic logical operation op with all unit among the vector registor vA and scalar register rB, and the result writes scalar register rD;
ⅳ. vectorial floating-point operation instruction comprises two classes:
E. vvvfop vD vA vB is certain floating-point operation fop with unit corresponding among vector registor vA and the vB, and the result writes vector registor vD;
G. vsvfop vD vA rB is certain floating-point operation fop with all unit among the vector registor vA and scalar register rB, and the result writes vector registor vD;
V. vector shuffles instruction, and vshuffle vC vB rA is according to the value of scalar register rA, with each unit weighs new sort of vector registor vB and write vector registor vC.
And a kind of vector intersection multithreaded microprocessor, it is characterized in that, comprise more than one vector intersection multithreaded microprocessor nuclear, described vector intersection multithreaded microprocessor nuclear comprises: N programmable counter, the multithreading instruction fetching component, instruction cache, N instruction queue buffer, the thread scheduling parts, scalar the vector decoding unit, scalar execution pipeline and vectorial execution pipeline and data cache, described multithreading instruction fetching component is according to the current instruction address of storing in the described N programmable counter, wheel changes the instruction of reading a described N thread from described instruction cache, and the instruction that will read send into described one to one N the instruction queue buffer of a described N thread in, described thread scheduling parts are selected one from a described N instruction queue buffer, therefrom an instruction being taken out and mail to described Biao Liang vector decoding unit deciphers, instruction after the decoding is sent into corresponding scalar execution pipeline or vectorial execution pipeline is carried out, in the implementation, described scalar execution pipeline or vectorial execution pipeline memory access data from described data cache.
Further improvement as vector intersection multithreaded microprocessor of the present invention:
Described data cache comprises on-chip cache and the second level cache that is interconnected, described scalar execution pipeline link to each other with described on-chip cache and from described on-chip cache the memory access data, described vectorial execution pipeline links to each other with described second level cache and direct memory access data from second level cache by vectorial access interface.
Described scalar execution pipeline comprises: be used for storing the source operand of scalar instruction the scalar register file unit, be used for selecting from described scalar register file unit source operand the scalar operands selected cell, be used for described source operand carried out the scalar performance element of computing and be used to carry out computing the scalar data that after finishing the result is write back the scalar register file unit and write back parts;
Described vectorial execution pipeline comprises: the vector registor file unit that is used to store the source operand of vector instruction, be used for from described scalar register file unit and or described vector registor file unit select source operand, and finish the vector operand selected cell that the operand between described vectorial execution unit and scalar execution unit transmits, be used for described source operand is carried out the vectorial performance element of computing, and the vector data that after being used to carry out computing and finishing the result is write back the vector registor file unit writes back parts, described scalar performance element links to each other with described on-chip cache, and described vectorial performance element links to each other with described second level cache.
Described vectorial performance element comprises: be used to finish vectorial Load/Store instruction or scalar Load/Store instruction vectorial Load/Store parts, be used to finish vectorial floating-point operation instruction vectorial floating-point calculation component, be used to finish the vector arithmetic logical block of vector arithmetic logic instruction, and be used to finish the vector that vector shuffles instruction and shuffle parts.
The size of the data block in the described data cache is identical with the width of described vector registor file unit.
Described vector intersection multithreaded microprocessor comprises the vector intersection multithreaded microprocessor nuclear more than two, described vector intersection multithreaded microprocessor nuclear more than two is furnished with on-chip cache and the second level cache that is interconnected respectively, described second level cache more than two interconnects by cross bar switch, and network interface links to each other with Peripheral Interface between the three grade high-speed caches outer with being located at nuclear, sheet, and described each three grades of high-speed cache link to each other with a memory controller that is used for access external memory.
Compared with prior art, the invention has the advantages that:
1, the present invention's vector intersection multithread processing method, to intersect multithreading and vectorial technology combines, be used to handle the vector instruction or the scalar instruction of multithreading, adopt the multithreading wheel to change the mode of getting finger, the instruction of a plurality of vectorial threads is carried out with interleaved mode, when certain thread runs into the long delay operation, the instruction of other thread still can continue to carry out, can well hide the long delay operation that runs in the thread implementation, reach when improving the microprocessor peak performance, can guarantee being full of of vectorial execution pipeline, thereby bring into play the arithmetic capability of microprocessor more fully.
2, the present invention's vector intersection multithreaded microprocessor, on the basis of traditional scalar micro-processor core, add the multithreading instruction fetching component and and increased vectorial execution pipeline, make it will intersect multithreading and vectorial technology combination, finishing the crossing parallel of a plurality of vectorial threads handles, realize the data level parallel processing in the vectorial technological development program, arithmetic capability is strong.
3, the present invention vector intersects multithreaded microprocessor, is still keeping original scalar processor structure when increasing vectorial execution pipeline, can compatible traditional fully scalar application program, and therefore compatibility is good.
4, the present invention's vector intersection multithreaded microprocessor, use intersection multithreading has substituted the superscale technology of current popular, has reduced the complexity of hardware design when guaranteeing performance, and hardware configuration is simple, and cost is little, favorable expandability.
Description of drawings
Fig. 1 is the structural representation of existing typical conventional microprocessor;
Fig. 2 is that existing typical scalar micro-processor is carried out schematic flow sheet;
Fig. 3 is the composition structural representation of vector intersection multithreaded microprocessor of the present invention;
Fig. 4 is the execution schematic flow sheet of vector intersection multithread processing method of the present invention;
Fig. 5 is that the multithreading intersection is carried out the principle schematic of hiding the long delay operation in the vector intersection multithread processing method of the present invention;
Fig. 6 is the realization principle schematic that vector of the present invention shuffles instruction vshuffle vA vB rA;
Fig. 7 is the syndeton synoptic diagram of scalar performance element of the present invention and vectorial performance element and data cache;
Fig. 8 is the composition structural representation of the polycaryon processor of vector intersection multithreaded microprocessor of the present invention.
Embodiment
Below with reference to Figure of description and specific embodiment the present invention is described in further detail.
As shown in Figure 4, vector intersection multithread processing method of the present invention may further comprise the steps:
1) reading command: the multithreading instruction fetching component is selected a vectorial thread to instruct from 8 vectorial threads with round robin to read, and will read the instruction storage that obtains in the instruction queue buffer of vectorial thread correspondence.Instruction fetching component can be that instruction queue buffer is read in the instruction of a cacheline with size at every turn, the length of every instruction generally is 1 machine work, if the size of each on-chip cache piece is 4 machine works, then each clock period of instruction fetching component can be read in the instruction cache formation with 4 instructions.During practical application, the quantity N of vectorial thread can be the natural number greater than 1, and its size only is subjected to the restriction of hardware resource; Generally, general N is got is 2 power, i.e. N=2
n, n=1,2,3 ..., help simplifying hardware design like this.
2) thread is selected: the thread scheduling parts are selected an instruction queue buffer from 8 instruction queue buffers, and take out an instruction and decipher from instruction queue buffer;
3) execution command: vectorial execution pipeline is sent in the instruction after will deciphering or the scalar execution pipeline is carried out, and concrete execution in step is as follows:
3.1) the operand selection: according to the content of the instruction after the decoding, visit vector registor file unit or scalar register file unit obtain source operand, and the source operand that obtains is delivered to corresponding vectorial execution unit or scalar execution unit;
3.2) the instruction execution: vectorial execution unit or scalar execution unit are carried out computing according to source operand, and the result who carries out computing writes back vector registor file unit or scalar register file unit respectively.
In the said method, instruct to be scalar instruction or vector instruction, scalar instruction is existing general scalar instruction, and vector instruction can comprise following five kinds:
ⅰ. vectorial access instruction comprises:
A. vectorial load instructions, a vector can be read in a vector registor from main memory:
Vload vA rB as the address, reads in vector registor vA with data with the numerical value among the scalar register rB;
Vload vA rB imm adds that with the value among the scalar register rB several immediately imm as the address, read in vector registor vA with data;
Vload vA imm to count imm immediately as the address, reads in vector registor vA with data;
B. vector is held instruction, and the content of a vector registor can be write back main memory:
Vstore vA rB imm adds that with the numerical value among the scalar register rB several immediately imm as the address, write main memory with data from vector registor vA;
Vstore vA imm to count imm immediately as the address, writes main memory with data from vector registor vA;
Vstore vA rB as the address, writes main memory with data from vector registor vA with the numerical value among the scalar register rB;
The vector load instructions is held instruction with vector can support multiple different addressing mode, comprises register addressing, immediate addressing and base addressing, can realize whole addressing modes when carrying out specific implementation, also can realize wherein one or more addressing modes;
ⅱ. the vector/scalar register data transfer instructions comprises:
C. vtos vA rB idx sends idx unit among the vector registor vA into scalar register rB, for by the instruction of scalar register to the vector registor Data transmission;
D. stov vA rB duplicates four parts with the value among the scalar register rB, sends into vector registor vA; For by the instruction of vector registor to the scalar register Data transmission;
Finish in certain unit that the value of scalar register is imported in the vector registor to the instruction of vector registor Data transmission by scalar register, or according to mask the value of scalar register is duplicated some parts and compose a plurality of unit of giving vector registor simultaneously; The value of certain unit in the vector registor is composed to certain scalar register to the instruction of scalar register Data transmission by vector registor;
ⅲ. the vector arithmetic logic instruction comprises four classes:
E. vvvop vD vA vB is certain arithmetic logical operation op with unit corresponding among vector registor vA and the vB, and the result writes vector registor vD; This instruction is the vector-vector-vector instruction, and promptly two source operands and a destination operand all are vector registors, and the arithmetic logical operation between vectorial corresponding unit is mainly finished in this class instruction;
F. vvsop rD vA vB is certain arithmetic logical operation op with unit all among vector registor vA and the vB, and the result writes scalar register rD; This instruction is vector-vector-scalar class instruction, and promptly two operands all are vector registors and destination operand is a scalar register, this class instruct mainly finish with the summation of all unit in the vector registor, ask and, ask or etc. operation.
G. vsvop vD vA rB is certain arithmetic logical operation op with all unit among the vector registor vA and scalar register rB, and the result writes vector registor vD;
H. vssop rD vA rB is certain arithmetic logical operation op with all unit among the vector registor vA and scalar register rB, and the result writes scalar register rD;
Vsvop vD vA rB and vssop rD vA rB are the vector-scalar-vector instruction, be that source operand is a vector registor and a scalar register, destination operand is a vector registor, and the arithmetic logical operation of each unit of vector and scalar is finished in this class instruction;
The operation that vector arithmetic logic class instruction is supported comprises the fixed point arithmetic logic instruction, as add, subtract, arithmetic instruction such as multiplication and division, with or, non-, XOR, negate, relatively wait logic instruction; And bit arithmetic instruction, comprise step-by-step and, step-by-step or, non-, the step-by-step XOR of step-by-step, step-by-step negate etc.
ⅳ. vectorial floating-point operation instruction comprises two classes:
E. vvvfop vD vA vB is certain floating-point operation fop with unit corresponding among vector registor vA and the vB, and the result writes vector registor vD; This instruction is the vector-vector-vector instruction, and promptly two source operands and a destination operand all are vector registors, and the floating-point operation between vectorial corresponding unit is mainly finished in this class instruction;
G. vsvfop vD vA rB is certain floating-point operation fop with all unit among the vector registor vA and scalar register rB, and the result writes vector registor vD;
The floating-point operation that the instruction of vector floating-point operation is supported mainly comprises floating add, floating-point subtraction, floating-point multiplication, floating-point division and floating-point comparison, floating-point/operations such as fixed-point data conversion.
V. vector shuffles instruction, and vshuffle vC vB rA is according to the value of scalar register rA, with each unit weighs new sort of vector registor vB and write vector registor vC.
The realization principle that this vector shuffles instruction as shown in Figure 6, the value of four unit of source vector register vB is sent into purpose vector registor vA by two-layer MUX, four MUX of ground floor all are four to select a MUX, their control signal is from the least-significant byte of scalar source-register rA (s0 among the figure ~ s3), two of each MUX; Four MUX of the second layer are the alternative MUX, be used to determine whether the value of source vector register is sent into the purpose vector registor, their control signal is from scalar source-register rA the 8th to the 11st (mask among the figure), if corresponding control signal is 1, the value of corresponding units is just sent into the corresponding unit of purpose vector registor in the source vector register, otherwise the value of corresponding units is constant in the purpose vector registor.
The intersection of a plurality of threads is carried out the long delay operation (as accessing operation) that can be good at hiding in the execution pipeline, thereby guarantees that as possible execution pipeline is in the state of operating at full capacity all the time.As shown in Figure 5, hide the principle of long delay operation for the multithreading intersection execution of vector intersection multithreaded microprocessor of the present invention, the intersection with 4 threads among the figure is implemented as example, and the process of the hiding memory access delay of vectorial thread of a plurality of intersections execution has been described.C representation vector operational order among the figure, M representation vector access instruction, L_hit is that cache hit postpones, L_miss is that cache miss postpones; Four threads are respectively with different sequence execute vector operational orders and vectorial access instruction.When carrying out, the vectorial access instruction of thread 0 will introduce two delays in cycle all the time, if do not intersect other threads of carrying out, this will introduce the obstruction in two cycles in execution pipeline, after introducing the intersection multi-thread mechanism, two idling cycles that the memory access delay brings have been filled up in the vector operation instruction of thread 1 and thread 2, thereby have guaranteed being full of of streamline.The execution of the vectorial access instruction of thread 2 has run into the inefficacy of high-speed cache, this may cause very long streamline to block, and the free time of execution pipeline has been filled up in the vector operation of thread 0, thread 1 and thread 2 instruction in the process of handling cache miss, although do not avoid the obstruction of execution pipeline fully, but also significantly reduced the blocking time of execution pipeline, improved operational performance.As the above analysis, generally speaking, the Thread Count that microprocessor is supported is many more, and the ability of its hiding long delay operation is also just strong more, and its performance just can better be brought into play.
Can realize above-mentioned vector intersection multithread processing method by vector intersection multithreaded microprocessor of the present invention, this vector intersection multithreaded microprocessor comprises more than one vector intersection multithreaded microprocessor nuclear, as shown in Figure 3, vector intersection multithreaded microprocessor nuclear comprises: 8 programmable counters (PC), the multithreading instruction fetching component, instruction cache, 8 instruction queue buffers, the thread scheduling parts, scalar the vector decoding unit, scalar execution pipeline and vectorial execution pipeline and data cache, the multithreading instruction fetching component is according to the current instruction address of storing in 8 programmable counters, wheel changes the instruction of reading N thread from instruction cache, and the instruction that will read is sent into and 8 threads one to one in 8 instruction queue buffers, the thread scheduling parts are selected one from 8 instruction queue buffers, therefrom an instruction being taken out and mail to Biao Liang vector decoding unit deciphers, corresponding scalar execution pipeline or vectorial execution pipeline are sent in instruction after the decoding, and (scalar instruction will be sent into the scalar execution pipeline and carry out, vector instruction then is sent to vectorial execution pipeline and carries out) carry out, in the implementation, scalar execution pipeline or vectorial execution pipeline memory access data from data cache.
In the present embodiment, the scalar execution pipeline has kept the structure of existing typical scalar execution pipeline, and it comprises: be used for storing the source operand of scalar instruction the scalar register file unit, be used for selecting from the scalar register file unit source operand the scalar operands selected cell, be used for source operand carried out the scalar performance element of computing and be used to carry out computing the scalar data that after finishing the result is write back the scalar register file unit and write back parts.And the present invention has increased vectorial execution pipeline, it comprises: be used for storing the source operand of vector instruction the vector registor file unit, be used for from the scalar register file unit and or the vector registor file unit select source operand, and finish the operand transmission between vectorial execution unit and scalar execution unit the vector operand selected cell, be used for source operand is carried out the vectorial performance element of computing, and the vector data that after being used to carry out computing and finishing the result is write back the vector registor file unit writes back parts.Why increasing operand selection and transmittance process is because some instruction (as vector-scalar-vector) needs two kinds of operands of vector sum scalar simultaneously.
Because vectorial execution pipeline can be finished the computing of whole vector simultaneously in each clock period, therefore it is far longer than the scalar execution pipeline to the data demand, and the less on-chip cache of capacity is difficult to satisfy simultaneously scalar operation parts and the vector operation parts demand to data usually.Therefore, as shown in Figure 7, in the present embodiment, data cache comprises on-chip cache and the second level cache that is interconnected, scalar performance element in the scalar execution pipeline links to each other with on-chip cache and memory access data from on-chip cache, the vectorial performance element of vector in the execution pipeline is by the vectorial access interface directly memory access data from second level cache that link to each other with second level cache also, that is: make vectorial Load/Store parts can walk around on-chip cache, through second level cache.
In the present embodiment, the vector performance element comprises: be used to finish vectorial Load/Store instruction or scalar Load/Store instruction vectorial Load/Store parts, be used to finish vectorial floating-point operation instruction vectorial floating-point calculation component, be used to finish the vector arithmetic logical block of vector arithmetic logic instruction, and be used to finish the vector that vector shuffles instruction and shuffle parts.The scalar performance element then adopts existing typical structure, during execution pipeline work, various instructions all are admitted to corresponding execution unit, source operand as scalar/vectorial access instruction will be admitted to scalar or vectorial Load/Store parts, and the vector arithmetic operational order will be admitted to vector arithmetic logical block etc.
In the present embodiment, the size of the data block in the data cache is identical with the width of vector registor file unit.As: the vector registor file unit comprises 8 vector registor groups, the corresponding thread of each vector registor group.Each vector registor group comprises 32 vector registor: v0~v31, the length of each vector registor is 4 machine works, can be corresponding 32 machine works of 128(according to the bit wide of the different vector registors of machine work length) or corresponding 64 machine works of 256().
It is two when above that vector in vector of the present invention intersects multithreaded microprocessor intersects multithreaded microprocessor nuclear, as shown in Figure 8, vector intersection multithreaded microprocessor nuclear more than two is furnished with on-chip cache and the second level cache that is interconnected respectively, second level cache more than two interconnects by cross bar switch, and network interface links to each other with Peripheral Interface between the three grade high-speed caches outer with being located at nuclear, sheet, and each three grades of high-speed cache links to each other with a memory controller that is used for access external memory.The interconnected HSSI High-Speed Serial Interface that provides of a plurality of vector intersection microprocessor chips is provided network interface between sheet; Peripheral Interface provides multiple peripheral bus support, comprises PCI-E bus, gigabit Ethernet etc.Vector intersects the multithreading multi-core microprocessor provides high Practical Performance with simple hardware construction comparatively, and process level, thread-level, instruction-level and the data level that can give full play in the application program are parallel; This processor can compatible existing scalar application program simultaneously, has compatibility and extensibility preferably.
The above only is a preferred implementation of the present invention, and protection scope of the present invention also not only is confined to the foregoing description, and all technical schemes that belongs under the thinking of the present invention all belong to protection scope of the present invention.Should be pointed out that for those skilled in the art the some improvements and modifications not breaking away under the principle of the invention prerequisite should be considered as protection scope of the present invention.
Claims (10)
1. a vector intersects multithread processing method, it is characterized in that may further comprise the steps:
1) reading command: the multithreading instruction fetching component is selected a vectorial thread to instruct from N vectorial thread with round robin to read, and will read the instruction storage that obtains in the instruction queue buffer of described vectorial thread correspondence;
2) thread is selected: the thread scheduling parts are selected an instruction queue buffer from N instruction queue buffer, and take out an instruction and decipher from described instruction queue buffer;
3) execution command: vectorial execution pipeline is sent in the instruction after will deciphering or the scalar execution pipeline is carried out.
2. vector intersection multithread processing method according to claim 1 is characterized in that described N=2
n, n=1,2,3 wherein ...
3. vector intersection multithread processing method according to claim 1 is characterized in that described step 3) specifically may further comprise the steps:
3.1) the operand selection: according to the content of the instruction after the described decoding, visit vector registor file unit or scalar register file unit obtain source operand, and the source operand that obtains is delivered to corresponding vectorial execution unit or scalar execution unit;
3.2) the instruction execution: described vectorial execution unit or scalar execution unit are carried out computing according to described source operand, and the result who carries out computing writes back vector registor file unit or scalar register file unit respectively.
4. according to claim 1 or 2 or 3 described vector intersection multithread processing methods, it is characterized in that described instruction is scalar instruction or vector instruction, described vector instruction comprises following classification:
ⅰ. vectorial access instruction comprises:
A. vectorial load instructions:
Vload vA rB as the address, reads in vector registor vA with data with the numerical value among the scalar register rB;
Vload vA rB imm adds that with the value among the scalar register rB several immediately imm as the address, read in vector registor vA with data;
Vload vA imm to count imm immediately as the address, reads in vector registor vA with data;
B. vector is held instruction:
Vstore vA rB imm adds that with the numerical value among the scalar register rB several immediately imm as the address, write main memory with data from vector registor vA;
Vstore vA imm to count imm immediately as the address, writes main memory with data from vector registor vA;
Vstore vA rB as the address, writes main memory with data from vector registor vA with the numerical value among the scalar register rB;
ⅱ. the vector/scalar data transfer instructions comprises:
C. vtos vA rB idx sends idx unit among the vector registor vA into scalar register rB;
D. stov vA rB duplicates four parts with the value among the scalar register rB, sends into vector registor vA;
ⅲ. the vector arithmetic logic instruction comprises:
E. vvvop vD vA vB is certain arithmetic logical operation op with unit corresponding among vector registor vA and the vB, and the result writes vector registor vD;
F. vvsop rD vA vB is certain arithmetic logical operation op with unit all among vector registor vA and the vB, and the result writes scalar register rD;
G. vsvop vD vA rB is certain arithmetic logical operation op with all unit among the vector registor vA and scalar register rB, and the result writes vector registor vD;
H. vssop rD vA rB is certain arithmetic logical operation op with all unit among the vector registor vA and scalar register rB, and the result writes scalar register rD;
ⅳ. vectorial floating-point operation instruction comprises two classes:
E. vvvfop vD vA vB is certain floating-point operation fop with unit corresponding among vector registor vA and the vB, and the result writes vector registor vD;
G. vsvfop vD vA rB is certain floating-point operation fop with all unit among the vector registor vA and scalar register rB, and the result writes vector registor vD;
V. vector shuffles instruction, and vshuffle vC vB rA is according to the value of scalar register rA, with each unit weighs new sort of vector registor vB and write vector registor vC.
5. a vector intersects multithreaded microprocessor, it is characterized in that, comprise more than one vector intersection multithreaded microprocessor nuclear, described vector intersection multithreaded microprocessor nuclear comprises: N programmable counter, the multithreading instruction fetching component, instruction cache, N instruction queue buffer, the thread scheduling parts, scalar the vector decoding unit, scalar execution pipeline and vectorial execution pipeline and data cache, described multithreading instruction fetching component is according to the current instruction address of storing in the described N programmable counter, wheel changes the instruction of reading a described N thread from described instruction cache, and the instruction that will read send into described one to one N the instruction queue buffer of a described N thread in, described thread scheduling parts are selected one from a described N instruction queue buffer, therefrom an instruction being taken out and mail to described Biao Liang vector decoding unit deciphers, instruction after the decoding is sent into corresponding scalar execution pipeline or vectorial execution pipeline is carried out, in the implementation, described scalar execution pipeline or vectorial execution pipeline memory access data from described data cache.
6. vector intersection multithreaded microprocessor according to claim 5, it is characterized in that, described data cache comprises on-chip cache and the second level cache that is interconnected, described scalar execution pipeline link to each other with described on-chip cache and from described on-chip cache the memory access data, described vectorial execution pipeline links to each other with described second level cache and direct memory access data from second level cache by vectorial access interface.
7. vector intersection multithreaded microprocessor according to claim 6, it is characterized in that described scalar execution pipeline comprises: be used for storing the source operand of scalar instruction the scalar register file unit, be used for selecting from described scalar register file unit source operand the scalar operands selected cell, be used for described source operand carried out the scalar performance element of computing and be used to carry out computing the scalar data that after finishing the result is write back the scalar register file unit and write back parts;
Described vectorial execution pipeline comprises: the vector registor file unit that is used to store the source operand of vector instruction, be used for from described scalar register file unit and or described vector registor file unit select source operand, and finish the vector operand selected cell that the operand between described vectorial execution unit and scalar execution unit transmits, be used for described source operand is carried out the vectorial performance element of computing, and the vector data that after being used to carry out computing and finishing the result is write back the vector registor file unit writes back parts, described scalar performance element links to each other with described on-chip cache, and described vectorial performance element links to each other with described second level cache.
8. vector intersection multithreaded microprocessor according to claim 7, it is characterized in that, described vectorial performance element comprises: be used to finish vectorial Load/Store instruction or scalar Load/Store instruction vectorial Load/Store parts, be used to finish vectorial floating-point operation instruction vectorial floating-point calculation component, be used to finish the vector arithmetic logical block of vector arithmetic logic instruction, and be used to finish the vector that vector shuffles instruction and shuffle parts.
9. according to claim 5 or 6 or 7 or 8 described vector intersection multithreaded microprocessors, it is characterized in that the size of the data block in the described data cache is identical with the width of described vector registor file unit.
10. according to claim 5 or 6 or 7 or 8 described vector intersection multithreaded microprocessors, it is characterized in that, described vector intersection multithreaded microprocessor comprises the vector intersection multithreaded microprocessor nuclear more than two, described vector intersection multithreaded microprocessor nuclear more than two is furnished with on-chip cache and the second level cache that is interconnected respectively, described second level cache more than two interconnects by cross bar switch, and with being located at nuclear outer three grades of high-speed caches, network interface links to each other with Peripheral Interface between sheet, and described each three grades of high-speed cache link to each other with a memory controller that is used for access external memory.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011101138829A CN102156637A (en) | 2011-05-04 | 2011-05-04 | Vector crossing multithread processing method and vector crossing multithread microprocessor |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011101138829A CN102156637A (en) | 2011-05-04 | 2011-05-04 | Vector crossing multithread processing method and vector crossing multithread microprocessor |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102156637A true CN102156637A (en) | 2011-08-17 |
Family
ID=44438145
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2011101138829A Pending CN102156637A (en) | 2011-05-04 | 2011-05-04 | Vector crossing multithread processing method and vector crossing multithread microprocessor |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102156637A (en) |
Cited By (32)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102508776A (en) * | 2011-11-03 | 2012-06-20 | 中国人民解放军国防科学技术大学 | Automatic construction method for evaluation stimulus of multi-thread cross double-precision short-vector structure |
CN103699360A (en) * | 2012-09-27 | 2014-04-02 | 北京中科晶上科技有限公司 | Vector processor and vector data access and interaction method thereof |
CN103890719A (en) * | 2011-10-18 | 2014-06-25 | 联发科技瑞典有限公司 | Digital signal processor and baseband communication device |
CN103930883A (en) * | 2011-09-28 | 2014-07-16 | Arm有限公司 | Interleaving data accesses issued in response to vector access instructions |
CN104040489A (en) * | 2011-12-23 | 2014-09-10 | 英特尔公司 | Multi-register gather instruction |
WO2017016486A1 (en) * | 2015-07-30 | 2017-02-02 | Huawei Technologies Co., Ltd. | System and method for variable lane architecture |
WO2017185405A1 (en) * | 2016-04-26 | 2017-11-02 | 北京中科寒武纪科技有限公司 | Apparatus and method for performing vector outer product arithmetic |
WO2017185411A1 (en) * | 2016-04-29 | 2017-11-02 | 北京中科寒武纪科技有限公司 | Apparatus and method for executing adagrad gradient descent training algorithm |
WO2017185404A1 (en) * | 2016-04-26 | 2017-11-02 | 北京中科寒武纪科技有限公司 | Apparatus and method for performing vector logical operation |
CN107315569A (en) * | 2016-04-27 | 2017-11-03 | 北京中科寒武纪科技有限公司 | A kind of device and method for being used to perform RMSprop gradient descent algorithms |
CN107315570A (en) * | 2016-04-27 | 2017-11-03 | 北京中科寒武纪科技有限公司 | It is a kind of to be used to perform the device and method that Adam gradients decline training algorithm |
CN107315717A (en) * | 2016-04-26 | 2017-11-03 | 北京中科寒武纪科技有限公司 | A kind of apparatus and method for performing vectorial arithmetic |
CN107315565A (en) * | 2016-04-26 | 2017-11-03 | 北京中科寒武纪科技有限公司 | It is a kind of to be used to generate the random vector apparatus and method obeyed and be necessarily distributed |
CN107315575A (en) * | 2016-04-26 | 2017-11-03 | 北京中科寒武纪科技有限公司 | A kind of apparatus and method for performing vectorial union operation |
CN107341540A (en) * | 2016-04-29 | 2017-11-10 | 北京中科寒武纪科技有限公司 | A kind of apparatus and method for performing Hessian-Free training algorithms |
CN107408063A (en) * | 2015-02-02 | 2017-11-28 | 优创半导体科技有限公司 | It is configured with the vector processor that asymmetric multithreading is operated to variable-length vector |
US9870340B2 (en) | 2015-03-30 | 2018-01-16 | International Business Machines Corporation | Multithreading in vector processors |
WO2018082229A1 (en) * | 2016-11-03 | 2018-05-11 | 北京中科寒武纪科技有限公司 | Slam operation apparatus and method |
CN108255519A (en) * | 2016-12-29 | 2018-07-06 | 展讯通信(上海)有限公司 | The floating point instruction processing method and processing device of synchronous multiline procedure processor |
CN109032666A (en) * | 2018-07-03 | 2018-12-18 | 中国人民解放军国防科技大学 | Method and device for determining number of assertion active elements for vector processing |
CN109062604A (en) * | 2018-06-26 | 2018-12-21 | 天津飞腾信息技术有限公司 | A kind of launching technique and device towards the mixing execution of scalar sum vector instruction |
CN111291880A (en) * | 2017-10-30 | 2020-06-16 | 上海寒武纪信息科技有限公司 | Computing device and computing method |
CN111464316A (en) * | 2012-03-30 | 2020-07-28 | 英特尔公司 | Method and apparatus for processing SHA-2 secure hash algorithms |
CN111580864A (en) * | 2016-01-20 | 2020-08-25 | 中科寒武纪科技股份有限公司 | Vector operation device and operation method |
CN111651204A (en) * | 2016-04-26 | 2020-09-11 | 中科寒武纪科技股份有限公司 | Device and method for executing vector maximum and minimum operation |
WO2022127441A1 (en) * | 2020-12-16 | 2022-06-23 | 广东赛昉科技有限公司 | Method for extracting instructions in parallel and readable storage medium |
CN115129480A (en) * | 2022-08-26 | 2022-09-30 | 上海登临科技有限公司 | Scalar processing unit and access control method thereof |
WO2023123453A1 (en) * | 2021-12-31 | 2023-07-06 | 华为技术有限公司 | Operation acceleration processing method, operation accelerator use method, and operation accelerator |
CN116450216A (en) * | 2023-06-12 | 2023-07-18 | 上海灵动微电子股份有限公司 | Local caching method for shared hardware operation unit |
CN116483441A (en) * | 2023-06-21 | 2023-07-25 | 睿思芯科(深圳)技术有限公司 | Output time sequence optimizing system, method and related equipment based on shift buffering |
CN117348933A (en) * | 2023-12-05 | 2024-01-05 | 睿思芯科(深圳)技术有限公司 | Processor and computer system |
CN117931729A (en) * | 2024-03-22 | 2024-04-26 | 芯来智融半导体科技(上海)有限公司 | Vector processor memory access instruction processing method and system |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4633389A (en) * | 1982-02-03 | 1986-12-30 | Hitachi, Ltd. | Vector processor system comprised of plural vector processors |
CN1349159A (en) * | 2001-11-28 | 2002-05-15 | 中国人民解放军国防科学技术大学 | Vector processing method of microprocessor |
CN1478228A (en) * | 2000-11-02 | 2004-02-25 | ض� | Breaking replay dependency loops in processor using rescheduled replay queue |
CN1781088A (en) * | 2001-12-20 | 2006-05-31 | 杉桥技术公司 | Multithreaded processor with efficient processing for convergence device applications |
CN1834956A (en) * | 2005-03-18 | 2006-09-20 | 联想(北京)有限公司 | Processing of multiroute processing element data |
CN101978350A (en) * | 2008-03-28 | 2011-02-16 | 英特尔公司 | Vector instructions to enable efficient synchronization and parallel reduction operations |
CN101986264A (en) * | 2010-11-25 | 2011-03-16 | 中国人民解放军国防科学技术大学 | Multifunctional floating-point multiply and add calculation device for single instruction multiple data (SIMD) vector microprocessor |
-
2011
- 2011-05-04 CN CN2011101138829A patent/CN102156637A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4633389A (en) * | 1982-02-03 | 1986-12-30 | Hitachi, Ltd. | Vector processor system comprised of plural vector processors |
CN1478228A (en) * | 2000-11-02 | 2004-02-25 | ض� | Breaking replay dependency loops in processor using rescheduled replay queue |
CN1349159A (en) * | 2001-11-28 | 2002-05-15 | 中国人民解放军国防科学技术大学 | Vector processing method of microprocessor |
CN1781088A (en) * | 2001-12-20 | 2006-05-31 | 杉桥技术公司 | Multithreaded processor with efficient processing for convergence device applications |
CN1834956A (en) * | 2005-03-18 | 2006-09-20 | 联想(北京)有限公司 | Processing of multiroute processing element data |
CN101978350A (en) * | 2008-03-28 | 2011-02-16 | 英特尔公司 | Vector instructions to enable efficient synchronization and parallel reduction operations |
CN101986264A (en) * | 2010-11-25 | 2011-03-16 | 中国人民解放军国防科学技术大学 | Multifunctional floating-point multiply and add calculation device for single instruction multiple data (SIMD) vector microprocessor |
Cited By (72)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103930883A (en) * | 2011-09-28 | 2014-07-16 | Arm有限公司 | Interleaving data accesses issued in response to vector access instructions |
CN103930883B (en) * | 2011-09-28 | 2017-02-15 | Arm 有限公司 | Interleaving data accesses method and device in response to vector access instructions |
CN103890719B (en) * | 2011-10-18 | 2016-11-16 | 联发科技瑞典有限公司 | Digital signal processor and baseband communication equipment |
CN103890719A (en) * | 2011-10-18 | 2014-06-25 | 联发科技瑞典有限公司 | Digital signal processor and baseband communication device |
CN102508776A (en) * | 2011-11-03 | 2012-06-20 | 中国人民解放军国防科学技术大学 | Automatic construction method for evaluation stimulus of multi-thread cross double-precision short-vector structure |
CN104040489A (en) * | 2011-12-23 | 2014-09-10 | 英特尔公司 | Multi-register gather instruction |
US9766887B2 (en) | 2011-12-23 | 2017-09-19 | Intel Corporation | Multi-register gather instruction |
US10180838B2 (en) | 2011-12-23 | 2019-01-15 | Intel Corporation | Multi-register gather instruction |
CN111464316A (en) * | 2012-03-30 | 2020-07-28 | 英特尔公司 | Method and apparatus for processing SHA-2 secure hash algorithms |
CN111464316B (en) * | 2012-03-30 | 2023-10-27 | 英特尔公司 | Method and apparatus for processing SHA-2 secure hash algorithm |
CN103699360B (en) * | 2012-09-27 | 2016-09-21 | 北京中科晶上科技有限公司 | A kind of vector processor and carry out vector data access, mutual method |
CN103699360A (en) * | 2012-09-27 | 2014-04-02 | 北京中科晶上科技有限公司 | Vector processor and vector data access and interaction method thereof |
CN107408063A (en) * | 2015-02-02 | 2017-11-28 | 优创半导体科技有限公司 | It is configured with the vector processor that asymmetric multithreading is operated to variable-length vector |
US9870340B2 (en) | 2015-03-30 | 2018-01-16 | International Business Machines Corporation | Multithreading in vector processors |
WO2017016486A1 (en) * | 2015-07-30 | 2017-02-02 | Huawei Technologies Co., Ltd. | System and method for variable lane architecture |
US10884756B2 (en) | 2015-07-30 | 2021-01-05 | Futurewei Technologies, Inc. | System and method for variable lane architecture |
US10691463B2 (en) | 2015-07-30 | 2020-06-23 | Futurewei Technologies, Inc. | System and method for variable lane architecture |
CN111580864B (en) * | 2016-01-20 | 2024-05-07 | 中科寒武纪科技股份有限公司 | Vector operation device and operation method |
CN111580864A (en) * | 2016-01-20 | 2020-08-25 | 中科寒武纪科技股份有限公司 | Vector operation device and operation method |
CN111580866A (en) * | 2016-01-20 | 2020-08-25 | 中科寒武纪科技股份有限公司 | Vector operation device and operation method |
CN111580866B (en) * | 2016-01-20 | 2024-05-07 | 中科寒武纪科技股份有限公司 | Vector operation device and operation method |
CN111651203A (en) * | 2016-04-26 | 2020-09-11 | 中科寒武纪科技股份有限公司 | Device and method for executing vector four-rule operation |
US11100192B2 (en) | 2016-04-26 | 2021-08-24 | Cambricon Technologies Corporation Limited | Apparatus and methods for vector operations |
CN111651206B (en) * | 2016-04-26 | 2024-05-07 | 中科寒武纪科技股份有限公司 | Apparatus and method for performing vector outer product operation |
CN111651203B (en) * | 2016-04-26 | 2024-05-07 | 中科寒武纪科技股份有限公司 | Device and method for executing vector four-rule operation |
WO2017185405A1 (en) * | 2016-04-26 | 2017-11-02 | 北京中科寒武纪科技有限公司 | Apparatus and method for performing vector outer product arithmetic |
CN111651204B (en) * | 2016-04-26 | 2024-04-05 | 中科寒武纪科技股份有限公司 | Apparatus and method for performing vector maximum-minimum operation |
WO2017185404A1 (en) * | 2016-04-26 | 2017-11-02 | 北京中科寒武纪科技有限公司 | Apparatus and method for performing vector logical operation |
CN107315575A (en) * | 2016-04-26 | 2017-11-03 | 北京中科寒武纪科技有限公司 | A kind of apparatus and method for performing vectorial union operation |
US11507640B2 (en) | 2016-04-26 | 2022-11-22 | Cambricon Technologies Corporation Limited | Apparatus and methods for vector operations |
CN107315565A (en) * | 2016-04-26 | 2017-11-03 | 北京中科寒武纪科技有限公司 | It is a kind of to be used to generate the random vector apparatus and method obeyed and be necessarily distributed |
CN107315717A (en) * | 2016-04-26 | 2017-11-03 | 北京中科寒武纪科技有限公司 | A kind of apparatus and method for performing vectorial arithmetic |
CN107315568B (en) * | 2016-04-26 | 2020-08-07 | 中科寒武纪科技股份有限公司 | Device for executing vector logic operation |
US11501158B2 (en) | 2016-04-26 | 2022-11-15 | Cambricon (Xi'an) Semiconductor Co., Ltd. | Apparatus and methods for generating random vectors |
CN107315716A (en) * | 2016-04-26 | 2017-11-03 | 北京中科寒武纪科技有限公司 | A kind of apparatus and method for performing Outer Product of Vectors computing |
CN107315568A (en) * | 2016-04-26 | 2017-11-03 | 北京中科寒武纪科技有限公司 | A kind of device for being used to perform vector logic computing |
US11436301B2 (en) | 2016-04-26 | 2022-09-06 | Cambricon Technologies Corporation Limited | Apparatus and methods for vector operations |
CN111651206A (en) * | 2016-04-26 | 2020-09-11 | 中科寒武纪科技股份有限公司 | Device and method for executing vector outer product operation |
CN111651204A (en) * | 2016-04-26 | 2020-09-11 | 中科寒武纪科技股份有限公司 | Device and method for executing vector maximum and minimum operation |
US10831861B2 (en) | 2016-04-26 | 2020-11-10 | Cambricon Technologies Corporation Limited | Apparatus and methods for vector operations |
US11341211B2 (en) | 2016-04-26 | 2022-05-24 | Cambricon Technologies Corporation Limited | Apparatus and methods for vector operations |
US11157593B2 (en) | 2016-04-26 | 2021-10-26 | Cambricon Technologies Corporation Limited | Apparatus and methods for combining vectors |
US10997276B2 (en) | 2016-04-26 | 2021-05-04 | Cambricon Technologies Corporation Limited | Apparatus and methods for vector operations |
US11126429B2 (en) | 2016-04-26 | 2021-09-21 | Cambricon Technologies Corporation Limited | Apparatus and methods for bitwise vector operations |
CN107315570B (en) * | 2016-04-27 | 2021-06-18 | 中科寒武纪科技股份有限公司 | Device and method for executing Adam gradient descent training algorithm |
CN107315569B (en) * | 2016-04-27 | 2021-06-18 | 中科寒武纪科技股份有限公司 | Device and method for executing RMSprop gradient descent algorithm |
CN107315569A (en) * | 2016-04-27 | 2017-11-03 | 北京中科寒武纪科技有限公司 | A kind of device and method for being used to perform RMSprop gradient descent algorithms |
CN107315570A (en) * | 2016-04-27 | 2017-11-03 | 北京中科寒武纪科技有限公司 | It is a kind of to be used to perform the device and method that Adam gradients decline training algorithm |
CN107341540A (en) * | 2016-04-29 | 2017-11-10 | 北京中科寒武纪科技有限公司 | A kind of apparatus and method for performing Hessian-Free training algorithms |
CN107341540B (en) * | 2016-04-29 | 2021-07-20 | 中科寒武纪科技股份有限公司 | Device and method for executing Hessian-Free training algorithm |
CN107341132B (en) * | 2016-04-29 | 2021-06-11 | 中科寒武纪科技股份有限公司 | Device and method for executing AdaGrad gradient descent training algorithm |
WO2017185411A1 (en) * | 2016-04-29 | 2017-11-02 | 北京中科寒武纪科技有限公司 | Apparatus and method for executing adagrad gradient descent training algorithm |
CN107341132A (en) * | 2016-04-29 | 2017-11-10 | 北京中科寒武纪科技有限公司 | It is a kind of to be used to perform the apparatus and method that AdaGrad gradients decline training algorithm |
WO2018082229A1 (en) * | 2016-11-03 | 2018-05-11 | 北京中科寒武纪科技有限公司 | Slam operation apparatus and method |
CN108255519A (en) * | 2016-12-29 | 2018-07-06 | 展讯通信(上海)有限公司 | The floating point instruction processing method and processing device of synchronous multiline procedure processor |
CN108255519B (en) * | 2016-12-29 | 2020-08-14 | 展讯通信(上海)有限公司 | Floating point instruction processing method and device of synchronous multi-thread processor |
CN111291880B (en) * | 2017-10-30 | 2024-05-14 | 上海寒武纪信息科技有限公司 | Computing device and computing method |
CN111291880A (en) * | 2017-10-30 | 2020-06-16 | 上海寒武纪信息科技有限公司 | Computing device and computing method |
CN109062604A (en) * | 2018-06-26 | 2018-12-21 | 天津飞腾信息技术有限公司 | A kind of launching technique and device towards the mixing execution of scalar sum vector instruction |
CN109032666B (en) * | 2018-07-03 | 2021-03-23 | 中国人民解放军国防科技大学 | Method and device for determining number of assertion active elements for vector processing |
CN109032666A (en) * | 2018-07-03 | 2018-12-18 | 中国人民解放军国防科技大学 | Method and device for determining number of assertion active elements for vector processing |
WO2022127441A1 (en) * | 2020-12-16 | 2022-06-23 | 广东赛昉科技有限公司 | Method for extracting instructions in parallel and readable storage medium |
WO2023123453A1 (en) * | 2021-12-31 | 2023-07-06 | 华为技术有限公司 | Operation acceleration processing method, operation accelerator use method, and operation accelerator |
CN115129480B (en) * | 2022-08-26 | 2022-11-08 | 上海登临科技有限公司 | Scalar processing unit and access control method thereof |
CN115129480A (en) * | 2022-08-26 | 2022-09-30 | 上海登临科技有限公司 | Scalar processing unit and access control method thereof |
CN116450216B (en) * | 2023-06-12 | 2023-08-29 | 上海灵动微电子股份有限公司 | Local caching method for shared hardware operation unit |
CN116450216A (en) * | 2023-06-12 | 2023-07-18 | 上海灵动微电子股份有限公司 | Local caching method for shared hardware operation unit |
CN116483441B (en) * | 2023-06-21 | 2023-09-12 | 睿思芯科(深圳)技术有限公司 | Output time sequence optimizing system, method and related equipment based on shift buffering |
CN116483441A (en) * | 2023-06-21 | 2023-07-25 | 睿思芯科(深圳)技术有限公司 | Output time sequence optimizing system, method and related equipment based on shift buffering |
CN117348933A (en) * | 2023-12-05 | 2024-01-05 | 睿思芯科(深圳)技术有限公司 | Processor and computer system |
CN117348933B (en) * | 2023-12-05 | 2024-02-06 | 睿思芯科(深圳)技术有限公司 | Processor and computer system |
CN117931729A (en) * | 2024-03-22 | 2024-04-26 | 芯来智融半导体科技(上海)有限公司 | Vector processor memory access instruction processing method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102156637A (en) | Vector crossing multithread processing method and vector crossing multithread microprocessor | |
Ipek et al. | Core fusion: accommodating software diversity in chip multiprocessors | |
CN106648554B (en) | For improving system, the method and apparatus of the handling capacity in continuous transactional memory area | |
CN102004719B (en) | Very long instruction word processor structure supporting simultaneous multithreading | |
CN102750133B (en) | 32-Bit triple-emission digital signal processor supporting SIMD | |
CN104050023B (en) | System and method for realizing transaction memory | |
EP3716056B1 (en) | Apparatus and method for program order queue (poq) to manage data dependencies in processor having multiple instruction queues | |
US11275637B2 (en) | Aggregated page fault signaling and handling | |
US10095623B2 (en) | Hardware apparatuses and methods to control access to a multiple bank data cache | |
US9904553B2 (en) | Method and apparatus for implementing dynamic portbinding within a reservation station | |
CN105453030B (en) | Processor, the method and system loaded dependent on the partial width of mode is carried out to wider register | |
CN103365627A (en) | System and method of data forwarding within an execution unit | |
US10275242B2 (en) | System and method for real time instruction tracing | |
KR20220151134A (en) | Apparatus and method for adaptively scheduling work on heterogeneous processing resources | |
CN104216681B (en) | A kind of cpu instruction processing method and processor | |
CN100451951C (en) | 5+3 levels pipeline structure and method in RISC CPU | |
US20140129805A1 (en) | Execution pipeline power reduction | |
CN101266559A (en) | Configurable microprocessor and method for dividing single microprocessor core as multiple cores | |
CN108351780A (en) | Contiguous data element-pairwise switching processor, method, system and instruction | |
CN105183697B (en) | Embedded RSIC DSP Processors system and construction method | |
EP3757772A1 (en) | System, apparatus and method for a hybrid reservation station for a processor | |
Omondi | The microarchitecture of pipelined and superscalar computers | |
CN103218207A (en) | Microprocessor instruction processing method and system based on single/dual transmitting instruction set | |
CN202720631U (en) | Single/double transmission instruction set-based microprocessor instruction processing system | |
CN105843589B (en) | A kind of storage arrangement applied to VLIW type processors |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C12 | Rejection of a patent application after its publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20110817 |