CN100489830C - 64 bit stream processor chip system structure oriented to scientific computing - Google Patents

64 bit stream processor chip system structure oriented to scientific computing Download PDF

Info

Publication number
CN100489830C
CN100489830C CNB2007100345666A CN200710034566A CN100489830C CN 100489830 C CN100489830 C CN 100489830C CN B2007100345666 A CNB2007100345666 A CN B2007100345666A CN 200710034566 A CN200710034566 A CN 200710034566A CN 100489830 C CN100489830 C CN 100489830C
Authority
CN
China
Prior art keywords
stream
group
instruction
controller
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB2007100345666A
Other languages
Chinese (zh)
Other versions
CN101021831A (en
Inventor
杨学军
张民选
邢座程
蒋江
马驰远
李勇
陈海燕
高军
李晋文
衣晓飞
张明
张承义
穆长富
阳柳
曾献君
倪晓强
唐遇星
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CNB2007100345666A priority Critical patent/CN100489830C/en
Publication of CN101021831A publication Critical patent/CN101021831A/en
Application granted granted Critical
Publication of CN100489830C publication Critical patent/CN100489830C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a scientific-oriented, 64-bit flow processor chip which includes scalar processing core as the main processor that is responsible for the executive of the scalar procedure and the scheduling of flow processor core. On-chip memory and controller have the 64-bit scalar processing core and interface between 64-bit flow processing core and outside world and store data and instructions, and the network interface is the communication interface to other processor of the 64-bit scalar processing core and 64-bit flow processing core. The 64-bit scalar flow processing core includes flow controller, calculated group, instruction buffer, computing cluster controller, data buffering and data buffer controller.

Description

64 bit stream processor chips towards science calculating
Technical field
The present invention is mainly concerned with the design field of microprocessor, refers in particular to a kind of 64 bit stream processor chips that calculate towards science.
Background technology
All the time, the lifting of microprocessor performance (the operation number that can handle in the unit interval is generally weighed with MIPS or FLOPS) depends primarily on two factors: clock frequency and parallel.In recent years, because the influence of factors such as power consumption, signal integrity, the lasting soaring of clock frequency presented the trend that slows down, and only be 6% lifting every year.Accordingly, the improvement of performance also drops to about 20% from 52% of the nineties in 20th century.The exploitation of concurrency becomes the main path that promotes microprocessor performance again.So-called concurrency just is meant the number of the operation that average each clock period can carry out simultaneously.Instruction level parallelism (Instruction Level Parallelism, abbreviation ILP), parallel (the Data LevelParallelism of data level, be called for short DLP) and Thread-Level Parallelism (Thread Level Parallelism is called for short TLP) all be the principal mode of concurrency exploitation in the microprocessor.
ILP is meant the overlapping of instruction or carries out simultaneously.Streamline is the simplest method of exploitation ILP, by the part stage of coming overlap instruction to carry out component-dedicatedization and command function segmentation.Relevant (hazards) is the major obstacle of concurrency exploitation, can solve by the form of hardware dynamic exploitation or the form of compiler static scheduling, superscale and very long instruction word (Very Long Instruction Word is called for short VLIW) are respectively the representatives of two kinds of technology.
DLP is meant the overlapping of data or handles simultaneously.Different data tend to be handled by identical instruction or instruction sequence, vector calculation, circular treatment supervisor structure during for example science calculating and media are used.SIMD (SingleInstruction stream, Multiple Data streams) walks abreast same operation and is applied to different processing element, handles different data, can effectively develop DLP.Each processing element all has the data space of oneself, but shares the same instruction space and steering logic.Vector processor and stream handle are the examples of DLP exploitation: Vector Processing is that a plurality of data are an operational parallel processing, but a plurality of continuous vector operations can be carried out by " link " technology series connection, the parallel processing that the stream processing then is a plurality of data on a sequence of operation or one section program.
TLP is meant the overlapping of thread or carries out simultaneously.Thread is meant the process with independent instruction space and data space, both can represent a process, also can be an independently program.When a plurality of thread parallels are carried out, the data space difference, it is lower that the relevant probability of data takes place, as long as enough execution resources are arranged, instruction sequence from different threads just can be handled simultaneously, be the execution pattern of a kind of MIMD (Multiple Instruction streams, Multiple Data streams), the exploitation concurrency is more flexible.Multicomputer system and chip multiprocessors can be supported the exploitation of TLP, are a kind of more coarseness, higher level walking abreast.
Though the exploitation of concurrency can be carried out from multiple channel, they all finally are subject to the bandwidth of data communication, particularly for the huge science computing application of data set, and problem that Here it is so-called " storage wall ".The development of modern integrated circuits technology makes that integrated thousands of calculating unit is out of question on the sheet, but how provides service data to become the subject matter that the restriction microprocessor performance promotes for the calculating unit of magnanimity.Being written into and preserving, change to and swap out of data occupied the major part of CPU time, be the critical path that program is carried out, improving the memory bandwidth in the sheet, outside the sheet or avoiding heavy storage class operation as far as possible is one of main target of modern microprocessor architectures design.
Application flow programization (streamization, be called for short fluidisation) is helped the concurrency in the exposure program and the locality of data.So-called stream (stream) is meant by one group to have ordered set same structure, that mutual incoherent data (being called record) are formed, the basic thought of string routine design is that storage operation is separated with calculating operation, to be calculated as node, to come organization procedure with data stream as the clue of connected node.Stream handle is exactly the micro-processor architecture that the concurrency that can effectively utilize string routine and exposed and locality improve handling property.Present stream handle mainly contains Imagine processor, Cell processor and Raw processor model, is 32 bit processors, for the science computing application of at least 64 computing powers of needs and be not suitable for.
The science computing application has a large amount of instruction-level parallelisms and data level concurrency in essence, can carry out string routineization, the micro-processor architecture that calculates towards science should have abundant computational resource support concurrency exploitation, under the limited situation of memory bandwidth, can effectively hide the storage time-delay, and improve calculating memory access ratio by the locality of development data operation.
Summary of the invention
The technical problem to be solved in the present invention just is: at the technical matters of prior art existence, the present invention proposes a kind of a kind of heterogeneous polynuclear stream handle architecture that the objective of the invention is to propose, towards 64 science computing applications, realize the execution that mixes of scalar program and string routine, by calculating unit and hierarchy type register file on a large amount of sheet is provided, can effectively improve and calculate the memory access ratio, alleviate the memory bandwidth bottleneck, develop when supporting instruction-level parallelism and data level concurrency, improved the execution performance of field application programs such as science calculating.
For solving the problems of the technologies described above, the solution that the present invention proposes is: a kind of 64 bit stream processor chips that calculate towards science, it is characterized in that: it comprises 64 scalars processing kernels, 64 bit streams are handled kernel, on-chip memory and controller and network interface, 64 scalars are handled kernel and are responsible for the execution of scalar program and the scheduling that stream is handled kernel as primary processor, on-chip memory and controller are that 64 scalars are handled kernel and 64 bit streams are handled the communication interface in the kernel and the external world and stored data and instruction, and network interface is that 64 scalars are handled kernels and 64 bit streams are handled the interface of kernel with other processor communications; Described 64 bit streams are handled kernel and are comprised stream controller, calculate group, instruction buffer, calculating group control device, data buffer and data buffer controller, stream controller is a stream level instruction control unit, be responsible for the control and the emission of stream dispatch command, calculate the execution of calculating group's operational order in the responsible string routine of group, the calculating group operational order of instruction buffer storage flow program, the control signal of calculating group control device reception stream controller starts the calculating group and loads calculating group operational order; Data buffer is preserved and is calculated required operand and the final calculation result of group's calculating; The control signal that data buffer controller receives stream controller reads flow data and stream instruction from storer, and net result is write back storer.
Described calculating group is one group of identical processing unit PE of structure, carry out simultaneously from same instruction of calculating the group control device or instruction sequence in the mode of single instruction stream multiple data stream, each PE comprises a plurality of arithmetic, the register file of logical operation processing element and a plurality of parts, described arithmetic, the logical operation processing element supports very long instruction word to carry out, the a plurality of dissimilar operations of parallel processing simultaneously, each arithmetic, the logical operation processing element all has independent local register file, directly is arithmetic, the logical operation processing element provides operand and preserves result of calculation.
Described data buffer is made up of register file, a m stream damper, arbitration controller, a m stream buffer control unit module.Register file adopts single port static random-access memory structure, be divided into N parallel body, each body is corresponding with the correlation computations group, being recorded in the register file of flow data deposited by the body intersection, and calculating group module, calculating group control device module, data buffer controller module, Network Interface Module that described 64 bit streams are handled in the kernel pass through stream damper visit data impact damper.
Compared with prior art, advantage of the present invention just is: adopt 64 bit stream processor architecture design methods proposed by the invention, can realize the significantly lifting of performance to the science computing application of fluidisation.The heterogeneous polynuclear integrated technology adopts the general and special-purpose design concept that combines, and can use compatibility with tradition, can realize quickening at scientific program again.Multi-level design of memory systems can be caught the locality in the fluidisation scientific program, the calculating group that SIMD and VLIW merge designs and has fully developed a large amount of instruction-level parallelism and data level concurrencys that exist in the scientific program, improved calculating memory access ratio, under the limited situation of memory bandwidth, can effectively hide the storage time-delay, alleviate the memory bandwidth bottleneck, greatly improved the execution performance of science computing application.
Description of drawings
Fig. 1 is the stream handle architecture frame synoptic diagram that calculates towards science;
Fig. 2 is that scalar is handled kernel streamline synoptic diagram;
Fig. 3 is the data buffer structural representation;
Fig. 4 is the computing group structure synoptic diagram.
Embodiment
Below with reference to the drawings and specific embodiments the present invention is described in further details.
A kind of 64 bit stream processor chips that calculate towards science of the present invention, it comprises that 64 scalars are handled kernel, 64 bit streams are handled kernel, on-chip memory and controller and network interface, 64 scalars are handled kernel and are responsible for the execution of scalar program and the scheduling that stream is handled kernel as primary processor, on-chip memory and controller are that 64 scalars are handled kernel and 64 bit streams are handled the communication interface in the kernel and the external world and stored data and instruction, and network interface is that 64 scalars are handled kernels and 64 bit streams are handled the interface of kernel with other processor communications; Described 64 bit streams are handled kernel and are comprised stream controller, calculate group, instruction buffer, calculating group control device, data buffer and data buffer controller, stream controller is a stream level instruction control unit, be responsible for the control and the emission of stream dispatch command, calculate the execution of calculating group's operational order in the responsible string routine of group, the calculating group operational order of instruction buffer storage flow program, the control signal of calculating group control device reception stream controller starts the calculating group and loads calculating group operational order; Data buffering is preserved and is calculated required operand and the final calculation result of group's calculating; The control signal device that data buffer controller receives stream controller reads flow data and stream instruction from storer, and net result is write back storer.Calculating the group is one group of identical processing unit PE (PE0 of structure, PE1, PEN), the structure of all PE is all identical, carry out simultaneously from same instruction of calculating the group control device or instruction sequence in the mode of single instruction stream multiple data stream SIMD, each PE comprises a plurality of arithmetic, logical operation processing element (F1, F2, Fn) and local register file, processing element F1, F2, Fn supports very long instruction word to carry out, the a plurality of dissimilar operations of parallel processing simultaneously, each processing element all has independent local register file, directly operand is provided and preserves result of calculation for processing element.Data buffer is made up of modules such as register file, a m stream damper, arbitration controller, m stream buffer control units.Register file adopts single port static random-access memory (Static RandomAccess Memory, be called for short SRAM), be divided into N parallel body, each body is corresponding with the correlation computations group, being recorded in the register file of flow data deposited by the body intersection, and other modules (calculate the group, calculate group control device, data buffer controller, network interface) are by stream damper visit data impact damper.
Wherein, the function of each assembly is in the structural system of the present invention:
1. instruction set expansion:
The present invention based on instruction set architecture (Instruction Set Architecture, be called for short ISA) can be the expansion of general 64 bit instruction collection (as x86-64, PowerPC, MIPS etc.) arbitrarily.Expansion mainly is for the convection current procedure development provides support, and mainly comprises following a few class:
● the stream dispatch command: the control flow data is between storer and data buffer, and calculating group operation code moving from the data buffer to the instruction buffer, is carried out by data buffer controller;
● calculate group's operational order: the core calculations instruction of string routine, the very long instruction word (VLIW) word format, explicit instruction level parallelism is carried out by calculating the group;
By the expansion of instruction set, scalar program and string routine can merge.Scalar program adopts universal instruction set to write, and string routine uses expansion instruction set to write, and the stream dispatch command is the interface that scalar program is communicated by letter with string routine, and scalar program is by the start and stop of stream dispatch command control stream acceleration components.
2.64 the position scalar is handled kernel:
64 scalars are handled the operation that kernel is used for handling traditional scalar application program and emission stream dispatch command control calculating group.It can be the simplest single transmit, order execution pipeline kernel that scalar is handled kernel, also can be that complicated pilosity is penetrated, out of order execution superscale streamline core or VLIW core, can comprise command cache or data cache.The most basic streamline has five stations: get finger, decipher, launch, carry out, write back.Get and refer to that section reads application code from storer; The function of decoding section analysis instruction; Transmitter section is generally instruction queue, and it is relevant relevant with resource to detect data, and read operands, is ready command assignment functional part; Carry out the section instruction and carry out function treatment at functional part; The section of writing back is with calculating the results modification processor state.If find to need to carry out the stream dispatch command in decoding section, then mail to stream controller, start string routine and carry out.
3. stream controller:
Stream controller inside comprises an instruction queue, be used for receiving from scalar handling the stream dispatch command that kernel sends, and under the condition that satisfies the correlativity constraint with these transmitting instructions to carrying out these instructions.By the execution of stream dispatch command, data buffer controller can be controlled being written into and storing of flow data (comprising instruction and data), also provides data path for scalar processing kernel reads some scalar data simultaneously; What calculating group control device can instruct is written into; Network interface can receive the flow data that transmits from other processors.According to these status signals of feeding back of stream processing element, can determine when operation is finished, when the correlativity of an instruction be met.Stream controller also comprises a general-purpose register file---SCTRF, and it can be by instruction access such as read, write, move, Data transmission between the internal control register of parts such as stream controller and scalar processing kernel.
4. multi-level storage organization:
On-chip memory among the present invention is the outer backup of primary memory in sheet of sheet, and the data of being stored are subclass of main-memory content, for scalar is handled kernel and data buffer provides data and instruction.Data buffer is handled the extended register file use of kernel in essence as scalar, preserve the string routine instruction sequence and calculate required flow data, being written into and storing by scalar of data block handled kernel combined stream controller by carry out the control of stream dispatch command in data buffer controller.Data buffer is not a random access, but sequential access, speed is fast, and is managed by software (stream dispatch command) fully.
Data buffer is made of modules such as register file, a m stream damper, arbitration controller, m stream buffer control units.Register file adopts single port SRAM structure, be divided into N parallel body, each body is corresponding with the correlation computations group, and being recorded in the register file of flow data deposited by the body intersection, and other modules (calculate the group, calculate group control device etc.) are by stream damper visit data impact damper.At first, access modules is sent to stream damper and is read stream (or writing stream) request, and arbitration controller is handled these requests then, reads (or writing) register file by corresponding stream buffer control unit control stream damper.Access modules can be under the condition that does not take too many bandwidth be read stream (or write become a mandarin to the related streams impact damper) from its relevant stream damper, and this mode makes SRAM list physical port finish the function of m logic port.
Simultaneously, also have a large amount of local register files, preserve the intermediate result of calculating group's internal calculation in the processing unit inside of calculating the group.The bandwidth of three grades of memory hierarchys is divided into Three Estate, data buffer is a per second GB magnitude to the bandwidth of on-chip memory, local register is a per second 10GB magnitude to the communication bandwidth of data buffer, and the communication bandwidth of local register inside is a per second l00GB magnitude.Calculating unit can be caught the locality in the string routine, computing velocity depends on employed local register bandwidth (communication bandwidth of maximum in three grades), avoid a large amount of remote storage consuming time visits, alleviated the pressure of exterior storage bandwidth, reduced the average memory access time of calculating.
5. data buffer controller:
Data buffer controller provides data buffering to the high speed channel of on-chip memory and relevant address generator and address buffer, is responsible for carrying out the stream dispatch command.Data buffer controller can be controlled being written into and storing of flow data (comprising instruction and data), also provides data path for scalar processing kernel reads some scalar data simultaneously.In order to reach desired bandwidth, data buffer controller allows to have simultaneously two memory access streams to be in active state.Each memory access stream can have 8 records to be in active state simultaneously, and each memory access stream uses own corresponding address maker to generate reference address.The total peak bandwidth of storer is 4GB/ second, and the transfer rate between on-chip memory and the data buffer is that 2*64bit/ claps.For storage operation, data come from the stream damper in the data buffer; For being written into operation, the also corresponding stream damper of reading from storer that deposits in of data.
6. calculate the group control device:
Calculating the group control device is the control module that calculates group's operation.Calculate the group control device and receive parameter and the control signal that stream controller sends, the calculating group operation code that loads string routine from data buffer is to the instruction buffer parts that calculate in the group control device.The Accounting Legend Code of string routine is the VLIW instruction that is generated by compiler.Calculate the group control device these VLIW instructions are deciphered, and be broadcast to N the processing unit that calculates the group and carry out in the mode of SIMD, i.e. the identical instruction of each processing unit execution, but handle different data.Calculate the group control device and can receive the obstruction request that scalar is handled core or data buffer, suspend calculating group's processing, guarantee to handle the synchronous of core or wait for the ready of data buffer with scalar by stream controller.
7. calculate the group:
Calculate the group and be one group of identical processing unit PE of structure (PE0, PE1 ..., PEN).The structure of all PE is all identical, carries out simultaneously from same that calculates the group control device in the mode of SIMD and instructs or instruction sequence.Each PE comprise a plurality of arithmetic, logical operation processing element (F1, F2 ..., Fn) and local register file.Processing element F1, F2 ..., Fn supports very long instruction word to carry out, a plurality of dissimilar operations of parallel processing simultaneously (as addition, multiplication, take advantage of add, logical operation etc.).Each processing element all has independent local register file, directly operand is provided and preserves result of calculation for processing element.Operand at first reads in local register from data buffer, because stream controller had been called in data buffer to operand before execution is deciphered, assigns, shone upon to startup calculating group control device, therefore the operand time-delay of calling in the PE local register from data buffer is fixed, and processing element does not exist factor according to not ready and pause that cause when executing instruction operations.Therefore, very long instruction word is carried out any operation required time-delay on all PE all be consistent, as can be known, and compiler can be developed instruction-level parallelism ILP fully, generates the very long instruction word instruction sequence, need not the hardware intervention.There is network to link to each other between each local register file, can exchanges and calculate the middle ephemeral data that generates.Also there are interconnection network between the PE simultaneously, can carry out necessary synchronous and exchanges data.
8. network interface:
Network interface provides the connection of the high bandwidth between each processor in the multicomputer system.Scalar is handled kernel and stream controller can the Control Network interface.Scalar is handled kernel and is adopted the mode of conventional message transmission to communicate with other processors, needs the support of soft communication agreement; Stream is handled kernel and is adopted the mode of stream to communicate with other processors, and direct and data buffer carries out the flow data exchange.Network uses source routing, and routing iinformation is by the decision of stream scheduler program, and it is followed the tracks of the use of link and distributes load with static mode.A message sends and receives the Flow Control register and specify various parameters to transmit by being provided with.Each processor has 4 outside bilateral network passages.Network can adopt the node degree of depth be 4 randomly topologically structured.Link clock rate can from arbitrarily low to 2 times of sheets clock frequency.In order to expand to the multicomputer system of many plates, transmission signals is the difference current pattern.The controlling links position is encoded in the tablet mode.Each input channel is used independent synchronizer.
With specific embodiment the present invention is elaborated below.Be the 64 bit stream processor architectural block diagram of calculating towards science as shown in Figure 1.The entire process device adopts the heterogeneous polynuclear structure, the processor core of integrated two kinds of different structures: 64 scalars are handled kernel and 64 bit streams processing kernel, when realizing traditional scalar program compatibility by scalar processing kernel, stream is handled instruction-level parallelism and the data level concurrency in the kernel employing magnanimity calculating unit development sequence, employing hierarchy type register file is caught the locality in the string routine, realizes the acceleration to the science computing application program of fluidisation.Its general structure comprises parts such as 64 scalars processing kernels, on-chip memory and controllers, network interface and 64 bit streams processing kernel.64 scalars are handled kernel and are responsible for the execution of scalar program and the scheduling that stream is handled kernel as primary processor; On-chip memory and controller are two kinds of processor cores and extraneous communication interface, storage data and instruction, communication bandwidth per second GB magnitude; Network interface is the interface that two in the processor handle kernel with other processor communications, can make up massive parallel processing by it.Stream is handled kernel and is comprised stream controller, calculates parts such as group, instruction buffer, calculating group control device, data buffer, data buffer controller.Wherein stream controller is a stream level instruction control unit, is responsible for the control and the emission of stream dispatch command; Calculate the execution of calculating group's operational order in the responsible string routine of group; The calculating group operational order of instruction buffer storage flow program; The control signal of calculating group control device reception stream controller starts to be calculated the group and loads calculating group operational order; Data buffer is preserved and is calculated required operand and the final calculation result of group's calculating; The control signal that data buffer controller receives stream controller reads flow data and stream instruction from storer, and net result is write back storer.
Fig. 2 is that scalar is handled kernel streamline synoptic diagram.The most basic streamline has five stations: get finger, decipher, launch, carry out, write back.Get and refer to that section reads application code from storer; The function of decoding section analysis instruction; Transmitter section is generally instruction queue, and it is relevant relevant with resource to detect data, and read operands, is ready command assignment functional part; Carry out the section instruction and carry out function treatment at functional part; The section of writing back is with calculating the results modification processor state.If find to need to carry out the stream dispatch command in decoding section, then mail to stream controller, start string routine and carry out.
Fig. 3 is the data buffer structural drawing.Data buffer is made of register file, a m stream damper, arbitration controller, a m stream buffer control unit module.Register file adopts single port SRAM structure, be divided into N parallel body, each body is corresponding with the correlation computations group, being recorded in the register file of flow data deposited by the body intersection, and other modules (calculate the group, calculate group control device, data buffer controller, network interface) are by stream damper visit data impact damper.At first, access modules is sent to stream damper and is read stream (or writing stream) request, and arbitration controller is handled these requests then, reads (or writing) register file by corresponding stream buffer control unit control stream damper.Access modules can be under the condition that does not take too many bandwidth be read stream (or write become a mandarin to the related streams impact damper) from its relevant stream damper, and this mode makes SRAM list physical port finish the function of m logic port.
Fig. 4 is computing group structure figure.Calculate the group and be one group of identical processing unit PE of structure (PE0, PE1 ..., PEN).The structure of all PE is all identical, carries out simultaneously from same that calculates the group control device in the mode of SIMD and instructs or instruction sequence.Each PE comprise a plurality of arithmetic, logical operation processing element (F1, F2 ..., Fn) and local register file.Processing element F1, F2 ..., Fn supports very long instruction word to carry out, a plurality of dissimilar operations of parallel processing simultaneously (as addition, multiplication, take advantage of add, logical operation etc.).Each processing element all has independent local register file, directly operand is provided and preserves result of calculation for processing element.Operand at first reads in local register from data buffer, and net result writes back data buffer.There is network to link to each other between each local register file, can exchanges and calculate the middle ephemeral data that generates.Also there are interconnection network between the PE simultaneously, can carry out necessary synchronous and exchanges data.

Claims (3)

1, a kind of 64 bit stream processor chips that calculate towards science, it is characterized in that: it comprises that 64 scalars are handled kernel, 64 bit streams are handled kernel, on-chip memory and controller and network interface, 64 scalars are handled kernel and are responsible for the execution of scalar program and the scheduling that stream is handled kernel as primary processor, on-chip memory and controller are that 64 scalars are handled kernel and 64 bit streams are handled the communication interface in the kernel and the external world and stored data and instruction, and network interface is that 64 scalars are handled kernels and 64 bit streams are handled the interface of kernel with other processor communications; Described 64 bit streams are handled kernel and are comprised stream controller, calculate group, instruction buffer, calculating group control device, data buffer and data buffer controller, stream controller is a stream level instruction control unit, be responsible for the control and the emission of stream dispatch command, calculate the execution of calculating group's operational order in the responsible string routine of group, the calculating group operational order of instruction buffer storage flow program, the control signal of calculating group control device reception stream controller starts the calculating group and loads calculating group operational order; Data buffer is preserved and is calculated required operand and the final calculation result of group's calculating; The control signal that data buffer controller receives stream controller reads flow data and stream instruction from storer, and net result is write back storer.
2,64 bit stream processor chips that calculate towards science according to claim 1, it is characterized in that: described calculating group is one group of identical processing unit PE of structure, carry out simultaneously from same instruction of calculating the group control device or instruction sequence in the mode of single instruction stream multiple data stream, each PE comprises a plurality of arithmetic, the register file of logical operation processing element and a plurality of parts, described arithmetic, the logical operation processing element supports very long instruction word to carry out, the a plurality of dissimilar operations of parallel processing simultaneously, each arithmetic, the logical operation processing element all has independent local register file, directly is arithmetic, the logical operation processing element provides operand and preserves result of calculation.
3,64 bit stream processor chips that calculate towards science according to claim 1 and 2, it is characterized in that: described data buffer is by register file, m stream damper, arbitration controller, m stream buffer control unit module formed, register file adopts single port static random-access memory structure, be divided into N parallel body, each body is corresponding with the correlation computations group, being recorded in the register file of flow data deposited by the body intersection, and 64 bit streams are handled the calculating group module in the kernel, calculate group control device module, the data buffer controller module, Network Interface Module is by stream damper visit data impact damper.
CNB2007100345666A 2007-03-19 2007-03-19 64 bit stream processor chip system structure oriented to scientific computing Expired - Fee Related CN100489830C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2007100345666A CN100489830C (en) 2007-03-19 2007-03-19 64 bit stream processor chip system structure oriented to scientific computing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2007100345666A CN100489830C (en) 2007-03-19 2007-03-19 64 bit stream processor chip system structure oriented to scientific computing

Publications (2)

Publication Number Publication Date
CN101021831A CN101021831A (en) 2007-08-22
CN100489830C true CN100489830C (en) 2009-05-20

Family

ID=38709603

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2007100345666A Expired - Fee Related CN100489830C (en) 2007-03-19 2007-03-19 64 bit stream processor chip system structure oriented to scientific computing

Country Status (1)

Country Link
CN (1) CN100489830C (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102681796B (en) * 2012-05-18 2015-04-08 重庆大学 RAM (Random Access Memory) distribution structure in data multistage pipelining algorithm module
CN102779075B (en) 2012-06-28 2014-12-24 华为技术有限公司 Method, device and system for scheduling in multiprocessor nuclear system
US10140129B2 (en) * 2012-12-28 2018-11-27 Intel Corporation Processing core having shared front end unit
GB2544994A (en) * 2015-12-02 2017-06-07 Swarm64 As Data processing
CN110503179B (en) * 2018-05-18 2024-03-01 上海寒武纪信息科技有限公司 Calculation method and related product
CN111079911B (en) * 2018-10-19 2021-02-09 中科寒武纪科技股份有限公司 Operation method, system and related product
CN111208948B (en) * 2020-01-13 2022-08-09 华东师范大学 Request distribution method based on hybrid storage
CN114218152B (en) * 2021-12-06 2023-08-15 海飞科(南京)信息技术有限公司 Stream processing method, processing circuit and electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
流处理器MASA内核的研究及实现. 伍楠.国防科学技术大学工学硕士学位论文. 2005
流处理器MASA内核的研究及实现. 伍楠.国防科学技术大学工学硕士学位论文. 2005 *

Also Published As

Publication number Publication date
CN101021831A (en) 2007-08-22

Similar Documents

Publication Publication Date Title
Mittal et al. A survey of techniques for optimizing deep learning on GPUs
Foley et al. Ultra-performance Pascal GPU and NVLink interconnect
CN100489830C (en) 64 bit stream processor chip system structure oriented to scientific computing
CN100456230C (en) Computing group structure for superlong instruction word and instruction flow multidata stream fusion
US10007527B2 (en) Uniform load processing for parallel thread sub-sets
KR20190044568A (en) Synchronization in a multi-tile, multi-chip processing arrangement
US20080250227A1 (en) General Purpose Multiprocessor Programming Apparatus And Method
TW201702866A (en) User-level fork and join processors, methods, systems, and instructions
US20130145124A1 (en) System and method for performing shaped memory access operations
CN103221933A (en) Method and apparatus for moving data to a SIMD register file from a general purpose register file
US11669443B2 (en) Data layout optimization on processing in memory architecture for executing neural network model
JP2021511576A (en) Deep learning accelerator system and its method
US20120331278A1 (en) Branch removal by data shuffling
CN108268385A (en) The cache proxy of optimization with integrated directory cache
TW202109286A (en) System and architecture of pure functional neural network accelerator
Lin et al. swFLOW: A dataflow deep learning framework on sunway taihulight supercomputer
Li et al. Dual buffer rotation four-stage pipeline for CPU–GPU cooperative computing
Zhang et al. An effective 2-dimension graph partitioning for work stealing assisted graph processing on multi-FPGAs
Liu et al. Ad-heap: An efficient heap data structure for asymmetric multicore processors
Zhang et al. Optimization Methods for Computing System in Mobile CPS
US20230385103A1 (en) Intelligent data conversion in dataflow and data parallel computing systems
Makino et al. Analysis of past and present processors
Sterling et al. The “MIND” scalable PIM architecture
CN112906877A (en) Data layout conscious processing in memory architectures for executing neural network models
Franz et al. Memory efficient multi-swarm PSO algorithm in OpenCL on an APU

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20090520

Termination date: 20110319