CN102053945B

CN102053945B - Concurrent computational system for multi-scale discrete simulation

Info

Publication number: CN102053945B
Application number: CN2009102370271A
Authority: CN
Inventors: 葛蔚; 李静海
Original assignee: Institute of Process Engineering of CAS
Current assignee: Institute of Process Engineering of CAS
Priority date: 2009-11-09
Filing date: 2009-11-09
Publication date: 2012-11-21
Anticipated expiration: 2029-11-09
Also published as: CN102053945A

Abstract

The invention relates to the technical field of high-performance computers, and discloses a concurrent computational system for multi-scale discrete simulation. In respect to the multi-scale structure and discrete characteristics of the complex system in the real world, the system maintains the consistency and similarity among simulation object, computation model, frame of algorithm and computer architecture. The system describes the local behavior of the systems in different layers with a large number of adjacent discrete units and describes the collective behavior of the systems in different layers with long-range restraint and correlation, and two-way feedback is applied between the upper layer unit and the lower layer unit. Correspondingly, the special reconfigurable vectorized accelerator group is used as the bottom layer computation hardware, the universal processor array is used as the upper layer computation hardware, and the adjacent processors or accelerators in the same layer can directly communicate or share memory with the processors or accelerators in the part of adjacent layers. Therefore, the system can efficiently and truly simulate complex process and system.

Description

A kind of concurrent computational system towards multiple dimensioned discrete analog

Technical field

The present invention relates to high-performance computer numerical simulation technology field, relate in particular to a kind of concurrent computational system towards multiple dimensioned discrete analog.

Background technology

Computer simulation has become with theory and has tested the third scientific research arranged side by side and the basic skills of technological development; It also is the topmost application direction of high-performance calculation; But the challenge that it faces is also very outstanding, and this mainly shows: the enough true and practical model that complicated phenomenon is lacked mechanism property; The precision of computing method and the complexity of simulated object are more and more incompatible; Gap between the peak velocity of computing machine and actual computation ability is increasing; The popularity of high-performance calculation is with increasing to the demand gap of high-performance calculation.

In order to tackle above-mentioned challenge; At first need propose to reflect the model of the mechanism property of complication system essential characteristic; Set up corresponding algorithm and Computer Architecture to the characteristics of model then, through the structural integrity of simulated object, model, algorithm and computer hardware realize on a large scale, high-level efficiency, low cost, high precision, unitized simulation.

To this, we notice that multiple dimensioned structure is the common trait of most of simulated object.The short range effect takes place between a large amount of unit, simultaneously and some upper units interact, and the short range effect taking place between upper unit, and interacts with upper strata more, form multiple dimensioned structural system by that analogy.In the material world, this similar characteristic is all arranged to whole universe from elementary particle.Such cellular system has some common character again simultaneously, for good condition has been created in the calculating simulation to it.

At first, no matter be the unit of nature existence or the model unit of arteface, the action intensity between them generally reduces rapidly with the growth of distance.Therefore generally can ignore the effect at a distance of between unit enough far away, perhaps replace the calculating between every pair of unit through the mass action of estimating a large amount of unit.This has just caused locality, although promptly total system can have any a plurality of unit, what directly determine arbitrary unit transient behavior mainly is the adjacent unit of very small amount.This locality is particularly suitable for the parallel computation of Region Decomposition (space decomposition); Promptly be responsible for the simulation that the unit develops in the different spaces zone by the different processes on the various computing node; And each process is only communicated by letter or shared drive with the adjacent processes of some; Thereby the communication topology of avoiding calculating between node is tending towards complicated with the expansion of system scale, keeps linear speed-up ratio.

Simultaneously, the action function between a pair of unit generally can be described through ordinary differential equation, but and the next unit of a lot of situation receive simultaneously each be superposition to effect.That is to say that we can add and obtain the new state of each unit by the random order effect between every pair of unit of independent processing respectively.This additive property very is fit to in-process multi-threaded parallel; Can give a large amount of different threads respectively and complementation is calculated intrusively with the effect of each unit and adjacent unit; Such as; A plurality of computation core by in the same processor are carried out, and (single-instruction multi-data carries out SIMD) also to be particularly suitable for singly instructing multidata.

To said method; We can design multi-level short range connect, from top to bottom by numerous to computing unit system simple, from less to more; Make between being connected between effect and the computing unit between computing unit and model unit, model unit and set up suitable mapping relations; Thereby bring into play the performance of computing hardware to greatest extent, reduce unnecessary hardware spending.Simultaneously, adopt this method, according to simulated object stability condition physically, we can also come the corrected Calculation error to the constraint of lower floor unit through upper unit, guarantee the precision of calculating from mechanism.

In one Chinese patent application 200510064799.1; 200710099551.8 in 200810057259.4; We are primarily aimed between the model unit in the said method and act on, and promptly so-called " particle method " proposed the different designs scheme of a kind of general-purpose algorithm and dedicated hardware systems thereof.Chinese invention patent application 200810224328.6 on this basis the algorithm of broad sense more and the more perfect multiple dimensioned method of variation and the General layout Plan of dedicated hardware systems thereof are proposed, emphasis solves the implementation of variation or extreme value constraint and the differentiation of different levels computing unit designs and coupling scheme.And the application will propose its Key Implementation Technology.For setting forth motivation of the present invention, meaning, technical scheme and application prospect better, we at first introduce correlative simulation method and computing technique.

One) the multiple dimensioned method of variation

Here the multiple dimensioned method of so-called variation mainly is meant a kind of complication system evolution analogy method with stability condition closed power system of equations that we develop.General multiple dimensioned method can be divided into description type and two kinds on related type: the former is the describing method of coupling different scale in the space and/or on the time, as describe the behavior of its block with Finite Element Method with near the behavior the molecular dynamics method simulation material micro-crack.And the latter is with the model equation of the statistical correlation formula that from small scale simulation, obtains as the large scale simulation, the ingredient that can be used as the fluid mechanics equation group like the fluid state equation that obtains from molecular dynamics simulation and constitutive relation.But these two kinds of methods all do not have explicitly to consider under nonequilibrium condition; In fact large scale behavior as small scale unit collective behavior can emerge in large numbers the characteristic that makes new advances; And to small scale behavior generation constraint; And will explicit or impliedly there be not closure in the small scale model, such as turbulence model so long as not from first principle on the other hand.

And the multiple dimensioned method of variation is thought, as far as non-linear nonequilibrium system, plural controlling mechanism must be arranged in action in the system.Though these mechanism of different systems are not quite similar mostly, can be expressed as certain extreme value or variation condition separately.And through analyzing each controlling mechanism independently extreme value or variation condition separately, and the rule coordinated each other of these conditions, the stability condition that we can the proposition system need satisfy.This stability condition just in time provides the sealing condition that lacks in the kinetic description.At present, for the polyphasic flow system, we have found dynamic multiple dimensioned stability of structure condition in the multiple systems such as gas-solid, gas-liquid, liquid liquid.Use these conditions and just can seal through yardstick and decompose on each yardstick obtain kinetics equation separately, and realize striding the association of yardstick.Then by mathematical multiple-objection optimization; Can find the solution the sub-micro meta-model of multi-scale coupling; And the mechanism that multiple dimensioned structure produces in the understanding complication system and space-time develops, in the rules such as sudden change of critical condition, grasp the generation of complication system and the essence of evolution.

Two) discrete analog method

The discrete analog method provides a kind of essence of complex systematic dynamics behavior, the pervasive and easy mode described.Wherein most ofly be a large amount of interactional particles with simulation system is discrete, describe the behavior of each particle through dynamics calculation, thereby directly or through the behavior of statistics with combination reproduction system, thereby many particle methods that also is called as.Their representative instance comprises:

Molecular dynamics (molecular dynamics, MD).Atom, atomic group or molecule be reduced to through interactional particles of mode such as potent and rigid constraints describe molecule, molecular group so that the microscopic behavior of material; Be widely used at present the synthetic of chemicals; The research of biomacromolecule and new material, design and preparation are to the fields such as exploration of life quintessence.And in a broad sense, the simulation of nuclear radiation also is included in the molecular dynamics method as neutron diffusion etc.

Discrete element method (distinct element method, DEM).To solid particulate matters such as picture sandstone, cereal, various powder; Interaction force between each discrete particle that wherein exists naturally (as is in contact with one another the pressure and the friction force of generation; And the electrostatic force that also can exist during noncontact etc.); And and then calculate their tracks separately, this is called as discrete element method.At present also in industrial process, agricultural engineering and aspect widespread uses such as geology, the hydrology.

Many-body dynamics (N-body dynamics).See on the yardstick at space, from celestial body, galaxy, to the cluster of galaxies even whole universe, the discrete characteristic in the world also is very tangible, and the latter can be considered the particle of forming the former.Many-body dynamics is followed the tracks of their track and collective behavior through calculating universal gravitation between these huge " particles ", is a kind of main flow means of astrodynamics simulation.This method is that formation and evolution and following space industry of exploration of the universe provides powerful measure.

Coarse model (Coarse-grained models).On also being not limited to intuitively, particle method can be treated to the system of particle assembly.In recent years, adopt the behavior of continuous medium method simulation traditionally, also proposed a lot of particle methods through the model particle structure coarse or that simplify for flowing of fluid and distortion of materials etc.As dissipation particle dynamics (the dissipative particle dynamics that is situated between and sees; DPD) method and grid Boltzmann (Lattice Boltzmann; LB) method, and the smooth particle dynamics of macroscopic view (smoothed particle hydrodynamics, SPH) method etc.Say that from physical background these model particle roughly are that the Lagrange (Lagrange) of an a glob of molecule or a material infinitesimal is expressed.The problem that the number of particles that the calculated amount that broken through these models contains with system naturally must increase (this is the important reason that adopts the continuous medium method); And be particularly suitable for dealing with complicated border, multiphase medium and large deformation etc. to the challenging problem of continuous medium method; At present on the naval vessel, the design of aircraft and vehicle; The research of nuclear weapon and reactor and design, the energy, chemical industry, water conservancy, geological exploration and development obtain in the extensive fields such as meteorology and marine forecasting to widely apply.

In fact, much the explicit numerical evaluation model of continuum Model also can be regarded as the model of discrete analog from the angle of physics.And the so-called Agent model of in the simulation of society, economic dispatch system, using always also can be thought the discrete analog model of relative complex.Therefore, the coverage rate of discrete analog is quite widely.

Three) to the special-purpose software and hardware system of different discrete models

Common software and dedicated hardware to the multiple dimensioned method of variation also do not occur at present, but in the world different particle methods proposed some special-purpose software and hardware systems.Morning has proposed a kind of processing unit with many IO ports like U.S. Pat 4740894 (1988-04-26 is open); U.S. Pat 3970993 (1976-07-20 is open) then adopts unidirectional chain passage (ChainingChannel) that processing unit is together in series, and makes data can pass to next processing unit.Such unit can be used for forming the processor array that is fit to some simple particle algorithms.More typical example of this respect is the method and system of particle movement on the employing combinational logic (combination logic) that proposes of U.S. Pat 5432718 (1995-07-11 is open) and the double-grid computation rule grid; Its corresponding LGA is very efficiently; Through appropriate reconstruction, also applicable to some other particle methods such as LBM based on grid.But the limitation of processor array is also obvious.Each processing unit can only be handled predefined a few computing of its hardware in principle, thereby and do not possess the function of storage and interpretive order independent operating program, so its versatility is very poor.

In recent years IBM and Japanese physics and chemistry research institute (Institute of Chemical Research, RIKEN) MD-GRAPE (abbreviation of Molecular Dynamics GRAvity PipE) family chip, integrated circuit board and the special purpose computer that is specifically designed to N-body problem and molecular dynamics simulation researched and developed in cooperation.They with in these problems between typical particle the effect algorithm be cured as special hardware pipeline, each chip have a large amount of can parallel processing streamline and every streamline can a plurality of interparticle effects of parallel processing.The streamline in later stage has also adopted programmable gate array (FPGA) device with according to the different hardware pipeline of effect algorithm reconstruct between different particles, improves its efficient and versatility.

The predecessor of MD-GRAPE chip is the GRAPE chip that once obtained the fastest computing machine Gordon Bell prize of IEEE ComputerSociety in 1995 and 1,996 two; The MD-GRAPE chip is through optimal design; Be used for calculating the calculating of the particle interphase interaction power in N-body problem and the simulation of molecule subdynamics specially; Its inside provides 4 parallel calculated flow waterlines; Every streamline can carry out the interaction force calculating of 6 particles, the information that chip can be stored 1,000,000 particles simultaneously simultaneously.MD-GRAPE cooperates the computer system that has made up the 100Tflops/s speed that is specifically designed to N-body problem and molecular dynamics simulation with other special chips that RIKEN designed afterwards.

The user uses powerful N-body problem and the molecular dynamics simulation dedicated computing ability of MD-GRAPE for ease; Also designed the MD-GRAPE integrated circuit board that can directly be inserted into the computing machine expansion slot; Integrated circuit board adopts pci interface, can directly be inserted into from user's personal computer until IBM RS6000SP high-performance computer.Integrated circuit board is integrated 4 MD-GRAPE chips provide the computing power of 64Gflops/s on PC or RISC workstation, and FORTRAN and C programmer bank interface are provided for the user.

Second generation MD-GRAPE chip is integrated 9,000,000 transistors adopt 0.25 μ m technology, the 2.5V of IBM Corporation technological, and the power under the 100M dominant frequency is 15 watts, is 4 calculated flow waterlines equally, and the computing power of 64Gflops is provided.Third generation MD-GRAPE chip has 20 calculated flow waterlines, and the difference of trial run frequency can provide 165 or the computing power of 200Gflops.The MD3-PCIX integrated circuit board is integrated two MD-GRAPE3 chips, computing power is 330Gflops, can directly be inserted in the subscriber computer to use.

Except chip and integrated circuit board, also have the MD-GRAPE-3 dedicated processor, processor is integrated 12 MD-GRAPE3 chips, its external dimensions is 2U, 19 inches, the private communication line through 10Gbits/s is connected with the interface card that is inserted in host computer PCIX slot.The computing power of each MD-GRAPE-3 chip is 200Gflops, and the calculating peak value of whole computing machine has reached 1Pflops, and the power of computing machine is 300kW, under the low-power consumption prerequisite, has realized the high-performance of calculating.

QCDOC is the special chip that quantum chromodynamics QCD (QuantumChromoDynamics) calculates that is specifically designed to of IBM development, and the Duo Jia laboratory adopts the QCDOC chip to set up the QCDOC special purpose computer.

The QCDOC chip is a kind of dedicated IC chip of on the basis of PowerPC kernel, developing (ASIC).It comprises a 500MHz 440PowerPC processor; 64 Floating-point Computation abilities of 1Gflops are provided; The integrated storer EDRAM of 4Mbytes; Be used for storage code and data when carrying out standard lattice QCD calculating, it is the data transmission path of 8GByte/s that calculating inner core has a peak bandwidth to EDRAM.Simultaneously ASIC possesses the DMA function, can between EDRAM and external memory storage, transmit data automatically, and communicates by letter between the support node, comprise one be used to guide, the Ethernet controller of diagnosis and I/O network.

The QCDOC special purpose computer is one and calculates the parallel computer that node adopts the QCDOC chip; Calculate the torus interconnected that adopts 6 dimension mesh between node; 12 neighbours' nodes of each node and its link to each other with the speed of full duplex 500Mbit/s; Connection line adopts lock Phase Receiver device, possesses the automatic functions of retransmission of single bit error-detecting, can carry out the direct dma access of neighborhood of nodes internal memory.The QCDOC special purpose computer also has a quick Ethernet of 100Mbit/s to be used to guide, diagnose and manage the general I/O of usefulness in addition.The QCDOC special purpose computer is used for QCD and calculates and custom-designed QOS operating system.

Have 3 QCDOC special purpose computers to be installed in Columbia University (1024 node), RIKEN BNL Research Center (12288 node) and DOE BrookhavenNational Laboratory respectively at present, the speed of wherein back two QCDOC computing machines has all reached 10Tflops.

Tilera company issued 64 core processor Tile64 of a employing 64nm technology in 2007, and frequency of operation is 600～900MHz.In existing polycaryon processor, mainly communicate by letter between each kernel through bus, if 16 or more kernels are arranged, the data rate of bus just will become bottleneck, thus performance of processors also will be had a greatly reduced quality.Tile64 does not have central bus, but each kernel is directly linked to each other, and has avoided the speed bottle-neck of existing processor architecture effectively, and can more move under the low-power consumption.In addition, each kernel of Tile64 all is the processor of a telotism, can operating system of isolated operation.

Tilera ' s Multicore Development Environment (MDE) is the development environment that is used for the Tile64 chip.Tilera company provides the integrated circuit board of two kinds of models at present, TILExpress-64 and TILExpress-20G, and they are mainly used in application such as multimedia streaming data processing and network traffics detection.

Four) early-stage Study of the present invention

Above-mentioned several kinds of systems have considered the characteristics of particle method in some aspects; Short range connection (Tile64 and QCDOC) and dedicated stream computing technique (MD-Grape) have been adopted; But the design proposal to the general software and hardware system of discrete analog is not proposed as yet; Certainly yet do not consider the constraint of multiple dimensioned discrete general character of actual complex system and the variation between the different levels analogue unit, realize the coupling of multiple dimensioned algorithm and multi-level architecture.

For the remarkable range of application that reduces the calculated amount of direct discrete analog and enlarge system of aspect at computation model and software; Multithreading shared drive and the flowmeter of realizing present main flow calculated combining of parallel computing and enhanced scalability short range internetwork connection mode; Making full use of their advantages separately also can learn from other's strong points to offset one's weaknesses; In one Chinese patent application 200510064799.1; 200710099551.8 in 200810057259.4, we have summed up the following common trait of discrete analog, and have proposed the corresponding calculated system design scheme.

At first, no matter the model unit that we considered is the particle of nature existence or the model particle of arteface, so that a lot of complicated Agent, the action intensity between them is rapid reduction the with the growth of distance (or certain logical reach) generally.Effect between physical particles is nothing but (be in fact three kinds or still less) that four kinds of fundamental forces cause in essence; Wherein distance square is inversely proportional between gravitation and electromagnetism intensity of force and particle; And the decay of strong and weak interaction is faster; Therefore generally can ignore at a distance of interparticle effect enough far away, perhaps through estimating the every pair of interparticle Force Calculation of making a concerted effort to replace of a large amount of particles.This has just caused locality, although promptly total system can have any a plurality of model unit, the model unit that directly determines arbitrary model unit transient motion mainly is the contiguous model unit of very small amount.

Simultaneously, the action function between a pair of model unit generally can be described through one or one group of algebraically or ordinary differential equation, but and model unit receives simultaneously that each is a superposition to effect.That is to say that we can be by the respectively effect between every pair of unit of independent processing of random order, through simply adding and obtaining the suffered overall function in unit.Though retraining the composite particle of forming (like the macromolecule of chain) through some to the hard sphere particle or by a plurality of particles; And be not so simple in the concrete processing of society, some Agent in the economic system; But on big slightly yardstick; Like the integral body to composite particle, its algorithm still has this character generally.

Can find that thus the discrete analog method has general application, and can significantly optimize to the design of hardware and software of these class methods.Be but that mode of action Modularly between various model units embeds in the general overall algorithm and data structure; And through space partition zone, the discrete analog method almost can obtain linear speed-up ratio, and each computing unit of hardware system can only provide memory shared or message transmission to specific only a few neighborhood calculation unit, expansion on a large scale quite easily; The complicacy and the scale of computing unit can reduce (as having only buffer memory, not having main memory) greatly simultaneously, thereby improve the ratio that is in the components and parts in the calculating operation, promptly improve its service efficiency, reduce cost.Compare with general general high-performance computer, though dwindle to some extent to the hardware system range of application of this Frame Design, but still have a large amount of demands.And the influence that the benefit that raising produced of the reduction of hardware cost and efficient will cause considerably beyond the former.Therefore this type systematic will have boundless prospect.

According to these characteristics, Chinese invention patent application 200510064799.1 has proposed a kind of a plurality of calculating and storage unit of comprising, forms array respectively, and each storage unit is connected with a plurality of computing units adjacent thereto; The parallel architecture expanded that each computing unit is connected with a plurality of storage unit adjacent thereto has been considered effect of particle short range and stackable general character with the local memory shared model.And Chinese invention patent application 200710099551.8 provides a kind of multi-layer direct connection cluster concurrent computational system towards the simulation of particle mould.This system is made up of a plurality of computing units of the one or more dimensions array of lining up one or more layers; Directly communicate connection between the neighborhood calculation unit with layer; The computing unit of different layers communicates connection through switch, when considering the short range effect, has tentatively considered the multiple dimensioned property that acts between particle.Simultaneously the computing unit in 200810057259.4 pairs of this cluster parallel systems of Chinese invention patent application has proposed concrete design proposal, particularly to (coupling of--GPU) with the general-purpose computations chip of multi-threaded parallel on a small scale (like central processing unit---CPU) is used and proposed concrete software and hardware solution like graphic process unit with the stream process chip of shared drive mode large-scale parallel in the computing unit.The special-purpose computing system of the multiple dimensioned discrete simulation system under the universal significance has more been considered in 200810224328.6 of Chinese invention patent applications, has proposed with the effect between the multiple dimensioned method restricted model of variation unit, simplifies and calculates, the mode of raising the efficiency.The present invention then will propose the realization technology of this type of special-purpose computing system.

Summary of the invention

The technical matters that one) will solve

It is a kind of towards multiple dimensioned complication system, based on the calculating soft or hard structure of the multiple dimensioned method of variation and concrete implementation thereof, with the feasible simulation of efficient realization to complication system that the present invention will provide.

Two) technical scheme

For achieving the above object, the invention provides the general-purpose algorithm framework of a kind of Simulation of Complex system and the design proposal of computer software and hardware system thereof.

The present invention has at first proposed a system modelling technology based on multiple dimensioned decomposition; Being dispersed by the simulation system quilt in this technology is three straton systems; The same layer adjacent unit interphase interaction of bottom subsystem; Between the non-adjacent unit of middle level subsystem interaction is arranged also, there is interaction a plurality of unit with lower floor simultaneously, and there is interaction a plurality of unit of the cell pairs layer of top layer subsystem.

Interaction between said each unit is carried out simultaneously, that is, not only with the interaction of layer model unit by parallel processing, the effect between the adjacent layer unit is also at the same time by parallel processing, thereby realizes multiple dimensioned parallel computation.

The unit of said bottom subsystem comprises three kinds of arrangement forms: arranging on the rule mesh lattice point of static state, arranging on the non-rule mesh lattice point in static state and arranging as particle movably.

Effect between said bottom subsystem element has concurrency; Promptly the processing of the effect between any two bottom subsystem element is carried out simultaneously, by the state that is processed the unit and with it the part or all of status information of another unit of effect calculate another unit to being processed the contribution that location mode changes.

Effect between said bottom subsystem element has superposability, and promptly bottom floor units is that these unit are separately to being processed the function of the contribution that location mode changes to a total contribution that is processed the state variation of unit.

Effect between said bottom subsystem element has short range property, and promptly each this type of unit only interacts with the adjacent unit that is no more than specific upper limit number.

Interaction between the unit of said middle level subsystem comprises the effect of many bodies, and promptly any one unit all contributes to the state variation of any another unit, and global optimization, promptly the state of all unit is selected the unit that meets certain criterion after relatively.

Act as constraint-feedback mechanism between the unit of described middle level subsystem and bottom subsystem, the unit that promptly direct-connected each bottom subsystem is arranged with the unit of middle level subsystem is to this middle level subsystem element transmitting portions or whole status informations; Each bottom subsystem element is the function of these information to total contribution of this middle level subsystem element state variation; This middle level subsystem element state is the function of this middle level subsystem element state and variable quantity thereof to the contribution of each bottom subsystem element state variation.

To above-mentioned modeling technique, the present invention proposes a kind of overall design of concurrent computational system.This system comprises three layers of computing equipment, and top layer is held concurrently for management and calculated node, is coupled with the mode of sharing storage or complete interconnected high-speed communication; The array of rule is lined up for calculating node in the middle level, and direct communication connection or shared storage are arranged between adjacent middle level node; Bottom is special-purpose accelerator, is arranged in the calculating inter-node.

The top layer node of said computing system contains one or more general processors, does not contain or only contain a spot of special-purpose accelerator; The middle level node contains one or more general processors and one or more special-purpose accelerator; The bottom node contains one or more special-purpose accelerators, does not contain or only contain a spot of general processor.

The middle level of said computing system is connected with the top layer node through global switch with the bottom node, or inserts global switch through the local exchange machine earlier, is connected with the top layer node again.

Processor in the said computing system is the logical organization unit with independently computing and communication function, comprises chip, chipset, programmable gate array chip, integrated circuit board, or wherein any one or a plurality of combination arbitrarily.

Described general processor is that general central processing unit or other can be handled various complicated orders such as branch's judgement, and has other processor of big memory buffer.

Described special-purpose accelerator includes but are not limited to stream handle, digital signal processor, the programmable gate array processor of single instruction multiple data for being fit to handle the application specific processor of certain type of problem; Application specific processor can not have storer or share a spot of storer, but other application specific processor that can result of calculation is delivered to general processor or be connected.

Described special-purpose accelerator contains a large amount of arithmetic units and register, and contains some instruction control units; Said arithmetic unit includes but are not limited to multiplier, totalizer and special function solver etc., and comprises the vector calculus parts of scalar operation parts and different word lengths separately.

Arithmetic unit in the described special-purpose accelerator comprises the vector calculus parts of scalar operation parts and different word lengths; And scalar operation parts programmable combination is the vector calculus parts, and the vector calculus parts long than short word can be combined to the vector calculus parts long than long word.

In the described special-purpose accelerator, the said arithmetic unit of identical function and word length is divided into the other centralized arrangement of fractions, alternative arrangement between the said arithmetic unit of difference in functionality and word length; Said register and instruction control unit dispersed placement between each arithmetic unit or near.

Described in the described special-purpose accelerator between the arithmetic unit and with register and instruction control unit between being connected of reconstruct able to programme arranged, this connects by instruction control unit sets up according to received program and instruction.

Arithmetic unit moves with the mode of data-driven in the described special-purpose accelerator, promptly after these parts are received specific data, promptly starts its pairing arithmetic operation, and the result is outputed in the specific follow-up arithmetic unit or register; Follow-up arithmetic unit starts immediately, by that analogy.

The array of the array of node for expanding arbitrarily in the described computing system; Or array for forming by any repeatably arrangement mode; At least comprise the array that forms by rectangle or rectangular parallelepiped, triangle or tetrahedron, hexagon or tetrakaidecahedron form; The edge of said array is open, or is to link to each other with corresponding sides.

Switch is any multichannel input, single channel output or the multichannel input of supporting said communication to connect, the communication exchange mechanism of multichannel output in the described computing system, and comprises the cascade of this switch and pile up.

According to above-mentioned modeling technique and computing system, the invention allows for corresponding analogue technique.When the model that described modeling technique is set up is carried out on described computing system; Topological relation in topological relation in the said model between the unit and the said concurrent computational system between node is corresponding; Be that the state variable of one or more bottom floor units is kept in the special-purpose accelerator of a bottom, the state variable with unit of neighborhood also is kept in the node with neighborhood or in same node; And the state variable of the higher level unit of said bottom floor units correspondence is kept in the corresponding calculating node of the special-purpose accelerator of said bottom; Main calculating about the state variation of a unit is carried out in the node of preserving its state variable, and the result of calculation of lower floor's node can directly be sent in the high-rise node and instant the application.

Analog computation not only walks abreast between with layer node, and parallel between the different layers node, thereby forms multiple dimensioned parallel computation.

The redundant node that keeps some during one concurrent program in operation of every layer of node moves independently program, and the field data of concurrent program is made centralized stores or the corresponding field data of each node is backed up at other node; Carry out a node monitoring facilities by the top layer node; When this program finds that the node of an execution concurrent program breaks down; The Backup Data that system is about to this node is loaded into the node of operation stand-alone program, and stops stand-alone program, then continues to carry out this concurrent program.

Topological relation between described model unit and described calculating node can be according to requirement adjustment flexibly under the condition that does not change the overall system layout of calculated examples.

Software modules such as the convergence differentiation of the statistical study of the monitoring analysis of the computational load of each node and scheduling, result of calculation, computation process and process control in the main operational system of top layer management double calculating node in the described concurrent computational system; Long-range dependent interaction, global iterative and the optimized Algorithm between the main computing unit of node calculated in the middle level; The short range effect between the main computing unit of the special-purpose accelerator of bottom and the state variation and the analyzing and processing of unit self.

General processor mainly carries out control, branch's judgement, search and the complicated logical and floating-point operation of program run in the described concurrent computational system, but and special-purpose accelerator mainly carries out the logic and the floating-point operation of simple large-scale parallel.

Described parallel computation node is also carried out parallel simulated data storage, processing and visual when calculating; Visualized operation directly utilizes the computational data in internal memory or the video memory and can directly connect the result of calculation that display device is showed this node.

The interphase interaction of described unit is merged and is some groups, and every group unit has different state, but interacts in an identical manner, their state at special-purpose accelerator as vector centralized stores and computing.

The transmission of information can independent and asynchronous carrying out between the calculating of said unit interphase interaction and unit; And the information of transmitting can comprise non-current time and calculate required redundant information, and a plurality of steps computing times can corresponding primary information transmission, or step computing time can repeatedly information transmission of correspondence.

Three) beneficial effect

1. can find out from technique scheme; Calculating software and hardware system provided by the invention can be applicable to the simulation of different in a large number complication systems; Big to celestial body galaxy system, little essential characteristic to the molecular atoms system; Be multiple dimensioned discrete topology, enlarged the application surface of existing dedicated discrete simulation system, have very strong versatility.

2. through algorithm and the computing hardware that suits the discrete topology simulation is provided; The present invention simplified high performance computing system design, improved its extensibility, fundamentally broken away from the components and parts calculated level to the upper limit of calculated performance setting, reduced its manufacturing cost and use cost and energy consumption, improved the actual efficiency of simulation system simultaneously.

3. the combination of above two aspects makes computing system provided by the invention both have distinct technicality, and good versatility is arranged again, for the big contradiction that solves high-performance computing sector versatility and counting yield provides good approach.And the multi-level reconfigurable hardware structure that is proposed can make full use of the components and parts function under different condition, has more strengthened this advantage.

Description of drawings

Fig. 1 is a hardware system overall construction drawing provided by the invention; Wherein, 11 hold concurrently to calculate node for management, 12 for the cascade of switch or switch with pile up, 13 for calculating node, 14 are the network connection, 15 is a simulated domain corresponding to a plurality of search grid;

Fig. 2 is the main structure chart of calculating node provided by the invention; Wherein, 21 is communication controler, and 22 for going the communication connection of neighborhood of nodes; 23 is central processing unit, and 24 is internal memory and Memory Controller Hub, and 25 is special-purpose accelerator; 26 is the corresponding zoning of a search grid, and 27 are the memory access passage, and 28 is the communication connection of inter-node;

Fig. 3 is a special-purpose accelerator basic block diagram provided by the invention; Wherein, 31 for instruction generates and data flow con-trol, and 32 is data-interface; 33 is register array, and 34 is arithmetic unit group A, and 35 is arithmetic unit group D; 36 is the word length of arithmetic unit group A, and 37 is the word length of arithmetic unit group D, and 38 is that one group of unit intercropping is with the diagram of calculating;

Fig. 4 is for separating the synoptic diagram of optimization problem on concurrent computational system; Wherein 41 is the free parameter space, and 42 is stability criterion, and 43 is the extreme point of stability criterion; 44 is system of equations; 45 is the numerical value of a collection of coordinate points and stability criterion thereof, and 46 is the new a collection of coordinate points and the numerical value of stability criterion thereof, and 47 is the root of nonlinear equation;

Fig. 5 is for calculating a kind of implementation of node; Wherein 51 is external bus PCI-E, and 52 is north bridge chips, and 53 is South Bridge chip;

Fig. 6 is for calculating the another kind of implementation of node; Wherein 61 is local bus, and 62 is local bus control, and 63 is private network adapter, and 64 is global bus;

Fig. 7 (a-c) is three kinds of reconstruct modes of arithmetic unit in the special-purpose accelerator.

Embodiment

For making the object of the invention, technical scheme and advantage clearer, below in conjunction with specific embodiment, and with reference to accompanying drawing, to further explain of the present invention.

One) enforcement of algorithm frame

In multiple dimensioned discrete system; Upper unit all can have constraint and influence in principle to the lower floor unit of forming it; Make it with fully independently individual behavior is different, the leading and managing such as individual in the human society will be organized will receive the constraint of its system and rules.Therefore the simulation of this reality of any reflection all need be set up feedback-tied mechanism with upper unit naturally.Contiguous interaction also can take place between upper unit simultaneously, such as unit in the human society to unit, country to the cooperation and competition between country etc.Have again, be exactly this relation be can be nested, promptly have multistage levels unit relation.At this moment, above-mentioned multiple dimensioned modeling algorithm framework is exactly description the most reasonable and nature.In general, the element number on upper strata is few more, but interaction mode is more complicated, so they can rely on a spot of general central processing unit to simulate more, simulates through a large amount of special-purpose accelerators and act between simple relatively in a large number lower floor unit.Several kinds of typical case are discussed below:

1) search in majorized function space.According to the multiple dimensioned principle of variation, this tied mechanism in a lot of physical systems shows as each model unit of lower floor and will satisfy certain stability condition generally, then is expressed as Multiobjective Variational Problems on the mathematics.In computer simulation, this Multiobjective Variational Problems can finally be converted into the optimization problem on the finite degrees of freedom, and algorithm frame of the present invention capable of using and computer hardware come rapid solving.As shown in Figure 4, its algorithm is exactly the extreme point (43) of in the free parameter space (41) of certain dimension, seeking stability criterion (42) generally, and the numerical value of last this criterion of each coordinate points (45) will obtain through finding the solution a system of equations in this space.

This moment is according to the form of criterion function; Can design specific calculation procedure; And in view of the above the difference in functionality arithmetic unit in the special-purpose accelerator (25) (as 34,35 etc.) to be connected be the corresponding streamlines of handling of many groups, under the control of central processing unit and scheduling, calculates the criterion numerical value (45) on a large amount of coordinate points simultaneously with the mode of data-driven.The result of these component computes is aggregated into central processing unit (23); On central processing unit, can carry out than complicated algorithm and seek extreme point,, give a collection of calculative coordinate points (46) that makes new advances such as simulated annealing, genetic algorithm etc.; Iteration like this is progressively near extreme point.Seeking the mode that the process of root (47) a large amount of arithmetic units also capable of using of nonlinear equation form streamlines simultaneously carries out.Whole Task Distribution and load balance, operations such as data input and output can be carried out on management node.This type of computation schema also is applicable to the application of " magnanimity independently calculates-analysis and judgement-recomputate " pattern of large quantities of meeting such as the prediction, the prediction of financial product earning rate, graph and image processing of protein and multi-phase complex medium system rock-steady structure.

2) discrete unit simulation.Interaction between a large amount of discrete units of analogue technique main processing provided by the invention, or continuous system is converted into this effect.Wherein a large amount of be calculated as interaction in twos stackable between adjacent cells.But simulate purpose in some cases or for certain, although the interparticle acting force far away of being separated by is less, long-range interaction can not be left in the basket, and need handle with a kind of mode of simplification.Wherein most of simplified way is exactly to consider many collective's effects to acting between particle in essence.On the other hand, even long-range interaction can be left in the basket, remaining only considers its collective's effect in order to reduce calculated amount also can only directly calculate the least possible proximate interaction.Therefore, in multiple dimensioned discrete simulation system, the contiguous effect between the unit is main to rely on special-purpose accelerator to calculate, and collective's effect of remote action mainly relies on the upper strata node or/and general processor calculates.

Specifically, as shown in Figure 1, the simulation of whole discrete system is corresponding to a whole set of hardware system.Management Calculation node (11) is responsible for the input and output of data, the scheduling of calculated amount and the calculating of a small amount of long-range effect or whole constraint; Each calculates the calculating (Fig. 2) that node (13) is responsible for the discrete unit of specific region or part separately.Wherein, central processing unit is responsible for calculation task is decomposed into a large amount of parallel stream line operations, can focus on the streamline of taking advantage of steps such as adding (like Fig. 3, shown in 5) for vector such as effect in twos stackable between the unit; The part that is difficult to fine decomposition such as the part that more redirect and logical operation are arranged (they often corresponding complicated many bodies or long-range effects), is directly calculated by central processing unit.Under this pattern,, to carry out some repetitions or insignificant calculating sometimes in order to improve overall calculation efficient.Such as be integrated into the state of a plurality of unit in the vector handle after, some does not take place will carry out same operation between interactional particle yet, just its result finally is not used.

3) numerical solution of continuum Model.Because numerical solution finally shows as " effect " between one group of discrete grid block point; It is the numerical value dependence; And under a lot of situation; Mainly be under explicit scheme, the grid with this dependence is appreciated that also and is certain special stationary particle system that these calculating can be carried out equally on special-purpose accelerator.But it is relatively poor on the stability of explicit numerical algorithm and/or the computational accuracy traditionally; Use general not as good as implicit schemes; And the multiple dimensioned simulation of variation can effectively improve its precision and stability through introducing whole constraint, and the calculating of these constraints can mainly realize in central processing unit.

Two) hardware designs instance

A kind of typical implementation of parallel hardware system as claimed in claim 2 has been explained in Fig. 1～3.Its general structure is as shown in Figure 1: each calculates node (13) and is arranged in two-dimensional matrix, sets up the latticed communication of round-robin between adjacent calculating node and connects.Certainly, we also can adopt three-dimensional matrice or other array format, like arrays such as triangle, hexagon and diamond crystals.

Simultaneously, each calculate node all through exchange and or the cascade of switch link to each other with piling up with some management calculating nodes (11) of holding concurrently.Management is held concurrently and is calculated task scheduling, data input and output, aftertreatment and the statistical study etc. that node is responsible for total system, and the data of calculating remote switch between node are dealt with and conversion.

This general structure is corresponding to the calculating of a discrete unit system; In typical case; Such cellular system with the form of space segmentation with distribution of computation tasks to each node; The mode of cutting apart is corresponding with the mode that node is arranged, and each calculates node and is responsible for handling the unit in its institute corresponding region.It is right to search interactional unit for ease; To introduce a search grid in the calculating; The distance that its length of side generally takes place greater than pairing interaction; Thereby only need consider in this grid during this interaction of the unit in handling a grid, perhaps the unit in the adjacent mesh.In general, one is calculated the zone that the node correspondence contains a plurality of search grid.

The double node that calculates of above-mentioned management can be served as by existing server, and the agent structure of calculating node is as shown in Figure 2.Wherein central processing unit is responsible for the control of node, comprise instruction and data reception, generate, send to other nodes, be distributed to special-purpose accelerator even comprise the control of aspects such as controlling of arithmetic unit reconstruct in the special-purpose accelerator.Be responsible for part computing complicated, that be not suitable for special-purpose accelerator execution simultaneously.For this reason, calculate configurable two communication controlers of node (21), concrete respectively control and neighborhood of nodes communicate by letter with the communicating by letter of special-purpose accelerator and upper strata node, these two controllers also can synthesize one in the concrete realization.Simultaneously, if concrete technical conditions allow, all right shared drive of central processing unit and special-purpose accelerator and Memory Controller Hub, thus make a large amount of asynchronous shared data need not to transmit with the mode of communication, but impliedly transmit with the mode that timesharing is shared.

A kind of implementation of calculating node is as shown in Figure 5; It can be based on the server or the workstation framework of present commercialization; Wherein north bridge chips (52) respectively with central processing unit, internal memory, PCI-E bus and South Bridge chip (53); And connect the network service adapter on the south bridge, like InfiniBand and Ethernet card, connect a plurality of special-purpose accelerator cards on the PCI-E bus.This mode is simple and easy to do, but the special-purpose main memory access between central processing unit and special-purpose accelerator is not provided.And with the speed of present PCI-E bus, the special-purpose accelerator quantity that is connect is also very limited.Another kind of mode is communication control chip and mainboard for aforementioned calculation node special.As shown in Figure 6; Link to each other with global bus (64) with the network adapter (63) of special use and to realize communicating by letter between node; Be connected special-purpose accelerator and internal memory and Memory Controller Hub with local bus (61) with local bus control (62), realize direct communication in twos between central processing unit, special-purpose accelerator and internal memory.

Described special-purpose accelerator is that combinations that mainly be made up of the vectored calculations parts, between parts are reconfigurable, the processor of data flow driven.Its basic composition is as shown in Figure 3.

The main body of said special-purpose accelerator is a large amount of arithmetic units and register.These arithmetic units include but are not limited to multiplier, totalizer and special function solver etc.Every kind of arithmetic unit comprises some groups of different word lengths again, can be scalar or vector calculus unit.The quantity of the arithmetic unit of these variety classeses and word length and arrangement are confirmed according to the characteristic of using, to utilize all parts as far as possible fully and efficiently.Register is near dispersed placement them then.

Described special-purpose accelerator contains instruction and generates and data flow control, and this controller is connected with the memory access passage with the communication of inter-node on the one hand and links to each other, and possesses the data-interface that links to each other with arithmetic unit on the other hand.And between the arithmetic unit and with register and instruction control unit between being connected of reconstruct able to programme arranged; This controller is set up different computing (like the signal of Fig. 7 a-c institute) according to received program and instruction, and can dynamically modification in program process.In this way, special-purpose accelerator can flexible adaptation to the finding the solution of different problems, but the high density that can guarantee the bottom computing is with high-speed.With respect to general fpga chip, this design is a kind of restructural of coarse, can adapt to higher dominant frequency, and with respect to the conventional vector processor, this design has increased dirigibility, can adapt to more more complicated application.

Another key points in design of described special-purpose accelerator is that arithmetic unit moves with the mode of data-driven, promptly after these parts are received specific data, promptly starts its pairing arithmetic operation, and the result is outputed in the specific follow-up arithmetic unit or register; Follow-up arithmetic unit starts immediately, by that analogy.This method of operation is complementary in reconfigurable arithmetic unit organizational form, can save a large amount of instruction control operations, significantly improves counting yield.

Said special-purpose accelerator is as far as the simulation of discrete unit, mainly adapt in the search grid unit and unit to calculating, the division of the moving of the judgement of leaving such as unit interval, the interaction of unit, unit, unit and merging etc.At this moment, the operation of a plurality of unit can be divided into some groups, utilizes the vector calculus parts to carry out simultaneously, and the corresponding different arithmetic unit organizational form of different algorithms.

Above-described specific embodiment; The object of the invention, technical scheme and beneficial effect have been carried out further explain; It should be understood that the above just to the explanation of some typical implementations of claim of the present invention, is not limited to the present invention.All other the different implementations that within spirit of the present invention and principle, proposes of those skilled in the art; As adopt different communication software and hardwares and different node configuration etc.; And any modification of being made, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. the concurrent computational system towards multiple dimensioned discrete analog is characterized in that, comprises three layers of computing equipment, and top layer is held concurrently for management and calculated node, is coupled with the mode of sharing storage or complete interconnected high-speed communication; The array of rule is lined up for calculating node in the middle level, and direct communication connection or shared storage are arranged between adjacent middle level node; Bottom is special-purpose accelerator, is arranged in the calculating inter-node;

Wherein, double monitoring analysis and scheduling, the statistical study of result of calculation, the convergence of computation process of calculating the computational load of each node in the node operational system of described top layer management differentiated and process Control Software module; Long-range dependent interaction, global iterative and the optimized Algorithm between the node computing unit calculated in the middle level; The short range effect between the special-purpose accelerator computing unit of bottom and the state variation and the analyzing and processing of unit self.

2. concurrent computational system according to claim 1; It is characterized in that being dispersed by simulation system is three straton systems; The same layer adjacent unit interphase interaction of bottom subsystem; Between the non-adjacent unit of middle level subsystem interaction is arranged also, there is interaction a plurality of unit with lower floor simultaneously, and there is interaction a plurality of unit of the cell pairs layer of top layer subsystem; Interaction between said each unit is carried out simultaneously, that is, not only with the interaction of layer model unit by parallel processing, the effect between the adjacent layer unit is also at the same time by parallel processing, thereby realizes multiple dimensioned parallel computation.

3. concurrent computational system according to claim 1 is characterized in that, the double node that calculates of management contains one or more general processors, does not contain or only contain a spot of special-purpose accelerator; Calculate node and contain one or more general processors and a plurality of special-purpose accelerator; Described general processor is that general central processing unit maybe can be handled various complicated orders such as branch's judgement, and has other processor of big memory buffer.

4. concurrent computational system according to claim 3 is characterized in that, the middle level node is connected with the top layer node through global switch, or inserts global switch through the local exchange machine earlier, is connected with the top layer node again; Said processor is the logical organization unit with independently computing and communication function, comprises chip, chipset, programmable gate array chip, integrated circuit board, or wherein any one or a plurality of combination arbitrarily.

5. concurrent computational system according to claim 1; It is characterized in that; Said special-purpose accelerator includes but are not limited to stream handle, digital signal processor, the programmable gate array processor of single instruction multiple data for being fit to handle the application specific processor of certain type of problem; Said special-purpose accelerator does not have storer or shares a spot of storer, but other special-purpose accelerator that can result of calculation is delivered to general processor or be connected; Described special-purpose accelerator contains a large amount of arithmetic units and register, and contains some instruction control units; Said arithmetic unit includes but are not limited to multiplier, totalizer and special function solver etc., and comprises the vector calculus parts of scalar operation parts and different word lengths separately; Said arithmetic unit comprises the vector calculus parts of scalar operation parts and different word lengths, and scalar operation parts programmable combination is the vector calculus parts, and the vector calculus parts long than short word can be combined to the vector calculus parts long than long word; The said arithmetic unit of identical function and word length is divided into the other centralized arrangement of fractions, alternative arrangement between the said arithmetic unit of difference in functionality and word length; Said register and instruction control unit dispersed placement between each arithmetic unit or near.

6. concurrent computational system according to claim 5 is characterized in that, between the said arithmetic unit and with register and instruction control unit between being connected of reconstruct able to programme arranged, this connects by instruction control unit sets up according to received program and instruction.

7. concurrent computational system according to claim 5; It is characterized in that; Said arithmetic unit moves with the mode of data-driven, promptly after these parts are received specific data, promptly starts its pairing arithmetic operation, and the result is outputed in the specific follow-up arithmetic unit or register; Follow-up arithmetic unit starts immediately, by that analogy.

8. concurrent computational system according to claim 1; It is characterized in that; The array of said array for expanding arbitrarily, or the array for forming by any repeatably arrangement mode comprise the array that forms by rectangle or rectangular parallelepiped, triangle or tetrahedron, hexagon or tetrakaidecahedron form at least; The edge of said array is open, or is to link to each other with corresponding sides.

9. concurrent computational system according to claim 2; It is characterized in that; Described when on described concurrent computational system, being carried out by simulation system; Said corresponding by the topological relation between node in the topological relation between the unit in the simulation system and the said concurrent computational system, promptly the state variable of one or more bottom floor units is kept in the special-purpose accelerator of a bottom, and the state variable with unit of neighborhood also is kept in the node with neighborhood or in same node; And the state variable of the higher level unit of said bottom floor units correspondence is kept in the corresponding calculating node of the special-purpose accelerator of said bottom; Main calculating about the state variation of a unit is carried out in the node of preserving its state variable, and the result of calculation of lower floor's node can directly be sent in the high-rise node and used immediately.

10. concurrent computational system according to claim 3; It is characterized in that; Control, branch that described general processor mainly carries out program run judge, search for and complicated logical and floating-point operation, but and special-purpose accelerator mainly carries out the logic and the floating-point operation of simple large-scale parallel.

11. concurrent computational system according to claim 2; It is characterized in that the interphase interaction of described unit is merged and is some groups, every group unit has different state; But interact in an identical manner, their state in special-purpose accelerator as vector centralized stores and computing.

12. concurrent computational system according to claim 2 is characterized in that, the transmission of information can independent and asynchronous carrying out between the calculating of said unit interphase interaction and unit; And the information of transmitting can comprise non-current time and calculate required redundant information, and a plurality of steps computing times can corresponding primary information transmission, or step computing time can repeatedly information transmission of correspondence.