CN110262996A - A kind of supercomputer based on high-performance Reconfigurable Computation - Google Patents

A kind of supercomputer based on high-performance Reconfigurable Computation Download PDF

Info

Publication number
CN110262996A
CN110262996A CN201910406990.1A CN201910406990A CN110262996A CN 110262996 A CN110262996 A CN 110262996A CN 201910406990 A CN201910406990 A CN 201910406990A CN 110262996 A CN110262996 A CN 110262996A
Authority
CN
China
Prior art keywords
rpu
array
information
reconfigurable
machine
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910406990.1A
Other languages
Chinese (zh)
Other versions
CN110262996B (en
Inventor
向志宏
吴君安
杨延辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao Tiankuo Information Technology Co ltd
Original Assignee
Beijing Super Dimension Computing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Super Dimension Computing Technology Co Ltd filed Critical Beijing Super Dimension Computing Technology Co Ltd
Priority to CN201910406990.1A priority Critical patent/CN110262996B/en
Publication of CN110262996A publication Critical patent/CN110262996A/en
Application granted granted Critical
Publication of CN110262996B publication Critical patent/CN110262996B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/382Information transfer, e.g. on bus using universal interface adapter
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/42Bus transfer protocol, e.g. handshake; Synchronisation
    • G06F13/4204Bus transfer protocol, e.g. handshake; Synchronisation on a parallel bus
    • G06F13/4221Bus transfer protocol, e.g. handshake; Synchronisation on a parallel bus being an input/output bus, e.g. ISA bus, EISA bus, PCI bus, SCSI bus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7867Architectures of general purpose stored program computers comprising a single central processing unit with reconfigurable architecture
    • G06F15/7871Reconfiguration support, e.g. configuration loading, configuration switching, or hardware OS
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2213/00Indexing scheme relating to interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F2213/0026PCI express
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2213/00Indexing scheme relating to interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F2213/38Universal adapter
    • G06F2213/3852Converter between protocols
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Stored Programmes (AREA)

Abstract

The present invention relates to a kind of supercomputers based on high-performance Reconfigurable Computation, comprising: machine perceptron, for obtaining reconfigurable data;RPU array, for calculating the reconfigurable data of input;Reconfigurable data is transmitted to RPU array for controlling by master control system;Machine behavior device, for exporting calculated result and/or executing supercomputer instruction;Compiling system for application task to be marked and pre-process, and is decomposed into master control system and executes code and RPU execution code, and ultimately generate every configuration information of the control code of master control system, elastic connection control information and RPU array;Under the control of control code, to form the data path of machine perceptron and RPU array, and the data path of formation machine behavior device and RPU array;And elastic connection control information makes RPU array form elastic computing architecture;And every configuration information of RPU array configures the RPU in RPU array, for calculating reconfigurable data.

Description

A kind of supercomputer based on high-performance Reconfigurable Computation
Technical field
The present invention relates to Reconfigurable Computation field, more particularly, to it is a kind of by the AI of high-performance Reconfigurable Computation it is super based on Calculation machine.
Background technique
With the development of science and technology the development of artificial intelligence (artificial intelligence, AI) is advanced by leaps and bounds.But It is that the platform overwhelming majority that it runs is still based on central processing unit (central processing unit, CPU), figure Processor (graphics processing unit, GPU), field programmable gate array (field programmable gate Array, FPGA) and specific integrated circuit (application specific integrated circuit, ASIC) and its group Close the platform formed.Currently, above-mentioned operation platform still causes much when AI product allocation to developer and user Puzzlement.
Such as the flexibility ratio highest of CPU, but under the scene of parallel computation a large amount of for needs such as AI, efficiency ratio is very It is low.The use of GPU and FPGA solves the problems, such as a part of parallel computation, but power consumption and cost are still to influence its deployment Major reason.For ASIC, there is good efficiency ratio, but AISC can only adapt to fixed algorithm, to algorithm Evolution is helpless.Secondly, by the one or more platform formed answering in system architecture of CPU, GPU, FPGA and ASIC Polygamy, expandability, the power consumption of system and the cost for calculating power etc. are all unsatisfactory.
For passing through high speed serialization computer expansion bus standard (peripheral under existing X86-based Component interconnect express, PCIE) product of AI calculation power is extended in practical applications to iteratively faster AI The support of algorithm, and to calculate power deployment flexibility by biggish restriction.Nowadays operation platform has become limitation AI The maximum restraining factors of deployment.
Summary of the invention
The characteristics of the present invention is based on AI calculating connects one by the PCIE interface of primary processor based on X86-based Or multiple Reconfigurable Computation unit (reconfigurable processing unit, RPU) arrays, can according to product demand and Power is calculated in the deployment of use environment elasticity, and edge calculations, large-scale calculations and great scale can be supported to calculate, and can also be supported without referring to It enables the various neural computings of driving, support on-line training and on-line Algorithm iteration, and have high versatility, flexible Property and Energy Efficiency Ratio.
To achieve the above object, one aspect of the present invention provides a kind of supercomputing based on high-performance Reconfigurable Computation Machine, comprising: at least one machine perceptron enters information as restructural number for obtaining environment sensing information and/or equipment According to;At least one Reconfigurable Computation unit R PU array, for calculating the reconfigurable data of input;Master control system is used for Reconfigurable data is transmitted at least one RPU array by control;At least one machine behavior device, for export calculated result and/ Or execute supercomputer instruction;Compiling system for application task to be marked and pre-process, and is decomposed into master control system It executes code and RPU executes code;Code is executed to RPU according at least one RPU array and carries out code conversion and optimization, finally Generate the control code of master control system, every configuration information of elastic connection control information and RPU array;So as in the control of control code Under system, the data path of at least one machine perceptron and at least one RPU array is formed, and forms at least one machine row For the data path of device and at least one RPU array;And elastic connection control information makes at least one RPU array form bullet The computing architecture of property;And every configuration information of RPU array configures the RPU at least one RPU array, for pair Reconfigurable data is calculated.
Preferably, master control system includes: platform courses center PCH and the master controller based on X86/AMD64 framework;PCH It is connected with master controller by direct media interface DMI;PCH is connected at least one machine perceptron, is used for environment Perception information and/or equipment input information are transmitted to the master controller based on X86/AMD64 framework;Based on X86/AMD64 framework Master controller be connected by PCIE interface at least one RPU array, for reconfigurable data to be transmitted at least one RPU array, to be calculated;PCH is connected at least one machine behavior device, is used for calculated result from based on X86/ The master controller of AMD64 framework is transmitted at least one machine behavior device.
Preferably, RPU array includes: elastic connecting system HEC_link;One or more RPU;HEC_link is in elasticity Under the control of connection control information, one or more RPU is connected;One or more RPU are matched accordingly by HEC_link acquisition Confidence breath;And one or more RPU obtains reconfigurable data from master control system or other RPU by HEC_link;And it is logical It crosses HEC_link and calculated result is transmitted to master control system or other RPU.
Preferably, at least one RPU array is connected with master control system by PCIE interface, and HEC_link includes: PCIE Protocol converter, for by PCIE interface message and at least one RPU array configuration bus and reconfigurable data bus into Row protocol conversion.
Preferably, HEC_link controls information to one or more RPU at least one RPU array according to elastic connection It carries out calculating depth and calculates the extension of width;And one or more RPU at least one RPU array are grouped, it is used for Different reconfigurable datas is inputted respectively and executes different task;Or different reconfigurable datas is inputted respectively and executes identical Business;Or identical reconfigurable data is inputted respectively and executes different task;Or identical reconfigurable data is inputted respectively and is executed Same task.
Preferably, compiling system is at least one the RPU array having determined, by HEC_link carry out width and/or The adjustment of depth, to change the connection relationship of one or more RPU.
Preferably, further includes: operating system, for managing the software and hardware resource and peripheral hardware money of supercomputer Source, and the compiling file of compiling system output is executed, and obtain the information from machine perceptron, and control machine row Calculated result is executed for device, and drives at least one RPU array, and control to compile according to every configuration information of RPU array It translates system and executes compiled online.
Preferably, compiling system is compiled offline mode, and the compiling file that compiling is completed is transferred to operating system;Or it compiles Translating system is compiled online mode, compiles and disposes for operating system real-time perfoming.
Preferably, machine perceptron includes: end sensor, for acquiring surrounding enviroment information and oneself state information; Sensor module, by being carried out based on secondary analysis to the collected surrounding enviroment information of end sensor and oneself state information It calculates, build environment perception information and/or equipment enter information as reconfigurable data.
Preferably, machine behavior device includes: communication unit, man-machine interface, servo mechanism and control unit.
Preferably, end sensor includes: imaging sensor, millimetre-wave radar, ultrasonic radar, laser radar, inertia Measuring unit, microphone, Global Satellite Navigation System, touch screen and stress induction device;Sensor module includes: RGB-D depth Camera, binocular depth camera and VIO three-dimensional reconstruction camera.
Preferably, environment sensing information includes: vision, the sense of hearing, tactile, the sense of taste, geographical location and change in location.
The present invention realizes based on X86-based, is connected by primary processor with machine perceptron, machine behavior device, And one or more RPU array is connected by PCIE interface, can flexibly be disposed according to product demand and use environment and calculate power. It can also support edge calculations, large-scale calculations and great scale to calculate simultaneously, can support the various nerves without order-driven Network query function supports on-line training and on-line Algorithm iteration, and has high versatility, flexibility and Energy Efficiency Ratio.
Detailed description of the invention
Fig. 1 is the supercomputer architecture schematic diagram provided in an embodiment of the present invention based on high-performance Reconfigurable Computation;
Fig. 2 is that power schematic diagram is calculated in a kind of elasticity deployment provided in an embodiment of the present invention;
Fig. 3 is that power schematic diagram is calculated in another elasticity deployment provided in an embodiment of the present invention;
Fig. 4 a is that a kind of flexible adjustment provided in an embodiment of the present invention calculates power execution multitask schematic diagram;
Fig. 4 b is that another flexible adjustment provided in an embodiment of the present invention calculates power execution multitask schematic diagram.
Specific embodiment
Below by drawings and examples, technical scheme of the present invention will be described in further detail.
Fig. 1 is the supercomputer architecture schematic diagram provided in an embodiment of the present invention based on high-performance Reconfigurable Computation.This The supercomputer being related in invention can be AI supercomputer.
As shown in Figure 1, in one embodiment, the present invention has supplied a kind of supercomputing based on high-performance Reconfigurable Computation Machine, comprising: at least one machine perceptron enters information as restructural number for obtaining environment sensing information and/or equipment According to.At least one Reconfigurable Computation unit R PU array, for calculating the reconfigurable data of input.Master control system is used for Reconfigurable data is transmitted at least one RPU array by control.At least one machine behavior device, for export calculated result and/ Or execute supercomputer instruction.Compiling system for application task to be marked and pre-process, and is decomposed into master control system It executes code and RPU executes code.Code is executed to RPU according at least one RPU array and carries out code conversion and optimization, finally Generate the control code of master control system, every configuration information of elastic connection control information and RPU array.So as in the control of control code Under system, the data path of at least one machine perceptron and at least one RPU array is formed, and forms at least one machine row For the data path of device and at least one RPU array;And elastic connection control information makes at least one RPU array form bullet The computing architecture of property;And every configuration information of RPU array configures the RPU at least one RPU array, for pair Reconfigurable data is calculated.
In one embodiment, master control system is based on X86/AMD64 framework.Master control based on X86/AMD64 framework System is the central processing unit of AI supercomputer (AI super computer, AISC), be AISC operating system and The operation of line compiling system provides hardware operation platform.Meanwhile and the multiple RPU arrays of connection control platform.Or it connects The peripheral hardware control platform of multiple machine perceptrons and multiple machine behavior devices;It is the operation platform and reconfigurable data of general program The control platform of calculating.
In one example, the master control system based on X86/AMD64 framework may include: platform courses center (planform controller hub, PCH) and master controller based on X86/AMD64 framework.
Wherein, PCH is the primary interface controller that the primary processor based on X86/AMD64 framework is connect with peripheral hardware.PCH with Master controller is connected by direct media interface (direct media interface, DMI).PCH and at least one machine Perceptron is connected, for environment sensing information and/or equipment input information to be transmitted to the master control based on X86/AMD64 framework Device processed.Master controller based on X86/AMD64 framework is connected by PCIE interface at least one RPU array, and being used for can Reconstruct data are transmitted at least one RPU array, to be calculated.PCH is connected at least one machine behavior device, is used for Calculated result or the instruction for executing AI supercomputer are transmitted at least one from the master controller based on X86/AMD64 framework Machine behavior device.
In one example, the primary processor based on X86/ADM64 framework can be the system-level of AI supercomputer can Reconfigurable controller, while also undertaking the repertoire of universal cpu processor.Wherein, system-level reconstructing controller, RPU array In HEC_Link controller and RPU array in reconstructing controller in each RPU together constitute AI supercomputer Reconstructing controller main body.
In another example, the master control system based on X86/AMD64 framework can also include: Double Data Rate synchronous dynamic Random access memory (double data rate SDRAM, abbreviation DDR), can for the master controller based on X86/AMD64 framework Temporarily data to be stored to DDR.To save the realization that executes of task, and improve task execution efficiency.
In one example, the compiling system of AI supercomputer is divided for application program to be marked and pre-process Solution executes code at master control system and RPU executes code.Then according to RPU array to RPU execute code carry out code conversion with Optimization.Such as task temporal partitioning, task RPU are divided, the generation of task configuration information.Final compiling generates the control of master controller Code processed, the control information of RPU_Link, RPU array every configuration information.
In one embodiment, at least one RPU array is the main operational list that AI supercomputer carries out elastic calculation Member, the reconfigurable data calculating carried out in AI supercomputer are all completed on RPU array.And it is let it pass according to actual Demand, can be with the arrangement mode and the network architecture of RPU quantity or change RPU array in additions and deletions RPU array, so as to reality The elasticity configuration of the calculation power of existing Reconfigurable Computation.
In one example, RPU array may include: elastic connecting system HEC_link;One or more RPU.HEC_ Link connects one or more RPU under the control of elastic connection control information.One or more RPU are obtained by HEC_link Corresponding configuration information, and one or more RPU is taken to obtain restructural number from master control system or other RPU by HEC_link According to, while calculated result is also transmitted to by master control system or other RPU by HEC_link.Those skilled in the art should be noted that can Reconstruct data input source and calculate structure output purpose depending on based on X86/ADM64 framework master control system and The control of HEC_Link.
In one embodiment, HEC_Link be AI supercomputer realize calculate power elasticity configuration main logic unit with And realize carrier.It in one example, can be corresponding according to the different mode and different application scene configuration of calculating power extension HEC_Link.HEC_Link is used to connect multiple RPU in master control system and RPU array based on X86/ADM64 framework, thus It realizes between high speed data transfer and master control system and the RPU between RPU and RPU and between RPU and master control system Configuration status is communicated with information.
In one example, HEC_Link can according to need, to the RPU array for having determined that RPU number, by changing Become connection relationship, allows between each RPU and be combined in different ways.
In one example, compiling system carries out width by HEC_link at least one the RPU array having determined And/or the adjustment of depth, to change the connection relationship of one or more RPU.Believe for example, HEC_link is controlled according to elastic connection It ceases and one or more RPU at least one RPU array is carried out calculating depth and calculate the extension of width, to be executed The ability of bigger program or more.HEC_link can also control information at least one RPU array according to elastic connection In one or more RPU grouping, for inputting different reconfigurable datas respectively and executing different task;Or it inputs respectively not Same reconfigurable data simultaneously executes same task;Or identical reconfigurable data is inputted respectively and executes different task;Or respectively It inputs identical reconfigurable data and expeditiously executes same task etc..
In one example, at least one RPU array and the master control system based on X86/AMD64 framework pass through PCIE interface It is connected, HEC_link may include: PCIE protocol converter, for will be in PCIE interface message and at least one RPU array Configuration bus and reconfigurable data bus carry out protocol conversion.
Present invention employs PCIE interface connection master control system and at least one RPU arrays, adopt compared to existing some schemes It is compared with special purpose interface, makes versatility more preferable using the PCIE interface of standard, reduce hardware cost and development cost.
In another example, HEC_link further includes reconfigurable data bus control unit, reconfigurable data bus, can weigh Structure data bus bridging circuits, configuration bus control unit and configuration bus.By reconfigurable data bus control unit, restructural number The reconfigurable data access collectively constituted according to bus, reconfigurable data bus bridge circuit.And by configuration bus control unit, match Set the RPU configuration information access that bus collectively constitutes.
In one embodiment, machine perceptron includes: end sensor, for acquiring surrounding enviroment information and itself shape State information;Sensor module, it is secondary for being carried out to the collected surrounding enviroment information of end sensor and oneself state information Analytical calculation, build environment perception information and/or equipment enter information as reconfigurable data.
In one example, machine perceptron is the peripheral device of AI supercomputer, for mentioning for AI supercomputer Information is inputted for the environment sensings information such as vision, the sense of hearing, tactile, the sense of taste, geographical location, pose variation or equipment.In an example In son, machine perceptron includes end sensor, for completing the information collection of terminal, peripheral environment and oneself state;And tool Standby perception analysis and computing capability sensor module, the data for acquiring to end sensor carry out secondary analysis and calculate generation New environment sensing information.Wherein, end sensor may include: imaging sensor (camera), millimetre-wave radar (radar), ultrasonic radar (ultrasonic), laser radar (lidar), Inertial Measurement Unit (IMU), microphone (MIC), Global Satellite Navigation System (global navigation satellite system, GNSS), touch screen (touch panel) With stress induction device etc..Sensor module may include: RGB-D depth camera, binocular depth camera and VIO three-dimensional reconstruction phase Machine etc..In another example, perception analysis and computing capability can be based on CPU, GPU, FPGA or DSP, can also be based on RPU。
Those skilled in the art should be noted that any equipment that can provide input information may each be machine perceptron, this hair It is bright it is not limited here.
In one embodiment, machine behavior device is the peripheral device of AI supercomputer, for exporting AI supercomputing In machine calculate or reasoning as a result, and execute AI supercomputer instruction.Wherein, calculated result can be for training nerve net The trained neural network model exported when network;The result of reasoning can be the meter that sample data is passed through to neural network model The result obtained after calculating.Wherein, machine behavior device may include: communication unit, man-machine interface, servo mechanism and control unit Deng.
Those skilled in the art should be noted that any equipment for executing output result that can provide may each be machine behavior device, The present invention is it is not limited here.
In one embodiment, AI supercomputer further include: operating system, for manage supercomputer software and Hardware resource and peripheral resources, and the compiling file of compiling system output is executed, and obtain from machine perceptron Information, and control machine behavior device execute calculated result, and drive at least one according to every configuration information of RPU array RPU array, and control compiling system execute compiled online.
In one example, the operating system of AI supercomputer is used to manage the hardware and software money of AI supercomputer Source, and the peripheral resources of management AI supercomputer.For executing the compiling file of compiling system output, for obtaining slave The information that device perceptron obtains.It is also used to execute control code, and configuration information and reconfigurable data is output to corresponding RPU In array, to control execution Reconfigurable Computation program and to return the result.And it controls machine behavior device and executes AI supercomputer Calculate as a result, and execute compiled online.
In another example, operating system can be also used for driving the RPU array according to the information of RPU array.
In one example, the input of compiling system can be the application program that high-level programming language is write, as C language, Python etc..Or the application framework also relied on comprising high-level language, such as TensorFlow, Caffe neural network frame Frame.Control code, the control information of RPU_Link, RPU gusts of the master controller that can be Reconfigurable Computation of compiling system output The process state information of the compiler tasks such as the universal executive program of every configuration informations of column and master control system.
In one example, compiling system is supported in RPU array without order-driven neural network.It is not necessarily to order-driven Neural network task is executed on RPU array.For example, the compiling file that neural network is formed after completing compiling, may include The control code of primary processor, the control information of RPU_Link, in RPU array RPU every configuration information.In task execution, Instruction control is not needed, but uses data-driven, so that it may realize from reconfigurable data and be input to output, master control system obtains The result obtained is exactly the result end to end of neural network.In another example, it can also be supported by extension compiling system each Kind neural network.In the training and reasoning process of neural network comprising a large amount of parallel computations with compute repeatedly, compiling system can root It is adapted to according to the characteristics of various neural networks, finally show that neural network structure and neural network network parameter are deployed to accordingly RPU array on.
In one example, compiling system can dispose the multi-mode elasticity of RPU array in compiling.Multi-mode is compiled The input for translating system can be not limited to the different mode such as application program or neural network.Compiling system can be according to RPU gusts The characteristics of column, can use the extension that RPU_Link carries out columns and rows to RPU array.Wherein, the column of RPU array are extended advantageous In supporting more parallel computations, the row extension of RPU array is conducive to accelerate the assembly line (pipeline) of the longer sequence of calculation Speed.Wherein, RPU_Link is Open architecture, i.e., without limitation to the quantity of row and column in RPU array.The column of RPU array It is depended primarily on capable extension and calculates power demand.
In another example, it can also be arranged in compiling, change the connection relationship of RPU by HEC_Link.For It has determined that for the RPU array of RPU quantity, the adjustment of width or depth can also be carried out, to adapt to respectively more simultaneously Row calculates or the task of longer calculating cycle.
In further example, it can also be arranged in compiling, change the connection relationship of RPU by HEC_Link, coordinate RPU One or several RPU in array form an independent RPU group.Multiple RPU groups can be formed in one RPU array, it is different respectively Step is parallel to execute multitask, and does not interfere between each other, to realize MIMD multiple-instruction-stream multiple-data stream (multiple instruction Multiple data, MIMD) and Multiple Instruction stream Single Data stream (multiple instruction single data, MISD) The calculating target of equal multitasks, highly-parallel and efficient operation.
For example, one or more different training missions can be performed simultaneously;Alternatively, one or more differences can be performed simultaneously Reasoning task;Alternatively, multiple training missions and multiple reasoning tasks can be performed simultaneously;Alternatively, can be defeated simultaneously by same data Enter to training network and inference network.
In further example, multiple RPU arrays can also be accessed, in master control system to adapt to more according to actual needs It is large-scale to calculate power deployment and parallel computation.
In one example, compiling system can support multi-mode to compile.Those skilled in the art should be noted that be sent out in RPU array When changing, the including but not limited to variation of RPU array quantity, the variation of RPU quantity, the variation of RPU connection type in RPU array Deng.It can realize that part or the overall situation redeploy task by recompilating.
In one embodiment, compiling system is compiled offline mode, and the compiling file that compiling is completed is transferred to operation System;Or compiling system is compiled online mode, compiles and disposes for operating system real-time perfoming.
In one example, the multi-mode compiling that compiling system is supported can be compiled offline mode or compiled online mould Formula.Compiled offline mode is that the compiling file of output is directly passed to operating system to be used to execute.In other words, as by complete Whole compiling file is passed to and is executed in operating system.Compiled online mode is after a part compiling finishes, so that it may root It is directly executed by operating system according to needs.In other words, as complete compiling file may also not have compiling to complete, but wherein A part has been completed compiling, at this point it is possible to directly be executed by operating system, after waiting further parts to complete compiling, then by grasping Make system execution.In another example, the network structure of on-line training can be passed through the real-time portion of compiled online by compiling system Administration goes down, and realizes the online real-time update of neural network.
The present invention is connected based on X86-based, through primary processor with machine perceptron, machine behavior device, and leads to PCIE interface connection one or more RPU array is crossed, can flexibly be disposed according to product demand and use environment and calculate power.Simultaneously also Edge calculations, large-scale calculations and great scale can be supported to calculate, can support the various neural network meters without order-driven It calculates, support on-line training and on-line Algorithm iteration, and there is high versatility, flexibility and Energy Efficiency Ratio.
Fig. 2 is that power schematic diagram is calculated in a kind of elasticity deployment provided in an embodiment of the present invention.
As shown in Fig. 2, for a kind of calculation power schematic diagram of elasticity deployment, for example, needing to be implemented at this time for task needs 6 RPU, RPU as shown in Figure 200、RPU01、RPU10、RPU11、RPU20And RPU21.Above-mentioned 6 RPU constitute the RPU of 3 rows 2 column Array.But if desired executing for task can then go to RPU array from when 6 RPU being needed to become to need 12 RPU Or the extension of column.Calculation power schematic diagram after extension can be as shown in Figure 3.Fig. 3 is another elasticity provided in an embodiment of the present invention Power schematic diagram is calculated in deployment.Needing to be implemented at this time for task becomes 12 RPU by 6 original RPU, as shown in Figure 3 RPU00、RPU01、RPU02、RPU03、RPU10、RPU11、RPU12、RPU13、RPU20、RPU21、RPU22、RPU23、RPU30、RPU31、 RPU32And RPU33.Above-mentioned 12 RPU constitute the RPU array of 4 rows 4 column.To realize the elasticity deployment of RPU array,
Fig. 4 a is that a kind of flexible adjustment provided in an embodiment of the present invention calculates power execution multitask schematic diagram.
Another flexible adjustment is provided as shown in fig. 4 a and calculates power, and executes the schematic diagram of multitask.It can incite somebody to action at this time Different number RPU in RPU array is grouped, such as A group --- RPU 2, B group are RPU 3 --- RPU 7 that is RPU 0.This When, it can be grouped for different RPU, execute different task respectively, such as A group executes A task, B group executes B task.Such as figure Shown in 4a, A task needs to call the RPU 0 in RPU array --- RPU 2;B task needs to call the RPU in RPU array 3——RPU7。
Master control system is by configuring bus for the configuration information for being used to configure bridge module and being used to configure RPU array Configuration information is transmitted to protocol controller.The configuration information received is carried out protocol conversion by protocol controller, is converted into serial Signal.Protocol controller is by the first control bus in first group of channel, by the serial signal transfer after conversion to bridging mould Block.Since the second control bus in bridge module is connected with the first control bus in first group of channel, so that in the first control The serial signal transmitted in bus processed is transmitted to the bridge controller in bridge module by the second control bus.Bridge controller The configuration information of RPU array will be used to configure, is controlled by the RPU that the second control bus is connect with each RPU in RPU array total Line is transmitted in RPU array in corresponding RPU, to complete the configuration to RPU.
Bridge controller configures bridge joint submodule according to the configuration information for configuring bridge module.In an example In son, such as current A task needs to use 4 RPU and carries out serial computing, respectively RPU0 --- RPU 3.A task is held at this time RPU needs sequence is called when row, sequentially can be from RPU 0 to RPU 3 successively sequence call.Bridge controller is to bridge joint submodule Corresponding access matrix is bridged in block, such as access matrix in bridge module in Fig. 4 a, the wherein intersection location of black dot The position bridged, for connecting the RPU input channel and RPU output channel that are in intersection location.For A task, figure Bridge joint mode shown in 4a is that the second Output matrix channel (Output_Array) is bridged to the input channel of RPU 0 (Input_RPU 0), the output channel (Output_RPU 0) of RPU 0 are bridged to the input channel (Input_RPU of RPU 1 1), the output channel (Output_RPU 1) of RPU 1 is bridged to the input channel (Input_RPU 2) of RPU2, the output of RPU 2 Channel (Output_RPU 2) is bridged to the input channel (Input_RPU 3) of RPU 3, when A task only need to call RPU 0 to When a circulation of RPU 3 can be completed, then the output channel (Output_RPU 3) of RPU 3 is bridged to the second square Battle array input channel (Input_Array).But in another example, if RPU's 0 to RPU 3 is more when A task needs to call When a circulation could be completed, then the output channel (Output_RPU 3) of RPU 3 is bridged to the input channel of RPU 0 (Input_RPU 0).Those skilled in the art, which should be noted that, needs to discharge at this time second Output matrix channel (Output_Array) bridge It is connected to the bridge joint of the input channel (Input_RPU 0) of RPU 0, to complete the recursive call of RPU.Until having executed last When a circulation, the output channel (Output_RPU end) and other RPU input channels (Input_RPU) of RPU end are discharged Bridge joint, and the output channel of RPU end (Output_RPU end) is bridged to the second Input matrix channel (Input_ Array), for calculated result to be transferred to HEC master control system.Wherein, RPU end indicates the last one RPU executed, can To be any one in RPU 0 to RPU 3.
For B task, the bridge joint mode shown in Fig. 4 a is to be bridged to the second Output matrix channel (Output_Array) The input channel (Input_RPU 4) of RPU 4, the input that the output channel (Output_RPU 4) of RPU 4 is bridged to RPU 5 are logical Road (Input_RPU 5), the output channel (Output_RPU 5) of RPU 5 are bridged to the input channel (Input_RPU of RPU 6 6), the output channel (Output_RPU 6) of RPU 6 is bridged to the input channel (Input_RPU 7) of RPU 7.When B task When needing to call a circulation of RPU 4 to RPU7 can complete, then by the output channel (Output_RPU of RPU 7 7) it is bridged to the second Input matrix channel (Input_Array).But in another example, if when B task needs to call RPU When 4 to RPU 7 multiple circulations could be completed, then the output channel (Output_RPU 7) of RPU 7 is bridged to the defeated of RPU 4 Enter channel (Input_RPU 4).Those skilled in the art, which should be noted that, needs to discharge at this time the second Output matrix channel (Output_ Array it is bridged to the bridge joint of the input channel (Input_RPU 4) of RPU 4) to complete the recursive call of RPU.Until executing When the last one complete circulation, the output channel (Output_RPU end) and other RPU input channels of RPU end are discharged (Input_RPU) bridge joint, and the output channel of RPU end (Output_RPU end) is bridged to the second Input matrix channel (Input_Array), for calculated result to be transferred to HEC master control system.Wherein, RPU end indicates that the last one has been executed RPU, can be any one in RPU 0 to RPU 7.
Those skilled in the art should be noted that RPU quantity is depending on practical calculating task demand, and the present invention corresponds to RPU number It measures and is not specifically limited.Meanwhile those skilled in the art shall also be noted that for different task, the second output channel needs pass through The mode of time-sharing multiplex, is bridged from different RPU input channels and the second input channel needs side by time-sharing multiplex Formula is bridged from different RPU output channels, to realize the parallel execution of multitask.
By above-mentioned bridge module to the bridge joint of RPU array, when needing A task and B task executes parallel, for calculating Reconfigurable data by data bus transmission to protocol controller, by protocol controller by the restructural of A task and B task Data carry out protocol conversion, are converted to serial signal.And pass through the first Output matrix channel transfer in first group of channel to bridge joint Second Output matrix channel of module.The serial signal of A task can be passed through bridge according to above-mentioned bridge joint mode by bridge module It connects from the second Output matrix channel transfer to 0 input channel of RPU, and by the serial signal of B task by bridging from the second square Battle array output channel is transmitted to 4 input channel of RPU.
A task is by 0 input channel of RPU in second group of channel connecting with RPU 0 by serial signal transfer to RPU It is accordingly calculated in 0.After the completion of the calculating of RPU 0, the data having been calculated are passed through in second group of channel connecting with RPU 0 0 output channel of RPU be transmitted to 0 output channel of RPU in bridge module.Bridge module further according to above-mentioned bridge joint mode, The data having been calculated are transmitted to 1 input channel of RPU by 0 output channel of RPU, are continued so that data are transmitted in RPU 1 It is calculated.Repeat the above process until calculating task after the completion of, by the data having been calculated by connect with RPU end second RPU end output channel in group channel is transmitted to the RPU end output channel in bridge module.Bridge module is according to above-mentioned The data having been calculated are transmitted to the second Input matrix channel by RPU end output channel, and pass through first group by bridge joint mode The first Output matrix channel transfer in channel is to protocol controller.The data having been calculated are carried out protocol conversion by protocol controller Afterwards, it is transmitted to HEC master control system, to complete calculating task.
B task is by 4 input channel of RPU in second group of channel connecting with RPU 4 by serial signal transfer to RPU It is accordingly calculated in 4.After the completion of the calculating of RPU 4, the data having been calculated are passed through in second group of channel connecting with RPU 4 4 output channel of RPU be transmitted to 4 output channel of RPU in bridge module.Bridge module further according to above-mentioned bridge joint mode, The data having been calculated are transmitted to 5 input channel of RPU by 4 output channel of RPU, are continued so that data are transmitted in RPU 5 It is calculated.Repeat the above process until calculating task after the completion of, by the data having been calculated by connect with RPU end second RPU end output channel in group channel is transmitted to the RPU end output channel in bridge module.Bridge module is according to above-mentioned The data having been calculated are transmitted to the second Input matrix channel by RPU end output channel, and pass through first group by bridge joint mode The first Output matrix channel transfer in channel is to protocol controller.The data having been calculated are carried out protocol conversion by protocol controller Afterwards, it is transmitted to HEC master control system, to complete calculating task.
Those skilled in the art should be noted that during above-mentioned task execution, in RPU array for execute the RPU of task with And bridge module can also by the status information of itself by respective control bus feed back to the bridge controller of bridge module with And HEC master control system, when executing complex task, can dynamically to adjust the bridged appearances in bridge module, and adjustment RPU is to execute different calculating tasks.
During above-mentioned task execution, each configuration information recycled is transmitted to RPU according to demand by control bus On array, status information of the bridge module according to the task execution of each RPU in configuration information and RPU array, dynamic control The bridge joint of input channel and output channel processed, the data transmitted for RPU according to corresponding configuration information and input channel into Row is corresponding to be calculated, and is exported the data having been calculated by output channel and believed by the state that corresponding control bus feeds back itself Breath.
Fig. 4 b is that another flexible adjustment provided in an embodiment of the present invention calculates power execution multitask schematic diagram.
As described in Fig. 4 b, the logical architecture figure of Fig. 4 a is shown.For the parallel execution of multitask in Fig. 4 a, master control system Can be by configuring bus, the sequence for configuring multiple RPU of different task in RPU array is called, such as A task be RPU 0 to RPU 2 successively call by sequence, and B task is that RPU 3 to RPU7 is successively sequentially called.And pass through data/address bus for A task and B task Reconfigurable data for calculating is transmitted in RPU array in corresponding RPU, and is calculated.When the calculation is finished, then pass through A task and the complete data of B task computation are transmitted to master control system by data/address bus.Those skilled in the art, which should be noted that, to be calculated Cheng Zhong, master control system can also be according to the status informations for the RPU feedback used in bridge module and RPU array, by configuring bus New configuration information is sent, for bridging the bridge joint in submodule and the RPU in RPU array in dynamic configuration bridge module.
The present invention realizes based on X86-based, is connected by primary processor with machine perceptron, machine behavior device, And one or more RPU array is connected by PCIE interface, can flexibly be disposed according to product demand and use environment and calculate power. It can also support edge calculations, large-scale calculations and great scale to calculate simultaneously, can support the various nerves without order-driven Network query function supports on-line training and on-line Algorithm iteration, and has high versatility, flexibility and Energy Efficiency Ratio.
Professional should further appreciate that, described in conjunction with the examples disclosed in the embodiments of the present disclosure Unit and algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, hard in order to clearly demonstrate The interchangeability of part and software generally describes each exemplary composition and step according to function in the above description. These functions are implemented in hardware or software actually, the specific application and design constraint depending on technical solution. Professional technician can use different methods to achieve the described function each specific application, but this realization It should not be considered as beyond the scope of the present invention.
The step of method described in conjunction with the examples disclosed in this document or algorithm, can be executed with hardware, processor The combination of software module or the two is implemented.Software module can be placed in random access memory (RAM), memory, read-only memory (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technical field In any other form of storage medium well known to interior.
Above-described specific embodiment has carried out further the purpose of the present invention, technical scheme and beneficial effects It is described in detail, it should be understood that being not intended to limit the present invention the foregoing is merely a specific embodiment of the invention Protection scope, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should all include Within protection scope of the present invention.

Claims (10)

1. a kind of supercomputer based on high-performance Reconfigurable Computation characterized by comprising
At least one machine perceptron enters information as reconfigurable data for obtaining environment sensing information and/or equipment;
At least one Reconfigurable Computation unit R PU array, for calculating the reconfigurable data of input;
The reconfigurable data is transmitted at least one described RPU array for controlling by master control system;
At least one machine behavior device, for exporting calculated result and/or executing supercomputer instruction;
Compiling system for application task to be marked and pre-process, and is decomposed into master control system and executes code and RPU execution Code;Code is executed to the RPU according at least one described RPU array and carries out code conversion and optimization, ultimately generates master control Control code, the elastic connection of system control every configuration information of information and RPU array;So as in the control of the control code Under, the data path of at least one machine perceptron Yu at least one RPU array is formed, and form at least one machine The data path of behavior device and at least one RPU array;And the elastic connection control information makes described at least one A RPU array forms the computing architecture of elasticity;And every configuration information of the RPU array is to described at least one RPU gusts RPU in column is configured, for calculating the reconfigurable data.
2. supercomputer according to claim 1, which is characterized in that the master control system includes: platform courses center PCH and master controller based on X86/AMD64 framework;The PCH is connected with the master controller by direct media interface DMI It connects;
The PCH is connected at least one described machine perceptron, for inputting the environment sensing information and/or equipment Information is transmitted to the master controller based on X86/AMD64 framework;
The master controller based on X86/AMD64 framework is connected by the PCIE interface at least one described RPU array It connects, for the reconfigurable data to be transmitted at least one described RPU array, to be calculated;
The PCH is connected at least one described machine behavior device, for calculated result to be based on X86/AMD64 frame from described The master controller of structure is transmitted at least one described machine behavior device.
3. supercomputer according to claim 1, which is characterized in that the RPU array includes:
Elastic connecting system HEC_link;
One or more RPU;
The HEC_link connects one or more of RPU under the control of elastic connection control information;
One or more of RPU obtain corresponding configuration information by the HEC_link;And
One or more of RPU obtain the restructural number from the master control system or other RPU by the HEC_link According to;And calculated result is transmitted to by the master control system or other RPU by the HEC_link.
4. supercomputer according to claim 3, which is characterized in that at least one described RPU array and the master control System is connected by PCIE interface, and the HEC_link includes:
PCIE protocol converter, for by the PCIE interface message and at least one described RPU array configuration bus and Reconfigurable data bus carries out protocol conversion.
5. supercomputer according to claim 3, which is characterized in that the HEC_link is controlled according to elastic connection to be believed It ceases and one or more RPU at least one described RPU array is carried out calculating depth and calculate the extension of width;
And one or more RPU at least one described RPU array are grouped, it is used for
Different reconfigurable datas is inputted respectively and executes different task;Or
Different reconfigurable datas is inputted respectively and executes same task;Or
Identical reconfigurable data is inputted respectively and executes different task;Or
Identical reconfigurable data is inputted respectively and executes same task.
6. supercomputer according to claim 3, which is characterized in that the compiling system to described in having determined extremely A few RPU array, carries out the adjustment of width and/or depth, by the HEC_link to change one or more of RPU Connection relationship.
7. supercomputer according to claim 1, which is characterized in that further include:
Operating system, for managing the software and hardware resource and peripheral resources of the supercomputer, and described in executing The compiling file of compiling system output, and the information from the machine perceptron is obtained, and the control machine behavior Device executes calculated result, and drives at least one described RPU array according to every configuration information of the RPU array, and It controls the compiling system and executes compiled online.
8. supercomputer according to claim 7, which is characterized in that the compiling system is compiled offline mode, will The compiling file that compiling is completed is transferred to the operating system;Or
The compiling system is compiled online mode, compiles and disposes for the operating system real-time perfoming.
9. supercomputer according to claim 1, which is characterized in that the machine perceptron includes:
End sensor, for acquiring surrounding enviroment information and oneself state information;
Sensor module, it is secondary for being carried out to the collected surrounding enviroment information of the end sensor and oneself state information Analytical calculation, generates the environment sensing information and/or equipment enters information as the reconfigurable data.
10. supercomputer according to claim 1, which is characterized in that the machine behavior device include: communication unit, Man-machine interface, servo mechanism and control unit.
CN201910406990.1A 2019-05-15 2019-05-15 Super computer based on high-performance reconfigurable calculation Active CN110262996B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910406990.1A CN110262996B (en) 2019-05-15 2019-05-15 Super computer based on high-performance reconfigurable calculation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910406990.1A CN110262996B (en) 2019-05-15 2019-05-15 Super computer based on high-performance reconfigurable calculation

Publications (2)

Publication Number Publication Date
CN110262996A true CN110262996A (en) 2019-09-20
CN110262996B CN110262996B (en) 2023-11-24

Family

ID=67914737

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910406990.1A Active CN110262996B (en) 2019-05-15 2019-05-15 Super computer based on high-performance reconfigurable calculation

Country Status (1)

Country Link
CN (1) CN110262996B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111859904A (en) * 2020-07-31 2020-10-30 南京三百云信息科技有限公司 NLP model optimization method and device and computer equipment
CN112202243A (en) * 2020-09-17 2021-01-08 许继集团有限公司 Full-acquisition intelligent terminal for power transmission line state monitoring
TWI798642B (en) * 2021-02-09 2023-04-11 寧茂企業股份有限公司 Array controlling system and controlling method thereof

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9646243B1 (en) * 2016-09-12 2017-05-09 International Business Machines Corporation Convolutional neural networks using resistive processing unit array
CN108804379A (en) * 2017-05-05 2018-11-13 清华大学 Reconfigurable processor and its configuration method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9646243B1 (en) * 2016-09-12 2017-05-09 International Business Machines Corporation Convolutional neural networks using resistive processing unit array
CN108804379A (en) * 2017-05-05 2018-11-13 清华大学 Reconfigurable processor and its configuration method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
王延升: "《粗粒度动态可重构处理器中的高能效关键配置技术研究》", 《中国博士学位论文全文数据库信息科技辑》 *
魏少军等: "可重构计算处理器技术", 《中国科学:信息科学》 *
黄石等: "粗粒度可重构并行计算的面向对象仿真研究", 《计算机工程与设计》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111859904A (en) * 2020-07-31 2020-10-30 南京三百云信息科技有限公司 NLP model optimization method and device and computer equipment
CN112202243A (en) * 2020-09-17 2021-01-08 许继集团有限公司 Full-acquisition intelligent terminal for power transmission line state monitoring
TWI798642B (en) * 2021-02-09 2023-04-11 寧茂企業股份有限公司 Array controlling system and controlling method thereof

Also Published As

Publication number Publication date
CN110262996B (en) 2023-11-24

Similar Documents

Publication Publication Date Title
Rupp et al. The NAPA adaptive processing architecture
Jin et al. Modeling spiking neural networks on SpiNNaker
CN110262996A (en) A kind of supercomputer based on high-performance Reconfigurable Computation
CN106201651A (en) The simulator of neuromorphic chip
Meng et al. Accelerating proximal policy optimization on cpu-fpga heterogeneous platforms
Smaragdos et al. BrainFrame: a node-level heterogeneous accelerator platform for neuron simulations
CN112580792B (en) Neural network multi-core tensor processor
Curzel et al. Automated generation of integrated digital and spiking neuromorphic machine learning accelerators
Vidal et al. Solving optimization problems using a hybrid systolic search on GPU plus CPU
CN114035916A (en) Method for compiling and scheduling calculation graph and related product
Nurvitadhi et al. Scalable low-latency persistent neural machine translation on CPU server with multiple FPGAs
CN110059050B (en) AI supercomputer based on high-performance reconfigurable elastic calculation
Chatzikonstantis et al. Multinode implementation of an extended hodgkin–huxley simulator
Othman et al. MPSoC design approach of FPGA-based controller for induction motor drive
Smaragdos et al. Performance analysis of accelerated biophysically-meaningful neuron simulations
Zhang et al. Biophysically accurate foating point neuroprocessors for reconfigurable logic
CN112799603A (en) Task behavior model for multiple data stream driven signal processing system
Fox Massively parallel neural computation
Lin et al. swFLOW: A dataflow deep learning framework on sunway taihulight supercomputer
CN110135572B (en) SOC-based trainable flexible CNN system design method
Boniface et al. A bridge between two paradigms for parallelism: Neural networks and general purpose mimd computers
Kerckhoffs et al. Speeding up backpropagation training on a hypercube computer
Hielscher et al. Platform generation for edge AI devices with custom hardware accelerators
CN117540783B (en) Method and device for generating simulated brain activity data, electronic device and storage medium
CN102760097A (en) Computer architecture performance simulation method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20231030

Address after: Room 1802, Taixin Building, No. 33 Jinhua Road, Shibei District, Qingdao City, Shandong Province, 266011

Applicant after: Qingdao TianKuo Information Technology Co.,Ltd.

Address before: 100142 907, area 1, floor 9, No. 160, North West Fourth Ring Road, Haidian District, Beijing

Applicant before: BEIJING HYPERX AI COMPUTING TECHNOLOGY Co.,Ltd.

GR01 Patent grant
GR01 Patent grant