The application requires the u.s. patent application serial number 12/794 that the title submitted on June 7th, 2010 is " System and Method for High-Performance; Low-Power DataCenter Interconnect Fabric(is for the system and method for high-performance, low-power data hub interconnection structure) " according to 35 USC 120,996 preferred rights and interests, it is incorporated into this paper in full by reference.In addition, present patent application is carried out the U.S. Provisional Patent Application sequence number 61/383 that the title submitted in requirement on September 16th, 2010 is " Performance and Power Optimized Computer System Architectures and Methods Leveraging Power Optimized Tree Fabric Interconnect(performance and power optimization computer system architecture and use the method for power optimization table structure interconnection) " according to 35 USC 119 (e) and 120,585 rights and interests, it is incorporated into this paper in full by reference.
Embodiment
Disclose performance and power optimization computer system architecture and used the method for power optimization tree structure interconnection.An embodiment builds to use has the low-power server cluster that tiling builds the structure of piece, and another embodiment realizes storage solution or cooling solution.Another embodiment exchanges other thing by structure.
Co-pending patent application 12/794,996 describe the framework of supporting to come with tree-shaped or graph topology the power optimization server communication structure of route, tree-shaped or graph topology is supported a plurality of links of every node, and wherein in topology, each link is designated as upper and lower or horizontal link.This system is used segmentation MAC framework, and this framework can have recycling (re-purpose) MAC IP address with for internal mac and outside MAC, and utilization is normally for the method for the physics signaling of MAC feed-in switch.Calxeda XAUI system interconnection reduce frame power, reduce the wire of frame and dwindle the size of frame.Not for high power, the Ethernet switch of costliness and the needs of the high power Ethernet Phys on alone server.It sharply reduces cable (cable complexity, cost, great the source of trouble).It also realizes that the heterogeneous server of frame inside mixes, to support to use any equipment of Ethernet or SATA or PCIe.In this framework, power is saved mainly from the aspect on two frameworks: 1) minimize the Ethernet Phys across structure, with the point-to-point XAUI between node, interconnect to substitute them, and 2) dynamically adjust XAUI width and the speed of link based on load.
Fig. 3 illustrates network polymerization 200.10Gb/sec ethernet communication 201 (thick line) between this network support aggregation router 202 and three frame 203a-c.In frame 203a, the Calxeda interconnection structure provides a plurality of high speed 10Gb paths between the server 206a-d on the machine frame in frame, and a plurality of high speed 10Gb paths mean with thick line.Embedded switch in server 206a-d can substitute the frame top switch, thereby saves a large amount of power and cost, still to aggregation router, provides the 10Gb ethernet port simultaneously.The Calxeda switching fabric can be integrated into traditional ethernet (1Gb or 10Gb) in Calxeda XAUI structure, and the Calxeda server can serve as the frame top switch for the server that third party's Ethernet connects.
Intermediate stand 203b illustrates another kind of situation, and wherein Calxeda server 206e, f can be integrated in the available data center frame that comprises frame top switch 208a.In this case, IT group can continue to make their other server upwards to be connected to existing frame top switch via the 1Gb Ethernet.The Calxeda internal server can be connected via Calxeda 10Gb XAUI structure, and they can enough 1Gb or the 10Gb ethernet interconnect upwards be integrated into existing frame top switch.The frame 203c on right side is the previous mode of working as that racks of data centers is disposed traditionally.The Thin Red Line means the 1Gb Ethernet.Therefore, the current deployment of racks of data centers is traditional 1Gb Ethernet upward to frame top switch 308b, is then out to the thick red line 201 of 10Gb(of aggregation router from the frame top switch).Note, Servers-all exists with unknown quantity, although for clarity and brevity, they here illustrate with finite quantity.In addition, use to strengthen the Calxeda server, do not need additional router, because following their the XAUI switching fabric of discussing of their operations.
Fig. 4 illustrates the sketch plan according to the demonstration of an embodiment " data center in frame " 400.It has 10Gb ethernet PHY 401a-n and 1Gb private ethernet PHY 402.Mainframe computer (power (power) server) 403a-n supports search, data mining, index, Hadoop, Java software frame, MapReduce, by Google, introduces to support software frame to the Distributed Calculation of the large data collection on the cluster of computer, cloud application etc.Computer (server) 404a-n with local flash memory and/or solid magnetic disc (SSD) supports search, MySQL, CDN, software to serve (SaaS), cloud application etc.Single, large-scale, that fan 405 increases the convection current of vertically arranged server thereon at a slow speed is cooling.Data center 400 has the array 406 of the hard disk that for example adopts JBOD (JBOD) configuration, and the Calxeda server (the green box in array 406 and 407) in the disk topography size alternatively, serves as alternatively Magnetic Disk Controler.Hard disk server or Calxeda disk server can be used for Web server, user's application and cloud application etc.What also illustrate is the array 407 of storage server and has history server 408a, the b (any size, any supplier) for the standard ethernet interface of legacy application.
Fig. 5 illustrates the advanced topologies 500 of the network system of describing in common patent application co-pending 12/794,996, and it illustrates the XAUI connected by switching fabric and connects the SoC node.10Gb ethernet port Eth0 501a and Eth1 501b are from the top of tree.Oval 502a-n is the Calxeda node that comprises computation processor and embedded switch.Node has 5 XAUI links that are connected to the inner exchanging machine.The exchange layer is used for exchange by whole 5 XAUI links.The 0th grade of leaf node 502d, e are (, N0n node or Nxy, wherein x=level and y=item number) only with 1 XAUI link, be attached to interconnection, can be used as to stay 4 high-speed ports that XAUI, 10Gb Ethernet, the confessions such as PCIe, SATA are attached to I/O.Tree and the great majority of fat tree only have the active node as leaf node, and other node is pure switching node.This mode makes route much more direct.Topology 500 has each node of allowance as combination calculating and switching node or the only in return flexibility of node.Most of tree type realizes having I/O on leaf node, but topology 500 makes I/O on any node.In general, the top that Ethernet is placed on to tree makes to minimize to the average number of hops of Ethernet.
Build piece with tiling and build power optimization server architecture plate
Fig. 6 illustrates the server board that forms a plurality of server nodes that interconnect with the point-to-point interconnection of describing.Server board has:
Each of ellipse in-this sketch is the stand-alone server node that comprises processor, memory, I/O and fabric switch machine.
-fabric switch facility have independently dynamically revises the width (number of channels) of each passage and the ability of speed for each link.
-14 gusset plate examples illustrate two Ethernets of self-structure to move back mouthful (escapes).Usually these Ethernets are moved back to mouth and be routed to standard ethernet switch or router.It can be standard 1Gb or 10Gb Ethernet that these Ethernets move back mouth.
-14 node sample topologys are the fat trees of butterfly, and it provides redundant via to allow the self adaptation route to walk around the fault route and to walk around the hot localised points route.
-3 node aggregation device plates allow only to use two pan tile sheets (tile) to form the large server structure.
-for redundancy, add the second polymerizer
-I/O:
-for the PCIe connector of smooth-stone structure
-optional Ethernet support (disconnect, 1,2,5,10 or 20Gbs)
-based on application, the Ethernet of required bandwidth is judged
Node on-polymerizer plate can be exactly switching node or the full computing node that comprises exchange.
The I/O of-plate can be the PCIe connector of supporting 2 x4 XAUI (2 smooth-stone fabric link) and/or optional Ethernet support (disconnect, 1,2,10 or 20Gbs).
-as the example structure topology of 14 node examples makes to cross over, the outside number of links of plate minimizes so that connector (size and number) and relevant cost reduce to minimum, still retains that Ethernet moves back mouthful and the multichannel redundancy simultaneously.
-two polymerizer plates can be used to realize the path redundancy degree when expansion structure.
-power is saved and can be configured to realize by enough static links
Low-level nodes in-Tu (being labeled as leaf node) can be moved with 1Gb/sec.
Ground floor switching node in-Tu (being labeled as the 1st layer switch) has the Incoming bandwidth from the 3Gb/sec of leaf node.The static link configuration of 2.5 or 5Gb/sec between this 1st layer of permission and the 2nd layer switch.
The abducent link of-2 layer switch layer can move with 10Gb/sec.
-in this topology, because most of nodes are leaf nodes, most of links are just operating in slow rate (being 1Gb/sec in this example), thereby the power consumption that makes to network reduces to minimum.
-permission Ethernet moves back any node of mouth in structure and is pulled (pull) to allow the compromise Ethernet of structural design personnel to move back the required bandwidth of mouth, the port number of frame top switch use and cost and the power associated with ethernet port.
-power is saved and can be used the dynamic link configuration driven to be further optimized via link.In this example, each link and the associated ports of fabric switch machine comprise bandwidth counter, wherein have and allow to use based on dynamic link the configurable threshold event that link width and speed are reconfigured up and down.
-because in many common server service conditions, Ethernet service is mainly node to external ethernet rather than node to node, so the tree structure proposed structure, and especially the fat tree example of butterfly makes to stride across structure to the jumping figure of Ethernet and minimizes, thereby the stand-by period is minimized.This permission creates the large-scale low latency structure to Ethernet, uses simultaneously and has the switch of the switching port of the quantity of few (being 5 in this example) relatively.
The integrated new system of another kind of using the server architecture limited that illustrates of server 209a in-Fig. 2.In this case, for performance and the power management that utilizes server architecture, and the use of the port on the frame top switch is minimized, this illustrates existing server isomery is integrated on the server architecture of restriction, so that can be by gateway passes in structure (can be gateway ' ed into the fabric) from the Ethernet service of existing server, with the node communication in permission and structure, and make the 209a Ethernet service be carried into up link ethernet port 201 by structure.
Fig. 6 a-6c illustrates another example as the structural topology of the 48 node structure topologys that are comprised of 12 cards, and wherein each card comprises 4 nodes that are connected in system board.This topology provides some redundant links, but there is no huge redundancy.Topology has 4 ethernet gateway and moves back mouth, and these ethernet gateway move back mouthful each can be 1Gb or 10Gb, but not all these ethernet gateway all need to be used or to be connected.In the example shown, 8 fabric link are taken out of (are brought off) four node cards, and in one example, PCIe x16 connector is used for the card release of 4 fabric link bands.
Build the summary that piece builds power optimization server architecture plate/general introduction with tiling
1. the Ethernet of any amount of permission cross-server interconnection structure moves back the server tree structure of mouth, so that the quantity of Ethernet Phys minimizes, make power and the cost of the port association for consuming on saving and Ethernet Phys, associated current and frame top Ethernet switch/router.
2. switching node can be to save the pure switching node of power by turn-offing computing subsystem, or can be as the full computing subsystem that comprises fabric switch.With reference to Figure 17, in one implementation, with a plurality of power domains come from management processor (frame 906) and fabric switch machine (remainder of frame) decouples computation subsystem (frame 905).This allows to utilize computing subsystem (frame 905) to configure SOC, by by the fabric switch machine, completing and be de-energized, and the management processing in holding frame 906, and hardware packet switching and route.
Butterfly fat tree topology server architecture provide the minimum number in plate link (saving power and cost), cross over the link (saving power and cost) of the minimum number of plate, allow in plate and the redundant link path of straddle simultaneously.
4. the substrate that proposed and polymerizer plate allow only with two plates, to build piece and form scalable fault recovery server architecture.
Towards the server architecture of tree and as the variant of the fat tree of the example butterfly static link width and the speed that allow the aggregate bandwidth of child node that can be by this node to limit specify, thereby allow link configuration easily to make interconnect power reduce to minimum simultaneously.
6. power is saved and can be used driven dynamic link to configure further via link to optimize.In this example, each link and the associated ports of fabric switch machine comprise bandwidth counter, wherein have and allow to use based on dynamic link the configurable threshold event that link width and speed are reconfigured up and down.
7. due in many common server service conditions, Ethernet service is mainly node to external ethernet rather than node to node, so the tree structure proposed structure, especially the fat tree example of butterfly makes to stride across structure to the jumping figure of Ethernet and minimizes, thereby the stand-by period is minimized.This permission creates the large-scale low latency structure to Ethernet, uses the switch of the switching port with relatively few (being 5 in this example) quantity simultaneously.
8. allow carrying from the integrated server communication structure to limiting of the structural isomerism server of the Ethernet service that has server now and by it.
Build piece with tiling and build power optimization server machine frame and frame
At this moment can combine machine frame and the rack of these plates " tile " with the server node of structural texture connection.Fig. 7 illustrate passive backplane how can connect 8 14 gusset plates and 2 polymeric plate with combination the example by 236 machine frames that server node was formed.For example, each plate can be for 6U for example 8.7 " height+mechanical part<10.75 " (8.7 " tall+mechanical<10.75 " for 6U), for the density fin that interweaves, and 16 plates are assemblied in 19 inches wide frames.Base plate can be simple/cheap, has PCIe transducer and route, and wherein route can be XAUI signal (blue and green)+very simple and without the power (+Power which is very simple without wires) of wire.Ethernet is shown at 8 plate congruent point places to be connected.
Fig. 8 illustrates the example across the structure of machine frame expansion and cross-server frame link machine frame.Ethernet moves back mouth and can be pulled by any Nodes in structure, in this example, from the passive backplane base plate that connects the multinode blade, pulls them.
Build the summary that piece builds power optimization server machine frame and frame/general introduction with tiling
1. draw Ethernet with the PCIe connector and move back mouth and the outside XAUI link of plate, so that plate and ptp services device structure are linked together, do not use the PCIe signaling, but use physical connector with the power for plate and XAUI signal, be kept for the redundancy communication path that troubleshooting (fail-over) and focus reduce simultaneously.
2. be finished completely without the formed XAUI ptp services of source base plate device interconnection structure.
3. the Ethernet across the structure across frame moves back the every one-level place of mouth in tree, and the top of just not setting.
4. the Ethernet that can dynamically enable and forbid across structure moves back mouth, so that bandwidth and optimizing power are used coupling.
5. node to the node traffic that comprises system management service is stayed on the structure across frame, and can not pass the frame top Ethernet switch fully.
Storage
Fig. 9 a illustrates the exemplary server with disk topography size 700 according to an embodiment, typically such as 2.3 inches of the standards with SCSI or SATA drive or 3.5 inches hard disk drives (HDD).Server board 701 is assemblied in the infrastructure identical with disc driver 702 in current magnetic disk machine stand.Server 701 is full servers, wherein has server S oC on DDR, chip, optional flash memory, local power supply management, connects (1-16 to the SATA of disk ... be subject to the restriction of connector size).Its output can be the structure (XAUI) of Ethernet or Calxeda, wherein has two XAUI outputs for troubleshooting.Alternatively, it can use PCIe to replace SATA (SSD or need other thing of PCIe), wherein has 1 to 4 node with the relative storage demand of EQUILIBRIUM CALCULATION FOR PROCESS.Such server can carry out RAID realization and the application of LAMP stack server.On each disk, use Calxeda ServerNode that full LAMP stack server and a plurality of SATA interface of the DDR3 with 4GB will be provided.Alternatively, if necessary, can add the Section Point of the DDR of 8GB.
Fig. 9 b and Fig. 9 c illustrate respectively the exemplary array 710 and 720 according to the disk of the use of embodiment storage server 1-as above node SATA plate-server combination 700a-n.Eliminate the needs to large-scale Ethernet switch by standard or proprietary certain express network or the connection of interconnection, thereby saved power, reduce costs, reduce heat and reduce area.Each plate 701 is less than the height of disk and the degree of depth.Array can arrange with the disk replaced and plate, and as shown in Fig. 7 b, or a plate can be a plurality of disk work, for example adopts the layout of disk, disk, plate, disk, disk, as shown in Figure 7 c.Therefore, rated output is so that mode and disk ratio mate neatly.The connectedness of plate 701a-n can be based on each node, and wherein SATA is for linking disk and a plurality of SATA links a plurality of disks.It also can be based on node to node, wherein as previously described in the structure configuration and in application 61/256723, there are two XAUI in order to redundancy in each node.Node connects by the XAUI structure.Such connection is tree or fat tree topology, and node is to node to node to node, and wherein certainty, irrelevant (oblivious) or self adaptation route are carried out Mobile data towards correct direction.Alternatively, can use complete proprietary interconnection, to forward other processing unit to.Some ports can forward Ethernet output or any other I/O conduit to.Each node can directly forward " box " inner Ethernet or XAUI to and then arrive PHY or XAUI to PHY to XAUI polymerizer (switch).Perhaps can use above any combination.In other situation, SATA connects available PCI e and substitutes, so that with the SSD with PCIe connection.Some SSD enter the disk topography size together with PCIe or SATA.Perhaps can mix PCIe and SATA.Ethernet that can be outer with box replaces AXUI for system interconnection.In some cases, for example, but Application standard SATA connector, but in other cases, can make the higher density connector had by the proprietary wiring of proprietary base plate.
In another situation, server capability can, in disc driver, add disk with the full server that single disc driver overall dimension is provided.For example, ServerNode can be placed on the plate of disk inside.This mode can realize by XAUI or Ethernet connectedness.Under these circumstances, on the known chip of the inventor, server (server-on-chip) mode can be used as Magnetic Disk Controler and adds server.Fig. 9 d illustrates this concept.3.5 inches drivers of standard shown in Fig. 9 d, project 9d0.It has the integrated circuit card 9d1 that controls disc driver.Be labeled as the large quantity space of 9d2 and be not used in driver, wherein can form Calxeda low-power servlets node PCB and be assemblied in this untapped space in disc driver.
Fig. 9 e illustrates the realization of a plurality of server nodes being put into to 3.5 inches disc driver overall dimensions of standard.In this case, the connector from server PCB to base plate produces the interconnection of server architecture based on XAUI, so that communication structure between network and server to be provided, and for 4 sata ports of the connection to adjacent S ATA driver.
Figure 10 illustrates for by server and the integrated realization of memory depth.Server node (101) illustrates the complete low-power server of integrated calculating core, DRAM, integrated I/O and fabric switch machine.In this example, be illustrated in the server node 101 in the overall dimension identical with 2 1/2 inches disc drivers of standard (102).(103) right mode one to one combines these server nodes and disc driver the employing group to be shown, and wherein each server node has its oneself local storage.(104) server node of controlling 4 disc drivers is shown.System (105) illustrates via the uniform server structure and combines these storage servers, then pulls 4 10Gb/sec Ethernets from structure in this example and moves back mouth to be connected to Ethernet switch or router.
Figure 11, by the use of using existing 3.5 inches JBOD (JBOD) memory pack is shown, illustrates the specific implementation of this intensive encapsulation of memory and server.In this case, do not change the JBOD mechanical part that comprises the disk housing, but be illustrated in man-to-man memory node right with the group of disc driver in unmodified JBOD box.This illustrates a concept, and wherein server node is the pluggable modules that inserts the basic mainboard that comprises fabric link.In this diagram, 23 3.5 inches disks (being shown rectangle in logical view) are placed in this standard JBOD box, and this illustrate in the JBOD box comprise 31 server nodes (being shown ellipse/circle in logical view) with control 23 disks and manifest two 10Gb/sec ethernet links (being shown dark wide line in logical view).Closely integrated servers/stores concept is only taken ready-made memory JBOD box, then adds by 31 server nodes in the identical appearance size of power optimization structure communication.This has reflected the application that preferably has local storage well.
Figure 12 illustrates and uses following true related notion: can be of a size of example with server node and 2.5 inches driver identical appearance.In this case, they are integrated in 2.5 inches JBOD with 46 disks.This concept is illustrated in 64 server nodes integrated in the overall dimension that the JBOD memory is identical.In this example, from structure, pull 2 10Gb ethernet links, and 1Gb/sec management ethernet link.
The summary of memory/general introduction
1. draw Ethernet with the PCIe connector and move back mouth and the outside XAUI link of plate, so that plate and ptp services device structure are linked together, do not use the PCIe signaling, but use physical connector with the power for plate and XAUI signal, be kept for the redundancy communication path of fault recovery and load balance simultaneously.
2. by server node and the disk group of enabling little overall dimension low power configuration, the server architecture that uses restriction is had now to the JBOD storage system to convert, provide with this locality storage closely organize right, via power and the integrated very high-density computer server of performance optimization server architecture, thereby create new high performance computing service device and storage server solution, and do not affect physics and the Machine Design of JBOD storage system.
3. for using at the high density computing system, encapsulate the method for integrity server in order to replace some drivers with Additional servers in the overall dimension of hard disk drive.
4. as in claim 3, wherein server is via with additional switching fabric, being connected to network.
5. as in claim 3, base plate in the shell of driver wherein is housed and replaces with the base plate that is applicable to creating at least one interexchange channel.
6. for using at high-density memory system, low-power server PCB is integrated into to the method in the white space in 3.5 inches disc drivers of standard, so that the integrated computing capability in disc driver to be provided.
The frame of integrated low power server cooling
An aspect that is driven into low-power computer server solution is that the temperature of management by frame and straddle, cooling and air move.The miniaturization of fan is an aspect that reduces the TCO (TCO) of low-power server.Fan, because moving component increases cost, complexity, reduction reliability, consumes a large amount of power and produces much noise.Reduce and remove fan remarkable benefit can be provided on reliability, TCO and power consumption.
Figure 13 illustrates and supports by whole frame or the only cooling cooling new realization of frame chimney at the chimney of a section of frame.In chimney frame concept, an importance is single fan, uses rising free convection under the help of a fan.The large-scale fan of cooling whole frame can be at a slow speed.It can be placed in bottom or the frame under the cooling subset of vertically arranged convection current of frame.When cooling-air arrives bottom, fan pushes it through chimney and releases top.Because all plates are all vertical, so do not exist level to block.Although in this example, fan is shown in the frame bottom, it can be in any position of system.That is to say, if " tradition " below ventilation hole and fan is cooling, stay top as vertical stack, this system may have level and block.This is vertical, the bottom coohng mode can work to mini system.Fan can be speed change and temperature correlation.
Figure 13 a illustrates the exemplary illustration of the new principle of the thermal convection 500 of using in chimney frame concept.With angled alignment, come placing modules to make thermal transpiration Double Data Rate (DDR) the memory chip 503a-n of hot-fluid 501a-n from printed circuit board (PCB) 502 increase, so those thermal transpiration chips do not form Hot Spare or heating mutually.In this example, the DDR chip is obliquely placed rather than vertical stacking each other, because they tend to mutual heating.In addition, the DDR chip be placed on such as on the mass computing chip 504a of ASIC, SOC or processor rather than under because they can tend to heat SOC.And the coolest chip of flash chip 506() be placed under SOC.Equally, as discussed below, node is not vertical stacking.Figure 14 expands to this concept the how mutual obliquely placement of server node to be shown so that the spontaneous heating of cross-server node reduces to minimum.
Figure 15 illustrates demonstration 16 node systems according to an embodiment, and wherein heat wave rises from printed circuit board (PCB).For typical 16 node systems, individual node is configured to make the unit on not heating from the heat of each unit rise.Monolithic case is normally longer, not high and not intensive.In addition, be not that PCB obliquely is installed as shown, PCB but can meet at right angles ground (squarely aligned) align and be rectangle, but assembly can obliquely align to place so that heating reduces to minimum mutually.PCB in different rows can have complementary layout or correspondingly stagger arrangement to reduce mutual heating.Similarly, Figure 16 illustrates the more high density variant of 16 node systems, wherein has the node arranged similarly and makes the spontaneous heating of cross-node reduce to minimum.
For the additional cooling concept of the frame of low-power server, be to create ascending air and without fan with pneumatic air pressure reduction.It for the technology of doing like this, is the sealed frame that creates the prolongation vent riser with air.This ventilation duct is high (approximately 20-30 foot above (20-30 feet+)) enough, thereby creates ascending air to create enough air pressure differences.This moves and cooling system for the complete passive air that the frame of low-power server provides.
The cooling summary of the low-power server that frame is installed/general introduction
1. for using at the high density computing system, the thermal transpiration assembly is placed on to the method on vertical placement mounting panel.
Wherein do not have the thermal transpiration assembly to be placed directly on another thermal transpiration assembly or under.
2. as claimed in claim 1, wherein assembly across the substantially oblique setting of mounting panel.
3. as claimed in claim 1, wherein assembly intersects oblique setting across mounting panel basically with some.
4. as described in claim 1,2 and 3, wherein mounting panel is printed substrate.
The server architecture exchange of non-Ethernet grouping
Described in common patent application 12/794996 co-pending, Figure 17 illustrates the inside structure of server node fabric switch machine.Figure 17 illustrates the block diagram according to the demonstration switch 900 of an aspect of system and method disclosed herein.It has four area-of-interest 910a-d.Zone 910a divides into groups corresponding to the Ethernet between CPU and internal mac.Zone 910b is corresponding to ethernet frame, the beginning that it comprises preamble, frame and the interframe gap field of the Ethernet physical interface at internal mac.Zone 910c is corresponding to the externally ethernet frame of the Ethernet physical interface of MAC, the beginning that it comprises preamble, frame and interframe gap field.Zone 910d is corresponding to processor and the grouping of the Ethernet between outside MAC 904 of route header 901.This segmentation MAC framework is asymmetric.Internal mac has to the Ethernet physics signaling interface in the route header processor, and outside MAC has to the Ethernet packet interface in the route header processor.Therefore, MAC IP is reused with for to internal mac and outside MAC, and uses normally the physics signaling for MAC feed-in switch.The MAC configuration makes the operating system device driver management of A9 core 905 and controls inner Eth0 MAC 902 and inner ETH1 MAC 903.Inner Eth2 MAC 907 is managed and controlled to the device driver of management processor 906.Outside Eth MAC 904 can't help device driver and controls.MAC 904 is configured to transmit all frames and does not carry out any filtering for network monitoring with promiscuous mode.The initialization of this MAC is coordinated between the hardware illustration of MAC and any other necessary management processor initialization.Outside Eth MAC 904 registers are that A9 905 and management processor 906 address mappings are all visible.The interruption of outside Eth MAC 904 can be routed to A9 or management processor.
Key to node is, route header processor 910d, receiving while from MAC, mailing to the grouping of switch, adds the structure route header to grouping, and removes the structure route header while receiving the grouping of from switch, mailing to MAC.Fabric switch machine itself is only to being included in node ID and the out of Memory route comprised in the structure route header, and original packet do not divided into groups to check.
Distributed PCIe structure
Figure 18 illustrates the server node that comprises the PCIe controller that is connected to the internal cpu bus structure.This allow to create uses the new PCIe switch architecture of high-performance, power optimization server architecture, in order to create scalable, high-performance, power optimization PCIe structure.
This technology is as follows:
-PCIe controller 902 is connected to Mux 902a, to allow the PCIe controller, is directly connected to exterior PC Ie Phy or is connected to PCIe route header processor 910c.When Mux 902a is configured to the PCIe business is led to local PCIe Phy, this is equivalent to the local PCIe of standard and connects.When Mux 902a is configured to PCIe business guiding PCIe route header processor 910c, this realizes new PCIe distributed frame switch mechanism.
-PCIe route header processor 910c is used the embedded routing iinformation (address, ID or implicit expression) in grouping to create the structure route header that this PCIe Packet routing is mapped to destination structure node PCIe controller.
-this provides the advantage with distributed PCIe structural similarity: server architecture provides networking.
-PCIe the affairs that are derived from processor core (905) can (via the Mux bypass or via switch) be routed to local PCIe Phy, structural any other node be can be routed to, inner PCIe controller (902) or exterior PC Ie controller/Phy (904) are routed directly to.
-same, Incoming PCIe affairs enter exterior PC Ie controller (904), by PCIe route header processor (910), by the structure route header, are carried out mark, and then structure is given the PCIe transmitted in packets its final goal.
The distributed bus protocol architecture
Figure 18 a illustrates additional extension, and it illustrates a plurality of protocol bridge can utilize the following fact: the fabric switch machine is to the route header rather than directly basic grouped payload (for example layer 2 ethernet frame) is carried out to route.In this diagram, 3 protocol bridge are shown: Ethernet, PCIe and bus protocol bridger.
The effect of bus protocol bridger is to obtain processor or inner SOC infrastructure protocol, to its grouping, adds Calxeda structure route header, then it is carried out to route by the Calxeda structure.
As a feasible example, consider the bus protocol such as AMBA AXI, HyperTransport or QPI (via interconnects fast) in SOC.
Consider following data flow:
Processor on-inner SOC bus structures sends memory and loads (or storage) request.
The physical address target of-storage operation has been mapped to structural remote node.
-bus transaction is passed the bus protocol bridger:
-bus transaction is divided into groups
-physical address map of memory transaction is arrived to remote node, use this node ID when building route header.
-routing frame is built by the bus protocol bridger, and routing frame comprises the route header with remote node id, and payload is the bus transaction of grouping.
-bus transaction routing frame, through the fabric switch machine, through structure, and is received by the frame switch of destination node.
-destination node bus protocol bridger is unpacked to the bus transaction of grouping, sends bus transaction in target SOC structure, completes memory and loads, and return results by same steps as, and wherein result is back to origination node.
Network processing unit and server architecture are integrated
Figure 19 illustrates server architecture and the integrated diagram of network processing unit (911).For by server architecture and network processing unit is integrated some service conditions, comprising:
-network processing unit can be used as the network packet processor accelerator of native processor (905) and structural any other processor.
-can be the network processing unit Center, wherein from the grouping of the Incoming of external ethernet for network processing unit, and network processing unit and control plane are processed and can be discharged into larger processing core (905).
-server architecture can be as the communication structure between network processing unit.
In order to realize these new service conditions, to network processing unit assignment MAC Address.In the exchange board structure shown in Figure 19, be not attached to the route header processor of port one-4.Therefore, the agency who is directly connected to port one-4 need to inject the grouping with the fabric switch machine header that divides into groups default to payload.It is integrated that network processing unit adds the fabric switch machine to its design through the following steps:
-carrying out the striking out grouping of mark from network processing unit with fabric switch machine header, it is to being encoded from MAC destination, destination node ID.
-before the Ethernet packet transaction, from the Incoming to network processing unit of fabric switch machine, grouping is removed fabric switch machine header.
External device (ED) and server architecture integrated
Figure 19 illustrates the diagram that server architecture and any external device (ED) (912) is integrated.External device (ED) means any processor, DSP, GPU, I/O or needs communication or the processing unit of communication structure between device.Typical case's service condition will be to need the DSP of the interconnection structure between DSP or GPU processor or the large-scale treatment system that the GPU processor forms.
The fabric switch machine carrys out the route grouping based on the structure route header, and the grouping payload is not divided into groups to check.The grouping payload is not formatted as the hypothesis of ethernet frame, and as opaque payload, treats fully.
This allows external device (ED) (for example DSP or GPU processor) to be attached to the fabric switch machine, and uses through the following steps scalable, high-performance, power optimization communication structure:
-add the routing frame header of the destination node ID that comprises grouping to send to frame switch any grouping payload.
-peel off the routing frame header when the grouping that receives from frame switch.
Load balance
When the structural topology considered shown in Fig. 5, each of the node in structure produces at least one MAC Address and IP address, with the gateway node by shown in 501a and 501b, provides the external ethernet connectedness.
Exposing these fine-grained MAC and IP address is favourable for the extensive World Wide Web (WWW) operation of using the hardware load balancer, because it provides the simple list of MAC/IP address for load balancer, to be operated based on simple list, wherein the internal structure of structure is sightless for load balancer.
But potential a large amount of new MAC/IP address that less data center can be provided by high density low-power server potentially overwhelms.Advantageously can be provided for the option of load balance, with isolating exterior data center infrastructure in order to avoid must process separately a large amount of IP address such as the layer of web services.
Consider Figure 20, wherein we have taken a port on the fabric switch machine, and the FPGA provided such as the service of IP virtual server (IPVS) has been provided.This IP is virtual can be completed in the scope of the network level that comprises the 4th layer (transmission) and the 7th layer (application).In many cases, advantageously, at the 7th layer of place of the data center's layer for such as web services, complete load balance so that the http session status can by specific Web server node local keep.IPVS FPGA only is attached to gateway node ( node 501a and 501b in Fig. 5).
In this example, when the structure shown in Fig. 5 enlarges at the IPVS FPGA with on gateway node, can produce every gateway node single ip address.IPVS FPGA for example, carries out load balance for the Incoming request (HTTP request) of the node in structure.For the 4th layer of load balance, IPVS FPGA can complete to stateless, and uses the algorithm comprise the cross-node poll, or the request of the maximum quantity of the every node of illustration before using next node.For the 7th layer of load balance, IPVS FPGA will need hold mode, so that utility cession can be for specific node.
The flow process that produces becomes:
-Incoming request (for example HTTP request) enters the gateway node (port 0) in Figure 20.
-fabric switch machine routing table has been configured to the IPVS FPGA port on the Incoming business guide frame switch from port 0.
-IPVS FPGA rewrites route header, with the specific node in structure, and by the produced destination node that forwards a packet to.
-destination node is processed request, and usually from gateway node, sends out result.
The structure of enabling networking of OpenFlow/ software definition
OpenFlow is to provide the communication protocol to the access of the Forwarding plane of switch or router by network.OpenFlow allows to be determined by the software moved on independent server by the path of the network of network grouping of switch.The permission ratio that separates of controlling with forwarding is used to the more complicated service management of the feasible service management of ACL and Routing Protocol now.OpenFlow is considered to the realization of the general fashion of software defined networking.
Figure 21 illustrates OpenFlow (or more generally software defined networking (SDF)) stream is processed to the mode in the Calxeda structure that is building up to.Each of gateway node will be illustrated in the FPGA that enables OpenFlow on the port of fabric switch machine of gateway node.OpenFlow FPGA need to be to the band of control plane processor path, this can complete by the independent networking port on OpenFlow FPGA, or another port that can be outer by claimed structure switch has simply been conversed to the control plane processor.
The flow process that produces becomes:
The request of-Incoming enters the gateway node (port 0) in Figure 20.
-fabric switch machine routing table is configured, will be directed to from the Incoming business of port 0 the OpenFlow/SDF FPGA port on the fabric switch machine.
-OpenFlow/SDF FPGA realizes that standard OpenFlow processes, and is included in and gets in touch with alternatively in case of necessity the control plane processor.OpenFlow/SDF FPGA rewrites route header, with the specific node (passing through MAC Address) in structure, and by the produced destination node that forwards a packet to.
-destination node is processed request, and sends it back result to OpenFlow FPGA, and wherein it realizes any out stream processing.
Via PCIe by the power optimization structure assembly to standard processor
Shown in Fig. 5 and previously described power optimization server architecture provide noticeable advantage for the existing standard processor, and can come with existing processor integrated by the integrated chip solution.Standard desktop and processor-server usually directly or via the integrated chip group are supported the PCIe interface.Figure 22 illustrates and via PCIe, power optimization fabric switch machine is integrated into to an example of existing processor.Project 22a illustrates directly or supports via the integrated chip group standard processor of one or more PCIe interfaces.Project 22b illustrates the disclosed fabric switch machine had the integrated ethernet mac controller of its integrated PCIe interface.Project 22b can realize being integrated with FPGA or the ASIC of PCIe integrated morphology switch usually.
In the disclosure, the node shown in Fig. 5 can be the isomery combination of power optimization server S OC and integrated morphology switch, and the PCIe of the present disclosure standard processor connected and the PCIe interface module that comprises ethernet mac and fabric switch machine is integrated.
Integrated via the power optimization structure of Ethernet and standard processor
Shown in Fig. 5 and previously described power optimization server architecture provide noticeable advantage to the existing standard processor, and can be integrated as integrated chip solution and existing processor.Standard desktop and processor-server are usually via integrated chip or the Ethernet interface that provides support in SOC potentially.Figure 23 illustrates and via Ethernet, power optimization fabric switch machine is integrated into to an example of existing processor.Project 23a illustrates by SOC or supports the standard processor of Ethernet interface via integrated chip.Project 23b illustrates disclosed fabric switch machine does not have integrated inner ethernet mac controller.Project 23b can realize being integrated with FPGA or the ASIC of integrated morphology switch usually.
In the disclosure, the node shown in Fig. 5 can be the isomery combination of power optimization server S OC and integrated morphology switch, and mode with FPGA or ASIC of the present disclosure realizes the integrated of standard processor that Ethernet connects and integrated morphology switch.
Although aforementioned with reference to specific embodiment of the present invention, but it will be understood by those of skill in the art that, in the situation that do not deviate from principle of the present disclosure and spirit, can carry out the present embodiment is changed, the scope of the present disclosure is limited by appended claims.