CN208013975U

CN208013975U - The hardware device of on-line intelligence ability platform

Info

Publication number: CN208013975U
Application number: CN201820583759.0U
Authority: CN
Inventors: 李宇歌
Original assignee: SUZHOU CHAOJI INFORMATION TECHNOLOGY CO LTD
Current assignee: SUZHOU CHAOJI INFORMATION TECHNOLOGY CO LTD
Priority date: 2018-04-23
Filing date: 2018-04-23
Publication date: 2018-10-26
Anticipated expiration: 2028-04-23

Abstract

The utility model is related to the hardware devices of on-line intelligence ability platform, storage server connects Ethernet by external access network, storage server, P4 GPU servers, P100 GPU servers and management server are by calculating network connection Inifiniband interchangers, storage server, P4 GPU servers, P100 GPU servers and management server connect Ethernet switch with computer room control centre by managing network connection Ethernet switch, working group by Internal Access Network.It can be shared using video memory in Infiniband real-time performance physical significances by NVIDIA RDMA technologies and GPU Director technologies；When large data calculates analysis, multiple servers and more GPU is called to complete jointly, when needing to complete diversiform data calculating analysis, multiple tasks is distributed into different server, realize the concurrently progress of multi-model training.

Description

The hardware device of on-line intelligence ability platform

Technical field

The utility model is related to a kind of hardware devices of on-line intelligence ability platform.

Background technology

The pattern that neural network can identify is numeric form, therefore all are existing for image, sound, text, time series etc. The data in the real world must be converted into numerical value.In deep learning network, each node layer is on the basis of preceding layer exports Study one group of specific feature of identification.As neural network depth increases, node can know another characteristic and also just become increasingly complex, Because of the feature of the whole merger and reorganization preceding layer of each layer of meeting.

Utility model content

The purpose of the utility model is to overcome the shortcomings of the prior art, a kind of the hard of on-line intelligence ability platform is provided Part equipment.

The purpose of this utility model is achieved through the following technical solutions：

The hardware device of on-line intelligence ability platform, feature are：Including storage server, P4GPU servers, P100GPU Server, management server, Inifiniband interchangers and Ethernet switch, the storage server are connect by outside Enter network connection Ethernet, the storage server, P4GPU servers, P100GPU servers and management server pass through calculating Network connection Inifiniband interchangers, the storage server, P4GPU servers, P100GPU servers and management service Device connects Ethernet with computer room control centre by managing network connection Ethernet switch, working group by Internal Access Network Interchanger.

Further, the hardware device of above-mentioned on-line intelligence ability platform, wherein the management server is XG- 22302EN servers.

Further, the hardware device of above-mentioned on-line intelligence ability platform, wherein the P4GPU servers are PSC- HB1X 4U machine towers mutually turn server.

Further, the hardware device of above-mentioned on-line intelligence ability platform, wherein the P100GPU servers are XG- 48201GK servers.

Further, the hardware device of above-mentioned on-line intelligence ability platform, wherein the storage server is XG- 42301^STStorage server.

Further, the hardware device of above-mentioned on-line intelligence ability platform, wherein the Inifiniband interchangers For 108 port InfiniBand interchangers of SX6506.

Further, the hardware device of above-mentioned on-line intelligence ability platform, wherein the Ethernet switch is 24 mouthfuls Gigabit switch.

The utility model has significant advantages and beneficial effects compared with prior art, embodies in the following areas：

The P4GPU server nodes and P100GPU server nodes of the hardware device can pass through NVIDIA RDMA technologies And GPU Director technologies, it is shared using the video memory in Infiniband real-time performance physical significances；In large data meter When point counting is analysed, multiple servers and more GPU can be called to complete task jointly, and be analyzed when needing completion diversiform data to calculate When, multiple tasks can be distributed to different servers according to the characteristic of different types of data, realize multi-model training and Hair carries out；Training flow, initiates training mission by worker, to after management node, computing resource, GPU is asked by management node Cluster is undergone training after task, is trained locally after reading training data to storage server, data back arrives after the completion Storage server, while management server node is fed back information to, management server node prompts training mission to worker It completes.

Description of the drawings

Fig. 1：The configuration diagram of the utility model.

The meaning of each reference numeral see the table below in figure：

Specific implementation mode

For a clearer understanding of the technical features, objectives and effects of the utility model, existing be described in detail specifically Embodiment.

As shown in Figure 1, the hardware device of on-line intelligence ability platform, including storage server 2, P4GPU servers 6, P100GPU servers 3, management server 5, Inifiniband interchangers 4 and Ethernet switch 7, storage server 2 are logical Cross external access network connection Ethernet 2, storage server 2, P4GPU servers 6, P100GPU servers 3 and management server 5 connect Inifiniband interchangers 4 by calculating network (data transmission network), storage server 2, P4GPU servers 6, P100GPU servers 3 and management server 5 are by managing network connection Ethernet switch 7, during working group 9 and computer room control The heart 8 connects Ethernet switch 7 by Internal Access Network.

Wherein, management server 5 is XG-22302EN servers.The server space of 2U, 8 3.5 cun of hot-plug hard disks Two-way E5Xeon 2600V4 series CPU, 16 root memory slots, 38 slots of PCI-E 3.0X and 3 PCI-E are supported in position 3.0X16 slot.Meet extension demand, system and data call disk are used as using the 480G SSD (RAID1) of 2 pieces of enterprise-levels, adopted The 4TB HDD (RAID10) of 4 pieces of enterprise-levels are used to be protected as data disks, LSI9271 RAID cards and capacitance data.Ensure data Safety and storage performance are excellent.Network facet is integrated dual port 1Gb network interface cards and IB 56Gb network interface cards.740W 1+1 redundant powers are protected The electrical stability for demonstrate,proving machine long-play, ensures the safety of data.

P4GPU servers 6 are that PSC-HB1X 4U machine towers mutually turn server.Data as entire cluster calculate center one Part carries large-scale concurrent, has a large amount of linear computing capability, for the high-performance calculation clothes of unique radiating treatment Business device, can support the GPU processors of mainstream on the market.The operation is stable, the design of full redundancy server level, machine tower mutually turn appearance, height Extending space.PSC-HB1X can install 4 pieces of GPU calculating cards, install NVIDIA TESLA P4 high-performance calculation cards, provide list Accuracy computation.CPU uses the CPU of two-way E5-2650v4, calculate node to configure 256G memories, and network facet is integrated dual port 1Gb Network interface card, IB 56Gb network interface cards.3 pieces of 8TB enterprise-level mechanical hard disks are configured in terms of storage, LSI9271 RAID cards and capacitance data are protected Shield, RAID5 configurations.Power supply is 2000W redundancy 1+1 80PLUS platinum level power supplies.

P100GPU servers 3 are XG-48201GK servers.Data as entire cluster calculate center another part, Double precision computing capability is provided, high density GPU deployment can be adopted in 8 P100 high-performance calculation cards of space configuration of 4U, CPU With the CPU of E5-2650v4, calculate node configures 256G memories, and network facet is integrated dual port 1Gb network interface cards, IB 56Gb network interface cards. Power supply is 1600W 2+2 redundant powers, being capable of flexible modulation power modes.

Storage server 2 is XG-42301^STStorage server.It is the key component of entire cluster-based storage data, each GPU Node reads data and is intended to through this node, 24 pieces of 3.5 cun of hard disks of 4U space configurations, and CPU uses the CPU of two-way E5-2620v4, Calculate node configures 64G memories, and network facet is integrated dual port 1Gb network interface cards, IB 56Gb network interface cards.Each hard-disk capacity is up to 8TB, while RAID50 disk arrays, more preferably protect data safety.

Inifiniband interchangers 4 are 108 port InfiniBand interchangers of SX6506.Mellanox SX6506 are handed over System of changing planes provides the Networking Solutions & provisioned of peak performance, and in the space of 6U, SX6506, which is provided, is up to the without hindrance of 12.1Tb/s Postpone between the port of plug bandwidth and 170ns to 510ns, using Mellanox the 6th generation SwitchX-2 chips SX6506InfiniBand interchangers, possess 108 ports, and each port can provide the complete two-way bandwidth of 56Gb/s.SX6506 Growth of the function with computing cluster number of nodes is exchanged, realizes extension on demand.To be medium-sized high sexual valence is provided to ultra-large type cluster The interconnection scheme of ratio, while being also equipped with high availability and reliability comparable to core stage of switches.In addition, impeller, blade and pipe Managing module and power supply and blower can help to shorten downtime, subnet management built in SX6506 interchangers with hot plug Device realizes the out-of-the-box for being up to 648 meshed networks.

Ethernet switch 7 is 24 mouthfuls of gigabit switch, accesses all nodes, as management interchanger.

It should be noted that considering the reasonability distribution of task, all calculating tasks when building High Performance Computing Cluster It is controlled by master server, realizes management node, the session of calculate node；In the fields such as deep learning, because of the data being related to It is huge and complicated, need, the calculate node of calling different number grade different according to the demand of Data；And it is directed to different type Data, the use environment of calculate node is also different.

However, the P4GPU server nodes and P100GPU server nodes of above-mentioned hardware device, can pass through NVIDIARDMA technologies and GPU Director technologies, it is total using the video memory in Infiniband real-time performance physical significances It enjoys；When large data is calculated and analyzed, multiple servers and more GPU can be called to complete task jointly, and it is more when needing to complete When categorical data calculates analysis, multiple tasks can be distributed to different servers according to the characteristic of different types of data, it is real The concurrently progress of existing multi-model training；Training flow, initiates training mission, to after management node, by management node by worker It asks computing resource, GPU cluster to be undergone training after task, is trained locally after reading training data to storage server, Data back is to storage server after the completion, while feeding back information to management server node, and management server node is to work Author prompts training mission to complete.

It should be noted that：Above description is merely a prefered embodiment of the utility model, is not limited to this practicality Novel interest field；Above description simultaneously, should can be illustrated and implement for the special personage of correlative technology field, therefore its It should be included in claim without departing from the lower equivalent change or modification completed of the revealed spirit of the utility model In.

Claims

1. the hardware device of on-line intelligence ability platform, it is characterised in that：Including storage server, P4 GPU servers, P100 GPU servers, management server, Inifiniband interchangers and Ethernet switch, the storage server pass through outside Network connection Ethernet is accessed, the storage server, P4 GPU servers, P100 GPU servers and management server pass through Calculate network connection Inifiniband interchangers, the storage server, P4 GPU servers, P100 GPU servers and pipe It manages server and is connected by Internal Access Network with computer room control centre by managing network connection Ethernet switch, working group Ethernet switch.

2. the hardware device of on-line intelligence ability platform according to claim 1, it is characterised in that：The management server For XG-22302EN servers.

3. the hardware device of on-line intelligence ability platform according to claim 1, it is characterised in that：The P4 GPU services Device is that PSC-HB1X 4U machine towers mutually turn server.

4. the hardware device of on-line intelligence ability platform according to claim 1, it is characterised in that：The P100 GPU clothes Business device is XG-48201GK servers.

5. the hardware device of on-line intelligence ability platform according to claim 1, it is characterised in that：The storage server For XG-42301^STStorage server.

6. the hardware device of on-line intelligence ability platform according to claim 1, it is characterised in that：It is described Inifiniband interchangers are 108 port InfiniBand interchangers of SX6506.

7. the hardware device of on-line intelligence ability platform according to claim 1, it is characterised in that：The Ethernet exchanging Machine is 24 mouthfuls of gigabit switch.