CN112804297B - Assembled distributed computing and storage system and construction method thereof - Google Patents

Assembled distributed computing and storage system and construction method thereof Download PDF

Info

Publication number
CN112804297B
CN112804297B CN202011599244.8A CN202011599244A CN112804297B CN 112804297 B CN112804297 B CN 112804297B CN 202011599244 A CN202011599244 A CN 202011599244A CN 112804297 B CN112804297 B CN 112804297B
Authority
CN
China
Prior art keywords
storage
data
pool
computing
memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011599244.8A
Other languages
Chinese (zh)
Other versions
CN112804297A (en
Inventor
曾令仿
银燕龙
何水兵
杨弢
汤昭荣
任祖杰
陈刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202011599244.8A priority Critical patent/CN112804297B/en
Publication of CN112804297A publication Critical patent/CN112804297A/en
Application granted granted Critical
Publication of CN112804297B publication Critical patent/CN112804297B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/104Peer-to-peer [P2P] networks
    • H04L67/1074Peer-to-peer [P2P] networks for supporting data block transmission mechanisms

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an assembled distributed computing and storage system and a construction method thereof, wherein the assembled distributed computing and storage system comprises: one or more domain servers for computing services or storage services; the network switching unit is responsible for connecting the domain servers to form a distributed computing and storage system; the domain server includes: the object processing unit adopts a multi-core structure processor thread group, is responsible for network connection in the domain server and provides management control and data processing through high-level language programming; a computing unit providing computing power; the memory unit is used for a dynamic random access memory; a persistent memory unit for a nonvolatile memory; a storage unit providing persistent storage; the system comprises a plurality of computing units, a plurality of memory units, a plurality of nonvolatile memory units and a plurality of storage units, wherein the computing units, the memory units, the nonvolatile memory units and the storage units are respectively connected through a network exchange unit to form a computing pool, a memory pool, a nonvolatile memory pool and a storage pool; one or more domain servers are connected through a network switching unit to form a distributed computing and storage system.

Description

Assembled distributed computing and storage system and construction method thereof
Technical Field
The present invention relates to the field of distributed computing, and more particularly, to an assemblable distributed computing and storage system and a method of constructing the same.
Background
One of the reasons for the current large-scale distributed system that the data storage and movement efficiency is low, the resource utilization rate is insufficient and often over-configured, resulting in extremely high total ownership cost (including construction cost and operation and maintenance cost) of the data infrastructure is that in the data storage or movement process, the communication architecture constructed in the traditional network form has low performance and high delay when the system resources are decomposed and pooled.
In the existing scheme, in order to reduce interference to a CPU, protocol analysis in the network is offloaded by using smart network cards smartNICs in a hardware acceleration manner, but from the system perspective, although the current smart network cards are in a key position on a network protocol acceleration data path, since the current smart network cards are in a loose coupling relationship with other resources (such as a solid-state disk, a hard disk, various processors or accelerators) of the system, the existing hyper-resolution scheme implemented by using the smart network cards is low in efficiency, and has a limited effect on optimizing the performance of the whole data path of the system. FPGAs provide a flexible way to implement function-specific hardware acceleration, which is suitable for prototyping and validation of hardware designs, but which has poor software programmability (i.e., not easily built around the FPGA ecology). Through a PCIe mode, related resources are directly connected with a CPU at a high speed to realize hyper-convergence (hyper-convergence), but the defects are that data are attached to a specific CPU, other hyper-convergence CPUs can only access the data inefficiently, and because the same data can be accessed by different load applications at the same time, the scheme of solidifying the data in a certain hyper-convergence server also has the problem of low efficiency.
Disclosure of Invention
In order to solve the defects of the prior art and achieve the purposes of improving the resource utilization rate and reducing the total cost of ownership of data infrastructure, the invention adopts the following technical scheme:
an assemblable distributed computing and storage system comprising: the domain server comprises an object processing unit, a computing unit, a memory unit, a persistent memory unit and a storage unit, wherein the computing unit, the memory unit, the persistent memory unit and the storage unit are respectively connected with the object processing unit; the user can adopt high-level language programming according to the requirements of the specific field, and provides data infrastructure services in a library form, so that the calculation acceleration of the specific field is formed; the high-level language programming is different from the 'programming' of the FPGA level, is realized by referring to a 'software defined network', and has more programmability; a plurality of computing units are connected through the network switching unit to form a computing pool; a plurality of memory units are connected through the network switching unit to form a memory pool; a plurality of nonvolatile memory units are connected through the network switching unit to form a nonvolatile memory pool; a plurality of storage units are connected through the network switching unit to form a storage pool;
the object processing unit includes: the domain server comprises more than two processor thread groups, a network interface, a storage interface, a memory interface, a PCIe interface, a network on chip and a resource scheduler, wherein each processor thread group consists of multiple cores, the same processor thread group is from all hardware threads of the same multi-core processor, more than one processor thread group is used for managing and controlling the domain server to form a control plane, and more than one processor thread group is used for processing or accelerating data of data infrastructure services of the domain server to form a data plane of the domain server; the network interface is used for connecting with other domain servers or network switching units; the storage interface is used for connecting a storage unit; the memory interface is used for connecting a memory unit or a nonvolatile memory unit; the PCIe interface is used for high-speed serial point-to-point double-channel high-bandwidth transmission; the network on chip is used for the object processing unit to perform communication on chip; the resource scheduler is used for scheduling shared resources in the object processing unit facing multithreading;
the control plane comprises a control plane OS and a service agent, and the data plane comprises a data plane OS and a library supporting data infrastructure services under the support of the data plane OS; in the control plane, operating a control plane OS, supporting a service agent by the control plane OS, and providing an application programming interface for the service agent to receive a control command; in the data plane, a data plane OS is run, and various libraries supporting data infrastructure services are supported by the data plane OS to provide various data infrastructure services.
Further, the control plane shares a data plane OS with the data plane, the data plane OS providing a control plane software stack executing on the data plane OS, the control plane software stack including a virtual machine monitor, a multitasking control plane OS executing on the virtual machine monitor and one or more service agents executing on the control plane OS, the virtual machine monitor isolating the control plane OS from functions executing on the data plane OS. The method comprises the steps of realizing hardware sharing of a plurality of operating systems and applications through a virtual machine monitor, realizing a virtualization technology which is convenient for various control software to run on an object processing unit to realize a control function, realizing the control function in a virtual machine form through the support of the virtual machine monitor by means of a shared data plane OS, and mutually isolating the control function and the data processing function through the virtualization technology. The isolation ensures that the upgrading of the data plane OS does not affect the control plane software (reduces the Agents development cost), and meanwhile, the isolation is enhanced for the untrusted and unsafe codes (the migration is convenient), thereby being beneficial to the safety, the reliability (even the load balance) and the like.
Further, the service agent, including application level software configured to perform configuration and offloading of software structures, controls the service agent via an application programming interface, and is further configured or offloaded by the service agent of data infrastructure services at the data plane to support various data processing functions running on the data plane OS.
Further, the object processing unit is a highly programmable I/O processor having a plurality of processing cores, implementing data infrastructure services through a data plane development framework.
Further, the control plane comprises a control plane OS executing in more than one processing core, the control plane OS comprising Linux, Unix, or a proprietary operating system.
Further, the library supporting data infrastructure services includes a storage function, a security function, a network function, and an analysis function.
Further, the computing units form various computing pools, including:
the central processing unit CPU forms a computing pool which can independently process complex logic operation and different data types;
the system comprises a graphic processor GPU which forms a computing pool suitable for processing a large amount of data with uniform types;
the tensor processor TPU is used for forming a computing pool which is specially used for accelerating the deep neural network operation;
the NPU forms a storage and processing integrated computing pool, breaks through a classical von Neumann structure, so that computing resources of the von Neumann structure or a non-von Neumann structure can be used as computing resources (pooled), computing advantages of various heterogeneous chips in respective fields are fully played, and the purpose of improving the overall system efficiency is achieved.
A method of constructing an assemblable distributed computing and storage system, comprising the steps of:
s1, when only one domain server exists, the domain server is used for constructing a calculation and storage system to form a super-fusion unit, a plurality of calculation units, a plurality of memory units, a plurality of persistent memory units and a plurality of storage units are respectively connected through an object processing unit to form a local calculation pool, a local memory pool, a local persistent memory pool and a local storage pool, and the step S3 is carried out; the method has the advantages that the calculation, storage, network and network functions (safety and optimization) are deeply merged into one standard server, the service requirements of storage, calculation, memory, nonvolatile memory and the like can be simultaneously expanded by only one domain server, the domain server is in a super-merging unit in a distributed system, and the super-merging has the advantages that:
(1) purchasing according to needs: the existing investment is protected, and one-time large-scale purchase is not needed;
(2) and (3) rapid delivery: the existing super-fusion scheme generally supports rapid deployment, and reduces delivery time;
(3) and (3) simplifying management: the existing super-fusion scheme generally realizes unified management on resources such as calculation, storage, virtualization and the like, so that operation and maintenance management is simplified;
(4) elastic expansion: the super-fusion architecture generally adopts a distributed architecture, is convenient for linear expansion and is convenient for supporting the capabilities of no single-point fault, local backup, same city, remote disaster recovery and the like.
S2, when there are more than two domain servers, the domain servers are connected through the network exchange unit to form a distributed system, all the computing units in the distributed system form a computing pool, all the memory units form a memory pool, all the persistent memory units form a persistent memory pool, all the memory units form a storage pool, the computing pool, the memory pool, the persistent memory pool and the storage pool exist in a resource super-decomposition mode, including the following steps:
s21, if the storage server needs to be expanded, more than one domain server is used as the storage server, so that the storage units of different nodes form a storage pool;
s22, if the computation server needs to be expanded, using more than one domain server as computation servers to make the computation units of different nodes form computation pools;
s23, if the memory server needs to be expanded, using more than one domain server as the memory server to make the memory units of different nodes form a memory pool;
s24, if the non-volatile memory server needs to be expanded, using more than one domain server as the non-volatile memory server to make the non-volatile memory units of different nodes form a non-volatile memory pool;
s3, more than one processor thread group of the domain server forms a control plane, manages and controls the domain server;
s4, more than one processor thread group of the domain server forms a data plane, and processes or accelerates data of data infrastructure service;
s5, the flow ends.
The super fusion also has defects, and the invention overcomes the defects caused by the super fusion by super decomposition:
(1) information isolated island: the super-fusion scheme does not support the original external storage in the data center generally, so that the problems of low resource utilization efficiency, difficult unified management and the like are caused by the formation of an information island.
In the invention: the original external storage is accessed and pooled through a high-performance network in a form of resource hyper-resolution (hyper-resolution), and the defect of a hyper-converged information island is overcome.
(2) Performance consistency issues: the performance of the storage in the data center is crucial, and the performance is expected to be predictable and consistent, especially for the technical indexes critical to the core service system, such as delay, IOPS, bandwidth, etc.; for the super-convergence architecture, physical resources such as a CPU/memory/network and the like can be contended for by calculation and storage, and the calculation and the storage are interdependent, and the essence of the problem is that various data infrastructure services need calculation capacity, so that the mutual interference (calculation, storage and network) is caused.
In the invention: meanwhile, the method has the advantage of super-decomposition, and because remote (computing or storage) resources are accessed and pooled as required through a high-performance network, expected performance can be guaranteed (predicted), so that the problem of performance consistency in super-fusion is solved.
(3) And (3) transverse expansion: one of the key characteristics of the super-fusion architecture is easy expansion, minimum deployment and capacity expansion as required; in the super fusion, the calculation capacity, the storage performance and the capacity are generally synchronously expanded, the expansion of single capacity in reality cannot be met, some manufacturers also have requirements on the minimum expansion unit, and the expansion flexibility is limited.
In the invention: meanwhile, the method has the advantage of super decomposition, the calculation or storage is accessed through an Object Processing Unit (OPU) (namely, if the storage is only needed, the storage device is accessed, if the calculation is only needed, the calculation device is accessed, and the storage device and the calculation device can be accessed at the same time), and the OPU is accessed through a high-performance network, so that the expansibility is very flexible.
(4) The data processing function: currently, most hyper-converged systems and Software Defined Storage (SDS) systems provide various data processing functions, such as data redundancy, thin provisioning, snapshot, cloning, and so on, and some provide even deduplication, data encryption, data compression, and so on, but these functions often affect the performance of the computing device.
In the invention: by offloading the above-mentioned data processing requirements (to an Object Processing Unit (OPU)), interference with the computing device is avoided from affecting system performance.
The super-resolution scheme has the important support that remote resources need to be accessed and pooled by adopting a high-performance network, the existing super-resolution scheme adopting the intelligent network card can partially realize the required effect of the invention by unloading a network protocol (or functions of data compression/decompression and the like) to the intelligent network card, but the intelligent network card is limited to realize the purpose of not interfering the performance of the computing equipment by unloading the functions of partial network I/O, namely being limited to the network (protocol) level.
In the invention: the data processing functions offloaded by the Object Processing Unit (OPU) are more, for example: compared with an intelligent network card, the storage/security/network/analysis system can only start from a network layer, is higher-level abstraction, and enables computing resources to process various kinds of computation which are good at each other more.
The traditional PCIe access method is a single physical node, and a (storage) device inside the PCIe access method accesses a CPU, and the physical node (in an ultra-converged manner) can efficiently use the device accessed in the PCIe access method through a virtualization technology (a virtual machine deployed in the physical node), but the limitation is that the (storage) device on the other physical node accessed in the ultra-converged manner cannot be efficiently and practically used by a virtual machine on the physical node as a local device (that is, from the perspective of a certain physical node, the current ultra-converged scheme is that the local device and a remote device do not adopt an ultra-decomposition manner to achieve high efficiency — this is one of the motivations for providing the ultra-decomposition scheme).
Distributed extension is realized through hyper-convergence and hyper-decomposition, the hyper-convergence extension is realized by adding a new physical node, and the physical node contains all computing, storage and network resources, however, since the newly added resources are to support the virtual machine (supporting services) on the physical node to run, the hyper-convergence is that computing, storage and network must be added simultaneously, the hyper-decomposition extension is also realized by adding a new physical node, including computing, storage and network resources, but the hyper-decomposition can add resources separately, for example: only adding DRAM, only adding solid disk, only adding CPU, or only adding GPU, that is, the newly added physical node does not necessarily need to support the virtual machine operation of the service, what resources are lacked, what resources are added, if the related resource pool exists, the existing resource pool is added, or the new resource pool can exist/be added.
Further, in step S3, the method for the control plane to manage and control the domain server includes the following steps:
s31, utilizing the programmable ability of the object processing unit to program the monitoring, management or control through high-level language;
s32, monitoring, managing or controlling the processor thread groups forming the control plane to manage and control the domain server.
Further, in step S4, the method for processing or accelerating data of the data infrastructure service by the data plane includes the following steps:
s41, if there is available processor thread group, using the programmable ability of object processing unit to select the data infrastructure service needed to accelerate through high level language to program the processor thread group;
s42, the programmed processor thread groups form a data plane, implementing processing or acceleration of data in the data infrastructure services aspect.
The invention has the advantages and beneficial effects that:
the invention provides an assembled distributed persistent storage system, which unloads the traditional calculation of network, memory, storage, safety and virtualization data paths, is completed through an object processing unit OPU (operation processing unit), the object processing unit plays the role of system structure control and network layer, the efficiency of a large-scale distributed system is improved by reducing the waiting time of resources and dynamically forming the basic framework of various resource pools, the total ownership cost is reduced, and the system has the advantages of two schemes of hyper-decomposition (hyper-segregation) and hyper-convergence (hyper-convergence).
Resources such as calculation, storage and the like are efficiently decomposed and pooled, so that the resource utilization rate can be improved to the maximum extent; by approaching different work load demands and rapidly deploying the data infrastructure according to needs, the total cost of ownership of the data infrastructure can be reduced.
Drawings
FIG. 1 is a schematic diagram of the architecture of a distributed computing and storage system that can be assembled in the present invention.
Fig. 2 is a schematic diagram of the structure of an object processing unit in the present invention.
FIG. 3a is a functional diagram of the control plane and data plane of the object processing unit according to the present invention.
FIG. 3b is a functional diagram of the control plane and the data plane under the common data plane OS in the present invention.
FIG. 4 is a workflow diagram of a method of construction of an assemblable distributed computing and storage system of the present invention.
Fig. 5 is a flowchart of a control plane-to-domain server management and control method in the system construction method of the present invention.
FIG. 6 is a flow chart of a method for processing or accelerating data in terms of data infrastructure services by the data plane in the system construction method of the present invention.
Detailed Description
The following describes in detail embodiments of the present invention with reference to the drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.
As shown in fig. 1, an assemblable distributed computing and storage system comprises more than two domain servers for computing services or storage services and at least one network switching unit NSU for connecting one or more of said domain servers to form a distributed computing and storage system.
A domain server, comprising:
an Object Processing Unit (OPU) responsible for network connection within the domain server and capable of providing data processing capability through high level language programming, and collectively classifying CPU, GPU, TPU, FPGA, etc. as components providing computing capability, the Object Processing Unit (OPU) is introduced to offload work affecting the computing components, such as encryption/decryption, data compression, data deduplication, network protocol parsing, etc. (some may affect the CPU through an interrupt, and some may consume the computing capability of the computing components for computing), for example: unloading data infrastructure services (including a storage function, a security function, a network function, an analysis function and the like, specifically, encryption and decryption, data reduction, Hash calculation, Erasure Code (EC coding), disk array RAID construction, deep packet analysis DPI, Lookup, direct memory access DMA and the like, wherein all or part of the functions are provided in default configuration, and a user can adopt high-level language programming to realize the functions according to the requirements of a specific field, namely, the functions are provided in a 'library' form), so that the calculation acceleration in a specific field is formed; the high-level language programming is different from "programming" at the FPGA level, but refers to the relevant implementation of "software defined networking".
More than one computing unit refers to a Central Processing Unit (CPU), a Graphic Processing Unit (GPU), a Tensor Processor (TPU), a neural Network Processor (NPU) or the like;
more than one memory unit is a Dynamic Random Access Memory (DRAM) of the domain server;
more than one persistent memory unit is a nonvolatile memory NVM of the domain server;
more than one storage unit, provide the persistent storage for the said domain server, such as solid-state disk SSD, hard disk HDD;
a plurality of computing units are connected through the network switching unit NSU to form a computing pool; a plurality of memory units are connected through the network switching unit NSU to form a memory pool; a plurality of nonvolatile memory units are connected through the network switching unit NSU to form a nonvolatile memory pool; a plurality of storage units are connected by the network switching unit NSU to form a storage pool.
As shown in fig. 2, the object processing unit OPU in the domain server includes:
more than two Processor Thread Groups (PTG), the processor thread groups are formed by multiple cores (multi-core), all hardware threads from 1 multi-core processor are referred to as 1 processor thread group for convenience of description, for example, a 6-core P6600 processor with (6 × 4 =) 24 hardware threads (hardware threads) using MIPS architecture Simultaneous Multithreading (SMT), if SMT =4, forming 1 processor thread group with 24 hardware threads, in each of which an OS runs; the at least one processor thread group is responsible for management and control of the domain server, and forms a control plane (control plane) of the domain server, that is, each hardware thread in the processor thread group runs an OS, and the OS supports a service agent (agent) that provides an Application Programming Interface (API) for external use to receive a control command. Wherein, more than one processor thread group is responsible for data processing or acceleration (including storage function, security function, network function and analysis function) in the data infrastructure service aspect of the domain server, and forms a data plane (data plane) of the domain server, that is, each hardware thread in each processor thread group runs an OS, and the OS supports various storage function, security function, network function and analysis function libraries to provide various data infrastructure services;
one or more network interfaces (e.g., IB interfaces) for network connection with other domain servers or network switching units;
one or more high bandwidth memory interfaces for connecting a High Bandwidth Memory (HBM);
more than one memory interface for connecting memory or nonvolatile memory (e.g., DRAM or 3 DXPoint);
more than one PCIe (peripheral component interconnect express) interface for high-speed serial point-to-point dual channel high-bandwidth transmission;
an on-chip-network (on-chip-network) for on-chip communication of the object processing unit;
the resource scheduler is configured to schedule shared resources (e.g., multiple resources such as a shared cache (LLC), a DRAM controller, a main memory bus bandwidth, and a prefetch component (prefetching hardware)) in an object processing unit for multiple threads.
As shown in fig. 3a, an Object Processing Unit (OPU) includes a control plane including a control plane OS (operating system) and various service agents (agents), and a data plane including a data plane OS and libraries supporting data infrastructure services under the support of the data plane OS.
Taking the example of the MIPS64 r6 from Wave Computing (product model: P6600), up to 6 cores (core) can be used, and if multi-threaded SMT =4 is used, i.e. up to 4 hardware threads (hardware threads) per core, so that a 6-core P6600 is like having 24 (= 6 × 4) Central Processing Units (CPUs) running OS support in each hardware thread, for example: the control plane OS supports various service agents (agents) that can use a high-level language, such as P4, to perform parsing, encapsulation, decapsulation, lookup, and transmission/reception, etc.; the data plane OS supports various function libraries, which are also programmed in a similar manner.
The object handling unit (OPU) is a highly programmable processor that controls the service agent (agent) of the control plane through a control Application Programming Interface (API) and then configures/offloads the working data processing functions in the data plane by the service agent (agent) of the control plane.
An Object Processing Unit (OPU) is a highly programmable I/O processor with multiple processing cores, and may implement data infrastructure services through some data plane development framework, such as a dpdk (data plane development kit). The control plane also includes a control plane OS executing in one or more processing cores, which may be Linux, Unix, or a proprietary operating system.
As shown in fig. 3b, the control plane may also share a data plane OS with the data plane, in which case the data plane OS provides a control plane software stack that includes a control plane software stack executing on the data plane OS, which may include a virtual machine monitor, such as Hypervisor, a multitasking control plane OS executing on the virtual machine monitor, and one or more control plane service agents (agents) executing on the control plane OS. In this case, the virtual machine monitor may isolate the control plane OS from functions executing on the data plane OS.
Agents in the control plane represent various application programs (realizing control functions), storage, safety, network and analysis in the data plane are various application programs (realizing data processing), a plurality of operating systems and application sharing hardware are realized through a virtual machine monitor, the realized virtualization technology is convenient for an object processing unit to run various control software (simultaneously depending on various complex software and hardware environments) to realize the control functions, the control plane adopts a virtual machine form to realize the control functions through the support of the virtual machine monitor by virtue of a shared data plane OS, and the control functions and the data processing functions are mutually isolated through the virtualization technology, so that the virtual machine control system has the advantages that: (1) the upgrading of the data plane OS can not influence the control plane software (reduce the Agents development cost); (2) the isolation (and convenient migration) is enhanced for the untrusted and insecure codes, and the security, reliability (even load balancing) and the like are facilitated.
A control plane service agent (agent) executing on the control plane OS includes application-level software configured to perform configuration and offloading of software structures to support data infrastructure services (i.e., various data processing functions) running on the data plane OS.
The library provided by the data plane OS includes a storage function, a security function, a network function, an analysis function, and the like. For example, the storage function may include storage I/O data processing such as NVMe (non-volatile memory express), compression, encryption, copy, erasure coding, and pooling; the security functions may include security data processing such as encryption, regular expression processing, and hash processing; network functions may include network I/O data processing with ethernet, network protocols, encryption, and firewalls; the analysis functions may include customized data screening, data statistics, and the like, for data analysis.
The computing units form various computing pools including:
the central processing unit CPU forms a computing pool which can independently process complex logic operation and different data types;
the system comprises a graphic processor GPU which forms a computing pool suitable for processing a large amount of data with uniform types;
a tensor processor TPU (tensor Processing Unit) which forms a computing pool for specially accelerating the operation of the deep neural network;
the neural network processor NPU forms a breakthrough of a classical von Neumann structure, and the storage and the processing are integrated computing pools. The computational resources of the von Neumann structure or the non-von Neumann structure can be used as the computational resources (pooled), so that the computational advantages of various heterogeneous chips in respective fields are fully exerted, and the aim of improving the overall system efficiency is fulfilled.
As shown in FIG. 4, a method of constructing an assembled distributed computing and storage system, comprising the steps of:
1. if only one domain server exists in the system, a calculation and storage system is constructed by using the domain server, in this case, only one domain server exists, and the domain server is a standard super-fusion unit in a super-fusion scheme, so that the method has the advantages that the calculation, storage, network and network functions (safety and optimization) are deeply fused into one standard X86 server, and the process is finished;
2. if there are more than two domain servers, the following processes are continued:
(1) if the storage server needs to be expanded, more than one domain server is used as the storage server (storage units of different nodes form a storage pool);
(2) if the computing server needs to be expanded, more than one domain server is used as the computing server (computing units of different nodes form a computing pool);
(3) if the memory server needs to be expanded, more than one domain server is used as the memory server (the memory units of different nodes form a memory pool);
(4) if the nonvolatile memory server needs to be expanded, using more than one domain server as the nonvolatile memory server (nonvolatile memory units of different nodes form a nonvolatile memory pool);
(5) the processor thread group of the domain server forms a control plane to manage and control the domain server;
(6) a processor thread group of the domain server forms a data plane, and processes or accelerates data in the aspect of data infrastructure service;
(7) the flow ends.
In the above process, only one domain server is used to simultaneously expand the service requirements of storage, computation, memory, non-volatile memory, etc., where the domain server is in the form of a super fusion unit in a distributed system, a plurality of computation units are connected through object processing units to form a local computation pool, a plurality of memory units are connected through object processing units to form a local memory pool, a plurality of persistent memory units are connected through object processing units to form a local persistent memory pool, and a plurality of storage units are connected through object processing units to form a local storage pool.
In the above process, a plurality of domain servers are connected through the network exchange unit to form a distributed system, all computing units in the distributed system form a computing pool, all memory units form a memory pool, all persistent memory units form a persistent memory pool, all memory units form a storage pool, and the computing pool, the memory pool, the persistent memory pool and the storage pool exist in a resource super-resolution manner. Wherein the local computing pool of the other domain server is connected by the object processing unit, referred to as a remote computing pool with respect to the other domain server, the local memory pool of the other domain server is connected by the object processing unit, referred to as a remote memory pool with respect to the other domain server, the local persistent memory pool of the other domain server is connected by the object processing unit, referred to as a remote persistent memory pool with respect to the other domain server, the local memory pool of the other domain server is connected by the object processing unit, and referred to as a remote holding memory pool with respect to the other domain server.
The advantages of hyperfusion are:
(1) purchasing according to needs: the existing investment is protected, and one-time large-scale purchase is not needed;
(2) and (3) rapid delivery: the existing super-fusion scheme generally supports rapid deployment, and reduces delivery time;
(3) and (3) simplifying management: the existing super-fusion scheme generally realizes the unified management of resources such as calculation, storage, virtualization and the like, so that the operation and maintenance management is simplified;
(4) elastic expansion: the super-fusion architecture generally adopts a distributed architecture, is convenient for linear expansion and is convenient for supporting the capabilities of no single-point fault, local backup, same city, remote disaster recovery and the like.
Hyperfusion also has disadvantages:
(1) information isolated island: the super-fusion scheme does not support the original external storage in the data center generally, so that the problems of low resource utilization efficiency, difficult unified management and the like are caused by the formation of an information island.
In the invention: the original external storage is accessed and pooled through a high-performance network in a form of resource hyper-resolution (hyper-resolution), and the defect of a hyper-converged information island is overcome.
(2) Performance consistency issues: the performance of the storage in the data center is crucial, and the performance is expected to be predictable and consistent, especially for the technical indexes critical to the core service system, such as delay, IOPS, bandwidth, etc.; for the super-convergence architecture, physical resources such as a CPU/memory/network and the like can be contended for by calculation and storage, and the calculation and the storage are interdependent, and the essence of the problem is that various data infrastructure services need calculation capacity, so that the mutual interference (calculation, storage and network) is caused.
In the invention: meanwhile, the method has the advantage of super-decomposition, and because remote (computing or storage) resources are accessed and pooled as required through a high-performance network, expected performance can be guaranteed (predicted), so that the problem of performance consistency in super-fusion is solved.
(3) And (3) transverse expansion: one of the key features of the super-fusion architecture is easy expansion, minimum deployment and capacity expansion as required; in the super fusion, the calculation capacity, the storage performance and the capacity are generally synchronously expanded, the expansion of single capacity in reality cannot be met, some manufacturers also have requirements on the minimum expansion unit, and the expansion flexibility is limited.
In the invention: meanwhile, the method has the advantage of superdecomposition, the calculation or storage is accessed through an Object Processing Unit (OPU) (namely, the storage equipment is accessed only by storage, the calculation equipment is accessed only by calculation, and the calculation equipment and the OPU can be accessed simultaneously), and the OPU is accessed through a high-performance network, so that the expansibility is very flexible.
(4) And a data processing function: currently, most hyper-converged systems and Software Defined Storage (SDS) systems provide various data processing functions, such as data redundancy, thin provisioning, snapshot, cloning, and so on, and some provide even deduplication, data encryption, data compression, and so on, but these functions often affect the performance of the computing device.
In the invention: by offloading the above-mentioned data processing requirements (to an Object Processing Unit (OPU)), interference with the computing device is avoided from affecting system performance.
The super-resolution scheme has the very important support that remote resources need to be accessed and pooled by adopting a high-performance network, the existing super-resolution scheme adopting the intelligent network card can partially realize the required effect of the invention by unloading a network protocol (or functions such as data compression/decompression) to the intelligent network card, but the intelligent network card is limited to realize the purpose of not interfering the performance of the computing equipment by unloading the functions of partial network I/O, namely being limited to the network (protocol) level.
In the invention: the data processing functions offloaded by the Object Processing Unit (OPU) are more, for example: compared with an intelligent network card, the storage/security/network/analysis system can only start from a network layer, is higher-level abstraction, and enables computing resources to process various kinds of computation which are good at each other more.
The traditional PCIe access method is a single physical node, and a (storage) device inside the PCIe access method accesses a CPU, and the physical node (in an ultra-converged manner) can efficiently use the device accessed in the PCIe access method through a virtualization technology (a virtual machine deployed in the physical node), but the limitation is that the (storage) device on the other physical node accessed in the ultra-converged manner cannot be efficiently and practically used by a virtual machine on the physical node as a local device (that is, from the perspective of a certain physical node, the current ultra-converged scheme is that the local device and a remote device do not adopt an ultra-decomposition manner to achieve high efficiency — this is one of the motivations for providing the ultra-decomposition scheme).
Distributed extension is realized through hyper-convergence and hyper-decomposition, the hyper-convergence extension is realized by adding a new physical node, and the physical node contains all computing, storage and network resources, however, since the newly added resources are to support the virtual machine (supporting services) on the physical node to run, the hyper-convergence is that computing, storage and network must be added simultaneously, the hyper-decomposition extension is also realized by adding a new physical node, including computing, storage and network resources, but the hyper-decomposition can add resources separately, for example: only adding DRAM, only adding solid disk, only adding CPU, or only adding GPU, that is, the newly added physical node does not necessarily need to support the virtual machine operation of the service, what resources are lacked, what resources are added, if the related resource pool exists, the existing resource pool is added, or the new resource pool can exist/be added.
As shown in fig. 5, the control plane-to-domain server management and control method includes the following steps:
1. programming monitoring, management or control through a high-level language by using the programmability of the object processing unit;
2. monitoring, managing, or controlling the set of processor threads forming the control plane implements managing and controlling the domain server.
As shown in fig. 6, the method for processing or accelerating data in data infrastructure service by data plane includes the following steps:
1. if the available processor thread groups exist, the programmable capability of the object processing unit is utilized to program the processor thread groups through data infrastructure services (including storage function, safety function, network function, analysis function and the like) which are accelerated according to needs by a high-level language;
2. the set of programmed processor threads forms a data plane that performs the corresponding processing or acceleration of data on the data infrastructure services (including storage functions, security functions, networking functions, analysis functions, etc.).
The above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and these modifications or substitutions do not depart from the scope of the embodiments of the present invention in nature.

Claims (9)

1. A method of constructing an assemblable distributed computing and storage system, the system comprising: the domain server comprises an object processing unit, a computing unit, a memory unit, a persistent memory unit and a storage unit, wherein the computing unit, the memory unit, the persistent memory unit and the storage unit are respectively connected with the object processing unit;
the object processing unit includes: the system comprises more than two processor thread groups, a network interface, a storage interface, a memory interface, a PCIe interface, an on-chip network and a resource scheduler, wherein the processor thread groups are formed by multiple cores, the same processor thread group is from all hardware threads of the same multi-core processor, the more than one processor thread group is used for management and control of the domain server to form a control plane, and the more than one other processor thread group is used for data processing or acceleration of data infrastructure services of the domain server to form a data plane of the domain server; the network interface is used for connecting with other domain servers or network switching units; the storage interface is used for connecting a storage unit; the memory interface is used for connecting a memory unit or a nonvolatile memory unit; the PCIe interface is used for high-speed serial point-to-point double-channel high-bandwidth transmission; the network on chip is used for the object processing unit to perform on-chip communication; the resource scheduler is used for scheduling shared resources in the object processing unit facing multithreading;
the control plane comprises a control plane OS and a service agent, and the data plane comprises a data plane OS and a library supporting data infrastructure services under the support of the data plane OS; in the control plane, operating a control plane OS, supporting a service agent by the control plane OS, and providing an application programming interface for the service agent to receive a control command; in the data plane, a data plane OS is operated, and various libraries supporting data infrastructure services are supported by the data plane OS to provide various data infrastructure services;
the construction method comprises the following steps:
s1, when only one domain server exists, the domain server is used for constructing a calculation and storage system to form a super-fusion unit, a plurality of calculation units, a plurality of memory units, a plurality of persistent memory units and a plurality of storage units are respectively connected through an object processing unit to form a local calculation pool, a local memory pool, a local persistent memory pool and a local storage pool, and the step S3 is entered;
s2, when there are more than two domain servers, the domain servers are connected through the network exchange unit to form a distributed system, all the computing units in the distributed system form a computing pool, all the memory units form a memory pool, all the persistent memory units form a persistent memory pool, all the memory units form a storage pool, the computing pool, the memory pool, the persistent memory pool and the storage pool exist in the form of resource super-decomposition, including the following steps:
s21, if the storage server needs to be expanded, more than one domain server is used as the storage server, so that the storage units of different nodes form a storage pool;
s22, if the calculation server needs to be expanded, more than one domain server is used as the calculation server, so that the calculation units of different nodes form a calculation pool;
s23, if the memory server needs to be expanded, using more than one domain server as the memory server to form memory pools by the memory units of different nodes;
s24, if the non-volatile memory server needs to be expanded, using more than one domain server as the non-volatile memory server to make the non-volatile memory units of different nodes form a non-volatile memory pool;
s3, more than one processor thread group of the domain server forms a control plane, manages and controls the domain server;
s4, more than one processor thread group of the domain server forms a data plane, and processes or accelerates data of data infrastructure service;
s5, the flow ends.
2. The method of claim 1, wherein the control plane and the data plane share a data plane OS, the data plane OS providing a control plane software stack executing on the data plane OS, the control plane software stack including a virtual machine monitor, a multitasking control plane OS executing on the virtual machine monitor, and one or more service agents executing on the control plane OS, the virtual machine monitor isolating the control plane OS from functions executing on the data plane OS.
3. A method of constructing an assemblable distributed computing and storage system as claimed in claim 1 or 2, in which the service agents, including application level software configured to perform configuration and offloading of software structures, are controlled through application programming interfaces, and data infrastructure services in the data plane are configured or offloaded by the service agents.
4. A method of constructing an assemblable distributed computing and storage system as claimed in claim 1 or 2, in which the object processing units are highly programmable I/O processors having a plurality of processing cores, implementing data infrastructure services through a data plane development framework.
5. A method of constructing an assemblable distributed computing and storage system as claimed in claim 1 or 2, wherein said control plane comprises a control plane OS executing in more than one processing core, the control plane OS comprising Linux, Unix or a proprietary operating system.
6. A method of construction of an assemblable distributed computing and storage system as claimed in claim 1 or 2, in which the repository supporting data infrastructure services comprises storage functions, security functions, networking functions and analysis functions.
7. A method of constructing an assemblable distributed computing and storage system as claimed in claim 1 or 2, in which the computing units form various computing pools, comprising:
the central processing unit CPU forms a computing pool which can independently process complex logic operation and different data types;
the system comprises a graphic processor GPU which forms a computing pool suitable for processing a large amount of data with uniform types;
a tensor processor TPU which forms a calculation pool specially used for accelerating deep neural network operation;
and the neural network processor NPU forms a computing pool integrating storage and processing.
8. The method for constructing an assembled distributed computing and storage system of claim 1, wherein in said step S3, said method for controlling the control plane to manage and control the domain server comprises the steps of:
s31, utilizing the programmable ability of the object processing unit to program the monitoring, management or control through high-level language;
s32, monitoring, managing or controlling the processor thread groups forming the control plane to manage and control the domain server.
9. The method of claim 1, wherein the data plane processes or accelerates data of data infrastructure services in step S4, comprising the steps of:
s41, if there is available processor thread group, using the programmable ability of object processing unit to select the data infrastructure service needed to accelerate through high level language to program the processor thread group;
s42, the programmed processor thread groups form a data plane, implementing processing or acceleration of data in the data infrastructure services aspect.
CN202011599244.8A 2020-12-30 2020-12-30 Assembled distributed computing and storage system and construction method thereof Active CN112804297B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011599244.8A CN112804297B (en) 2020-12-30 2020-12-30 Assembled distributed computing and storage system and construction method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011599244.8A CN112804297B (en) 2020-12-30 2020-12-30 Assembled distributed computing and storage system and construction method thereof

Publications (2)

Publication Number Publication Date
CN112804297A CN112804297A (en) 2021-05-14
CN112804297B true CN112804297B (en) 2022-08-19

Family

ID=75804166

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011599244.8A Active CN112804297B (en) 2020-12-30 2020-12-30 Assembled distributed computing and storage system and construction method thereof

Country Status (1)

Country Link
CN (1) CN112804297B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114201427B (en) * 2022-02-18 2022-05-17 之江实验室 Parallel deterministic data processing device and method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105573959A (en) * 2016-02-03 2016-05-11 清华大学 Computation and storage integrated distributed computer architecture
US10140304B1 (en) * 2015-12-10 2018-11-27 EMC IP Holding Company LLC Distributed metadata servers in a file system with separate metadata servers for file metadata and directory metadata
CN110892380A (en) * 2017-07-10 2020-03-17 芬基波尔有限责任公司 Data processing unit for stream processing
CN111444020A (en) * 2020-03-31 2020-07-24 中国科学院计算机网络信息中心 Super-fusion computing system architecture and fusion service platform
WO2020236363A1 (en) * 2019-05-20 2020-11-26 Microsoft Technology Licensing, Llc Server offload card with soc and fpga

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3005621A1 (en) * 2013-05-30 2016-04-13 Thomson Licensing Networked data processing apparatus

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10140304B1 (en) * 2015-12-10 2018-11-27 EMC IP Holding Company LLC Distributed metadata servers in a file system with separate metadata servers for file metadata and directory metadata
CN105573959A (en) * 2016-02-03 2016-05-11 清华大学 Computation and storage integrated distributed computer architecture
CN110892380A (en) * 2017-07-10 2020-03-17 芬基波尔有限责任公司 Data processing unit for stream processing
WO2020236363A1 (en) * 2019-05-20 2020-11-26 Microsoft Technology Licensing, Llc Server offload card with soc and fpga
CN111444020A (en) * 2020-03-31 2020-07-24 中国科学院计算机网络信息中心 Super-fusion computing system architecture and fusion service platform

Also Published As

Publication number Publication date
CN112804297A (en) 2021-05-14

Similar Documents

Publication Publication Date Title
NL2029116B1 (en) Infrastructure processing unit
US11010053B2 (en) Memory-access-resource management
US9619270B2 (en) Remote-direct-memory-access-based virtual machine live migration
US10254987B2 (en) Disaggregated memory appliance having a management processor that accepts request from a plurality of hosts for management, configuration and provisioning of memory
WO2016155335A1 (en) Task scheduling method and device on heterogeneous multi-core reconfigurable computing platform
CN109542831B (en) Multi-core virtual partition processing system of airborne platform
US8387064B2 (en) Balancing a data processing load among a plurality of compute nodes in a parallel computer
US20150378762A1 (en) Monitoring and dynamic configuration of virtual-machine memory-management
US20070150895A1 (en) Methods and apparatus for multi-core processing with dedicated thread management
US20070061779A1 (en) Method and System and Computer Program Product For Maintaining High Availability Of A Distributed Application Environment During An Update
KR20060071307A (en) Systems and methods for exposing processor topology for virtual machines
US20160124872A1 (en) Disaggregated memory appliance
US12021898B2 (en) Processes and systems that translate policies in a distributed computing system using a distributed indexing engine
US20140068165A1 (en) Splitting a real-time thread between the user and kernel space
US8028017B2 (en) Virtual controllers with a large data center
US20190007483A1 (en) Server architecture having dedicated compute resources for processing infrastructure-related workloads
KR20230041593A (en) Scalable address decoding scheme for cxl type-2 devices with programmable interleave granularity
US10289306B1 (en) Data storage system with core-affined thread processing of data movement requests
CN112804297B (en) Assembled distributed computing and storage system and construction method thereof
Liu Fabric-centric computing
CN114281529B (en) Method, system and terminal for dispatching optimization of distributed virtualized client operating system
Jin et al. : Efficient Resource Disaggregation for Deep Learning Workloads
CN117075971A (en) Dormancy method and related equipment
US10261817B2 (en) System on a chip and method for a controller supported virtual machine monitor
US11650849B2 (en) Efficient component communication through accelerator switching in disaggregated datacenters

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant