CN106383791A

CN106383791A - Memory block combination method and apparatus based on non-uniform memory access architecture

Info

Publication number: CN106383791A
Application number: CN201610844237.7A
Authority: CN
Inventors: 张健; 王梅
Original assignee: Shenzhen Polytechnic
Current assignee: Shenzhen Polytechnic
Priority date: 2016-09-23
Filing date: 2016-09-23
Publication date: 2017-02-08
Anticipated expiration: 2036-09-23
Also published as: CN106383791B

Abstract

The present invention belongs to the technical field of cloud storage, and relates to a memory block combination method and apparatus based on a non-uniform memory access architecture. The method comprises three steps: 1) classifying memory provided by available nodes according to frequencies of the nodes, and connecting memory logic of the available nodes with a same frequency to form a memory block; 2) using the memory block as a window block, and by adjusting an arrangement order of window blocks and an arrangement order of each available node in the window block, determining a logic arrangement result with the lowest connection cost, wherein the logic arrangement result comprises a master node in a logic arrangement with the lowest connection cost, and recording the logic arrangement result in a routing table; and 3) storing the routing table into a control processor connected to the master node, and allocating the routing table to a global address of each memory block by the control processor, so as to construct a memory cloud. According to the method and apparatus provided by the present invention, low efficiency of a cluster interconnection network and heterogeneity of different memories are overcome, and non-uniform access memory cloud storage is built up as much as possible.

Description

A kind of memory block combined method based on nonuniform memory access framework and device

Technical field

The invention belongs to cloud storage technical field is and in particular to a kind of memory block group based on nonuniform memory access framework Close method and device.

Background technology

At present, the cloud storage technology development in cloud computing is increasingly faster, (Solid State from disk array to SSD Drives, solid state hard disc) array, development RAM (Random Access Memory, random access memory) cloud till now deposits Storage.RAM cloud storage stores the data of whole application, throughput using the memory ram of up to hundreds of or even thousand of server On higher hundreds of～thousand of times than disk base system, postpone but only hundreds of～a few one thousandth.Typically MapReduce is The new technique that Google rises recent years, it is therefore intended that improving data access speed, eliminates delay issue.It solves Large-scale problem, but if be continuous data access, by make the program be only limited in random access data should Use middle use.This set distributed computing framework of MapReduce is realized primary limitation and is following two aspects, and the first is used MapReduce writes linear communication model comparision trouble, and it two is how it improves all or a frame based on batch mode Frame；The RAMCloud project that Stanford University announces, builds memory array it is achieved that more than 1PB using the internal memory of same type Amount of storage.But the limitation of this project is the internal memory using same type.

NUMA (Non Uniform Memory Access Architecture, Non Uniform Memory Access accesses) framework then carries The different types of internal memory of permission has been supplied to be combined into the possibility of internal memory cloud storage.But, corresponding iff memory group is passed through Board, bus or network connection are got up, and can not constitute optimized internal memory cloud storage.

Content of the invention

The invention aims to the internal memory cloud framework of change existing homotype memory array composition and other correlations are asked Topic it is proposed that a kind of memory block combined method based on nonuniform memory access framework and device, can efficiently to non-homotype, The non-unified internal memory that accesses is ranked up merger, logic arrangement result is transferred to control process device, constructs high-quality as much as possible Non- unified access internal memory cloud storage.

For achieving the above object, the present invention adopts the following technical scheme that：In a kind of framework based on nonuniform memory access Counterfoil combined method, comprises the steps：

Step one：The internal memory that enabled node is provided according to the frequency of node, by the internal memory of the enabled node of same frequency Logic connects and composes a memory block；

Step 2：Using memory block as window block, by adjusting in putting in order between each window block and window block Putting in order of each enabled node, determines the minimum logic arrangement result of link cost, in wherein said logic arrangement result Including the host node in the minimum logic arrangement of link cost, by described logic arrangement result record in the routing table；

Step 3：Described routing table is stored in the control process device being connected with described host node, and by described Each memory block global address distributed to by control process device, to build internal memory cloud.

The algorithm of the present invention is based on NUMA and SIMD hardware environment.Heretofore described node is network section Point, wherein, enabled node can provide the node of internal memory to be connected to the node of network by NUMA card for part.Wherein, with regard to section , due to each server in model and connection, there is different internal memories, CPU, mainboard and network interface, therefore in the frequency of point Connection speed is different, and the present invention, by the factor of each such impact speed, is reduced to the frequency of node memory.Its In, host node is to the enabled node that the totle drilling cost of other each enabled nodes is minimum in enabled node.Wherein, link cost, shadow Any factor ringing data transfer is considered as link cost.Wherein, the cost of host node to memory block is interior to this for host node In counterfoil, the cost of all nodes is cumulative.

Preferably, described step 2 includes：

One enabled node is first chosen from described enabled node by simulated annealing and is used as host node, wherein said Host node is the connecting interface of described control process device；

By each window block, the link cost sequence from small to large according to described host node to window block is arranged, and By the link cost of the enabled node in each window block according to described host node of the enabled node in each window block from small to large Sequence arranged.

Preferably, described step 3 is included described host node and is connected with described control process device by bus.

On the other hand, the present invention also provides a kind of combination unit of the memory block based on nonuniform memory access framework, described Device includes：

Division module, for internal memory that enabled node is provided according to node frequency, by the enabled node of same frequency Internal memory logic connect and compose a memory block；

Processing module, for using memory block as window block, by adjusting putting in order and window between each window block The putting in order of each enabled node in buccal mass, determines the minimum logic arrangement result of link cost, wherein said logic arrangement Result includes the host node in the minimum logic arrangement of link cost, by described logic arrangement result record in the routing table；

Build module, for being stored in described routing table in the control process device being connected with described host node, and lead to Cross described control process device and distribute to each memory block global address, to build internal memory cloud.

Preferably, described processing module, is additionally operable to first choose one from described enabled node by simulated annealing As host node, wherein said host node is the connecting interface of described control process device to enabled node；

Described processing module, is additionally operable to each window block, according to described host node to window block link cost from little to Big sequence is arranged, and by the enabled node in each window block according to described host node the enabled node in each window block Link cost sequence from small to large arranged.

The memory block combined method based on nonuniform memory access framework of the present invention and device, this algorithm is based on non-unification Internal storage access framework, efficiently can be ranked up merger to non-homotype, the non-unified internal memory that accesses, constitute processor and operation The architecture of system interconnectivity and sharing memory bus；The present invention can be applied to large-scale NUMA internal memory cloud storage platform, Overcome due to the poor efficiency of the cluster interconnection network memorizer different with heterozygosis, construct high-quality non-unified visit as much as possible Ask internal memory cloud storage.

Brief description

Fig. 1 is RAMCloud nonuniform memory access framework in the embodiment of the present invention；

Fig. 2 is potential data center node topology in the embodiment of the present invention；

Fig. 3 is the memory block merging in the embodiment of the present invention；

Fig. 4 is window block-simulated annealing in the embodiment of the present invention；

Fig. 5 is number of run and convergence state figure in the embodiment of the present invention.

Specific embodiment

Embodiment 1：

A kind of memory block combined method based on nonuniform memory access framework, comprises the steps：

A kind of memory block combination unit based on nonuniform memory access framework, described device includes：

This embodiment will be applied to large-scale NUMA internal memory cloud storage platform, mutual using processor and operating system cluster The even architecture of shared memory bus, this structure overcome due to the poor efficiency of cluster interconnection network different with heterozygosis Memorizer, availability leaps, and constitutes the internal memory cloud storage of more optimization.

Embodiment 2：

Wherein, described step 2 includes：

Wherein, described step 3 is included described host node and is connected with described control process device by bus.

A kind of memory block combination unit based on framework nonuniform memory access framework, described device includes：

Wherein, described processing module, being additionally operable to first to choose one from described enabled node by simulated annealing can With node as host node, wherein said host node is the connecting interface of described control process device；

As shown in figure 1, the internal memory cloud under nonuniform memory access framework includes application library, data center and control Processor.Nonuniform memory access framework tissue internal memory cloud is pressed by data center, and control process device manages data center.

For internal memory cloud, realize the target of its low latency, need the high performance network technology with following characteristic：Low prolong Late, high bandwidth and full-duplex bandwidth.

Below by model, the algorithm of the present invention is elaborated：

1. model formulation

Assume 1：Each node has memorizer, may with other nodes non-homotype, such as different frequency, bus, CPU model With the speed of service etc., in this model, these aspects are all reduced to different frequencies；

Assume 2：According to prior art, node presses frequency sequence merger, can obtain optimal performance；

Assume 3：Connecting node needs different costs.Any factor of impact data transfer is all assumed to connect into This.

2. model

As shown in Fig. 2 the connection topological structure of node A/B/C.../H, using the different non-homotypes of frequency analog, the Organization of African Unity One internal memory.These nodes provide respectively and a number of interior are stored to high in the clouds；Node is connected to each other and has different costs.

3. data model and initialization

For each node above-mentioned, each node has memory span and frequency.Related data such as table 1 institute Show.

Table 1:Nodal information

For any node being connected, node 1 arrives node 2 and corresponding cost.Related data is as shown in table 2.

Table 2:Node connects expense

Node 1	Node 2	Node connects expense
			A	B	2
B	A	1
			A	D	3
D	A	1
			…	…	…
D	B	1

This model is the cloud storage of nonuniform memory access framework, follows following 3 rules during access：

(1) must not the adjacent memory node of random writing；

(2) must not the adjacent memory node of random read take；

(3) asynchronous adjacent memory node.

Experiment shows, performance can be made drastically to decline if any violating dependency rule.To Jin Shidun memory performance test data Show, be optimal in the combination of identical clock memory.Otherwise, internal memory may under single channel or single tape wide mode work Make, so that Memory access speeds can be drastically declined.

In internal memory cloud, the research of sequence merger join algorithm optimization is concentrated mainly on NUMA and SIMD hardware environment.Non- Sorting in parallel merger join algorithm under uniform memory access framework can be divided into three phases：Phase sorting, separator stage and company Connect the stage.Therefore, the present invention is to merge homotype internal memory and find with the access node of minimum cost, and this node directly passes through to make It is interconnected with the bus of processor, such as AMD HT (super transmission) and Intel QPI (Quick Path Interconnect).

We will define following rule：

Rule 1：In order to obtain optimum performance, sorted according to nodal frequency and merge the internal memory of enabled node, after sequence To corresponding internal memory set of blocks, it is designated asMemory block will be a set, be designated as { Mbi }；

Rule 2：Find a host node as the connecting interface of control process device.Assembly from host node to other nodes This is minimum, is expressed asMeanwhile, logically do not changed inside the memory block after merging；

Rule 3：From the second memory block, sequence node by by from the cost of host node to this node, immediate first sort, This group is represented as { Ai }.

According to above-mentioned rule, can fast and effeciently sort the non-homotype of merger, non-unified memory, and searches out connection control The node of processor；Carried out the distribution of global address by control process device, build internal memory cloud, for applying routine access.

As a example model shown in by Fig. 2, this algorithm will have three phases：Sequence merger, subregion and connection

(1) sequence, merger ----initialization

Data according to table 1, we sort the internal memory of merge node.By the memorizer of node memory and same frequency Rate logic connects.Obtain four memory block { Mbi }={ 6,9,6,2 }, as shown in Figure 3.

(2) subregion ----window block simulated annealing

Data according to table 2, our initialization systems.We have obtained the data in table 3, from any node to The shortest path of the cost of other nodes.If access details are 0 it means that this two nodes are joined directly together；Otherwise this will A character string as from a node to another routed path.Associated data is illustrated as table 3.

Table 3:Minimum cost and corresponding access path from server-to-server

According to table 3, current overhead is

The present invention by the thought of simulated annealing, in the approximate global optimization approach in big search space.

According to rule 1, in this case it is impossible to memory block after breaking merger.The present invention utilizes window block, each Window block all can be taken as an internal storage location.Inside window block, each node can be reordered.During, meter Let it pass the totle drilling cost of current annealing, and anneal.With the internal node sequence of Moving Window buccal mass and window block, obtain finite time Best solution in cost.

In fig .4, it is one possible solution.Host node is F, and coprocessor accesses other nodes from F point, always Cost is 65.

In fig. 4b, it is a preferably solution.Host node is B, and from B to other nodes, totle drilling cost is 27, and from 2nd piece of window starts, and sequence node is ranked up by rule 3.

(3) connect the internal memory cloud of combination

When obtaining a best solution, as shown in Figure 4 b, coprocessor will be connected to node B, and routing table is (similar In table 3) coprocessor will be copied and stored in.Coordinator will distribute to the global address of each cluster.

The present embodiment not details to the greatest extent, refers to the associated description of previous embodiment 1, here is omitted.

The simulated annealing that this embodiment adopts has carried out some improvement to traditional algorithm, not only to memory block proportionately Originally be ranked up, also the node within memory block be ranked up simultaneously, using simulated annealing flexibly, efficiency high, new when having Node when being added in this internal memory cloud, can quickly the memory block in internal memory cloud and respective nodes be made adjustment, thus Construct high-quality non-unified access internal memory cloud storage.

With a concrete application scene, the memory block based on nonuniform memory access framework in the present embodiment is combined below Method illustrates, and concrete mode is as follows：

(4) arthmetic statement

According to rule 1, initialize first, call Init () sequence merger node, produce first state S0, be shown in Table 3.So Afterwards, according to window block simulated annealing rule 2.Call Cost () will to calculate and return the cost of Present solutions.Call Neighbor () arrives traditional simulated annealing, and it will produce a given state of the neighbours of a random selection.? Afterwards, it obtains best solution.Coprocessor is connected to host node and replicates route information table by function Connect () 3.Function AssignGlobalAddress () is by coordinated allocation cluster memorizer according to the global address of block sequence.

Parameter S0 initial solution, the preferably best at present solution of parameter Sbest, parameter T0 is initial temperature, and α is cold But speed, β is a constant, and M represents the time until the renewal of next parameter, and the maximum time limit is the lehr attendant of total time Skill.

Pseudo-code below gives the described memory block combined method for nonuniform memory access framework.

In the algorithm, most important function is Neighbor ().Give for one of the neighbours of its one random selection of generation Fixed state.In " window block ", each node will be rearranged by rule 3；Outside " block " window, each window Block all will rearrange.

In this model, there are 8 nodes and 4 window blocks.Through many experiments result, the overhead of best solution Finally it is converged in 27.Best situation is 3 times, top it all off 15 times, as shown in Figure 5.

Claims

1. a kind of memory block combined method based on nonuniform memory access framework is it is characterised in that comprise the steps：

Step one：The internal memory that enabled node is provided according to the frequency of node, by the internal memory logic of the enabled node of same frequency Connect and compose a memory block；

Step 2：Using memory block as window block, by adjusting each in putting in order between each window block and window block Putting in order of enabled node, determines the minimum logic arrangement result of link cost, and wherein said logic arrangement result includes Host node in the minimum logic arrangement of link cost, by described logic arrangement result record in the routing table；

Step 3：Described routing table is stored in the control process device being connected with described host node, and by described control Each memory block global address distributed to by processor, to build internal memory cloud.

2. the memory block combined method based on nonuniform memory access framework according to claim 1 is it is characterised in that institute State step 2 to include：

One enabled node is first chosen from described enabled node by simulated annealing and is used as host node, wherein said main section Point is the connecting interface of described control process device；

By each window block, the link cost sequence from small to large according to described host node to window block is arranged, and will be each The link cost of the enabled node in each window block according to described host node of the enabled node in window block row from small to large Sequence is arranged.

3. the memory block combined method based on nonuniform memory access framework according to claim 1 it is characterised in that：Institute State step 3 and include described host node and be connected with described control process device by bus.

4. a kind of memory block based on framework nonuniform memory access framework combines the unit it is characterised in that described device includes：

Division module, for internal memory that enabled node is provided according to node frequency, by the enabled node of same frequency Deposit logic and connect and compose a memory block；

Processing module, for using memory block as window block, by adjusting putting in order and window block between each window block In the putting in order of each enabled node, determine the minimum logic arrangement result of link cost, wherein said logic arrangement result Include the host node in the minimum logic arrangement of link cost, by described logic arrangement result record in the routing table；

Build module, for being stored in described routing table in the control process device being connected with described host node, and pass through institute State control process device and distribute to each memory block global address, to build internal memory cloud.

5. according to claim 4 based on nonuniform memory access framework memory block combination unit it is characterised in that

Described processing module, is additionally operable to first to choose an enabled node from described enabled node by simulated annealing and is used as Host node, wherein said host node is the connecting interface of described control process device；

Described processing module, be additionally operable to each window block, according to described host node to window block link cost from small to large Arranged, and the company by the enabled node in each window block according to described host node of the enabled node in each window block It is connected into this sequence from small to large to be arranged.