CN108984308A

CN108984308A - A kind of cloud data processing method and system based on workload

Info

Publication number: CN108984308A
Application number: CN201810825782.0A
Authority: CN
Inventors: 严莉; 赵鹏; 刘范范; 刘子雁; 韩圣亚; 汤耀庭; 汤琳琳; 黄振; 张悦; 朱韶松; 张凯; 赵忱; 赵晓; 李刚; 林鹏; 付本娟; 赵阳; 宫淑卿; 朱璐; 吕舒清
Original assignee: State Grid Corp of China SGCC; Information and Telecommunication Branch of State Grid Shandong Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; Information and Telecommunication Branch of State Grid Shandong Electric Power Co Ltd
Priority date: 2018-07-25
Filing date: 2018-07-25
Publication date: 2018-12-11

Abstract

The cloud data processing method based on workload that the invention discloses a kind of, comprising the following steps: cloud data copy places step: obtaining history workload, completes the mapping of data item to partition holding using event-driven segmentation strategy；Cloud data copy positioning step: after the transactions requests for receiving application system initiation, using greedy algorithm selective positioning copy.Also disclose a kind of cloud data processing system based on workload.The present invention supports the resilient expansion of Data Storage Models, meanwhile, it also solves because of degradation problem caused by workload variation, ensure that the high-efficiency operation of application.

Description

A kind of cloud data processing method and system based on workload

Technical field

The present invention relates to cloud technical field of data processing, especially a kind of cloud data processing method based on workload and System.

Background technique

Cloud computing is a kind of emerging technical approach, using this calculation, on demand, can be customized by internet Change ground and shared hardware device and software resource are provided for user.Cloud computing platform has become now serves enterprises and individuals A new platform.Cloud computing generallys use the unified centrally stored mode of data, and in cloud computing platform, how data are put Setting is an extremely important problem, during actual use, is needed for data to be assigned in cloud on suitable node.

Data in cloud computing platform place problem, are studied extensively and in depth by industry, current research is main Concentrate on the dynamic adjustment and the routing of transactions requests when primary data Placement Strategy, the selection of data copy quantity, operation In algorithm etc. content.The above process is the problem is that existing placement technique may bring performance bottleneck, limitation database Scalability.Meanwhile existing copy selection strategy easily leads to the problems such as increase and excessively high cost of distributed transaction quantity.

Summary of the invention

The object of the present invention is to provide a kind of cloud data processing method and system based on workload, solve cloud computing Data in platform place problem, support quickly to develop and interact to the maximum extent.

To achieve the above object, the present invention adopts the following technical solutions:

First aspect present invention provides a kind of cloud data processing method based on workload, comprising the following steps:

Cloud data copy places step: obtaining history workload, completes data item using event-driven segmentation strategy and arrive The mapping of partition holding；

Cloud data copy positioning step: after the transactions requests for receiving application system initiation, it is fixed to be selected using greedy algorithm Position copy.

With reference to first aspect, in a first possible implementation of that first aspect, the acquisition history workload, is adopted Data item is completed to the mapping of partition holding with event-driven segmentation strategy, is specifically included:

The workload information for obtaining history log, models workload using hypergraph；

To workload hypergraph carry out subregion, establish data item to physical data blockette mapping relations；

Mapping relations are handled using hypergraph minimal cut partitioning technique, realize physical data blockette in machinery compartment It places.

With reference to first aspect, in a second possible implementation of that first aspect, the work for obtaining history log Load information models workload using hypergraph, specifically includes:

The inquiry workload in a period of time is captured, is mapped to one across multiple nodes by vertex, inquiry of data item Super side, establish hypergraph；

Hypergraph is compressed, the result hypergraph of workload is generated.

With reference to first aspect, described to receive application system initiation in first aspect in the third possible implementation Transactions requests after, using greedy algorithm selective positioning copy, specifically include:

After the transactions requests for receiving application system initiation, span, selection packet are inquired and selected using standard greedy algorithm The minimal number of partition holding of all data is needed containing inquiry.

With reference to first aspect, described to be looked into using standard greedy algorithm in the 4th kind of possible implementation of first aspect Span is ask and is selected, is specifically included:

The intersection size of each partition holding and subset of queries is calculated, the maximum partition holding of intersection is selected, deletes packet It is contained in all items of the subset of queries of the partition holding；

Using iteration until, without item, exporting partition holding combination in subset of queries.

Second aspect of the present invention provides a kind of cloud data processing system based on workload, comprising:

Cloud data copy placement module obtains history workload, completes data item using event-driven segmentation strategy and arrives The mapping of partition holding；

Cloud data copy locating module, after receiving the transactions requests that application system is initiated, it is fixed to be selected using greedy algorithm Position copy.

In conjunction with second aspect, in second aspect in the first possible implementation, the cloud data copy placement module Include:

Workload processing unit carries out modeling management using hypergraph, presses hypergraph using workload as input Result hypergraph is exported after contracting；

Data zoning unit carries out subregion, mapping relations of the output data item to physical data subregion to result hypergraph；

Data placement unit is handled mapping relations using hypergraph minimal cut partitioning technique, realizes physical data point Placement of the block in machinery compartment.

In conjunction with second aspect, in second of second aspect possible implementation, the cloud data copy locating module Include:

The foundation of data entry index, mapping relations maintenance and log pipe are completed in indexing units, rapidly locating item position Reason；

Router unit automatically selects according to transactions requests and sets routing, is forwarded and is requested with best route selection strategy；

Data engine unit receives the transactions requests that application system is sent, and returns to processing result to application system.

The cloud data processing system based on workload of second aspect of the present invention can be realized first aspect and Method in each implementation of one side, and obtain identical effect.

The effect provided in summary of the invention is only the effect of embodiment, rather than invents all whole effects, above-mentioned A technical solution in technical solution have the following advantages that or the utility model has the advantages that

(1) present invention supports the resilient expansion of Data Storage Models, meanwhile, it also solves and is led because workload changes The degradation problem of cause ensure that the high-efficiency operation of application.

(2) the present invention is based on the hypergraphs that workload is established, and can be modeled according to the data cell that affairs access, Realize multi-level, fine-grained platform data management；The fine granularity management of data copy quantity is each of required to have determined The copy amount of data item and one group of set covering algorithm meet the requirements required subregion to determine to meet inquiry and meet Minimal amount condition.

(3) copy selection through the invention and Placement Strategy, can be minimized being averaged for affairs involved in system and look into Ask range.Distributed transaction can be greatly reduced in the strategy, improve the overall performance of system.This system framework makes data copy more Add equilibrium, greatly improves the efficiency of platform data management.Using the data copy placement technique of dynamic self-adapting, realize higher Scalability, and increase fault-tolerant ability, improve the adaptibility to response of the variation of workload.

Detailed description of the invention

Fig. 1 is the cloud data processing method flow chart based on workload；

Fig. 2 is the method flow diagram of step S1；

Fig. 3 is the method flow diagram of step S11；

Fig. 4 is the method flow diagram of step S2；

Fig. 5 is the cloud data processing system structural schematic diagram based on workload；

Fig. 6 is cloud data copy placement module structural schematic diagram；

Fig. 7 is cloud data copy locating module structural schematic diagram.

Specific embodiment

In order to clarify the technical characteristics of the invention, below by specific embodiment, and its attached drawing is combined, to this hair It is bright to be described in detail.It is noted that following detailed description is all illustrative, it is intended to provide further the application It is bright.Unless otherwise specified, all technical and scientific terms used herein has the common skill with the application technical field The normally understood identical meanings of art personnel.Following disclosure provides many different embodiments or example is used to realize the present invention Different structure.In order to simplify disclosure of the invention, hereinafter the component of specific examples and setting are described.In addition, this Invention can in different examples repeat reference numerals and/or letter.This repetition be for purposes of simplicity and clarity, Body does not indicate the relationship between discussed various embodiments and/or setting.Present invention omits to known assemblies and processing technique And the description of technique is to avoid being unnecessarily limiting the present invention.

Term is explained:

Hypergraph: hypergraph is a kind of figure of broad sense, and feature is that a super side can connect multiple points.Hypergraph H is a set Group H=(V, E), V therein is the set on vertex, and E is the non-empty power set of V.

Data item can be the relationship in database, a part of the relationship in database or any file.Target is to deposit Each data item is stored up to machine, obeys the storage capacity requirement of subregion.It should be noted that subregion needs not be machine, but can To represent rack even data center.

The span of inquiry is defined as the smallest number of partitions of all data needed comprising inquiry.

In cloud data storage environment, since back end is unstable, data are easy to be lost, and user is to availability of data It has different needs, therefore uses Replication technology.The present invention is in the feature for placing problem to the data copy based on workload On the basis of analysis, a kind of system architecture of cloud computing platform data management is proposed.As changing for existing data placement method Into with it is perfect, need the system architecture whole from higher level upper design, while based on the workload in System History log Data copy laying method and data copy dynamic select strategy are supported in information, design.

The application supports quickly exploitation to the maximum extent and hands over to solve the problems, such as that the data in cloud computing platform are placed Mutually, based on the workload information in System History log, the cloud data place system framework an of entirety is designed, from higher The system architecture of data copy laying method and data copy dynamic select strategy is supported in secondary upper description.By the inquiry work of history Load modeling becomes a hypergraph, and hypergraph includes one group of data item, and models and analyze Replica placement by graphtheoretic concept and ask Topic.Develop a series of algorithm simultaneously to determine which data item needs are replicated, and by these Replica placements where.Base In this system architecture, the data copy laying method and data copy dynamic select strategy of design work load driving.By this Algorithm, the average lookup range of affairs involved in minimum system.

As shown in Figure 1, a kind of cloud data processing method based on workload, comprising the following steps:

S1, history workload is obtained, the mapping of data item to partition holding is completed using event-driven segmentation strategy；

S2, receive application system initiation transactions requests after, using greedy algorithm selective positioning copy.

As shown in Fig. 2, step S1 is specifically included:

S11, the workload information for obtaining history log, model workload using hypergraph；

S12, to workload hypergraph carry out subregion, establish data item to physical data blockette mapping relations；

S13, mapping relations are handled using hypergraph minimal cut partitioning technique, realizes physical data blockette in machine Between placement.

As shown in figure 3, step S11 is specifically included:

S111, the inquiry workload in capture a period of time are mapped to one across multiple by vertex, inquiry of data item The super side of node, establishes hypergraph；

S112, hypergraph is compressed, generates the result hypergraph of workload.

As shown in figure 4, step S2 is specifically included:

S21, the intersection size for calculating each partition holding and subset of queries select the maximum partition holding of intersection, delete Except all items for the subset of queries for being included in the partition holding；

S22, using iteration until without item, exporting partition holding combination in subset of queries.

As shown in figure 5, a kind of cloud data processing system based on workload, comprising:

Cloud data copy placement module 10 obtains history workload, completes data item using event-driven segmentation strategy To the mapping of partition holding；

Cloud data copy locating module 11 is selected after receiving the transactions requests that application system is initiated using greedy algorithm Position copy.

As shown in fig. 6, cloud data copy placement module includes:

Workload processing unit 101, using workload as input, carry out modeling management using hypergraph, to hypergraph into Result hypergraph is exported after row compression.

Workload unit carries out modeling management using hypergraph mode using the workload of system as input --- it establishes Using tuple as vertex, using the relationship between tuple as the large-scale hypergraph on side, then the hypergraph compressed, generated negative about work The result hypergraph of the compression carried.If not illustrating, in all hypergraphs generation, refers to the result hypergraph in the present invention.

Data zoning unit 102 carries out subregion to result hypergraph, and the mapping of output data item to physical data subregion is closed System.

The take charge of data of initial workload-aware of data zoning unit are placed and the decision of duplication, then by suitable When data migration and duplication execute this decision.It handles the result hypergraph for the workload that workload module generates, Subregion is carried out to result hypergraph using general partitioning tool.The output of data zoning unit is exactly physics of the tuple to them The mapping of data blockette.

Data placement unit 103 is handled mapping relations using hypergraph minimal cut partitioning technique, realizes physical data Placement of the blockette in machinery compartment.

The mapping relations that data placement unit generates data zoning unit are handled, and use the minimal cut skill of hypergraph Art realizes physical data blockette in the placement of machinery compartment.

As shown in fig. 7, cloud data copy locating module includes:

The foundation of data entry index, mapping relations maintenance and log are completed in indexing units 111, rapidly locating item position Management.

Indexing units provide index of metadata foundation, mapping relations dimension as the index management tool in system architecture The various functions such as shield, log management.Indexing units by rapidly locating position, improve system transaction capabilities and Throughput of system.

Router unit 112 automatically selects according to transactions requests and sets routing, is asked with the forwarding of best route selection strategy It asks.

Router unit is used to connect data engine with index module, data division module, data placement module and work Load blocks are the important connection components in system architecture.The case where it can be requested according to business processing automatically selects and sets Routing is forwarded with best route selection strategy and is requested.

Data engine unit 113 receives the transactions requests that application system is sent, and returns to processing result to application system.

Application system submits a transactions requests, then receives the knot returned by the interface that data engine unit provides Fruit.Data engine unit guarantees that affairs correctly execute required atomicity, consistency, isolation using two-phase commitment protocol Property, persistence elemental.Data engine unit sends transactions requests to router unit, and router unit is responsible for calculating affairs The subregion needed is executed, is performed in parallel on data subregion required for then affairs are just distributed on.

By the abstract of several levels, shield complexity for user, refine the data structure of each level, external interface, Metadata driven mechanism has carried out Data Management Model abstract formalized description at all levels.

For data are placed on each back end, data have been defined as segmentation one by one first, each point It cuts and also corresponds to an one's own copy set.Each copy of segmentation is placed into some in cloud according to certain strategy On back end.When application program is run, data engine modules are accessed by the common data interface of extension, by data engine mould Root tuber is established database with back end where data copy and is connect according to data copy positioning strategy, returns to user's request and rings The data answered.

It, can be using the strategy of event-driven segmentation, by observing and capturing one to the data placement method under cloud environment The workload of inquiry and affairs in the section time, then utilizes these workload informations, to realize that data are placed, maximum limit Reduce the quantity of distributed transaction in degree ground.According to workload information, by a series of compress techniques, establishing with tuple is top Point, using the relationship between tuple as the hypergraph on side, by the smallest partition algorithm in figure, obtain the segmentation information and segmentation of tuple Distributed transaction can be greatly reduced in copy information, the strategy, improve the overall performance of system.

Specifically, the module compression method are as follows:

To each of large-scale hypergraph node v ∈ V, by calculating v '=f (pk_v), it maps that in result hypergraph A dummy node v ' ∈ V ', one group of node is finally compressed into a dummy node.Wherein pkv represents the major key of node v, F uses hash function HF (pk_v)=hash (pk_v) mod N, wherein N is required dummy node quantity.

For the super side in large-scale hypergraphIfIndicate the collection of the e interior joint virtual nodes to be mapped to It closes.If e ' includes at least two virtual nodes, we increase a super side H '=(V ', the E ') of e ' hypergraph as a result.It is fixed Node quantity of the adopted hypergraph compression ratio (CR) as large-scale hypergraph | V | with the virtual nodes quantity of compressed result hypergraph | V ' | Ratio, i.e.,If CR=1 expression is not compressed, if CR=| V |, indicate that all original nodes are mapped to On one single virtual nodes.

Inquiry workload, is expressed as a hypergraph, node by the inquiry workload for capturing the history in a period of time Indicate data item, each inquiry is accordingly mapped to the super side across multiple nodes.It is in hypergraph on the basis of hypergraph Each vertex increase copy amount mark.By using minimal cut partitioning technique, the minimal cut i.e. data of hypergraph The partition scheme of layout.

By the way that copy is arranged to data, aggressive replicate data brings the distributed cost for updating affairs, number of copies After amount increases, it will guarantee consistency when distributed transaction is updated, so increasing cost.In order to maintain Active Replication The consistency of copy relevant expense is updated with distribution to control using fine-grained replication policy.In addition to this, carefully The replication policy of granularity, it helps by handling subregion failure normally to improve fault-tolerant ability.

Workload based on system establishes the relationship between data block using hypergraph.The copy amount of data is carried out Fine-grained management improves the copy service efficiency of system.It proposes using fine-grained data copy quantity control strategy, Policy definition can more preferably control the cost of distributed update, improve throughput of system, and can on the level of tuple-set The adaptation level of different workloads is provided, enable the system to it is different read and write access module under, more preferably Ground processing inquiry workload.

Data copy dynamic select strategy based on workload, comprising the following steps:

When a transactions requests arrive, how this is quickly navigated on suitable data copy, and transaction model is negative Duty carries out response processing to transactions requests.

Firstly the need of management cloud index, metadata in index module.Secondly, when transactions requests reach when, need using Scientific and efficient data copy location algorithm is first determined the physical data blockette of copy by index module, then passes through router Module determines the data subregion of copy, is assigned to transactions requests in suitable data copy, to reduce distributed access, improves The performance of cloud system.

By the greedy algorithm of a standard come location data copy, for inquiring and calculating the span of transactions requests.It is right Each data subregion, calculates the size of itself and the intersection of subset of queries.The maximum subregion of intersection size is selected, deletion is included in All items of the subset of queries of the subregion, using iteration, until without item in subset of queries.It is asked similar to minimal set cover Topic, is given one group of subset and subset of queries, is to be required that the subset for finding required minimum number is inquired to cover Subset.

Based on Data Management Model and data management technique, when transactions requests reach, propose quick using data copy Positioning strategy.The size of mapping table is determined by the compression ratio of hypergraph, compresses more severe, then mapping table is smaller.It depends on The calculating and memory capacity of data copy locator can choose a suitable compression ratio to optimize overall performance.It closes simultaneously And two additional functions, to reduce query context and the distributed cost updated: the fine granularity management of data copy quantity Come determined required each data item copy amount and one group of set covering algorithm come determine to meet inquiry and meet Meet the requirements the condition of the minimal amount of required subregion.Select copy using the greedy algorithm of standard, by inquire and based on Calculate span.To each subregion, the size of itself and the intersection of subset of queries is calculated.The maximum subregion of intersection size is selected, is deleted Included in all items of the subset of queries of the subregion, using iteration, until without item in subset of queries.

Above-mentioned, although the foregoing specific embodiments of the present invention is described with reference to the accompanying drawings, not protects model to the present invention The limitation enclosed, those skilled in the art should understand that, based on the technical solutions of the present invention, those skilled in the art are not Need to make the creative labor the various modifications or changes that can be made still within protection scope of the present invention.

Claims

1. a kind of cloud data processing method based on workload, characterized in that the following steps are included:

Cloud data copy places step: obtaining history workload, completes data item to storage using event-driven segmentation strategy The mapping of subregion；

Cloud data copy positioning step: after the transactions requests for receiving application system initiation, using greedy algorithm selective positioning pair This.

2. the method as described in claim 1, characterized in that the acquisition history workload divides plan using event-driven Data item is slightly completed to the mapping of partition holding, is specifically included:

Mapping relations are handled using hypergraph minimal cut partitioning technique, realize physical data blockette putting in machinery compartment It sets.

3. method according to claim 2, characterized in that the workload information for obtaining history log uses hypergraph Workload is modeled, is specifically included:

The inquiry workload in a period of time is captured, is mapped to surpassing across multiple nodes by vertex, inquiry of data item Hypergraph is established on side；

Hypergraph is compressed, the result hypergraph of workload is generated.

4. the method as described in claim 1, characterized in that after the transactions requests for receiving application system initiation, use Greedy algorithm selective positioning copy, specifically includes:

After the transactions requests for receiving application system initiation, span is inquired and selected using standard greedy algorithm, selection is comprising looking into Ask the minimal number of partition holding for needing all data.

5. method as claimed in claim 4, characterized in that it is described to inquire and select span using standard greedy algorithm, specifically Include:

The intersection size of each partition holding and subset of queries is calculated, the maximum partition holding of intersection is selected, deletion is included in All items of the subset of queries of the partition holding；

6. a kind of cloud data processing system based on workload, characterized in that include:

Cloud data copy placement module obtains history workload, completes data item to storage using event-driven segmentation strategy The mapping of subregion；

Cloud data copy locating module, after receiving the transactions requests that application system is initiated, using greedy algorithm selective positioning pair This.

7. system as claimed in claim 6, characterized in that the cloud data copy placement module includes:

Workload processing unit carries out modeling management using hypergraph, after compressing to hypergraph using workload as input Export result hypergraph；

Data placement unit is handled mapping relations using hypergraph minimal cut partitioning technique, realizes physical data blockette In the placement of machinery compartment.

8. system as claimed in claim 6, characterized in that the cloud data copy locating module includes:

The foundation of data entry index, mapping relations maintenance and log management are completed in indexing units, rapidly locating item position；