CN103593433A

CN103593433A - Graph data processing method and system for massive time series data

Info

Publication number: CN103593433A
Application number: CN201310559846.4A
Authority: CN
Inventors: 周薇; 高赟; 冉攀峰; 韩冀中
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2013-11-12
Filing date: 2013-11-12
Publication date: 2014-02-19
Anticipated expiration: 2033-11-12
Also published as: CN103593433B

Abstract

The invention relates to a graph data processing method and system for massive time series data. The graph data processing method for the massive time series data comprises the steps of carrying out preprocessing on social network data, and abstracting a graph structure which uses vertexes to represent figures, and uses a plurality of edges with timestamps to represent the interactive relationship among the figures, wherein the representing method can effectively represent a social network relationship with an interactive sequential relationship; dividing the graph structure into a plurality of graph structure blocks according to the celebrity charm and a preset Euclidean distance, and numbering the graph structure blocks and the vertexes of the interiors of the graph structure blocks; leading the graph structure blocks in corresponding positions of a memory according to the organization mode of the memory, wherein the storage mode of the memory makes full use of the distribution characteristics of graph data, and efficient storage performance and efficient query performance can be achieved. According to the graph data processing method for the massive time series data, an original programming model using the vertexes as calculation units is improved on the basis of the principle that the calculation time and the memory space are saved, and a programming model using messages as calculation units is adopted, so that the calculation time is saved through the mode to a large extent, and the storage space is also saved.

Description

A kind of diagram data disposal route and system towards magnanimity time series data

Technical field

The present invention relates to Large Scale Graphs data processing field, particularly relate to a kind of diagram data disposal route and system towards magnanimity time series data.

Background technology

Along with the fast development of internet, social network-i i-platform also develops rapidly and popularizes in recent years.With diagram data, represent the character relation in social networks, in conjunction with nomography, can excavate hiding Info in social networks character relation.So along with the development of social networks, diagram data has welcome again upsurge.

But along with popularizing gradually of social networks, social network data also presents the trend of exponential growth.So, just can not realize the analyzing and processing to these Large Scale Graphs data with standalone version diagram data handling procedure.

Extensive along with the temperature of social networks and social network data, the scheme of these large-scale datas of processing that adopt is at present these data of parallel processing on many machines.These parallel schemes are all followed following thinking substantially: first by huge diagram data, according to certain rule cutting, (the larger region of the consistency of take is that benchmark is set the foundation of dividing memory block, this has just caused memory block can store the data of the graph structure that lower consistency is large, but for more sparse graph structure, although also can store, but caused the waste of a large amount of storage spaces), be cut into many parts, every part of diagram data is all stored in wherein on a machine.When starting to process, from the machine, load the diagram data that the machine is stored, then calculate, last again by the mechanism exchange results of intermediate calculations of transmission of messages, through iteration repeatedly, thereby obtain final result of calculation.

In the solution that existing diagram data is processed, more representational solution has two kinds.A Pregel that to be Google proposed in 2009, Pregel is used BSP(Bulk Synchronous Parallel, Integral synchronous parallel computational model) complete calculating.Due to figure application, can not once calculate and just obtain net result, need repeatedly iterative computation.So synchronous process once just between every twice iteration, this synchronous process refers to waits for that all tasks all calculate completely, then could carry out synchronous.This mode of Pregel is fairly simple, easily understand, but the large synchronizing process between every twice iteration is more consuming time, and treatment effeciency is not high.Another one Typical Representative is the Ttrinity system of Microsoft, this system is to announce out for 2012, relative Pregel system, there is following advantage: first, for diagram data, process this typical case's application, diagram data is stored in distributed memory from being stored in file system instead, has accelerated the loading efficiency of diagram data; Secondly, for concrete figure application, the asynchronous replacement of synchronizing process between every twice iteration, can reduce the overall performance expense of synchronously bringing.

But diagram data is processed the following several problem that is still faced with:

In diagram data, except having interpersonal annexation, also have the interactive relationship relevant to time series.This interactive relationship is different from annexation, and annexation can represent with a limit in graph structure, but interactive relationship is directly related with the time, and it represents according to time relationship, have multiple expression and interactive relationship between two summits.But at present, Large Scale Graphs data handling system has all only been processed interpersonal annexation, do not process interpersonal according to the interactive relationship of asynchronism(-nization).

The distribution situation of social networks presents celebrity effect, and famous person is more concerned than ordinary people in social networks.So famous person colony is part dense in graph structure, and the people of this circle is familiar with mutually substantially.So in representing the graph structure of social networks, dense part is relatively concentrated, remaining most ordinary people is exactly the sparse part in this graph structure.So for this dense and sparse graph structure coexisting, adopt which kind of diagram data storage can reach efficient characteristic with method for organizing? in existing diagram data disposal system, do not consider data characteristic and take unified storage policy, do not customize and just cannot reach high-performance, so it is also the significant problem that this field should be considered that efficient diagram data is organized.

Diagram data is all usingd diagram data summit as processing unit in processing, and when arriving a collection of message, for each summit of diagram data, travels through all message, after diagram data summit is traveled through, the message of all arrivals could be deleted, and is equivalent to a matching process.This matching process not only expends time in, and travel through all message to each summit, and will take that a large amount of memory headrooms are stored these intermediary message until all summits have all traveled through.

Summary of the invention

Technical matters to be solved by this invention is to provide a kind of diagram data disposal route and system towards magnanimity time series data, and the storage that solves be beyond expression in existing diagram data treatment technology sequential interactive relationship, diagram data do not fully take into account data distribution character, the diagram data that summit is unit of take is processed programming model and exist the problems such as significant wastage on computing time and storage space.

The technical scheme that the present invention solves the problems of the technologies described above is as follows: a kind of diagram data disposal route towards magnanimity time series data, comprises the steps:

Step 1: pre-service social network data, and take out with summit representative figure, with the graph structure of interactive relationship between the limit representative figure of the free stamp of some bands;

Step 2: according to celebrity effect, graph structure is cut into several graph structure pieces according to predetermined Euclidean distance, and gives graph structure piece and inner summit numbering thereof;

Step 3: graph structure piece is distributed to different nodes and process, the graph structure piece that each node is got according to memory organization mode imports in the relevant position of internal memory;

Step 4: user writes application job according to message based programming model, and application job is submitted to application job processing unit;

Step 5: application job processing unit obtains the data of required graph structure piece from internal memory, and carry out application job according to message based tupe, obtain operation result.

The invention has the beneficial effects as follows: the present invention sets up many limits between two summits, every limit represents two interactive relationship between summit, on every limit with representing this limit timestamp of Time Created, namely the interactive time occurring, this method for expressing can effectively indicate the social network relationships of interactive sequential relationship; The special distribution (celebrity effect) that the present invention is directed to diagram data has designed a kind of memory mode of diagram data, takes full advantage of the distribution character of diagram data, can reach efficient storage and query performance; The present invention is in line with the principle of saving computing time and memory headroom, improved the original programming model that summit is computing unit of take, but adopt, take the programming model that message is computing unit, this mode has been saved computing time to a great extent, has also saved storage space.

On the basis of technique scheme, the present invention can also do following improvement.

Further, described step 1 pair social network data is carried out pretreated concrete steps and is:

Step 1.1: extract concrete personage from social network data, the summit of composition diagram structure;

Step 1.2: extract the interaction between personage and personage from social network data, a limit of each interactive composition diagram structure;

Step 1.3: arrange on every limit of graph structure and represent this limit timestamp of Time Created.

Further, step 2 is cut into several graph structure piece by graph structure according to Euclidean distance according to celebrity effect, and the specific implementation of numbering to graph structure piece and inner summit thereof is:

Step 2.1: the number on the limit being connected with each summit in statistical graph structure, the mean value of calculating chart structure consistency;

Step 2.2: set the Euclidean distance for cutting graph structure according to the mean value of graph structure consistency, and according to Euclidean distance, graph structure is carried out to cutting, obtain several graph structure pieces;

Step 2.3: the graph structure piece after cutting is numbered and obtains block number, and each summit in graph structure piece is numbered, the block number that is numbered on each summit adds summit numbering.

Concrete steps in the relevant position of the graph structure piece importing internal memory that further, in described step 3, each node is got according to memory organization mode are as follows:

Step 3.1: open up in internal memory between a slice memory field, and will be divided into the memory partitioning of N fixed size between memory field, for storing N graph structure piece;

Step 3.2: distribute the memory block of a fixed size for each summit in each memory partitioning, for storing vertex data and the relation data on this summit;

Step 3.3: judge whether this summit has celebrity effect, if had, and described memory block can not be stored vertex data and all relation datas on this summit, the vertex data on this summit and part relations data are stored in memory block, and open up again the add-in memories piece of one or more fixed sizes, remaining relation data is stored in add-in memories piece, and with add-in memories piece described in pointed; Otherwise directly the vertex data on this summit and all relation datas are stored in correspondence memory piece;

Step 3.4: in annex memory block, set up and take the time as primary key, the index that the summit of take in graph structure is secondary key.

Further, in step 5, application job processing unit obtains desired data from internal memory, and carries out application job according to message based tupe, and the specific implementation that obtains operation result is:

Step 5.1: application job performance element is carried out application job, and application job comprises several tasks, graph structure piece of each task management;

Step 5.2: each task is obtained desired data by summit numbering from the graph structure piece of its management, processes the data in graph structure piece according to the processing logic of application job, generates some message, and message is sent to other tasks after finishing dealing with;

Step 5.3: the task of receipt message is resolved every a piece of news of arrival successively, extracts the object summit that this message should arrive;

Step 5.4: upgrade the value on object summit, and then delete this message;

Step 5.5: judge whether to also have untreated message, if had, return to step 5.3, otherwise finish.

The technical scheme that the present invention solves the problems of the technologies described above is as follows: a kind of diagram data disposal system towards magnanimity time series data, comprises accumulation layer, computation layer and client layer;

Described accumulation layer, it is for setting up the graph structure that can represent interactive relationship between personage, and according to celebrity effect, realizes the personalization storage of graph structure;

Described client layer, it is for writing the application job based on Message Processing, and submits to computation layer;

Described computation layer, it is for obtain the data of required graph structure piece from accumulation layer, and carries out application job according to message based tupe, obtains operation result.

Further, described accumulation layer comprises data pretreatment unit, graph structure cutter unit, data importing unit and internal storage location;

Described data pretreatment unit, it is for social network data is carried out to pre-service, and takes out with summit representative figure, with the graph structure of interactive relationship between the limit representative figure of the free stamp of some bands;

Described graph structure cutter unit, it is for according to celebrity effect, graph structure being cut into several graph structure pieces according to predetermined Euclidean distance, and the summit numbering of giving graph structure piece and inside thereof;

Described data importing unit, it is processed for graph structure piece is distributed to different nodes, and the graph structure piece that each node is got according to memory organization mode imports in the relevant position of internal memory;

Described internal storage location, comprises the internal memory of several nodes, is respectively used to the graph structure blocks of data that storage imports.

Further, described client layer comprises that application job writes unit and application job commit unit,

Described application job is write unit, and it is for writing the application job based on Message Processing, and application job is sent to application job commit unit;

Described application job commit unit, it is for submitting to application job the application job processing unit of computation layer.

Further, described computation layer comprises application job processing unit, and it is for carrying out application job according to Message Processing pattern.

Further, in described figure application job, comprise several tasks, each task is responsible for processing a graph structure piece.

Accompanying drawing explanation

Fig. 1 is a kind of diagram data process flow figure towards magnanimity time series data of the present invention;

Fig. 2 is the specific implementation process flow diagram of step 1 of the present invention;

Fig. 3 is the specific implementation process flow diagram of step 2 of the present invention;

Fig. 4 is the specific implementation process flow diagram of step 3 of the present invention;

Fig. 5 is the specific implementation process flow diagram of step 5 of the present invention;

Fig. 6 is a kind of diagram data disposal system structured flowchart towards magnanimity time series data of the present invention;

Fig. 7 is partial graph structural representation in embodiment 1 of the present invention;

Fig. 8 is the storage organization schematic diagram of a memory partitioning in the internal memory of 2 one nodes of embodiment of the present invention;

Fig. 9 be take the processing procedure schematic diagram that summit is processing unit in prior art;

Figure 10 is the processing procedure schematic diagram that message is processing unit of take of the present invention.

In accompanying drawing, the list of parts of each label representative is as follows:

1, accumulation layer, 2, client layer, 3, computation layer, 1-1, data pretreatment unit, 1-2, graph structure cutter unit, 1-3, data importing unit, 1-4, internal storage location, 2-1, application job are write unit, 2-2, application job commit unit, 3-1, application job processing unit; 101, memory block, 102, add-in memories piece, 201, summit, 202, message.

Embodiment

Below in conjunction with accompanying drawing, principle of the present invention and feature are described, example, only for explaining the present invention, is not intended to limit scope of the present invention.

As shown in Figure 1, a kind of diagram data disposal route towards magnanimity time series data, comprises the steps:

As shown in Figure 2, described step 1 pair social network data is carried out pretreated concrete steps and is:

As shown in Figure 3, step 2 is cut into several graph structure piece by graph structure according to Euclidean distance according to celebrity effect, and the specific implementation of numbering to graph structure piece and inner summit thereof is:

Concrete steps in the relevant position of the graph structure piece importing internal memory that as shown in Figure 4, in described step 3, each node is got according to memory organization mode are as follows:

As shown in Figure 5, in step 5, application job processing unit obtains desired data from internal memory, and carries out application job according to message based tupe, and the specific implementation that obtains operation result is:

Step 5.4: upgrade the value on object summit, and then delete this message;

As shown in Figure 6: a kind of diagram data disposal system towards magnanimity time series data, comprises accumulation layer 1, client layer 2 and computation layer 3;

Described accumulation layer 1, it is for setting up the graph structure that can represent interactive relationship between personage, and according to celebrity effect, realizes the personalization storage of graph structure;

Described client layer 2, it is for writing the application job based on Message Processing, and submits to computation layer 3;

Described computation layer 3, it is for obtain the data of required graph structure piece from accumulation layer 1, and carries out application job according to message based tupe, obtains operation result.

Wherein, described accumulation layer 1 comprises data pretreatment unit 1-1, graph structure cutter unit 1-2, data importing unit 1-3 and internal storage location 1-4;

Described data pretreatment unit 1-1, it is for social network data is carried out to pre-service, and takes out with summit representative figure, with the graph structure of interactive relationship between the limit representative figure of the free stamp of some bands;

Described graph structure cutter unit 1-2, it is for according to celebrity effect, graph structure being cut into several graph structure pieces according to predetermined Euclidean distance, and the summit numbering of giving graph structure piece and inside thereof;

Described data importing unit 1-3, it is processed for graph structure piece is distributed to different nodes, and the graph structure piece that each node is got according to memory organization mode imports in the relevant position of internal memory;

Described internal storage location 1-4, comprises the internal memory of several nodes, is respectively used to the graph structure blocks of data that storage imports.

Wherein, described client layer 2 comprises that application job writes unit 2-1 and application job commit unit 2-2,

Described application job is write unit 2-1, and it is for writing the application job based on Message Processing, and application job is sent to application job commit unit 2-2;

Described application job commit unit 2-2, it is for submitting to application job the application job processing unit 3-1 of computation layer.

Wherein, described computation layer 3 comprises application job processing unit 3-1, and it is for carrying out application job according to Message Processing pattern.

Wherein, in described figure application job, comprise several tasks, each task is responsible for processing one or more and is numbered adjacent graph structure.

In diagram data, except having interpersonal annexation, also have the interpersonal interactive relationship relevant to time series.This interactive relationship is different from annexation, and annexation can represent with a limit in graph structure, but interactive relationship is directly related with the time, and it represents according to time relationship, have multiple expression and interactive relationship between two summits.But at present, Large Scale Graphs data handling system has all only been processed interpersonal annexation, do not process interpersonal according to the interactive relationship of asynchronism(-nization).

The present invention has designed a kind of with summit representative figure, graph structure with the interactive relationship between the limit representative figure with timestamp, this method for expressing is set up many limits between two summits, Fig. 7 is in the embodiment of the present invention 1, the partial graph structural representation that can represent interactive relationship between personage, value on every limit is the Time Created on this limit, and namely the interactive time occurring, this method for expressing can effectively indicate the social network relationships of interactive sequential relationship.Many versions of data are except representing the graph structure of social networks, take the interactive relationship that the time is unit in can also presentation graphs structure between these summits, referred to as sequential relationship.After being abstracted into graph structure, there is essential difference with original graph structure.In original graph structure, between every two summits, only have a limit (situation of non-directed graph) or two limits (situation of digraph) to be connected, in graph structure after improvement, between every two summits, there are many limits to be connected, every limit represents between these two summits the interactive relationship with timestamp, as comment on photo, comment is had a talk about, reprint " praising " etc.

Existing data storage method is that to take the larger region of consistency be that benchmark is set the foundation of dividing memory block, this has just caused memory block can store the data of the graph structure that lower consistency is large, but for more sparse graph structure, although also can store, caused the waste of a large amount of storage spaces; On the other hand, inquiry for data, prior art is by all data (vertex data and the relation data of this piece graph structure, wherein relation data comprises direction and its timestamp on the limit that represents interactive relationship) be all stored in the memory block of getting, during data query, to travel through all data in memory partitioning, efficiency data query is reduced.

In social networks, everyone active degree and pouplarity are not quite similar.Well imagine, famous person's microblogging is more concerned compared with ordinary populace.So, represent that there is more polygon coupled, same reason on famous person's summit, also have more interactive relationship.So, can be understood as in social networks, famous person's circle is a dense graph structure, and ordinary people is a sparse graph structure.But famous person, with respect to ordinary people, is a very little colony after all.The special distribution (celebrity effect) that the present invention is directed to diagram data has designed a kind of memory mode of diagram data, takes full advantage of the distribution character of diagram data, can reach efficient storage and query performance.As shown in Figure 8, schematic diagram for a memory partitioning storage map block structure in the internal memory of a node in the embodiment of the present invention 2, the present invention is according to the cutting situation of graph structure piece, in internal memory, open up a slice memory headroom, and be divided into the memory partitioning of N piece fixed size, graph structure piece is stored in corresponding memory partitioning.Concrete storage means is: the memory block 101 that distributes a fixed size in memory partitioning for each summit, wherein the data on each summit comprise vertex data and relation data, relation data comprises limit data and time stamp data, the data on each summit are deposited in a memory block, when certain summit consistency is larger (its relation data comprising is more), the vertex data on this summit and a small amount of relation data are stored in the memory block of getting, open up in addition again several add-in memories pieces 102, and with add-in memories piece 102 described in pointed, and remaining relation data is stored in add-in memories piece 102.The present invention also sets up and take the time as the first key word in annex memory block 102, take the index that summit is the second key word, is convenient to searching of data.

As shown in Figure 9, for take the processing procedure schematic diagram that summit is processing unit in prior art, traditional figure computation model is all to take summit as processing unit, at each, take turns in iterative process, received the message 202 that neighbours' task sends, concrete processing procedure is exactly for each summit 201, travels through all message 202, find the message 202 of mating with this summit 201, then this message 202 is extracted.After all summits 201 all travel through, these message 202 could be deleted.So, taken a large amount of computational resources and storage resources.

Figure 10 is the processing procedure schematic diagram that message is processing unit of take of the present invention, in the present invention, thisly take programming model that message is processing unit and refer to and often carry out a piece of news 202, directly resolve this message 202, then find out the summit corresponding with this message 202 201 and calculate, can delete this message 202 simultaneously.Under this mode, calculating is only relevant to message 202 numbers, and all message numbers of unnecessary preservation simultaneously, from saving computing time and storage space to a great extent.

The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims

1. towards a diagram data disposal route for magnanimity time series data, it is characterized in that, comprise the steps:

2. a kind of diagram data disposal route towards magnanimity time series data according to claim 1, is characterized in that, described step 1 pair social network data is carried out pretreated concrete steps and is:

3. a kind of diagram data disposal route towards magnanimity time series data according to claim 1, it is characterized in that, step 2 is cut into several graph structure piece by graph structure according to Euclidean distance according to celebrity effect, and the specific implementation of numbering to graph structure piece and inner summit thereof is:

4. a kind of diagram data disposal route towards magnanimity time series data according to claim 1, is characterized in that, the concrete steps that the graph structure piece that in described step 3, each node is got according to memory organization mode imports in the relevant position of internal memory are as follows:

5. a kind of diagram data disposal route towards magnanimity time series data according to claim 1, it is characterized in that, in step 5, application job processing unit obtains desired data from internal memory, and carries out application job according to message based tupe, and the specific implementation that obtains operation result is:

Step 5.4: upgrade the value on object summit, and then delete this message;

6. towards a diagram data disposal system for magnanimity time series data, it is characterized in that, comprise accumulation layer, computation layer and client layer;

7. a kind of diagram data disposal system towards magnanimity time series data according to claim 6, is characterized in that, described accumulation layer comprises data pretreatment unit, graph structure cutter unit, data importing unit and internal storage location;

8. a kind of diagram data disposal system towards magnanimity time series data according to claim 6, is characterized in that, described client layer comprises that application job writes unit and application job commit unit,

9. a kind of diagram data disposal system towards magnanimity time series data according to claim 6, is characterized in that, described computation layer comprises application job processing unit, and it is for carrying out application job according to Message Processing pattern.

10. a kind of diagram data disposal system towards magnanimity time series data according to claim 6, is characterized in that, in described figure application job, comprise several tasks, each task is responsible for processing a graph structure piece.