CN107038260A

CN107038260A - A kind of efficient parallel loading method for keeping titan Real-time Data Uniforms

Info

Publication number: CN107038260A
Application number: CN201710390469.4A
Authority: CN
Inventors: 毛洪亮; 唐积强; 王秀文; 李焱余; 苏沐冉; 马秀娟; 吴震; 徐小磊; 张露晨; 李传海; 李斌斌; 蒲路; 谢铭
Original assignee: BEIJING SCISTOR TECHNOLOGY Co Ltd; National Computer Network and Information Security Management Center
Current assignee: BEIJING SCISTOR TECHNOLOGY Co Ltd; National Computer Network and Information Security Management Center
Priority date: 2017-05-27
Filing date: 2017-05-27
Publication date: 2017-08-11
Anticipated expiration: 2037-05-27
Also published as: CN107038260B

Abstract

The invention discloses a kind of efficient parallel loading method for keeping titan Real-time Data Uniforms, belong to big data process field；First, titan is divided into the module of 7 concurrent workings, cleaning rule management module real-time update filtering rule；Data reception module receives pieceOfData and is put into queue1；Data cleansing module filters qualified data and is put into queue2；ID modular converters are interacted with high speed index module, judge two points in current pieceOfData and titan ID corresponding relation whether there is with chart database；If it is, ID attributes inside titan and ID value substitution points are saved in pieceOfDataT, it is put into queue4；Otherwise, the point not loaded is put into HashSet, and corresponding pieceOfData is put into queue3；PieceOfDataT is loaded into titan by remaining data load-on module multi-threaded parallel；Point load-on module is responsible for HashSet midpoints adding titan, will put the corresponding relation addition high speed index module with titan ID.Each module of the invention is alone or interaction completes partial function, so as to realize the lifting of loading efficiency on the whole.

Description

A kind of efficient parallel loading method for keeping titan Real-time Data Uniforms

Technical field

The invention belongs to big data process field, it is related to a kind of chart database real time data pretreatment loading of highly effective and safe Method, specifically a kind of efficient parallel loading method for keeping titan Real-time Data Uniforms.

Background technology

With the continuous improvement continued to develop with the level of informatization of computer technology, data volume is being increased rapidly, data Structure is also gradually being complicated, and traditional relevant database is difficult with many scenes, therefore various non-passes of being born It is type database.

Chart database is one kind in non-relational database, the various relational network data of storage is good at, in numerous figure numbers According in storehouse, titan is as very outstanding handy distributed chart database, with high scalability, by expanding cluster Size linearly improves the upper limit of figure storage, while the memory scan of super big figure can be supported；Therefore apply in many scenes Under；But in loading processing real time data, in order to ensure the uniformity of data, titan can only carry out single thread loading, in real time The inefficiency of data loading, with significant limitations, it is impossible to meet the loading demand of big flow real time data.

The content of the invention

For in the prior art, chart database titan when handling big flow real time data the problem of poorly efficient insecurity, The invention provides a kind of colleges and universities' loaded in parallel method for keeping titan Real-time Data Uniforms.

Comprise the following steps that：

Step 1: chart database titan is divided into 7 modules, and 7 modular concurrent operations；

7 modules include：Data reception module, cleaning rule management module, data cleansing module, ID modular converters are high Fast index module, point load-on module and remaining data load-on module；

Data reception module, which is responsible for reception, needs data to be processed, and is put into bounded queue；

Cleaning rule management module realizes that the dynamic of filtering rule updates by monitoring rules file；

Data cleansing module is by unwanted data in the given rule-based filtering bounded queue of cleaning rule management module；

ID modular converters replace with the point in the data after cleaning the ID of corresponding points in chart database.

High speed index module is responsible for accelerating ID conversion rate.

Point load-on module, is responsible for the point being not present in during load id conversion in chart database；And after loading is complete by point And its ID corresponding relations are added to high speed index module.

Remaining data load-on module, the loading velocity of diagram data is substantially improved by loaded in parallel.

Step 2: the multithreading of data reception module concurrent working simultaneously, each thread loops are literary from message queue or CSV The data source such as part or message queue obtains data, is parsed into a plurality of pieceOfData data, is put into bounded queue queue1.

Relation of the pieceOfData data between two points, two points, and point are constituted with the attribute in relation；

Bounded queue queue1 is used to deposit the data obtained from data source；

Step 3: regular configuration file is read in the timing of cleaning rule management module, or receive client request reading rule Configuration file, the filtering rule of dynamic renewal in real time；

Step 4: data cleansing module multi-threaded parallel works, each thread loops are obtained from bounded queue queue1 successively A pieceOfData data are taken, are judged using cleaning rule, if meeting filter condition, directly abandons, otherwise, puts Enter bounded queue queue2.

Queue2 is used to deposit the data after filtering in bounded queue queue1；

Step 5: ID modular converters multi-threaded parallel works, each thread loops take out clearly from bounded queue queue2 The pieceOfData data after filtering are washed to be handled；

Concretely comprise the following steps：

Step 501, judge the corresponding relation between ID inside two points in current pieceOfData data and titan Whether all it is present in high speed index module；If it is, into step 502, otherwise, into step 503；

Step 502, ID modular converters take out corresponding relation from high speed index module, corresponding with the replacement of ID values with ID attributes PieceOfData data in point, and be saved in pieceOfDataT data, pieceOfDataT data be put into Boundary's queue queue4；

What is preserved in pieceOfDataT data is that point in pieceOfData data is replaced by corresponding ID attributes and ID values PieceOfData after alternatively；

Queue4 is used to deposit pieceOfDataT data；

Corresponding relation between the point of at least one in step 503, current pieceOfData data and titan inside ID is not It is loaded into high speed index module, the point not being loaded is put into HashSet by ID modular converters, and should PieceOfData data are put into bounded queue queue3；

Queue3 is used to deposit the pieceOfData data selected from bounded queue queue2, the pieceOfData numbers Corresponding relation between at least one point and titan inside ID is not loaded into high speed index module.

Step 6: the concurrent working simultaneously of the multithreading of remaining data load-on module, each thread loops are from bounded queue PieceOfDataT data are obtained in queue4, and are carried in titan databases；

Step 7: point load-on module is interacted with high speed index module, after termination condition is met, terminate all threads；

Comprise the following steps that：

Step 701, judge whether to meet termination condition, if it is, all threads terminate；Otherwise, into step 702；

Step 702, when judging that data are alreadyd exceed in the whether full HashSet apart from last time loading of bounded queue queue3 Between threshold value t, if it is, perform step 703, otherwise, dormancy time t1；Return to step 701 continues；

Threshold value t is that system initialization is participated in the experiment, and is set according to actual conditions；

Step 703, the point put in each thread loading HashSet of load-on module, and by ID inside the point and titan it Between corresponding relation add high speed index module in；

Step 704, point load-on module are reset to HashSet, and record current time is data in loading HashSet Time；

Step 705, the pieceOfData data in bounded queue queue3 are all put into bounded queue queue2, Empty bounded queue queue3；Return to step 701.

The advantage of the invention is that：

1), a kind of efficient parallel loading method for keeping titan Real-time Data Uniforms, can greatly improve titan real When data loading performance, loading velocity is lifted on 20 times.

2), a kind of efficient parallel loading method for keeping titan Real-time Data Uniforms, is the real-time number of highly effective and safe Data preprocess loading method；Data loading efficiency can be greatly improved on the premise of data consistency is kept, and can real time modifying Interpolation data filtering rule.

Brief description of the drawings

Fig. 1 is the structure chart that chart database titan of the present invention is divided into 7 modules；

Fig. 2 is a kind of efficient parallel loading method flow chart for keeping titan Real-time Data Uniforms of the present invention.

Specific embodiment

The specific implementation method to the present invention is described in detail below in conjunction with the accompanying drawings.

The present invention in order to ensure data consistency on the premise of be greatly enhanced the loading performance of titan real time datas, carry A kind of colleges and universities' loaded in parallel method for keeping titan Real-time Data Uniforms is gone out；Generally include three parts：Real time data is clear Wash, storage control and the processing of new point；

Regulation management thread is responsible for dynamic in real time and updates filtering rule；Main thread receives pieceOfData data, has been put into In boundary's queue queue1；Data cleansing module filters out underproof data according to cleaning rule, is put into bounded queue queue2 In；ID modular converters are fetched evidence from bounded queue queue2, are interacted with high speed index module；Judge current pieceOfData Corresponding relation inside two points in data and titan between ID whether there is with chart database；If it is, from index ID attributes and ID value substitution points inside the corresponding titan of off-take point, and be saved in pieceOfDataT data, it is put into bounded Queue queue4；Otherwise, the point not being loaded is put into HashSet, and the corresponding pieceOfData data has been put into In boundary's queue queue3；Remaining data load-on module obtains pieceOfDataT data, multithreading from bounded queue queue4 Loaded in parallel is in titan；

Point Loading Control thread judges whether data cleansing terminates, if it has not ended, continuing to judge bounded queue Whether queue3 is full, if less than thread dormancy waits bounded queue queue3 to expire for a period of time, and otherwise, multithreading adds The point in HashSet is carried, and the corresponding relation between ID inside the point and titan is added in high speed index module；Then, point Load-on module resets HashSet, and the pieceOfData data in bounded queue queue3 are all put into bounded queue queue2 In, empty bounded queue queue3.

Specific steps are as shown in Fig. 2 as follows：

As shown in figure 1,7 modules include：Data reception module, cleaning rule management module, data cleansing module, ID turns Change the mold block, high speed index module, point load-on module and remaining data load-on module；Each module is alone or interaction completes part work( Can, so as to realize the lifting of loading efficiency on the whole.

First module data receiving module, realizing to receive from the place such as message queue or csv file needs what is be processed Data, and be put into bounded queue.

Second module data cleaning module, is responsible for filtering unwanted data according to given rule；Given rule includes Accurate matching, is obscured or canonical matching.

3rd module cleaning rule management module, the dynamic for realizing filtering rule by monitoring rules file updates.

Filtering rule file is Json formatted files, and concrete structure is shown in annex 1.

Filtering rule file:

4th module I D modular converter, is responsible for replacing with the point in data the ID of corresponding points in chart database.

5th module high speed index module, structure is key-value types；It is responsible for accelerating the conversion of ID in the 4th module Speed.

6th module point load-on module, is responsible for the point being not present in during loading the 4th module I D conversions in chart database； And point and its ID corresponding relations are added to high speed index module after loading is complete.

7th module remaining data load-on module, the loading velocity of diagram data is substantially improved by loaded in parallel.

Step 2: the multithreading of data reception module concurrent working simultaneously, each thread loops are literary from message queue or CSV The data sources such as part obtain diagram data, are parsed into a plurality of pieceOfData data, are put into bounded queue queue1.

Diagram data is the various topological diagram datas by putting and side is constituted；

Relation of the pieceOfData data between two points, two points, and point and the attribute of relation are constituted；Putting is One is used for the key-value pair of unique mark specified point, such as uid=9867；

Bounded queue queue1 is used to deposit the data obtained from data source；

Queue2 is used to deposit the data after filtering in bounded queue queue1；

Step 5: ID modular converters multi-threaded parallel works, each thread loops take out clearly from bounded queue queue2 The pieceOfData data after filter are washed, and judge two points in current pieceOfData data and ID inside titan Between corresponding relation whether be all present in high speed index module；If it is, into step 6, otherwise, into step 8；

Step 6: ID modular converters take out corresponding relation from high speed index module, with ID attributes inside titan and ID values The point in corresponding pieceOfData data is replaced, and is saved in pieceOfDataT data, bounded queue is put into queue4；

Queue4 is used to deposit pieceOfDataT data；

Step 7: the concurrent working simultaneously of the multithreading of remaining data load-on module, each thread loops are from bounded queue PieceOfDataT data are obtained in queue4, and are carried in titan databases, return to step five；

Step 8: corresponding relation in current pieceOfData data between at least one point and ID inside titan not by It is loaded into high speed index module, the point not being loaded is put into HashSet by ID modular converters, and is somebody's turn to do corresponding PieceOfData data are put into bounded queue queue3；

Step 9: judging whether the full or time reaches given threshold t to bounded queue queue3, if it is, performing step Ten, otherwise, return to step five；

When the full and time reaches that both given threshold t condition meets one of them to bounded queue queue3, Continue into subsequent step；Conversely, when bounded queue queue3 less than and the time be not up to given threshold t when, current thread Dormancy is carried out, the data not being loaded into bounded queue queue2 in high speed index module are waited, by ID modular converters by point It is put into HashSet, and corresponding pieceOfData data is put into bounded queue queue3；Until bounded queue The full or time reaches given threshold t to queue3；

Step 10: point load-on module judges whether data cleansing terminates, if it is, all threads terminate；Otherwise, into step Rapid 11；

Step 11: the point in each thread loading HashSet of point load-on module, and by ID inside the point and titan Between corresponding relation add high speed index module in；

Step 12: point load-on module is reset to HashSet, by the pieceOfData in bounded queue queue3 Data are all put into bounded queue queue2, empty bounded queue queue3；Return to step five.

It should be noted that and understand, in the feelings for not departing from the spirit and scope of the present invention required by appended claims Under condition, various modifications and improvements can be made to the present invention of foregoing detailed description.It is therefore desirable to the model of the technical scheme of protection Enclose and do not limited by given any specific exemplary teachings.

Claims

1. a kind of colleges and universities' loaded in parallel method for keeping titan Real-time Data Uniforms, it is characterised in that specific steps are such as Under：

7 modules include：Data reception module, cleaning rule management module, data cleansing module, ID modular converters, high speed rope Draw module, point load-on module and remaining data load-on module；

Step 2: the concurrent working simultaneously of the multithreading of data reception module, each thread loops from message queue or csv file or The data sources such as message queue obtain data, are parsed into a plurality of pieceOfData data, are put into bounded queue queue1；

Relation of the pieceOfData data between two points, two points, and point and the attribute of relation are constituted；

Step 4: data cleansing module multi-threaded parallel works, each thread loops obtain one from bounded queue queue1 successively Bar pieceOfData data, are judged using cleaning rule, if meeting filter condition, are directly abandoned, otherwise, have been put into Boundary's queue queue2；

Step 5: ID modular converters multi-threaded parallel works, each thread loops take out from bounded queue queue2 and cleaned PieceOfData data after filter are handled；

Step 6: the concurrent working simultaneously of the multithreading of remaining data load-on module, each thread loops are from bounded queue queue4 Middle acquisition pieceOfDataT data, and be carried in titan databases；

Step 7: point load-on module is interacted with high speed index module, after termination condition is met, terminate all threads.

2. a kind of colleges and universities' loaded in parallel method for keeping titan Real-time Data Uniforms as claimed in claim 1, its feature It is, in the step one, data reception module, which is responsible for reception, needs data to be processed, and is put into bounded queue；

ID modular converters replace with the point in the data after cleaning the ID of corresponding points in chart database；

High speed index module is responsible for accelerating ID conversion rate；

Point load-on module, is responsible for the point being not present in during load id conversion in chart database；And after loading is complete will point and its ID corresponding relations are added to high speed index module；

3. a kind of colleges and universities' loaded in parallel method for keeping titan Real-time Data Uniforms as claimed in claim 1, its feature It is, the step 5 is specially：

Step 501, judge whether is corresponding relation inside two points in current pieceOfData data and titan between ID All it is present in high speed index module；If it is, into step 502, otherwise, into step 503；

Step 502, ID modular converters take out corresponding relation from high speed index module, corresponding with the replacement of ID values with ID attributes Point in pieceOfData data, and be saved in pieceOfDataT data, pieceOfDataT data are put into bounded Queue queue4；

What is preserved in pieceOfDataT data is that point in pieceOfData data is replaced it by corresponding ID attributes and ID values PieceOfData afterwards；

Corresponding relation between the point of at least one in step 503, current pieceOfData data and titan inside ID is not added It is downloaded in high speed index module, the point not being loaded is put into HashSet by ID modular converters, and by the pieceOfData numbers According to being put into bounded queue queue3；

Queue3 is used to deposit in the pieceOfData data selected from bounded queue queue2, the pieceOfData data Corresponding relation between at least one point and titan inside ID is not loaded into high speed index module.

4. a kind of colleges and universities' loaded in parallel method for keeping titan Real-time Data Uniforms as claimed in claim 1, its feature It is, the step 7 is specially：

Step 702, judge that data already exceed time threshold in the whether full HashSet apart from last time loading of bounded queue queue3 Value t, if it is, performing step 703, otherwise, dormancy time t1；Return to step 701 continues；

Step 703, the point put in each thread loading HashSet of load-on module, and by between ID inside the point and titan Corresponding relation is added in high speed index module；

Step 704, point load-on module HashSet is reset, record current time for load HashSet in data when Between；

Step 705, the pieceOfData data in bounded queue queue3 are all put into bounded queue queue2, emptied Bounded queue queue3；Return to step 701.