CN106959948A

CN106959948A - The system and its preprocess method pre-processed for distributed nature to big data

Info

Publication number: CN106959948A
Application number: CN201610010843.9A
Authority: CN
Inventors: 顾青; 梁佐泉; 谢超; 梁艳敏; 王宁宁; 冯四风; 赵艳红; 田文晋; 王亚红; 黄奚芳
Original assignee: Waterhouse Integrity Information Technology Co Ltd
Current assignee: Waterhouse Integrity Information Technology Co Ltd
Priority date: 2016-01-08
Filing date: 2016-01-08
Publication date: 2017-07-18

Abstract

The invention discloses a kind of system pre-processed for distributed nature to big data, including：Adapter is pre-processed, entrance is provided for initial data pretreatment, is divided into automation pretreatment adapter and semi-automatic pretreatment adapter；Data processing module, the data that pretreatment adapter is sent carry out the division of data block according to the data form of specified rule and unified standard, data block after division is distributed on different memory nodes, mutually have related data to be divided in same data block, and do not possess relevance between data block；Distributed storage module, sets multiple memory nodes, the data block sent for data storage processing module.Present invention also offers a kind of method pre-processed for distributed nature to big data.The present invention can greatly improve big data Distributed Calculation and the degree of accuracy and the efficiency of mining analysis.

Description

The system and its preprocess method pre-processed for distributed nature to big data

Technical field

The present invention relates to computer realm, the system that more particularly to a kind of distributed nature for big data is pre-processed to big data.The invention further relates to a kind of method that distributed nature for big data is pre-processed to big data

Background technology

The development of big data technology is swift and violent, and data technique handle the data of single type from early stage on unit, develops into the data of the current processing polymorphic type on computer cluster, realizes time loose data analysis application.With data volume develop into it is PB, EB grades even more big, and it is required that the faster Treatment Analysis time, the application technology of the general technology such as analysis, the second level time series analysis of complex types of data such as big data special-purpose computer, strange land distributed computer cluster, the processing of polymorphic type multi-source data and analysis, data network and various domain-orienteds is the development trend of big data technology.The big data general technology for representative and open source projects fast development with HDFS, GFS, MapReduce, Hadoop, Spark, Storm, HBase, MongoDB etc., big data preconditioning technique is an essential link in big data processing procedure, and these big data treatment technologies all introduce the concept that Distributed Calculation is analyzed with distributed libray.

Big data information source is complicated, and data structure is various, the data collected need to be pre-processed using big data preconditioning technique, and information is established as to the data standard of unified standard, so as to support follow-up data to calculate and mining analysis.In order to effectively support big data Distributed Calculation and mining analysis, big data need to be pre-processed for distributed nature, it is ensured that related data, which is between same node and node, is not present data and the interactivity in calculating.

Data distribution formula characteristic includes Distributed Calculation algorithm, distributed libray parser and the corresponding Mathematical Modeling of distributed libray parser.

The content of the invention

The technical problem to be solved in the present invention is to provide the system pre-processed using data distribution formula characteristic (Distributed Calculation algorithm, distributed libray parser and the corresponding Mathematical Modeling of distributed libray parser) to big data, so as to which big data to be fast and effeciently processed as to the data form of unified standard, and associated data is divided into the data that same node, total data be divided between different nodes and node does not in order possess relevance, the interactive computing between node is avoided, big data Distributed Calculation and mining analysis is effectively supported.Present invention also offers a kind of method pre-processed using data distribution formula characteristic to big data

In order to solve the above technical problems, the system pre-processed for distributed nature to big data that the present invention is provided, including：Pre-process adapter, data processing module and distributed storage module；

Adapter is pre-processed, the data for providing entrance and initial data being converted into object format are pre-processed for initial data, is divided into automation pretreatment adapter and semi-automatic pretreatment adapter；

Automation pretreatment adapter, different automation adapters are set according to different data source formats, initial data is converted into the data of object format；

Semi-automation pretreatment adapter, the standard for carrying out secondary development by pre-processing interface to open standard or pre-processing adapter according to automation adds corresponding configuration file, and initial data is converted into the data of object format or the data of automation pretreatment adapter call format are met；

Data processing module, the data that pretreatment adapter is sent carry out the division of data block according to the data form of specified rule and unified standard, data block after division is distributed on different memory nodes, the data for meeting default association computation rule are divided in same data block, and do not possess relevance between data block；

Distributed storage module, sets multiple memory nodes, the data block sent for data storage processing module.

Wherein, the specified rule of data processing module progress data block division is：Distributed Calculation algorithm, distributed libray parser and the corresponding Mathematical Modeling of distributed libray parser of data；

Wherein, when being divided for Distributed Calculation algorithm to data block, it is accomplished by the following way：

Data aggregate, by the sequence of data, Classifying Sum, data packet operation by data integration be data block；

Data recombination, according to specific rule, extracts corresponding data items, reconfigures as new data block；

The data that correlation rule is met between data item, by setting associated data rule, are divided into a data block by data correlation；

Data cutting, on the basis of data aggregate, data correlation and data reorganization operation, for between different pieces of information block in Distributed Calculation or between the data of different machines data calculating need to be carried out according to the computation model of setting, data are carried out by data cutting by specified rule according to business demand, so that data be distributed on different nodes in order；

Computation model, i.e., the mathematical formulae abstracted according to business demand；

The data item of data division can be carried out in specified rule, including data category, size of data or calculating data.

Wherein, when being divided for distributed libray parser to data, it is accomplished by the following way：

Data message is extracted, and according to the parameter requirements of parser, extracts the data item for needing to analyze, and be stored on identical back end；

Data processing, on the basis of initial data, according to business diagnosis target, sets corresponding computing formula, new data item is produced by the calculating between data with existing；

Mining analysis algorithm data form is changed, and converts raw data into the data form of mining analysis algorithm requirement.

Wherein, when carrying out data block division using the corresponding Mathematical Modeling of distributed libray parser, it is accomplished by the following way：

By Data Format Transform and data model extraction, data item, data type and the data form needed for Mathematical Modeling are extracted, by data distribution to different nodes；

Mathematical Modeling Data Format Transform, the data form converted raw data into needed for Mathematical Modeling；

Data model is extracted, the need for mining analysis, and extracting part typical data according to specified rule in initial data builds Mathematical Modeling.

Wherein, the configuration item of automation pretreatment adapter can be corresponded by the field name or data item mapped with data storage, the data item of automation pretreatment adapter can be chosen by configuring the page, or by being set to configuration item data parameter value, changing and choose.

The method pre-processed for distributed nature to big data that the present invention is provided, including：

The first step, initial data is converted into the data of object format according to different data source formats, major key of one of data item as calculating is set according to data analysis target, based on available data item, data corresponding to the prime key item of any two data in all data are combined, associated data pair is drawn；

Second step, the prime key item based on associated data centering, the corresponding computation model of setting obtain the data item needed for association is calculated as value, are converted to<key,value>Key-value pair；

3rd step, different data blocks is obtained to the division that key-value pair carries out data block according to specified rule, and obtain new data block to obtained data block progress parallel computation；

4th step, new data block is divided on different nodes, does not possess any relevance between obtained data block.

Wherein, when implementing three steps, the specified rule for carrying out data block division is：Distributed Calculation algorithm, distributed libray parser and the corresponding Mathematical Modeling of distributed libray parser of data.

Wherein, when carrying out data block division using Distributed Calculation algorithm, it is accomplished by the following way：

Data cutting, on the basis of data aggregate, data correlation and data reorganization operation, for between different pieces of information block in Distributed Calculation or between the data of different machines data calculating need to be carried out according to the computation model of setting, data are carried out by data cutting by specified rule according to business demand, so that data be distributed on different nodes in order.

Wherein, when carrying out data block division using distributed libray parser, it is accomplished by the following way：

Wherein, when implementing the first step, the configuration item of the data of object format can be corresponded by mapping with the field name of data storage or data item, and the data item of the data of object format can be by choosing, or by being set to configuration item data parameter value, changing and choose.

So that the relation value between data is calculated as an example, illustrate the operation principle of the present invention.

Assuming that shared N datas, set the unique mark per data as major key key, by the calculating between any same data item of two datas, obtain the relating value between any two data, N* (N-1)/2 calculating need to be carried out altogether.

As shown in figure 1, being the computation structure figure of the data after Several Traditional Preconditioning Methods processing.

Traditional data preprocessing method is：Data are averagely divided on m node according to the size of data volume, because any two data is all needed calculate so as to draw in its relating value, Fig. 1, the data in the data block 1 of node 1, which are calculated, can be seen that, co-exists in following three types of data and calculates：Calculated between any two data in c1, same data block；Data between c2, the different pieces of information block of uniform machinery are calculated；Data between c3, the different pieces of information block of different machines are calculated.

Need frequently to be interacted between different nodes between different pieces of information, between different pieces of information block when data after preprocess method processing carry out data calculating, can all cause what is calculated to take.

Fig. 2 is the computation structure figure of the data after present invention pretreatment.By the data that need to be calculated storage it is a data by pretreated data, it is to avoid communication and interaction between different pieces of information, between different pieces of information block between different nodes, greatly improves the efficiency that data carry out Distributed Calculation.Through pretreated data of the invention according to business diagnosis target, the data form needed for mining analysis is processed into.

Big data Distributed Calculation and the efficiency of mining analysis can be substantially improved in the present invention.

Brief description of the drawings

The present invention is further detailed explanation with embodiment below in conjunction with the accompanying drawings：

Fig. 1 is the computation structure schematic diagram of the data after Several Traditional Preconditioning Methods processing.

Fig. 2 is the computation structure schematic diagram of data after present invention pretreatment.

Fig. 3 is pretreatment system structural representation of the present invention.

Embodiment

As shown in figure 3, the system pre-processed for distributed nature to big data that the present invention is provided, including：Pre-process adapter, data processing module and distributed storage module；

Computation model, i.e., the mathematical formulae abstracted according to business demand.

The present invention provides a kind of method pre-processed for distributed nature to big data, including：

The present invention is described in detail above by embodiment and embodiment, but these are not construed as limiting the invention.Without departing from the principles of the present invention, those skilled in the art can also make many modification and improvement, and these also should be regarded as protection scope of the present invention.

Claims

1. a kind of system pre-processed for distributed nature to big data, it is characterised in that including：Pre-process adapter, data processing module and distributed storage module；

2. the system pre-processed as claimed in claim 1 for distributed nature to big data, it is characterised in that：Data processing module carry out data block division specified rule be：Distributed Calculation algorithm, distributed libray parser and the corresponding Mathematical Modeling of distributed libray parser of data.

3. the system pre-processed as claimed in claim 2 for distributed nature to big data, it is characterised in that：When being divided for Distributed Calculation algorithm to data block, it is accomplished by the following way：

Computation model, i.e., the mathematical formulae taken out according to business demand；

4. the system pre-processed as claimed in claim 2 for distributed nature to big data, it is characterised in that：When being divided for distributed libray parser to data, it is accomplished by the following way：

5. the system pre-processed as claimed in claim 2 for distributed nature to big data, it is characterised in that：When carrying out data block division using the corresponding Mathematical Modeling of distributed libray parser, it is accomplished by the following way：

6. the system pre-processed as claimed in claim 1 for distributed nature to big data, it is characterised in that：The configuration item of automation pretreatment adapter can be corresponded by the field name or data item mapped with data storage, the data item of automation pretreatment adapter can be chosen by configuring the page, or by being set to configuration item data parameter value, changing and choose.

7. a kind of method pre-processed for distributed nature to big data, it is characterised in that including：

8. the method pre-processed as claimed in claim 7 for distributed nature to big data, it is characterised in that：When carrying out data block division using Distributed Calculation algorithm, it is accomplished by the following way：

9. the method pre-processed as claimed in claim 7 for distributed nature to big data, it is characterised in that：When carrying out data block division using distributed libray parser, it is accomplished by the following way：

10. the method pre-processed as claimed in claim 7 for distributed nature to big data, it is characterised in that：When carrying out data block division using the corresponding Mathematical Modeling of distributed libray parser, it is accomplished by the following way：