CN106951475A

CN106951475A - Big data distributed approach and system based on cloud computing

Info

Publication number: CN106951475A
Application number: CN201710130418.8A
Authority: CN
Inventors: 梁明亮; 孙逸洁; 刘伟; 苏东民; 董黎生
Original assignee: Zhengzhou Railway Vocational and Technical College
Current assignee: Zhengzhou Railway Vocational and Technical College
Priority date: 2017-03-07
Filing date: 2017-03-07
Publication date: 2017-07-14

Abstract

A kind of big data distributed approach based on cloud computing, it comprises the following steps：S1, reception input file, input burst is carried out according to input file size, and a mapping tasks are distributed by each input burst, and input burst stores the array of the position of burst length and record data；S2, by the mapping function write in advance on data memory node map obtaining intermediate file；S3, the duplicate key value merged in intermediate file；S4, open up circulating memory buffering area in internal memory, circulating memory buffering area is used to map output file output；Configuration file is created in circulating memory buffering area；Protection thread pause writes data into internal memory, and writes spill file in internal memory, and spill file determines the file of write-in disk, and the file of circulating memory buffering area is write into disk until all mapping output file output is finished；S5, by all mapping output files and store on distributed file storage system.

Description

Big data distributed approach and system based on cloud computing

Technical field

The present invention relates to big data field of cloud computer technology, at more particularly to a kind of big data distribution based on cloud computing Manage method and system.

Background technology

With the arriving of cloud era, big data (Big data) has also attracted increasing concern.Big data (Big Data a large amount of unstructured datas and semi-structured data) are conventionally used to indicate, these data are downloading to relevant database For purposes analysis.Big data analysis is often linked together with cloud computing, because large data set analysis needs picture in real time Framework the same MapReduce shares out the work to tens of, hundreds of or even thousands of computer.Big data needs special skill Art, effectively to handle the data in the substantial amounts of tolerance elapsed time.Suitable for the technology of big data, including at large-scale parallel Manage (MPP) database, data digging system, distributed file system, distributed data base, cloud computing platform, internet and can The storage system of extension.

Data source is enriched very much under big data environment and data type is various, and the data volume of storage and analysis mining is huge Greatly, the requirement to data exhibiting is higher, and values very much the high efficiency and availability of data processing.But traditional data processing side Method has the following disadvantages：1st, traditional data acquisition source is single, and storage, management and analyze data amount are also relatively small, greatly It is many to be handled using relevant database and parallel data warehouse.To by parallel computation lifting data processing speed aspect Speech, traditional parallel database technology pursues high consistency and fault-tolerance, theoretical according to CAP, it is difficult to ensure its availability and Autgmentability.2nd, traditional data processing method is the expense that calculating is considerably increased centered on processor, it is impossible to adapt to big number According to a large amount of unstructuredness data process demand.

The content of the invention

In view of this, the present invention proposes a kind of big data distributed approach and system based on cloud computing.

A kind of big data distributed approach based on cloud computing, it comprises the following steps：

S1, reception input file, input burst is carried out according to input file size, and distributing one by each input burst reflects Task is penetrated, input burst stores the array of the position of burst length and record data；

S2, by the mapping function write in advance on data memory node map obtaining intermediate file；

S3, the duplicate key value merged in intermediate file, to reduce mapping output file redundancy；And the key assignments after merging is entered Row serializing obtains mapped cache file；Automatically the computational load value of each calculate node is obtained, according to the calculating of calculate node Each mapped cache file is assigned in each calculate node by load value；

S4, open up circulating memory buffering area in internal memory, circulating memory buffering area is used to map output file output；In ring Configuration file is created in shape core buffer, the EMS memory occupation threshold value of core buffer is configured in configuration file；In annular Deposit in buffering area EMS memory occupation to be more than or equal to when taking threshold value, protection thread pause writes data into internal memory, and in internal memory Write spill file, spill file determines the file of write-in disk, and the file of circulating memory buffering area write into disk until All mapping output file outputs are finished；

S5, by all mapping output files and store on distributed file storage system.

In the big data distributed approach of the present invention based on cloud computing, to input text in the step S1 Part size, which carries out input burst, to be included：

Incidence relation table is set up, input file is split as position relationship value, activity relationship value, structural relation value, function Relation value, functional relationship value, behavior relation value and other relation values, and by the correspondence of each relation value of each input file In relation write-in incidence relation table；

The corresponding data of each relation value are included in input burst.

In the big data distributed approach of the present invention based on cloud computing, the step S2 includes：

Mapped by the mapping function write in advance by burst is inputted according to mapping tasks, the mapping including according to The data form pre-set will input burst content and enter row-column list alignment, judge that position relationship value, activity relationship value, structure are closed Set occurrence, functional relationship value, functional relationship value, behavior relation value and other relation values whether there is, if each relation value is deposited Then directly retaining, if there is no a certain item or a few n-th-trem relation n values, then the relation value lacked is sky；The arrangement of each relation Order is consistent.

In the big data distributed approach of the present invention based on cloud computing,

The step S5 includes：

Each corresponding all index information of mapping output file is inquired about from incidence relation table, by each mapping output text One segment data of each correspondence of part is inserted into section list；The position relationship value, activity relationship value, structure for recording segment data are closed Set occurrence, functional relationship value, functional relationship value, behavior relation value and other relation values.

Burst will be inputted in the step S2 to the mapping function by writing in advance to carry out mapping also according to mapping tasks Including judging that input burst whether there is logic error according to incidence relation table, the input burst is abandoned if existing.

The present invention also provides a kind of big data distributed processing system(DPS) based on cloud computing, and it includes such as lower unit：

Split cells, for receiving input file, input burst is carried out according to input file size, by each input burst A mapping tasks are distributed, input burst stores the array of the position of burst length and record data；

Map unit, on data memory node map and obtains middle text for the mapping function by writing in advance Part；

Computing unit, for merging the duplicate key value in intermediate file, to reduce mapping output file redundancy；And to merging Key assignments afterwards serialize obtaining mapped cache file；Automatically the computational load value of each calculate node is obtained, according to calculating Each mapped cache file is assigned in each calculate node by the computational load value of node；

Output unit, for opening up circulating memory buffering area in internal memory, circulating memory buffering area is used to map output text Part is exported；Configuration file is created in circulating memory buffering area, the EMS memory occupation threshold of core buffer is configured in configuration file Value；When EMS memory occupation is more than or equal to occupancy threshold value in circulating memory buffering area, protection thread pause writes data into internal memory, And spill file is write in internal memory, spill file determines the file of write-in disk, and the file of circulating memory buffering area is write Enter disk until all mapping output file output is finished；

Merge memory cell, for by all mapping output files and storing to distributed file storage system.

In the big data distributed processing system(DPS) of the present invention based on cloud computing, to input in the split cells File size, which carries out input burst, to be included：

The corresponding data of each relation value are included in input burst.

In the big data distributed processing system(DPS) of the present invention based on cloud computing, the map unit includes：

In the big data distributed processing system(DPS) of the present invention based on cloud computing,

The merging memory cell includes：

The mapping function by writing in advance is mapped burst is inputted according to mapping tasks in the map unit Also include judging that input burst whether there is logic error according to incidence relation table, the input burst is abandoned if existing.

Implement big data distributed approach based on cloud computing that the present invention provides and system compared with prior art Have the advantages that：By the way that if the big data data of magnanimity have been divided into stem portion according to the rule pre-set, point To many processor parallel processings；Then the result after each processor processing is carried out collecting operation to obtain final result； Have the following effects that：A large amount of, the non-structured data of processing can be realized, data processing type and speed is improved.

Brief description of the drawings

Fig. 1 be the embodiment of the present invention modified wireless communication procedure in language transfer method flow chart.

Embodiment

As shown in figure 1, a kind of big data distributed approach based on cloud computing, it comprises the following steps：

S5, by all mapping output files and store on distributed file storage system.

The corresponding data of each relation value are included in input burst.

By implementing the embodiment of the present invention, various types of data can uniformly be split into each relation value, even if having A little relation value specific type of data do not have.Then distributed treatment is carried out to each relation value, data can be greatly improved Disposal ability.

By implementing the present embodiment, it will input burst content according to the data form pre-set and enter row-column list alignment, make The process resource for obtaining follow-up calculate node takes less.

The step S5 includes：

By implementing the present embodiment, redundancy, false judgment can be carried out to data, reduce operand.

The corresponding data of each relation value are included in input burst.

The merging memory cell includes：

Implement big data distributed approach based on cloud computing that the present invention provides and system compared with prior art Have the advantages that：By the way that if the big data data of magnanimity have been divided into stem portion according to the rule pre-set, point To many processor parallel processings；Then the result after each processor processing is carried out collecting operation to obtain final result； Have the following effects that：A large amount of, the non-structured data of processing can be realized, data processing type and speed is improved.Can be with Apply in fields such as Study of Intelligent Robot Control, track traffic controls, have broad application prospects.

It is understood that for the person of ordinary skill of the art, can be done with technique according to the invention design Go out other various corresponding changes and deformation, and all these changes and deformation should all belong to the protection model of the claims in the present invention Enclose.

Claims

1. a kind of big data distributed approach based on cloud computing, it is characterised in that it comprises the following steps：

S1, reception input file, input burst is carried out according to input file size, and a mapping times is distributed by each input burst The array of the position of business, input burst storage burst length and record data；

S3, the duplicate key value merged in intermediate file, to reduce mapping output file redundancy；And sequence is carried out to the key assignments after merging Row obtain mapped cache file；Automatically the computational load value of each calculate node is obtained, according to the computational load of calculate node Each mapped cache file is assigned in each calculate node by value；

S4, open up circulating memory buffering area in internal memory, circulating memory buffering area is used to map output file output；In annular Deposit and configuration file is created in buffering area, the EMS memory occupation threshold value of core buffer is configured in configuration file；It is slow in circulating memory Rush in area EMS memory occupation to be more than or equal to when taking threshold value, protection thread pause writes data into internal memory, and writes in internal memory Spill file, spill file determines the file of write-in disk, and the file of circulating memory buffering area is write into disk until all Mapping output file output finish；

S5, by all mapping output files and store on distributed file storage system.

2. the big data distributed approach as claimed in claim 1 based on cloud computing, it is characterised in that the step S1 In to input file size carry out input burst include：

Incidence relation table is set up, input file is split as position relationship value, activity relationship value, structural relation value, functional relationship Value, functional relationship value, behavior relation value and other relation values, and by the corresponding relation of each relation value of each input file Write in incidence relation table；

The corresponding data of each relation value are included in input burst.

3. the big data distributed approach as claimed in claim 2 based on cloud computing, it is characterised in that the step S2 Including：

Mapped by the mapping function write in advance by burst is inputted according to mapping tasks, the mapping is included according to advance The data form of setting will input burst content and enter row-column list alignment, judge position relationship value, activity relationship value, structural relation Value, functional relationship value, functional relationship value, behavior relation value and other relation values whether there is, if each relation value is present Then directly retain, if there is no a certain item or a few n-th-trem relation n values, then the relation value lacked is sky；The arrangement of each relation is suitable Sequence is consistent.

4. the big data distributed approach as claimed in claim 3 based on cloud computing, it is characterised in that

The step S5 includes：

Each corresponding all index information of mapping output file is inquired about from incidence relation table, by each mapping output file Each one segment data of correspondence is inserted into section list；Record the position relationship value, activity relationship value, structural relation of segment data Value, functional relationship value, functional relationship value, behavior relation value and other relation values.

5. the big data distributed approach as claimed in claim 3 based on cloud computing, it is characterised in that

The mapping function by writing in advance, which will be inputted burst and be mapped according to mapping tasks, in the step S2 also includes Judge that input burst whether there is logic error according to incidence relation table, the input burst is abandoned if existing.

6. a kind of big data distributed processing system(DPS) based on cloud computing, it is characterised in that it includes such as lower unit：

Split cells, for receiving input file, input burst is carried out according to input file size, by each input burst distribution The array of the position of one mapping tasks, input burst storage burst length and record data；

Map unit, on data memory node map obtaining intermediate file for the mapping function by writing in advance；

Computing unit, for merging the duplicate key value in intermediate file, to reduce mapping output file redundancy；And to merging after Key assignments serialize obtaining mapped cache file；Automatically the computational load value of each calculate node is obtained, according to calculate node Computational load value each mapped cache file is assigned in each calculate node；

Output unit, for opening up circulating memory buffering area in internal memory, circulating memory buffering area is defeated for mapping output file Go out；Configuration file is created in circulating memory buffering area, the EMS memory occupation threshold value of core buffer is configured in configuration file； When EMS memory occupation is more than or equal to occupancy threshold value in circulating memory buffering area, protection thread, which suspends, writes data into internal memory, and Spill file is write in internal memory, spill file determines the file of write-in disk, and the file of circulating memory buffering area is write into magnetic Disk is until all mapping output file output is finished；

7. the big data distributed processing system(DPS) as claimed in claim 6 based on cloud computing, it is characterised in that the fractionation list Carrying out input burst to input file size in member includes：

The corresponding data of each relation value are included in input burst.

8. the big data distributed approach as claimed in claim 7 based on cloud computing, it is characterised in that the mapping list Member includes：

9. the big data distributed processing system(DPS) as claimed in claim 8 based on cloud computing, it is characterised in that

The merging memory cell includes：

10. the big data distributed processing system(DPS) as claimed in claim 9 based on cloud computing, it is characterised in that

Burst will be inputted in the map unit to the mapping function by writing in advance to be mapped and also wrap according to mapping tasks Include and judge that input burst whether there is logic error according to incidence relation table, the input burst is abandoned if existing.