CN108132970A

CN108132970A - Big data distributed approach and system based on cloud computing

Info

Publication number: CN108132970A
Application number: CN201711259714.4A
Authority: CN
Inventors: 黄凯锋; 周岩; 王旭辉; 李莉; 孟庆超
Original assignee: Luoyang Normal University
Current assignee: Luoyang Normal University
Priority date: 2017-12-04
Filing date: 2017-12-04
Publication date: 2018-06-08

Abstract

A kind of big data distributed approach based on cloud computing, includes the following steps：S1, input file is received, input fragment is carried out according to input file size, distribute each input fragment to a mapping tasks, input fragment stores fragment length and records the array of the position of data；S2, it is mapped to obtain intermediate file on data memory node by the mapping function write in advance；Duplicate key value in S3, merging intermediate file；S4, circulating memory buffering area is opened up in memory, circulating memory buffering area exports for mapping output file；Configuration file is created in circulating memory buffering area；Protection thread pause writes data into memory, and spill file is written in memory, and spill file determines the file of write-in disk, and by the file write-in disk of circulating memory buffering area until all mapping output file output finishes；S5, it all mapping output files and will store on distributed file storage system.

Description

Big data distributed approach and system based on cloud computing

Technical field

The present invention relates to big data field of cloud computer technology, at more particularly to a kind of big data distribution based on cloud computing Manage method and system.

Background technology

With the arriving of cloud era, big data (Big data) has also attracted more and more concerns.Big data (Big Data a large amount of unstructured datas and semi-structured data) are conventionally used to indicate, these data are downloading to relevant database It is analyzed for purposes.Big data analysis is often linked together with cloud computing, because large data set analysis needs picture in real time Frame the same MapReduce shares out the work to tens of, hundreds of or even thousands of computer.Big data needs special skill Art, effectively to handle the data in a large amount of tolerance elapsed time.Suitable for the technology of big data, at large-scale parallel Manage (MPP) database, data digging system, distributed file system, distributed data base, cloud computing platform, internet and can The storage system of extension.

Data source is very abundant under big data environment and data type is various, and the data volume of storage and analysis mining is huge Greatly, to the more demanding of data exhibiting, and value very much the high efficiency and availability of data processing.However traditional data processing side Method has the following disadvantages：1st, traditional data acquisition source is single, and storage, management and analysis data volume are also relatively small, greatly It can mostly be handled using relevant database and parallel data warehouse.To dependence parallel computation promotion data processing speed aspect Speech, traditional parallel database technology pursue high consistency and fault-tolerance, according to CAP theories, it is difficult to ensure its availability and Autgmentability.2nd, traditional data processing method is that the expense of calculating is considerably increased centered on processor, can not adapt to big number According to a large amount of unstructuredness data process demand.

Invention content

In view of this, the present invention proposes a kind of big data distributed approach and system based on cloud computing.

A kind of big data distributed approach based on cloud computing, includes the following steps：

S1, input file is received, input fragment is carried out according to input file size, distributing one by each input fragment reflects Task is penetrated, input fragment stores fragment length and records the array of the position of data；

S2, it is mapped to obtain intermediate file on data memory node by the mapping function write in advance；

Duplicate key value in S3, merging intermediate file, maps output file redundancy to reduce；And to the key assignments after merging into Row serializing obtains mapped cache file；Automatically the computational load value of each calculate node is obtained, according to the calculating of calculate node Each mapped cache file is assigned in each calculate node by load value；

S4, circulating memory buffering area is opened up in memory, circulating memory buffering area exports for mapping output file；In ring Configuration file is created in shape core buffer, the EMS memory occupation threshold value of core buffer is configured in configuration file；In annular It deposits in buffering area EMS memory occupation to be greater than or equal to when occupying threshold value, protection thread pause writes data into memory, and in memory Be written spill file, spill file determines the file of write-in disk, and by the file of circulating memory buffering area write-in disk until All mapping output file outputs finish；

S5, it all mapping output files and will store on distributed file storage system.

In the big data distributed approach of the present invention based on cloud computing, to input text in the step S1 Part size carries out input fragment and includes：

Incidence relation table is established, input file is split as position relationship value, activity relationship value, structural relation value, function Relation value, functional relationship value, behavior relation value and other relation values, and by the correspondence of each relation value of each input file In relationship write-in incidence relation table；

It will be in the corresponding data cut-in input fragment of each relation value.

In the big data distributed approach of the present invention based on cloud computing, the step S2 includes：

Mapped by the mapping function write in advance by fragment is inputted according to mapping tasks, the mapping include according to Pre-set data format will input fragment content and will be aligned into row-column list, judge that position relationship value, activity relationship value, structure are closed Set occurrence, functional relationship value, functional relationship value, behavior relation value and other relation values whether there is, if each relation value is deposited Then directly retaining, if there is no a certain item or a few n-th-trem relation n values, then the relation value lacked is sky；The arrangement of each relationship Sequence is consistent.

In the big data distributed approach of the present invention based on cloud computing,

The step S5 includes：

The corresponding all index informations of each mapping output file are inquired from incidence relation table, by each mapping output text Each corresponding segment data of part is inserted into section list；The position relationship value, activity relationship value, structure for recording segment data are closed Set occurrence, functional relationship value, functional relationship value, behavior relation value and other relation values.

The mapping function by writing in advance map also by fragment is inputted according to mapping tasks in the step S2 Including judging that with the presence or absence of logic error, the input fragment is abandoned if existing for input fragment according to incidence relation table.

The present invention also provides a kind of big data distributed processing system(DPS) based on cloud computing, including such as lower unit：

For receiving input file, input fragment is carried out according to input file size for split cells, by each input fragment A mapping tasks are distributed, input fragment stores fragment length and records the array of the position of data；

Map unit is mapped to obtain intermediate text on data memory node for passing through the mapping function write in advance Part；

Computing unit for merging the duplicate key value in intermediate file, maps output file redundancy to reduce；And to merging Key assignments afterwards is serialized to obtain mapped cache file；Automatically the computational load value of each calculate node is obtained, according to calculating Each mapped cache file is assigned in each calculate node by the computational load value of node；

Output unit, for opening up circulating memory buffering area in memory, circulating memory buffering area is literary for mapping output Part exports；Configuration file is created in circulating memory buffering area, the EMS memory occupation threshold of core buffer is configured in configuration file Value；When EMS memory occupation is greater than or equal to occupancy threshold value in circulating memory buffering area, protection thread pause writes data into memory, And spill file is written in memory, spill file determines the file of write-in disk, and the file of circulating memory buffering area is write Enter disk until all mapping output file output finishes；

Merge storage unit, for by all mapping output files and storing to distributed file storage system.

In the big data distributed processing system(DPS) of the present invention based on cloud computing, to input in the split cells File size carries out input fragment and includes：

In the big data distributed processing system(DPS) of the present invention based on cloud computing, the map unit includes：

In the big data distributed processing system(DPS) of the present invention based on cloud computing,

The merging storage unit includes：

The mapping function by writing in advance is mapped fragment is inputted according to mapping tasks in the map unit It further includes and judges that with the presence or absence of logic error, the input fragment is abandoned if existing for input fragment according to incidence relation table.

Implement the big data distributed approach provided by the invention based on cloud computing and system compared with prior art It has the advantages that：By the way that if the big data data of magnanimity have been divided into stem portion according to pre-set rule, point To more processor parallel processings；Then each processor, treated that result carries out summarizes operation to obtain final result； It has the following effects that：It can realize a large amount of, the non-structured data of processing, improve data processing type and speed.

Description of the drawings

Fig. 1 be the embodiment of the present invention modified wireless communication procedure in language transfer method flow chart.

Specific embodiment

As shown in Figure 1, a kind of big data distributed approach based on cloud computing, includes the following steps：

By implementing the embodiment of the present invention, various types of data can uniformly be split into each relation value, even if having A little relation value specific type of data do not have.Then distributed treatment is carried out to each relation value, data can be greatly improved Processing capacity.

By implementing the present embodiment, it will input fragment content according to pre-set data format and be aligned into row-column list, be made The process resource for obtaining follow-up calculate node occupies less.

The step S5 includes：

By implementing the present embodiment, redundancy, false judgment can be carried out to data, reduce operand.

The merging storage unit includes：

It is understood that for those of ordinary skill in the art, it can be conceived with the technique according to the invention and done Go out other various corresponding changes and deformation, and all these changes and deformation should all belong to the protection model of the claims in the present invention It encloses.

Claims

1. a kind of big data distributed approach based on cloud computing, which is characterized in that it includes the following steps：

S1, input file is received, input fragment is carried out according to input file size, distributed each input fragment to a mapping and appoint The array of the position of business, input fragment storage fragment length and record data；

Duplicate key value in S3, merging intermediate file, maps output file redundancy to reduce；And sequence is carried out to the key assignments after merging Rowization obtain mapped cache file；Automatically the computational load value of each calculate node is obtained, according to the computational load of calculate node Each mapped cache file is assigned in each calculate node by value；

S4, circulating memory buffering area is opened up in memory, circulating memory buffering area exports for mapping output file；In annular It deposits and configuration file is created in buffering area, the EMS memory occupation threshold value of core buffer is configured in configuration file；Delay in circulating memory It rushes in area EMS memory occupation to be greater than or equal to when occupying threshold value, protection thread pause writes data into memory, and be written in memory Spill file, spill file determine the file of write-in disk, and by the file write-in disk of circulating memory buffering area until all Mapping output file output finish；

2. the big data distributed approach based on cloud computing as described in claim 1, which is characterized in that the step S1 In to input file size carry out input fragment include：

Incidence relation table is established, input file is split as position relationship value, activity relationship value, structural relation value, functional relationship Value, functional relationship value, behavior relation value and other relation values, and by the correspondence of each relation value of each input file It is written in incidence relation table；

3. the big data distributed approach based on cloud computing as claimed in claim 2, which is characterized in that the step S2 Including：

It is mapped by the mapping function write in advance by fragment is inputted according to mapping tasks, the mapping is included according to advance The data format of setting will input fragment content and will be aligned into row-column list, judge position relationship value, activity relationship value, structural relation Value, functional relationship value, functional relationship value, behavior relation value and other relation values whether there is, if each relation value exists Then directly retain, if there is no a certain item or a few n-th-trem relation n values, then the relation value lacked is sky；The arrangement of each relationship is suitable Sequence is consistent.

4. the big data distributed approach based on cloud computing as claimed in claim 3, which is characterized in that

The step S5 includes：

The corresponding all index informations of each mapping output file are inquired from incidence relation table, by each mapping output file Each a corresponding segment data is inserted into section list；Record position relationship value, activity relationship value, the structural relation of segment data Value, functional relationship value, functional relationship value, behavior relation value and other relation values.

5. the big data distributed approach based on cloud computing as claimed in claim 3, which is characterized in that

Input fragment is carried out mapping according to mapping tasks to the mapping function by writing in advance in the step S2 to further include Judge that with the presence or absence of logic error, the input fragment is abandoned if existing for input fragment according to incidence relation table.

6. a kind of big data distributed processing system(DPS) based on cloud computing, which is characterized in that it includes such as lower unit：

For receiving input file, input fragment is carried out according to input file size for split cells, by each input fragment distribution The array of the position of one mapping tasks, input fragment storage fragment length and record data；

Map unit is mapped to obtain intermediate file on data memory node for passing through the mapping function write in advance；

Computing unit for merging the duplicate key value in intermediate file, maps output file redundancy to reduce；And to merging after Key assignments is serialized to obtain mapped cache file；Automatically the computational load value of each calculate node is obtained, according to calculate node Computational load value each mapped cache file is assigned in each calculate node；

Output unit, for opening up circulating memory buffering area in memory, circulating memory buffering area is defeated for mapping output file Go out；Configuration file is created in circulating memory buffering area, the EMS memory occupation threshold value of core buffer is configured in configuration file； When EMS memory occupation is greater than or equal to occupancy threshold value in circulating memory buffering area, protection thread, which suspends, writes data into memory, and Spill file is written in memory, spill file determines the file of write-in disk, and the file of circulating memory buffering area is written magnetic Disk is until all mapping output file output finishes；

7. the big data distributed processing system(DPS) based on cloud computing as claimed in claim 6, which is characterized in that described to split list Input fragment is carried out in member to input file size to include：

8. the big data distributed approach based on cloud computing as claimed in claim 7, which is characterized in that the mapping is single Member includes：

9. the big data distributed processing system(DPS) based on cloud computing as claimed in claim 8, which is characterized in that

The merging storage unit includes：

10. the big data distributed processing system(DPS) based on cloud computing as claimed in claim 9, which is characterized in that

Input fragment according to mapping tasks is mapped by the mapping function by writing in advance in the map unit and is also wrapped It includes and judges that with the presence or absence of logic error, the input fragment is abandoned if existing for input fragment according to incidence relation table.