CN104731968B

CN104731968B - A kind of cluster result method of the large-scale dataset of unit

Info

Publication number: CN104731968B
Application number: CN201510163967.6A
Authority: CN
Inventors: 范仕良; 张雪洁; 骆融臻
Original assignee: Hohai University HHU
Current assignee: Jiangsu Evastellar Information Technology Co ltd
Priority date: 2015-04-08
Filing date: 2015-04-08
Publication date: 2018-06-19
Anticipated expiration: 2035-04-08
Also published as: CN104731968A

Abstract

The invention discloses a kind of cluster result methods of the large-scale dataset of unit, mainly include three steps：First, incidental Memory Leaks when solving to read large-scale dataset；Second is that make full use of hardware superiority and the algorithm idea of " dividing and rule " that big data problem is converted into the small data problem easily solved；Third, building appropriate mining model based on cluster result algorithm, the cluster result work of small data is sequentially completed, finally merges to obtain final result by each Result.The method of the disclosure of the invention is by designing the memory limitation common when modes efficiently solve the problems, such as that big data is excavated of storage mode, extension virtual unit and operational efficiency, it realizes in the case where not utilizing network cluster, the cluster result work of GB scale data collection is completed in the physical machine that Yu Yitai is independently worked normally.

Description

A kind of cluster result method of the large-scale dataset of unit

Technical field

The present invention relates to a kind of cluster result methods of the large-scale dataset of unit, belong to data mining technology field.

Background technology

In recent years, with the fast development and popularization and application of computer and information technology, the scale of sector application system It expands rapidly, data caused by sector application are in explosive increase.Easily reach hundreds of GB even tens of to hundreds of TB scales Industry/enterprise's big data far beyond existing traditional computing technique and the processing capacity of information system.

Big data brings many new challenges to traditional computing technique.Big data causes much have in small data set Traditional serialization algorithm of effect is difficult to complete to calculate within the acceptable time when handling in face of big data；Big data simultaneously Containing more noise, the features such as sample is sparse, sample imbalance existing many machine learning algorithm validity are reduced.Greatly Data also bring huge technological innovation and business opportunities while huge technological challenge is brought.Therefore, seek effective big Data processing technique, ways and means have become the active demand of real world.

Big data treatment technology of today, is mostly based on network cluster to realize distributed data digging, to resource and item Part is more demanding, and it is higher to implement difficulty.In the case of an only physical machine, if wanting to complete large-scale dataset Excacation needs a set of more mature scheme.

Invention content

Goal of the invention：For problems of the prior art, the present invention provides a kind of large-scale dataset of unit Cluster result method.

Mechanism of the method provided by the invention based on virtual memory, piecemeal processing efficiently solves separate unit physical machine memory The problem of wretched insufficiency, based on shared drive mechanism, Memory Mapping File mechanism and unit multi-core parallel concurrent treatment mechanism successfully The digging efficiency of large-scale dataset is improved, and based on the mining model of R language structure cluster, completes the digging of Sub Data Set Result is merged after pick work, is worked so as to fulfill the cluster result of the large-scale dataset on unit.

Technical solution：A kind of cluster result method of the large-scale dataset of unit, unit is solved using number of mechanisms The RAM leakage that big data cluster result is encountered and operational efficiency problem, mainly include：

A, memory leak issue is solved：Mainly completed by improving traditional data storage and reading manner.Using void Intend memory, the mechanism that piecemeal is handled, i.e., big data is read one to two data blocks and simultaneously loaded every time according to fixed size piecemeal Into memory, timely releasing memory and temporary space after the completion of processing utilize physical machine memory to such timesharing.

B, operational efficiency is solved the problems, such as：The thought for making full use of hardware superiority and " dividing and rule " on unit is to improve greatly The key of the operational efficiency of data mining.It is handled using shared drive mechanism, Memory Mapping File mechanism and unit multi-core parallel concurrent Mechanism shares one piece of region of memory, to facilitate realization parallel algorithm by the multiple processes of Sharing Memory Realization；It is mapped by memory File causes the file handled on disk that need not perform I/O operation, saves run time；Unit multinuclear mechanism is then by different data The mining task of block is assigned to parallel processing on the different core of processor.

C, in terms of cluster result：Cluster result model, first customized model parameter are built based on R language, complete institute The cluster result for having subtask merges the Result of subtask after working, i.e., the obtained central point of every cluster again The input as mining model is organized, so iteration continues, until result meets preset cluster number of clusters.

The present invention is had the advantages that using above-mentioned technical proposal：The large-scale data of unit provided by the invention The cluster result method of collection is had based on shared drive mechanism, Memory Mapping File mechanism and unit multi-core parallel concurrent treatment mechanism etc. It solves the problems, such as to effect existing RAM leakage and operational efficiency during big data cluster result on unit, takes full advantage of separate unit object Reason machine limited memory headroom and processor resource are not depending on network cluster or are building the situation that network cluster condition is limited Under, the cluster result work for large-scale dataset provides a set of available unit solution.

Description of the drawings

Fig. 1 is the design flow diagram of the embodiment of the present invention；

Fig. 2 is the realization step schematic diagram of the embodiment of the present invention；

Fig. 3 be the embodiment of the present invention in virtual memory, piecemeal treatment mechanism design principle figure；

Fig. 4 is the decomposable process that unit multinuclear is handled in the embodiment of the present invention；

Fig. 5 is the aggregation process that unit multinuclear is handled in the embodiment of the present invention.

Specific embodiment

With reference to specific embodiment, the present invention is furture elucidated, it should be understood that these embodiments are merely to illustrate the present invention Rather than limit the scope of the invention, after the present invention has been read, those skilled in the art are to the various equivalences of the present invention The modification of form falls within the application range as defined in the appended claims.

The cluster result of large-scale dataset on unit is roughly divided into three big steps, such as Fig. 1 by method provided by the present invention It is shown, cluster result model is built first, then solves the problems, such as Memory Leaks and operational efficiency respectively；As described in Figure 2, originally The realization step of inventive method is respectively：The solution of memory leak issue will be counted greatly based on virtual memory, piecemeal treatment mechanism Memory is read according to piecemeal；The solution of operational efficiency problem, i.e., it is more based on shared drive mechanism, memory limited mechanism and unit Core parallel processing mechanism, by reducing the process switching time, the time for saving file I/O operation, making full use of the more of processor The methods of core progress parallel computation, greatly improves operational efficiency；The structure of cluster result model merges with Result, i.e. base Model is built in R language, completes to merge after the excacation of subtask and obtains the process of final result.

The cluster result method of the large-scale dataset of unit includes the following steps in the embodiment of the present invention：

Step 1：Solve Memory Leaks.It is dynamically marked off from the hard disk of physical machine according to the scale of data set first A piece of storage region is as virtual memory, for storing the ephemeral data of big data in the form of binary file；Then will Big data carries out piecemeal according to the size of 10 ~ 20M, is stored in above-mentioned temporary file, and establishes mapping rope for each data block Draw.When being handled, read from external memory in one or two data block to memory, data block be modified every time then, Data data in the block are preserved by the use of the form of data frame in R language so that it can input as mining model in step 3 Data.After the completion of processing, timely releasing memory space and virtual memory space.Virtual memory, mechanism such as Fig. 3 institutes of piecemeal processing Show.

Step 2：Solve the problems, such as operational efficiency.Shared drive mechanism is realized first, particularly as being that multiple processes is allowed to share one Block physical memory shares the data block in the memory.So different processes is just allowd to communicate by shared drive, and Data block can be shared by multiple processes, realize that parallel algorithm provides conveniently for unit multinuclear mechanism below；Then memory is realized File Mapping mechanism retains the region of an address space, while by physical storage（The file being already present on disk） This region is submitted to, and this document and correspondence memory region are established and mapped, is processed as storing in the file on disk, Need not I/O operation be performed to file again；Then bottom using the data type of C language to step 1 in obtained data block into Row processing, data therein is stored with matrix, and by matrix allocation to shared drive or Memory Mapping File, then with one A pointer object is directed toward the data block of storage, realizes the mechanism of file cache, improves the efficiency of follow-up work；It is final to realize unit Multi-core parallel concurrent treatment mechanism first according to the concrete condition of processor, selects a core as master, other core conducts worker.After master receives a task, n-1 subtask can be broken down into（N is processor check figure）, distribute to Each worker, decomposable process such as Fig. 4.Then each worker is individually handled based on shared drive/memory limited mechanism The small task of oneself（Cluster result model based on step 3）, each worker returns to intermediate result after being disposed Master is exported after finally result is summarized by master, and aggregation process is as shown in Figure 5.

Step 3：The structure of mining model merges with Result.In cluster program bags with Java encapsulation R language Clara () function, so just build cluster result model, and be deployed on each core of processor, set related ginseng Number（Can customized parameter, such as number of clusters of cluster etc.）Afterwards, each data block is completed according to the unit multinuclear mechanism of step 2 Cluster result.After data block mining task on all worker is completed, by what is divided in the Result of these data blocks All cluster central points are reintegrated, and the cluster task new as one is excavated, and Result so just covers these Data block.It is iterated in this manner, until Result covers initial large-scale dataset.

Claims

A kind of 1. cluster result method of the large-scale dataset of unit, it is characterised in that：There is provided one kind can efficiently solve The scheme of existing RAM leakage and operational efficiency problem when big data is excavated, specifically includes following steps：

Step 1：When reading large-scale dataset, recurrent memory is let out due to the limitation of operation machine memory size for solution Dew problem；It is mainly completed by improving traditional data storage and reading manner, using virtual memory, the machine of piecemeal processing System that is, by big data piecemeal, reads one to two data blocks and is loaded onto in memory, timely releasing memory after the completion of processing every time With temporary space, such timesharing physical machine memory is utilized；

Step 2：Make full use of hardware superiority and the algorithm idea of " dividing and rule " that big data problem is converted into easy solution Small data problem, so as to improve the operational efficiency of large-scale dataset；Using shared drive mechanism, Memory Mapping File mechanism With unit multi-core parallel concurrent treatment mechanism, one piece of region of memory is shared by the multiple processes of Sharing Memory Realization, is realized simultaneously with facilitating Row algorithm；The file handled on disk by Memory Mapping File need not perform I/O operation, save run time；Unit The mining task of different data block is then assigned to parallel processing on the different core of processor by multinuclear mechanism；

Step 3：Based on the mining model of cluster result algorithm structure, the cluster for being sequentially completed the data block generated in step 2 is dug Work is dug, finally merges to obtain final result by each Result；

Bottom using the data type of C language to step 1 in obtained data block handle, by data matrix therein Storage, and by matrix allocation to shared drive or Memory Mapping File, the data of storage are then directed toward with a pointer object Block realizes the mechanism of file cache, improves the efficiency of follow-up work；

Selecting a core, other cores are as worker as master；It, can be by its point after master receives a task It solves as n-1 subtask, distributes to each worker, n is processor check figure；Then each worker is based on shared drive/interior It deposits File Mapping mechanism and individually handles the task of oneself, intermediate result is returned to master by each worker after being disposed, It is exported after finally result is summarized by master；

With clara () function in the cluster program bags of Java encapsulation R language, cluster result model is so just built, and It is deployed on each core of processor, after setting relevant parameter, each data block is completed according to the unit multinuclear mechanism of step 2 Cluster result；After data block mining task on all worker is completed, it will be divided in the Result of these data blocks All cluster central points reintegrate, the cluster task new as one is excavated, and Result so just covers this A little data blocks；It is iterated in this manner, until Result covers initial large-scale dataset.