CN104731968B - A kind of cluster result method of the large-scale dataset of unit - Google Patents

A kind of cluster result method of the large-scale dataset of unit Download PDF

Info

Publication number
CN104731968B
CN104731968B CN201510163967.6A CN201510163967A CN104731968B CN 104731968 B CN104731968 B CN 104731968B CN 201510163967 A CN201510163967 A CN 201510163967A CN 104731968 B CN104731968 B CN 104731968B
Authority
CN
China
Prior art keywords
result
memory
data
cluster
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510163967.6A
Other languages
Chinese (zh)
Other versions
CN104731968A (en
Inventor
范仕良
张雪洁
骆融臻
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Evastellar Information Technology Co ltd
Original Assignee
Hohai University HHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hohai University HHU filed Critical Hohai University HHU
Priority to CN201510163967.6A priority Critical patent/CN104731968B/en
Publication of CN104731968A publication Critical patent/CN104731968A/en
Application granted granted Critical
Publication of CN104731968B publication Critical patent/CN104731968B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a kind of cluster result methods of the large-scale dataset of unit, mainly include three steps:First, incidental Memory Leaks when solving to read large-scale dataset;Second is that make full use of hardware superiority and the algorithm idea of " dividing and rule " that big data problem is converted into the small data problem easily solved;Third, building appropriate mining model based on cluster result algorithm, the cluster result work of small data is sequentially completed, finally merges to obtain final result by each Result.The method of the disclosure of the invention is by designing the memory limitation common when modes efficiently solve the problems, such as that big data is excavated of storage mode, extension virtual unit and operational efficiency, it realizes in the case where not utilizing network cluster, the cluster result work of GB scale data collection is completed in the physical machine that Yu Yitai is independently worked normally.

Description

A kind of cluster result method of the large-scale dataset of unit
Technical field
The present invention relates to a kind of cluster result methods of the large-scale dataset of unit, belong to data mining technology field.
Background technology
In recent years, with the fast development and popularization and application of computer and information technology, the scale of sector application system It expands rapidly, data caused by sector application are in explosive increase.Easily reach hundreds of GB even tens of to hundreds of TB scales Industry/enterprise's big data far beyond existing traditional computing technique and the processing capacity of information system.
Big data brings many new challenges to traditional computing technique.Big data causes much have in small data set Traditional serialization algorithm of effect is difficult to complete to calculate within the acceptable time when handling in face of big data;Big data simultaneously Containing more noise, the features such as sample is sparse, sample imbalance existing many machine learning algorithm validity are reduced.Greatly Data also bring huge technological innovation and business opportunities while huge technological challenge is brought.Therefore, seek effective big Data processing technique, ways and means have become the active demand of real world.
Big data treatment technology of today, is mostly based on network cluster to realize distributed data digging, to resource and item Part is more demanding, and it is higher to implement difficulty.In the case of an only physical machine, if wanting to complete large-scale dataset Excacation needs a set of more mature scheme.
Invention content
Goal of the invention:For problems of the prior art, the present invention provides a kind of large-scale dataset of unit Cluster result method.
Mechanism of the method provided by the invention based on virtual memory, piecemeal processing efficiently solves separate unit physical machine memory The problem of wretched insufficiency, based on shared drive mechanism, Memory Mapping File mechanism and unit multi-core parallel concurrent treatment mechanism successfully The digging efficiency of large-scale dataset is improved, and based on the mining model of R language structure cluster, completes the digging of Sub Data Set Result is merged after pick work, is worked so as to fulfill the cluster result of the large-scale dataset on unit.
Technical solution:A kind of cluster result method of the large-scale dataset of unit, unit is solved using number of mechanisms The RAM leakage that big data cluster result is encountered and operational efficiency problem, mainly include:
A, memory leak issue is solved:Mainly completed by improving traditional data storage and reading manner.Using void Intend memory, the mechanism that piecemeal is handled, i.e., big data is read one to two data blocks and simultaneously loaded every time according to fixed size piecemeal Into memory, timely releasing memory and temporary space after the completion of processing utilize physical machine memory to such timesharing.
B, operational efficiency is solved the problems, such as:The thought for making full use of hardware superiority and " dividing and rule " on unit is to improve greatly The key of the operational efficiency of data mining.It is handled using shared drive mechanism, Memory Mapping File mechanism and unit multi-core parallel concurrent Mechanism shares one piece of region of memory, to facilitate realization parallel algorithm by the multiple processes of Sharing Memory Realization;It is mapped by memory File causes the file handled on disk that need not perform I/O operation, saves run time;Unit multinuclear mechanism is then by different data The mining task of block is assigned to parallel processing on the different core of processor.
C, in terms of cluster result:Cluster result model, first customized model parameter are built based on R language, complete institute The cluster result for having subtask merges the Result of subtask after working, i.e., the obtained central point of every cluster again The input as mining model is organized, so iteration continues, until result meets preset cluster number of clusters.
The present invention is had the advantages that using above-mentioned technical proposal:The large-scale data of unit provided by the invention The cluster result method of collection is had based on shared drive mechanism, Memory Mapping File mechanism and unit multi-core parallel concurrent treatment mechanism etc. It solves the problems, such as to effect existing RAM leakage and operational efficiency during big data cluster result on unit, takes full advantage of separate unit object Reason machine limited memory headroom and processor resource are not depending on network cluster or are building the situation that network cluster condition is limited Under, the cluster result work for large-scale dataset provides a set of available unit solution.
Description of the drawings
Fig. 1 is the design flow diagram of the embodiment of the present invention;
Fig. 2 is the realization step schematic diagram of the embodiment of the present invention;
Fig. 3 be the embodiment of the present invention in virtual memory, piecemeal treatment mechanism design principle figure;
Fig. 4 is the decomposable process that unit multinuclear is handled in the embodiment of the present invention;
Fig. 5 is the aggregation process that unit multinuclear is handled in the embodiment of the present invention.
Specific embodiment
With reference to specific embodiment, the present invention is furture elucidated, it should be understood that these embodiments are merely to illustrate the present invention Rather than limit the scope of the invention, after the present invention has been read, those skilled in the art are to the various equivalences of the present invention The modification of form falls within the application range as defined in the appended claims.
The cluster result of large-scale dataset on unit is roughly divided into three big steps, such as Fig. 1 by method provided by the present invention It is shown, cluster result model is built first, then solves the problems, such as Memory Leaks and operational efficiency respectively;As described in Figure 2, originally The realization step of inventive method is respectively:The solution of memory leak issue will be counted greatly based on virtual memory, piecemeal treatment mechanism Memory is read according to piecemeal;The solution of operational efficiency problem, i.e., it is more based on shared drive mechanism, memory limited mechanism and unit Core parallel processing mechanism, by reducing the process switching time, the time for saving file I/O operation, making full use of the more of processor The methods of core progress parallel computation, greatly improves operational efficiency;The structure of cluster result model merges with Result, i.e. base Model is built in R language, completes to merge after the excacation of subtask and obtains the process of final result.
The cluster result method of the large-scale dataset of unit includes the following steps in the embodiment of the present invention:
Step 1:Solve Memory Leaks.It is dynamically marked off from the hard disk of physical machine according to the scale of data set first A piece of storage region is as virtual memory, for storing the ephemeral data of big data in the form of binary file;Then will Big data carries out piecemeal according to the size of 10 ~ 20M, is stored in above-mentioned temporary file, and establishes mapping rope for each data block Draw.When being handled, read from external memory in one or two data block to memory, data block be modified every time then, Data data in the block are preserved by the use of the form of data frame in R language so that it can input as mining model in step 3 Data.After the completion of processing, timely releasing memory space and virtual memory space.Virtual memory, mechanism such as Fig. 3 institutes of piecemeal processing Show.
Step 2:Solve the problems, such as operational efficiency.Shared drive mechanism is realized first, particularly as being that multiple processes is allowed to share one Block physical memory shares the data block in the memory.So different processes is just allowd to communicate by shared drive, and Data block can be shared by multiple processes, realize that parallel algorithm provides conveniently for unit multinuclear mechanism below;Then memory is realized File Mapping mechanism retains the region of an address space, while by physical storage(The file being already present on disk) This region is submitted to, and this document and correspondence memory region are established and mapped, is processed as storing in the file on disk, Need not I/O operation be performed to file again;Then bottom using the data type of C language to step 1 in obtained data block into Row processing, data therein is stored with matrix, and by matrix allocation to shared drive or Memory Mapping File, then with one A pointer object is directed toward the data block of storage, realizes the mechanism of file cache, improves the efficiency of follow-up work;It is final to realize unit Multi-core parallel concurrent treatment mechanism first according to the concrete condition of processor, selects a core as master, other core conducts worker.After master receives a task, n-1 subtask can be broken down into(N is processor check figure), distribute to Each worker, decomposable process such as Fig. 4.Then each worker is individually handled based on shared drive/memory limited mechanism The small task of oneself(Cluster result model based on step 3), each worker returns to intermediate result after being disposed Master is exported after finally result is summarized by master, and aggregation process is as shown in Figure 5.
Step 3:The structure of mining model merges with Result.In cluster program bags with Java encapsulation R language Clara () function, so just build cluster result model, and be deployed on each core of processor, set related ginseng Number(Can customized parameter, such as number of clusters of cluster etc.)Afterwards, each data block is completed according to the unit multinuclear mechanism of step 2 Cluster result.After data block mining task on all worker is completed, by what is divided in the Result of these data blocks All cluster central points are reintegrated, and the cluster task new as one is excavated, and Result so just covers these Data block.It is iterated in this manner, until Result covers initial large-scale dataset.

Claims (1)

  1. A kind of 1. cluster result method of the large-scale dataset of unit, it is characterised in that:There is provided one kind can efficiently solve The scheme of existing RAM leakage and operational efficiency problem when big data is excavated, specifically includes following steps:
    Step 1:When reading large-scale dataset, recurrent memory is let out due to the limitation of operation machine memory size for solution Dew problem;It is mainly completed by improving traditional data storage and reading manner, using virtual memory, the machine of piecemeal processing System that is, by big data piecemeal, reads one to two data blocks and is loaded onto in memory, timely releasing memory after the completion of processing every time With temporary space, such timesharing physical machine memory is utilized;
    Step 2:Make full use of hardware superiority and the algorithm idea of " dividing and rule " that big data problem is converted into easy solution Small data problem, so as to improve the operational efficiency of large-scale dataset;Using shared drive mechanism, Memory Mapping File mechanism With unit multi-core parallel concurrent treatment mechanism, one piece of region of memory is shared by the multiple processes of Sharing Memory Realization, is realized simultaneously with facilitating Row algorithm;The file handled on disk by Memory Mapping File need not perform I/O operation, save run time;Unit The mining task of different data block is then assigned to parallel processing on the different core of processor by multinuclear mechanism;
    Step 3:Based on the mining model of cluster result algorithm structure, the cluster for being sequentially completed the data block generated in step 2 is dug Work is dug, finally merges to obtain final result by each Result;
    Bottom using the data type of C language to step 1 in obtained data block handle, by data matrix therein Storage, and by matrix allocation to shared drive or Memory Mapping File, the data of storage are then directed toward with a pointer object Block realizes the mechanism of file cache, improves the efficiency of follow-up work;
    Selecting a core, other cores are as worker as master;It, can be by its point after master receives a task It solves as n-1 subtask, distributes to each worker, n is processor check figure;Then each worker is based on shared drive/interior It deposits File Mapping mechanism and individually handles the task of oneself, intermediate result is returned to master by each worker after being disposed, It is exported after finally result is summarized by master;
    With clara () function in the cluster program bags of Java encapsulation R language, cluster result model is so just built, and It is deployed on each core of processor, after setting relevant parameter, each data block is completed according to the unit multinuclear mechanism of step 2 Cluster result;After data block mining task on all worker is completed, it will be divided in the Result of these data blocks All cluster central points reintegrate, the cluster task new as one is excavated, and Result so just covers this A little data blocks;It is iterated in this manner, until Result covers initial large-scale dataset.
CN201510163967.6A 2015-04-08 2015-04-08 A kind of cluster result method of the large-scale dataset of unit Active CN104731968B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510163967.6A CN104731968B (en) 2015-04-08 2015-04-08 A kind of cluster result method of the large-scale dataset of unit

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510163967.6A CN104731968B (en) 2015-04-08 2015-04-08 A kind of cluster result method of the large-scale dataset of unit

Publications (2)

Publication Number Publication Date
CN104731968A CN104731968A (en) 2015-06-24
CN104731968B true CN104731968B (en) 2018-06-19

Family

ID=53455855

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510163967.6A Active CN104731968B (en) 2015-04-08 2015-04-08 A kind of cluster result method of the large-scale dataset of unit

Country Status (1)

Country Link
CN (1) CN104731968B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105975574A (en) * 2016-05-04 2016-09-28 北京思特奇信息技术股份有限公司 R language-based large-data volume data screening method and system
CN110543940B (en) * 2019-08-29 2022-09-23 中国人民解放军国防科技大学 Neural circuit body data processing method, system and medium based on hierarchical storage
CN111212276A (en) * 2020-04-22 2020-05-29 杭州趣链科技有限公司 Monitoring method, system, equipment and storage medium based on camera module

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101359333A (en) * 2008-05-23 2009-02-04 中国科学院软件研究所 Parallel data processing method based on latent dirichlet allocation model
CN102193830A (en) * 2010-03-12 2011-09-21 复旦大学 Many-core environment-oriented division mapping/reduction parallel programming model
CN103020077A (en) * 2011-09-24 2013-04-03 国家电网公司 Method for managing memory of real-time database of power system

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070101332A1 (en) * 2005-10-28 2007-05-03 International Business Machines Corporation Method and apparatus for resource-based thread allocation in a multiprocessor computer system
CN102193831B (en) * 2010-03-12 2014-05-21 复旦大学 Method for establishing hierarchical mapping/reduction parallel programming model
CN102385588B (en) * 2010-08-31 2014-08-06 国际商业机器公司 Method and system for improving performance of data parallel insertion
CN102231121B (en) * 2011-07-25 2013-02-27 北方工业大学 Memory mapping-based rapid parallel extraction method for big data file

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101359333A (en) * 2008-05-23 2009-02-04 中国科学院软件研究所 Parallel data processing method based on latent dirichlet allocation model
CN102193830A (en) * 2010-03-12 2011-09-21 复旦大学 Many-core environment-oriented division mapping/reduction parallel programming model
CN103020077A (en) * 2011-09-24 2013-04-03 国家电网公司 Method for managing memory of real-time database of power system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"Scalable multi-core simulation using parallel dynamic binary translation";Oscar Almer 等;《Embedded Computer Systems (SAMOS), 2011 International Conference on》;20110721;第190-199页 *

Also Published As

Publication number Publication date
CN104731968A (en) 2015-06-24

Similar Documents

Publication Publication Date Title
WO2018099299A1 (en) Graphic data processing method, device and system
US9830303B1 (en) Optimized matrix multiplication using vector multiplication of interleaved matrix values
US8676874B2 (en) Data structure for tiling and packetizing a sparse matrix
US8762655B2 (en) Optimizing output vector data generation using a formatted matrix data structure
WO2018160773A1 (en) Matrix transfer accelerator system and method
CN104731968B (en) A kind of cluster result method of the large-scale dataset of unit
JP5778343B2 (en) Instruction culling in the graphics processing unit
CN107851004A (en) For the register spilling management of general register (GPR)
CN103425534A (en) Graphics processing unit sharing between many applications
CN103995827B (en) High-performance sort method in MapReduce Computational frames
WO2022147518A1 (en) Neural network accelerator writable memory reconfigurability
US20240005446A1 (en) Methods, systems, and non-transitory storage media for graphics memory allocation
CN103713953A (en) Device and method for transferring data in memory
CN106708437A (en) VMware virtualization storage allocation method and system
Du et al. Feature-aware task scheduling on CPU-FPGA heterogeneous platforms
CN104376047A (en) Big table join method based on HBase
Siegel et al. Efficient sparse matrix-matrix multiplication on heterogeneous high performance systems
CN110502337A (en) For the optimization system and method for shuffling the stage in Hadoop MapReduce
CN110209631A (en) Big data processing method and its processing system
US20150248303A1 (en) Paravirtualized migration counter
US20080021938A1 (en) Technique for allocating objects in a managed run time environment
CN106991058B (en) Method and device for processing pre-fetched files
CN106844605A (en) Batch data logical process method and device
CN104699520B (en) A kind of power-economizing method based on virtual machine (vm) migration scheduling
Liu et al. A-MapCG: an adaptive MapReduce framework for GPUs

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20190610

Address after: Room 805, Building B2, Huizhi Science Park, Hengtai Road, Nanjing Economic and Technological Development Zone, Jiangsu Province

Patentee after: JIANGSU EVASTELLAR INFORMATION TECHNOLOGY Co.,Ltd.

Address before: No. 8, West Road, Buddha city, Jiangning District, Nanjing, Jiangsu

Patentee before: HOHAI University