CN104731968B - A kind of cluster result method of the large-scale dataset of unit - Google Patents
A kind of cluster result method of the large-scale dataset of unit Download PDFInfo
- Publication number
- CN104731968B CN104731968B CN201510163967.6A CN201510163967A CN104731968B CN 104731968 B CN104731968 B CN 104731968B CN 201510163967 A CN201510163967 A CN 201510163967A CN 104731968 B CN104731968 B CN 104731968B
- Authority
- CN
- China
- Prior art keywords
- result
- memory
- data
- cluster
- unit
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Abstract
The invention discloses a kind of cluster result methods of the large-scale dataset of unit, mainly include three steps:First, incidental Memory Leaks when solving to read large-scale dataset;Second is that make full use of hardware superiority and the algorithm idea of " dividing and rule " that big data problem is converted into the small data problem easily solved;Third, building appropriate mining model based on cluster result algorithm, the cluster result work of small data is sequentially completed, finally merges to obtain final result by each Result.The method of the disclosure of the invention is by designing the memory limitation common when modes efficiently solve the problems, such as that big data is excavated of storage mode, extension virtual unit and operational efficiency, it realizes in the case where not utilizing network cluster, the cluster result work of GB scale data collection is completed in the physical machine that Yu Yitai is independently worked normally.
Description
Technical field
The present invention relates to a kind of cluster result methods of the large-scale dataset of unit, belong to data mining technology field.
Background technology
In recent years, with the fast development and popularization and application of computer and information technology, the scale of sector application system
It expands rapidly, data caused by sector application are in explosive increase.Easily reach hundreds of GB even tens of to hundreds of TB scales
Industry/enterprise's big data far beyond existing traditional computing technique and the processing capacity of information system.
Big data brings many new challenges to traditional computing technique.Big data causes much have in small data set
Traditional serialization algorithm of effect is difficult to complete to calculate within the acceptable time when handling in face of big data;Big data simultaneously
Containing more noise, the features such as sample is sparse, sample imbalance existing many machine learning algorithm validity are reduced.Greatly
Data also bring huge technological innovation and business opportunities while huge technological challenge is brought.Therefore, seek effective big
Data processing technique, ways and means have become the active demand of real world.
Big data treatment technology of today, is mostly based on network cluster to realize distributed data digging, to resource and item
Part is more demanding, and it is higher to implement difficulty.In the case of an only physical machine, if wanting to complete large-scale dataset
Excacation needs a set of more mature scheme.
Invention content
Goal of the invention:For problems of the prior art, the present invention provides a kind of large-scale dataset of unit
Cluster result method.
Mechanism of the method provided by the invention based on virtual memory, piecemeal processing efficiently solves separate unit physical machine memory
The problem of wretched insufficiency, based on shared drive mechanism, Memory Mapping File mechanism and unit multi-core parallel concurrent treatment mechanism successfully
The digging efficiency of large-scale dataset is improved, and based on the mining model of R language structure cluster, completes the digging of Sub Data Set
Result is merged after pick work, is worked so as to fulfill the cluster result of the large-scale dataset on unit.
Technical solution:A kind of cluster result method of the large-scale dataset of unit, unit is solved using number of mechanisms
The RAM leakage that big data cluster result is encountered and operational efficiency problem, mainly include:
A, memory leak issue is solved:Mainly completed by improving traditional data storage and reading manner.Using void
Intend memory, the mechanism that piecemeal is handled, i.e., big data is read one to two data blocks and simultaneously loaded every time according to fixed size piecemeal
Into memory, timely releasing memory and temporary space after the completion of processing utilize physical machine memory to such timesharing.
B, operational efficiency is solved the problems, such as:The thought for making full use of hardware superiority and " dividing and rule " on unit is to improve greatly
The key of the operational efficiency of data mining.It is handled using shared drive mechanism, Memory Mapping File mechanism and unit multi-core parallel concurrent
Mechanism shares one piece of region of memory, to facilitate realization parallel algorithm by the multiple processes of Sharing Memory Realization;It is mapped by memory
File causes the file handled on disk that need not perform I/O operation, saves run time;Unit multinuclear mechanism is then by different data
The mining task of block is assigned to parallel processing on the different core of processor.
C, in terms of cluster result:Cluster result model, first customized model parameter are built based on R language, complete institute
The cluster result for having subtask merges the Result of subtask after working, i.e., the obtained central point of every cluster again
The input as mining model is organized, so iteration continues, until result meets preset cluster number of clusters.
The present invention is had the advantages that using above-mentioned technical proposal:The large-scale data of unit provided by the invention
The cluster result method of collection is had based on shared drive mechanism, Memory Mapping File mechanism and unit multi-core parallel concurrent treatment mechanism etc.
It solves the problems, such as to effect existing RAM leakage and operational efficiency during big data cluster result on unit, takes full advantage of separate unit object
Reason machine limited memory headroom and processor resource are not depending on network cluster or are building the situation that network cluster condition is limited
Under, the cluster result work for large-scale dataset provides a set of available unit solution.
Description of the drawings
Fig. 1 is the design flow diagram of the embodiment of the present invention;
Fig. 2 is the realization step schematic diagram of the embodiment of the present invention;
Fig. 3 be the embodiment of the present invention in virtual memory, piecemeal treatment mechanism design principle figure;
Fig. 4 is the decomposable process that unit multinuclear is handled in the embodiment of the present invention;
Fig. 5 is the aggregation process that unit multinuclear is handled in the embodiment of the present invention.
Specific embodiment
With reference to specific embodiment, the present invention is furture elucidated, it should be understood that these embodiments are merely to illustrate the present invention
Rather than limit the scope of the invention, after the present invention has been read, those skilled in the art are to the various equivalences of the present invention
The modification of form falls within the application range as defined in the appended claims.
The cluster result of large-scale dataset on unit is roughly divided into three big steps, such as Fig. 1 by method provided by the present invention
It is shown, cluster result model is built first, then solves the problems, such as Memory Leaks and operational efficiency respectively;As described in Figure 2, originally
The realization step of inventive method is respectively:The solution of memory leak issue will be counted greatly based on virtual memory, piecemeal treatment mechanism
Memory is read according to piecemeal;The solution of operational efficiency problem, i.e., it is more based on shared drive mechanism, memory limited mechanism and unit
Core parallel processing mechanism, by reducing the process switching time, the time for saving file I/O operation, making full use of the more of processor
The methods of core progress parallel computation, greatly improves operational efficiency;The structure of cluster result model merges with Result, i.e. base
Model is built in R language, completes to merge after the excacation of subtask and obtains the process of final result.
The cluster result method of the large-scale dataset of unit includes the following steps in the embodiment of the present invention:
Step 1:Solve Memory Leaks.It is dynamically marked off from the hard disk of physical machine according to the scale of data set first
A piece of storage region is as virtual memory, for storing the ephemeral data of big data in the form of binary file;Then will
Big data carries out piecemeal according to the size of 10 ~ 20M, is stored in above-mentioned temporary file, and establishes mapping rope for each data block
Draw.When being handled, read from external memory in one or two data block to memory, data block be modified every time then,
Data data in the block are preserved by the use of the form of data frame in R language so that it can input as mining model in step 3
Data.After the completion of processing, timely releasing memory space and virtual memory space.Virtual memory, mechanism such as Fig. 3 institutes of piecemeal processing
Show.
Step 2:Solve the problems, such as operational efficiency.Shared drive mechanism is realized first, particularly as being that multiple processes is allowed to share one
Block physical memory shares the data block in the memory.So different processes is just allowd to communicate by shared drive, and
Data block can be shared by multiple processes, realize that parallel algorithm provides conveniently for unit multinuclear mechanism below;Then memory is realized
File Mapping mechanism retains the region of an address space, while by physical storage(The file being already present on disk)
This region is submitted to, and this document and correspondence memory region are established and mapped, is processed as storing in the file on disk,
Need not I/O operation be performed to file again;Then bottom using the data type of C language to step 1 in obtained data block into
Row processing, data therein is stored with matrix, and by matrix allocation to shared drive or Memory Mapping File, then with one
A pointer object is directed toward the data block of storage, realizes the mechanism of file cache, improves the efficiency of follow-up work;It is final to realize unit
Multi-core parallel concurrent treatment mechanism first according to the concrete condition of processor, selects a core as master, other core conducts
worker.After master receives a task, n-1 subtask can be broken down into(N is processor check figure), distribute to
Each worker, decomposable process such as Fig. 4.Then each worker is individually handled based on shared drive/memory limited mechanism
The small task of oneself(Cluster result model based on step 3), each worker returns to intermediate result after being disposed
Master is exported after finally result is summarized by master, and aggregation process is as shown in Figure 5.
Step 3:The structure of mining model merges with Result.In cluster program bags with Java encapsulation R language
Clara () function, so just build cluster result model, and be deployed on each core of processor, set related ginseng
Number(Can customized parameter, such as number of clusters of cluster etc.)Afterwards, each data block is completed according to the unit multinuclear mechanism of step 2
Cluster result.After data block mining task on all worker is completed, by what is divided in the Result of these data blocks
All cluster central points are reintegrated, and the cluster task new as one is excavated, and Result so just covers these
Data block.It is iterated in this manner, until Result covers initial large-scale dataset.
Claims (1)
- A kind of 1. cluster result method of the large-scale dataset of unit, it is characterised in that:There is provided one kind can efficiently solve The scheme of existing RAM leakage and operational efficiency problem when big data is excavated, specifically includes following steps:Step 1:When reading large-scale dataset, recurrent memory is let out due to the limitation of operation machine memory size for solution Dew problem;It is mainly completed by improving traditional data storage and reading manner, using virtual memory, the machine of piecemeal processing System that is, by big data piecemeal, reads one to two data blocks and is loaded onto in memory, timely releasing memory after the completion of processing every time With temporary space, such timesharing physical machine memory is utilized;Step 2:Make full use of hardware superiority and the algorithm idea of " dividing and rule " that big data problem is converted into easy solution Small data problem, so as to improve the operational efficiency of large-scale dataset;Using shared drive mechanism, Memory Mapping File mechanism With unit multi-core parallel concurrent treatment mechanism, one piece of region of memory is shared by the multiple processes of Sharing Memory Realization, is realized simultaneously with facilitating Row algorithm;The file handled on disk by Memory Mapping File need not perform I/O operation, save run time;Unit The mining task of different data block is then assigned to parallel processing on the different core of processor by multinuclear mechanism;Step 3:Based on the mining model of cluster result algorithm structure, the cluster for being sequentially completed the data block generated in step 2 is dug Work is dug, finally merges to obtain final result by each Result;Bottom using the data type of C language to step 1 in obtained data block handle, by data matrix therein Storage, and by matrix allocation to shared drive or Memory Mapping File, the data of storage are then directed toward with a pointer object Block realizes the mechanism of file cache, improves the efficiency of follow-up work;Selecting a core, other cores are as worker as master;It, can be by its point after master receives a task It solves as n-1 subtask, distributes to each worker, n is processor check figure;Then each worker is based on shared drive/interior It deposits File Mapping mechanism and individually handles the task of oneself, intermediate result is returned to master by each worker after being disposed, It is exported after finally result is summarized by master;With clara () function in the cluster program bags of Java encapsulation R language, cluster result model is so just built, and It is deployed on each core of processor, after setting relevant parameter, each data block is completed according to the unit multinuclear mechanism of step 2 Cluster result;After data block mining task on all worker is completed, it will be divided in the Result of these data blocks All cluster central points reintegrate, the cluster task new as one is excavated, and Result so just covers this A little data blocks;It is iterated in this manner, until Result covers initial large-scale dataset.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510163967.6A CN104731968B (en) | 2015-04-08 | 2015-04-08 | A kind of cluster result method of the large-scale dataset of unit |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510163967.6A CN104731968B (en) | 2015-04-08 | 2015-04-08 | A kind of cluster result method of the large-scale dataset of unit |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104731968A CN104731968A (en) | 2015-06-24 |
CN104731968B true CN104731968B (en) | 2018-06-19 |
Family
ID=53455855
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510163967.6A Active CN104731968B (en) | 2015-04-08 | 2015-04-08 | A kind of cluster result method of the large-scale dataset of unit |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104731968B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105975574A (en) * | 2016-05-04 | 2016-09-28 | 北京思特奇信息技术股份有限公司 | R language-based large-data volume data screening method and system |
CN110543940B (en) * | 2019-08-29 | 2022-09-23 | 中国人民解放军国防科技大学 | Neural circuit body data processing method, system and medium based on hierarchical storage |
CN111212276A (en) * | 2020-04-22 | 2020-05-29 | 杭州趣链科技有限公司 | Monitoring method, system, equipment and storage medium based on camera module |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101359333A (en) * | 2008-05-23 | 2009-02-04 | 中国科学院软件研究所 | Parallel data processing method based on latent dirichlet allocation model |
CN102193830A (en) * | 2010-03-12 | 2011-09-21 | 复旦大学 | Many-core environment-oriented division mapping/reduction parallel programming model |
CN103020077A (en) * | 2011-09-24 | 2013-04-03 | 国家电网公司 | Method for managing memory of real-time database of power system |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070101332A1 (en) * | 2005-10-28 | 2007-05-03 | International Business Machines Corporation | Method and apparatus for resource-based thread allocation in a multiprocessor computer system |
CN102193831B (en) * | 2010-03-12 | 2014-05-21 | 复旦大学 | Method for establishing hierarchical mapping/reduction parallel programming model |
CN102385588B (en) * | 2010-08-31 | 2014-08-06 | 国际商业机器公司 | Method and system for improving performance of data parallel insertion |
CN102231121B (en) * | 2011-07-25 | 2013-02-27 | 北方工业大学 | Memory mapping-based rapid parallel extraction method for big data file |
-
2015
- 2015-04-08 CN CN201510163967.6A patent/CN104731968B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101359333A (en) * | 2008-05-23 | 2009-02-04 | 中国科学院软件研究所 | Parallel data processing method based on latent dirichlet allocation model |
CN102193830A (en) * | 2010-03-12 | 2011-09-21 | 复旦大学 | Many-core environment-oriented division mapping/reduction parallel programming model |
CN103020077A (en) * | 2011-09-24 | 2013-04-03 | 国家电网公司 | Method for managing memory of real-time database of power system |
Non-Patent Citations (1)
Title |
---|
"Scalable multi-core simulation using parallel dynamic binary translation";Oscar Almer 等;《Embedded Computer Systems (SAMOS), 2011 International Conference on》;20110721;第190-199页 * |
Also Published As
Publication number | Publication date |
---|---|
CN104731968A (en) | 2015-06-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2018099299A1 (en) | Graphic data processing method, device and system | |
US9830303B1 (en) | Optimized matrix multiplication using vector multiplication of interleaved matrix values | |
US8676874B2 (en) | Data structure for tiling and packetizing a sparse matrix | |
US8762655B2 (en) | Optimizing output vector data generation using a formatted matrix data structure | |
WO2018160773A1 (en) | Matrix transfer accelerator system and method | |
CN104731968B (en) | A kind of cluster result method of the large-scale dataset of unit | |
JP5778343B2 (en) | Instruction culling in the graphics processing unit | |
CN107851004A (en) | For the register spilling management of general register (GPR) | |
CN103425534A (en) | Graphics processing unit sharing between many applications | |
CN103995827B (en) | High-performance sort method in MapReduce Computational frames | |
WO2022147518A1 (en) | Neural network accelerator writable memory reconfigurability | |
US20240005446A1 (en) | Methods, systems, and non-transitory storage media for graphics memory allocation | |
CN103713953A (en) | Device and method for transferring data in memory | |
CN106708437A (en) | VMware virtualization storage allocation method and system | |
Du et al. | Feature-aware task scheduling on CPU-FPGA heterogeneous platforms | |
CN104376047A (en) | Big table join method based on HBase | |
Siegel et al. | Efficient sparse matrix-matrix multiplication on heterogeneous high performance systems | |
CN110502337A (en) | For the optimization system and method for shuffling the stage in Hadoop MapReduce | |
CN110209631A (en) | Big data processing method and its processing system | |
US20150248303A1 (en) | Paravirtualized migration counter | |
US20080021938A1 (en) | Technique for allocating objects in a managed run time environment | |
CN106991058B (en) | Method and device for processing pre-fetched files | |
CN106844605A (en) | Batch data logical process method and device | |
CN104699520B (en) | A kind of power-economizing method based on virtual machine (vm) migration scheduling | |
Liu et al. | A-MapCG: an adaptive MapReduce framework for GPUs |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20190610 Address after: Room 805, Building B2, Huizhi Science Park, Hengtai Road, Nanjing Economic and Technological Development Zone, Jiangsu Province Patentee after: JIANGSU EVASTELLAR INFORMATION TECHNOLOGY Co.,Ltd. Address before: No. 8, West Road, Buddha city, Jiangning District, Nanjing, Jiangsu Patentee before: HOHAI University |