CN110019414A

CN110019414A - Big data digging system based on Distributed Parallel Computing

Info

Publication number: CN110019414A
Application number: CN201711491787.6A
Authority: CN
Inventors: 周峻松; 徐继峰; 祁建明; 陈墩金
Original assignee: Guangzhou Ming - Collar Gene Technology Co Ltd
Current assignee: Guangzhou Ming - Collar Gene Technology Co Ltd
Priority date: 2017-12-30
Filing date: 2017-12-30
Publication date: 2019-07-16

Abstract

The invention discloses a kind of big data digging system based on Distributed Parallel Computing, which includes: client modules, task schedule control module, algoritic module and data set module；Wherein, the client modules are responsible for user and provide system interaction interface and access interface；The task schedule control module is the core of whole system task schedule, is responsible for management and calls each functional component and coordinate the operation of each component；The algoritic module is mainly responsible for management parallel algorithm library；The data set module is responsible for pre-processing initial data, and provides efficient data access interface, will submit to the task schedule control module to treated data efficient and the algoritic module uses.

Description

Big data digging system based on Distributed Parallel Computing

Technical field

The invention belongs to big data digging technology fields, are related to a kind of big data excavation system based on Distributed Parallel Computing System.

Background technique

In recent years, sharply increasing with data volume, facing data mining realization problem, data complexity and system The limited contradiction of computing capability becomes increasingly conspicuous, and traditional one-of-a-kind system shows that speed is slow, efficiency bottom, energy in calculating process The deficiencies of high is consumed, needs to realize large-scale calculations using parallel computation.

Cloud computing platform is the computing platform for having dynamic resource allocation and scheduling, virtualization and High Availabitity feature, can be full The requirement of sufficient data mining calculated performance provides strong support for parallel data mining.

Summary of the invention

It is an object of that present invention to provide a kind of big data digging system based on Distributed Parallel Computing, for conventional individual The problems such as the deficiencies of system shows slow speed, efficiency bottom in calculating process, energy consumption is high, utilize the think of of database fragment Think, by the way that information after data fragmentation to be stored into each partial node, is responsible for each point by a unified central distribution unit Summarizing and safeguarding for nodal information, efficiently solves the problems, such as mass data processing, improves the utilization rate of resource, realizes pair The on-demand offer of user.

In order to solve the above technical problems, the present invention adopts the following technical scheme that: a kind of based on Distributed Parallel Computing Big data digging system, the system include: client modules, task schedule control module, algoritic module and data set module； Wherein, the client modules are responsible for user and provide system interaction interface and access interface；The task schedule controls mould Block is the core of whole system task schedule, is responsible for management and calls each functional component and coordinate the operation of each component；The algorithm Module is mainly responsible for management parallel algorithm library；The data set module is responsible for pre-processing initial data, and provides efficient Data access interface, will submit to the task schedule control module and the algoritic module to treated data efficient It uses.

Further, the client modules include the terminal users such as computer, mobile phone.

Further, the task schedule control module is made of task schedule control unit with knowledge base；Wherein, described Task schedule control unit can receive user interface and the client of open interface sending requests and coordinates other each functional components realities Existing system function；The knowledge base is structuring in knowledge engineering, easy to operate, Yi Liyong, comprehensively organized knowledge cluster, is adopted The knowledge piece set interknited for being stored, organize, managing and being used in computer storage with knowledge representation mode, for Customer satisfaction system data mining results can be used as useful knowledge deposit knowledge base, to guide user's evaluation Result.

Further, the Parallel Algorithms for Data Mining library in the algoritic module is an important functional component, is to calculate The important support of method module, the management to it are realized by the task schedule control unit.

Further, the data set module mainly includes the data source and data of data warehouse and data file composition The functional components such as pretreatment and data access management.

The present invention have compared with prior art it is below the utility model has the advantages that

The present invention program shows slow speed, efficiency bottom, energy consumption height etc. for conventional individual system in calculating process The problems such as insufficient, using the thought of database fragment, by the way that information after data fragmentation to be stored into each partial node, by one Unified central distribution unit is responsible for summarizing and safeguarding for each partial node information, efficiently solves mass data processing and asks Topic, improves the utilization rate of resource, realizes the on-demand offer to user.

Detailed description of the invention

Fig. 1 is the integrated stand composition of the big data digging system based on Distributed Parallel Computing.

Fig. 2 is that the task schedule of the big data digging system based on Distributed Parallel Computing controls the scheduling relationship of each component Schematic diagram.

Specific embodiment

With reference to the accompanying drawing and specific embodiment to the present invention carry out in further detail with complete explanation.It is understood that It is that described herein the specific embodiments are only for explaining the present invention, rather than limitation of the invention.

Big data digging system based on Distributed Parallel Computing is the complication system for including multiple technologies, there is three big cores Module composition, by being responsible for the task schedule control module of task schedule, managing the algoritic module in parallel algorithm library and organizing and manage Manage the data set module of data, layout of each module in platform model such as Fig. 1.

Referring to Fig.1, a kind of big data digging system based on Distributed Parallel Computing of the invention, which includes: visitor Family end module, task schedule control module, algoritic module and data set module；Wherein, the client modules are responsible for using Family provides system interaction interface and access interface；The task schedule control module is the core of whole system task schedule, It is responsible for management to call each functional component and coordinate the operation of each component；The algoritic module is mainly responsible for management parallel algorithm library； The data set module is responsible for pre-processing initial data, and provides efficient data access interface, will treated number It is used according to the task schedule control module and the algoritic module is efficiently submitted to.

Task schedule control module

Task schedule control module is made of two parts, task schedule control unit and knowledge base.Task schedule control unit Client's request that part can receive user interface and open interface issues, and coordinate other each functional components and realize system function Energy；Knowledge base is structuring in knowledge engineering, easy to operate, Yi Liyong, comprehensive organized knowledge cluster, using representation of knowledge side The knowledge piece set that formula was stored in computer storage, and organized, manages and used interknit, for customer satisfaction system number According to Result, it can be used as useful knowledge deposit knowledge base, user's evaluation Result can be guided in this way.

Task schedule control unit is the core of whole system task schedule, and management calls each functional component and coordinates each portion The operation of part.

When user submits data mining to request, user submits some data minings by user interface or open interface Necessary parameter and master data give task schedule control unit, and task schedule control unit is generated using existing information control matches File is set, includes the parameter of finished Parallel Algorithms for Data Mining and the base that data access unit needs to use in configuration file This information；Task schedule control unit will be according to parameter and master data schedule parallel data mining algorithm library, from algorithm simultaneously The data mining algorithm that selection is suitble to this to excavate in library；Then task schedule control unit dispatches data access component, data Access unit control extracts data from data warehouse and data file；After the completion of data preparation, data access component passes through tune The Parallel Algorithms for Data Mining that degree information informs that the starting of task schedule control unit is chosen carries out data mining.What system was excavated As a result not only knowledge base can be exported but also can be stored in using visual mode according to the requirement of user, for later excavation knot The evaluation of fruit.

Algoritic module

Parallel Algorithms for Data Mining library is an important functional component, and algorithms library is algoritic module important support.This number It is that data and algorithm is allowed to separate according to the main purpose that Mining Platform introduces algorithms library, the degree of coupling of offer can be provided in this way, side Just respective upgrading and maintenance.When there is new algorithm to need to expand Parallel Algorithms for Data Mining library, it is only necessary in task schedule The expansion just completed to algorithms library is registered in control unit.

Management for Parallel Algorithms for Data Mining library is realized by task schedule control unit, the wound of algorithms library It founds a capital, expand and the calling of algorithm and cancellation are that task schedule control unit manages and controls.All parallel data minings Algorithm must be registered in task scheduling controlling component and could be used, and task schedule control unit is each algorithm in its registration A control block is created in table, records the parameter of algorithm, when data mining, these parameters can help task schedule to control Component selects suitable algorithm from Parallel Algorithms for Data Mining library.

The angle analyzed from data, data mining can be divided into two types: the data mining of description type and forecasting type Data mining.Some significant properties in the presence of data are expressed in description type data mining in a manner of brief, general description；In advance The data mining of survey type obtains one or a set of data model and analyzing provided data set application ad hoc approach, and will This model is used to predict the related properties of the following new data.Description type data mining includes association analysis, sequence analysis, cluster point The methods of analysis, and forecasting type data mining includes the methods of classification and statistical regression, common prediction model includes decision tree, mind Through network, linear regression etc..Contain above-mentioned typical algorithm in the parallel algorithm library of parallel data mining platform, by analysis and Study the algorithm efficiency with higher and stronger stability in algorithms library.Algorithms library is containing the same of above-mentioned classic algorithm When also reserved the extended interface of algorithms library, when there is outstanding data mining algorithm, only extended interface need to be called real The expansion of existing algorithms library.

Data set module

If there will be no ideal data mining results, the major function of data set module to be for good data environment The isomerism of data is reduced, eliminates noise data, AFR control and inconsistent data, and provide efficient data access interface, So as to treated data efficient submit to task schedule control unit and Parallel Algorithms for Data Mining uses, and then provide The Accuracy and high efficiency of data mining.Data set module mainly includes data warehouse, data file, data prediction and data Access management component continues with the structure that the function of each component is discussed and elaborates data access management component.

Data in data warehouse are to carry out tissue according to theme, and the data of storage can provide letter from the viewpoint of history Breath faces multi-data source, and the data warehouse after through over cleaning and converting can provide ideal discovery knowledge into data mining Environment.Data file refers to the file of database, and the data file of a database includes the total data of entire database, number According to the physical support of logical data base when file.In data mining process, data access management component can direct access number According to warehouse and data file, when the result of first time data mining cannot allow user to be satisfied with, parallel data mining platform can be with Second of data mining and Increment Mining are carried out using the data guidance user in data file, is until obtaining satisfied result Only.Data prediction is a kind of technology for improving the quality of data and data mining results quality, holds that mining process more effectively, more Easily, data preprocessing method include data cleansing, data integration and conversion, hough transformation, attributive concept layering automatically generate Deng.

Data access management component is the core component of data call in data set module, is mentioned for Parallel Algorithms for Data Mining It is supported for data.The function of data access management component includes: to respond the scheduling request of task schedule control unit, from data bins Data are accessed in library or data file；The distribution and placement of data are realized using MapReduce programming model；For parallel data Mining algorithm provides read-write interface.

The above description is only a preferred embodiment of the present invention, is not intended to restrict the invention, for those skilled in the art For, the invention can have various changes and changes.All any modifications made within the spirit and principles of the present invention are equal Replacement, improvement etc., should all be included in the protection scope of the present invention.

Claims

1. the big data digging system based on Distributed Parallel Computing, which is characterized in that the system comprises: client modules, Task schedule control module, algoritic module and data set module；Wherein, the client modules are responsible for user and provide system Interactive interface and access interface；The task schedule control module is the core of whole system task schedule, is responsible for management and adjusts With each functional component and coordinate the operation of each component；The algoritic module is mainly responsible for management parallel algorithm library；The data set Module is responsible for pre-processing initial data, and provides efficient data access interface, will mention to treated data efficient It gives the task schedule control module and the algoritic module uses.

2. the big data digging system according to claim 1 based on Distributed Parallel Computing, which is characterized in that the visitor Family end module includes the terminal users such as computer, mobile phone.

3. the big data digging system according to claim 1 based on Distributed Parallel Computing, which is characterized in that described Business dispatching control module is made of task schedule control unit with knowledge base；Wherein, the task schedule control unit can receive The client that user interface and open interface issue requests and coordinates other each functional components realization system functions；The knowledge base is Structuring, easy to operate, Yi Liyong, comprehensively organized knowledge cluster in knowledge engineering, using knowledge representation mode in computer The knowledge piece set interknited for storing, organize, managing and using in memory, for customer satisfaction system data mining knot Fruit can be used as useful knowledge deposit knowledge base, to guide user's evaluation Result.

4. the big data digging system according to claim 1 based on Distributed Parallel Computing, which is characterized in that the calculation Parallel Algorithms for Data Mining library in method module is an important functional component, is the important support of algoritic module, to its Management is realized by the task schedule control unit.

5. the big data digging system according to claim 1 based on Distributed Parallel Computing, which is characterized in that the number It mainly include the data source and data prediction and data access management etc. of data warehouse and data file composition according to collection module Functional component.