CN110019414A - Big data digging system based on Distributed Parallel Computing - Google Patents

Big data digging system based on Distributed Parallel Computing Download PDF

Info

Publication number
CN110019414A
CN110019414A CN201711491787.6A CN201711491787A CN110019414A CN 110019414 A CN110019414 A CN 110019414A CN 201711491787 A CN201711491787 A CN 201711491787A CN 110019414 A CN110019414 A CN 110019414A
Authority
CN
China
Prior art keywords
data
module
task schedule
schedule control
responsible
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711491787.6A
Other languages
Chinese (zh)
Inventor
周峻松
徐继峰
祁建明
陈墩金
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Ming - Collar Gene Technology Co Ltd
Original Assignee
Guangzhou Ming - Collar Gene Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Ming - Collar Gene Technology Co Ltd filed Critical Guangzhou Ming - Collar Gene Technology Co Ltd
Priority to CN201711491787.6A priority Critical patent/CN110019414A/en
Publication of CN110019414A publication Critical patent/CN110019414A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a kind of big data digging system based on Distributed Parallel Computing, which includes: client modules, task schedule control module, algoritic module and data set module;Wherein, the client modules are responsible for user and provide system interaction interface and access interface;The task schedule control module is the core of whole system task schedule, is responsible for management and calls each functional component and coordinate the operation of each component;The algoritic module is mainly responsible for management parallel algorithm library;The data set module is responsible for pre-processing initial data, and provides efficient data access interface, will submit to the task schedule control module to treated data efficient and the algoritic module uses.

Description

Big data digging system based on Distributed Parallel Computing
Technical field
The invention belongs to big data digging technology fields, are related to a kind of big data excavation system based on Distributed Parallel Computing System.
Background technique
In recent years, sharply increasing with data volume, facing data mining realization problem, data complexity and system The limited contradiction of computing capability becomes increasingly conspicuous, and traditional one-of-a-kind system shows that speed is slow, efficiency bottom, energy in calculating process The deficiencies of high is consumed, needs to realize large-scale calculations using parallel computation.
Cloud computing platform is the computing platform for having dynamic resource allocation and scheduling, virtualization and High Availabitity feature, can be full The requirement of sufficient data mining calculated performance provides strong support for parallel data mining.
Summary of the invention
It is an object of that present invention to provide a kind of big data digging system based on Distributed Parallel Computing, for conventional individual The problems such as the deficiencies of system shows slow speed, efficiency bottom in calculating process, energy consumption is high, utilize the think of of database fragment Think, by the way that information after data fragmentation to be stored into each partial node, is responsible for each point by a unified central distribution unit Summarizing and safeguarding for nodal information, efficiently solves the problems, such as mass data processing, improves the utilization rate of resource, realizes pair The on-demand offer of user.
In order to solve the above technical problems, the present invention adopts the following technical scheme that: a kind of based on Distributed Parallel Computing Big data digging system, the system include: client modules, task schedule control module, algoritic module and data set module; Wherein, the client modules are responsible for user and provide system interaction interface and access interface;The task schedule controls mould Block is the core of whole system task schedule, is responsible for management and calls each functional component and coordinate the operation of each component;The algorithm Module is mainly responsible for management parallel algorithm library;The data set module is responsible for pre-processing initial data, and provides efficient Data access interface, will submit to the task schedule control module and the algoritic module to treated data efficient It uses.
Further, the client modules include the terminal users such as computer, mobile phone.
Further, the task schedule control module is made of task schedule control unit with knowledge base;Wherein, described Task schedule control unit can receive user interface and the client of open interface sending requests and coordinates other each functional components realities Existing system function;The knowledge base is structuring in knowledge engineering, easy to operate, Yi Liyong, comprehensively organized knowledge cluster, is adopted The knowledge piece set interknited for being stored, organize, managing and being used in computer storage with knowledge representation mode, for Customer satisfaction system data mining results can be used as useful knowledge deposit knowledge base, to guide user's evaluation Result.
Further, the Parallel Algorithms for Data Mining library in the algoritic module is an important functional component, is to calculate The important support of method module, the management to it are realized by the task schedule control unit.
Further, the data set module mainly includes the data source and data of data warehouse and data file composition The functional components such as pretreatment and data access management.
The present invention have compared with prior art it is below the utility model has the advantages that
The present invention program shows slow speed, efficiency bottom, energy consumption height etc. for conventional individual system in calculating process The problems such as insufficient, using the thought of database fragment, by the way that information after data fragmentation to be stored into each partial node, by one Unified central distribution unit is responsible for summarizing and safeguarding for each partial node information, efficiently solves mass data processing and asks Topic, improves the utilization rate of resource, realizes the on-demand offer to user.
Detailed description of the invention
Fig. 1 is the integrated stand composition of the big data digging system based on Distributed Parallel Computing.
Fig. 2 is that the task schedule of the big data digging system based on Distributed Parallel Computing controls the scheduling relationship of each component Schematic diagram.
Specific embodiment
With reference to the accompanying drawing and specific embodiment to the present invention carry out in further detail with complete explanation.It is understood that It is that described herein the specific embodiments are only for explaining the present invention, rather than limitation of the invention.
Big data digging system based on Distributed Parallel Computing is the complication system for including multiple technologies, there is three big cores Module composition, by being responsible for the task schedule control module of task schedule, managing the algoritic module in parallel algorithm library and organizing and manage Manage the data set module of data, layout of each module in platform model such as Fig. 1.
Referring to Fig.1, a kind of big data digging system based on Distributed Parallel Computing of the invention, which includes: visitor Family end module, task schedule control module, algoritic module and data set module;Wherein, the client modules are responsible for using Family provides system interaction interface and access interface;The task schedule control module is the core of whole system task schedule, It is responsible for management to call each functional component and coordinate the operation of each component;The algoritic module is mainly responsible for management parallel algorithm library; The data set module is responsible for pre-processing initial data, and provides efficient data access interface, will treated number It is used according to the task schedule control module and the algoritic module is efficiently submitted to.
Task schedule control module
Task schedule control module is made of two parts, task schedule control unit and knowledge base.Task schedule control unit Client's request that part can receive user interface and open interface issues, and coordinate other each functional components and realize system function Energy;Knowledge base is structuring in knowledge engineering, easy to operate, Yi Liyong, comprehensive organized knowledge cluster, using representation of knowledge side The knowledge piece set that formula was stored in computer storage, and organized, manages and used interknit, for customer satisfaction system number According to Result, it can be used as useful knowledge deposit knowledge base, user's evaluation Result can be guided in this way.
Task schedule control unit is the core of whole system task schedule, and management calls each functional component and coordinates each portion The operation of part.
When user submits data mining to request, user submits some data minings by user interface or open interface Necessary parameter and master data give task schedule control unit, and task schedule control unit is generated using existing information control matches File is set, includes the parameter of finished Parallel Algorithms for Data Mining and the base that data access unit needs to use in configuration file This information;Task schedule control unit will be according to parameter and master data schedule parallel data mining algorithm library, from algorithm simultaneously The data mining algorithm that selection is suitble to this to excavate in library;Then task schedule control unit dispatches data access component, data Access unit control extracts data from data warehouse and data file;After the completion of data preparation, data access component passes through tune The Parallel Algorithms for Data Mining that degree information informs that the starting of task schedule control unit is chosen carries out data mining.What system was excavated As a result not only knowledge base can be exported but also can be stored in using visual mode according to the requirement of user, for later excavation knot The evaluation of fruit.
Algoritic module
Parallel Algorithms for Data Mining library is an important functional component, and algorithms library is algoritic module important support.This number It is that data and algorithm is allowed to separate according to the main purpose that Mining Platform introduces algorithms library, the degree of coupling of offer can be provided in this way, side Just respective upgrading and maintenance.When there is new algorithm to need to expand Parallel Algorithms for Data Mining library, it is only necessary in task schedule The expansion just completed to algorithms library is registered in control unit.
Management for Parallel Algorithms for Data Mining library is realized by task schedule control unit, the wound of algorithms library It founds a capital, expand and the calling of algorithm and cancellation are that task schedule control unit manages and controls.All parallel data minings Algorithm must be registered in task scheduling controlling component and could be used, and task schedule control unit is each algorithm in its registration A control block is created in table, records the parameter of algorithm, when data mining, these parameters can help task schedule to control Component selects suitable algorithm from Parallel Algorithms for Data Mining library.
The angle analyzed from data, data mining can be divided into two types: the data mining of description type and forecasting type Data mining.Some significant properties in the presence of data are expressed in description type data mining in a manner of brief, general description;In advance The data mining of survey type obtains one or a set of data model and analyzing provided data set application ad hoc approach, and will This model is used to predict the related properties of the following new data.Description type data mining includes association analysis, sequence analysis, cluster point The methods of analysis, and forecasting type data mining includes the methods of classification and statistical regression, common prediction model includes decision tree, mind Through network, linear regression etc..Contain above-mentioned typical algorithm in the parallel algorithm library of parallel data mining platform, by analysis and Study the algorithm efficiency with higher and stronger stability in algorithms library.Algorithms library is containing the same of above-mentioned classic algorithm When also reserved the extended interface of algorithms library, when there is outstanding data mining algorithm, only extended interface need to be called real The expansion of existing algorithms library.
Data set module
If there will be no ideal data mining results, the major function of data set module to be for good data environment The isomerism of data is reduced, eliminates noise data, AFR control and inconsistent data, and provide efficient data access interface, So as to treated data efficient submit to task schedule control unit and Parallel Algorithms for Data Mining uses, and then provide The Accuracy and high efficiency of data mining.Data set module mainly includes data warehouse, data file, data prediction and data Access management component continues with the structure that the function of each component is discussed and elaborates data access management component.
Data in data warehouse are to carry out tissue according to theme, and the data of storage can provide letter from the viewpoint of history Breath faces multi-data source, and the data warehouse after through over cleaning and converting can provide ideal discovery knowledge into data mining Environment.Data file refers to the file of database, and the data file of a database includes the total data of entire database, number According to the physical support of logical data base when file.In data mining process, data access management component can direct access number According to warehouse and data file, when the result of first time data mining cannot allow user to be satisfied with, parallel data mining platform can be with Second of data mining and Increment Mining are carried out using the data guidance user in data file, is until obtaining satisfied result Only.Data prediction is a kind of technology for improving the quality of data and data mining results quality, holds that mining process more effectively, more Easily, data preprocessing method include data cleansing, data integration and conversion, hough transformation, attributive concept layering automatically generate Deng.
Data access management component is the core component of data call in data set module, is mentioned for Parallel Algorithms for Data Mining It is supported for data.The function of data access management component includes: to respond the scheduling request of task schedule control unit, from data bins Data are accessed in library or data file;The distribution and placement of data are realized using MapReduce programming model;For parallel data Mining algorithm provides read-write interface.
The above description is only a preferred embodiment of the present invention, is not intended to restrict the invention, for those skilled in the art For, the invention can have various changes and changes.All any modifications made within the spirit and principles of the present invention are equal Replacement, improvement etc., should all be included in the protection scope of the present invention.

Claims (5)

1. the big data digging system based on Distributed Parallel Computing, which is characterized in that the system comprises: client modules, Task schedule control module, algoritic module and data set module;Wherein, the client modules are responsible for user and provide system Interactive interface and access interface;The task schedule control module is the core of whole system task schedule, is responsible for management and adjusts With each functional component and coordinate the operation of each component;The algoritic module is mainly responsible for management parallel algorithm library;The data set Module is responsible for pre-processing initial data, and provides efficient data access interface, will mention to treated data efficient It gives the task schedule control module and the algoritic module uses.
2. the big data digging system according to claim 1 based on Distributed Parallel Computing, which is characterized in that the visitor Family end module includes the terminal users such as computer, mobile phone.
3. the big data digging system according to claim 1 based on Distributed Parallel Computing, which is characterized in that described Business dispatching control module is made of task schedule control unit with knowledge base;Wherein, the task schedule control unit can receive The client that user interface and open interface issue requests and coordinates other each functional components realization system functions;The knowledge base is Structuring, easy to operate, Yi Liyong, comprehensively organized knowledge cluster in knowledge engineering, using knowledge representation mode in computer The knowledge piece set interknited for storing, organize, managing and using in memory, for customer satisfaction system data mining knot Fruit can be used as useful knowledge deposit knowledge base, to guide user's evaluation Result.
4. the big data digging system according to claim 1 based on Distributed Parallel Computing, which is characterized in that the calculation Parallel Algorithms for Data Mining library in method module is an important functional component, is the important support of algoritic module, to its Management is realized by the task schedule control unit.
5. the big data digging system according to claim 1 based on Distributed Parallel Computing, which is characterized in that the number It mainly include the data source and data prediction and data access management etc. of data warehouse and data file composition according to collection module Functional component.
CN201711491787.6A 2017-12-30 2017-12-30 Big data digging system based on Distributed Parallel Computing Pending CN110019414A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711491787.6A CN110019414A (en) 2017-12-30 2017-12-30 Big data digging system based on Distributed Parallel Computing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711491787.6A CN110019414A (en) 2017-12-30 2017-12-30 Big data digging system based on Distributed Parallel Computing

Publications (1)

Publication Number Publication Date
CN110019414A true CN110019414A (en) 2019-07-16

Family

ID=67187226

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711491787.6A Pending CN110019414A (en) 2017-12-30 2017-12-30 Big data digging system based on Distributed Parallel Computing

Country Status (1)

Country Link
CN (1) CN110019414A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110633308A (en) * 2019-08-28 2019-12-31 北京浪潮数据技术有限公司 Data mining method, system and related device
CN111260969A (en) * 2020-03-06 2020-06-09 华南农业大学 Data mining course teaching practice system and teaching practice method based on system

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110633308A (en) * 2019-08-28 2019-12-31 北京浪潮数据技术有限公司 Data mining method, system and related device
CN111260969A (en) * 2020-03-06 2020-06-09 华南农业大学 Data mining course teaching practice system and teaching practice method based on system
CN111260969B (en) * 2020-03-06 2021-12-14 华南农业大学 Data mining course teaching practice system and teaching practice method based on system

Similar Documents

Publication Publication Date Title
CN107491345B (en) Method for writing picture data and distributed NewSQ L database system
US20210191954A1 (en) Push model for intermediate query results
CN104915450B (en) A kind of big data storage and retrieval method and system based on HBase
CN104123369B (en) A kind of implementation method of the configuration management Database Systems based on graphic data base
CN103345514B (en) Streaming data processing method under big data environment
CN107193967A (en) A kind of multi-source heterogeneous industry field big data handles full link solution
CN100594497C (en) System for implementing network search caching and search method
CN105956087B (en) Data version management system and method
CN103577605A (en) Data warehouse based on data fusion and data mining and application method of data warehouse
CN103430144A (en) Data source analytics
CN104834557B (en) A kind of data analysing method based on Hadoop
CN103077070B (en) Cloud computing management system and management method for cloud computing systems
CN109508355A (en) A kind of data pick-up method, system and terminal device
CN107402926A (en) A kind of querying method and query facility
CN114721833A (en) Intelligent cloud coordination method and device based on platform service type
Vo et al. A multi-core approach to efficiently mining high-utility itemsets in dynamic profit databases
CN104317957B (en) A kind of open platform of report form processing, system and report processing method
CN101986661A (en) Improved MapReduce data processing method under virtual machine cluster
CN109710668A (en) A kind of multi-source heterogeneous data access middleware construction method
CN112632025A (en) Power grid enterprise management decision support application system based on PAAS platform
Mukherjee Synthesis of non-replicated dynamic fragment allocation algorithm in distributed database systems
CN110019414A (en) Big data digging system based on Distributed Parallel Computing
US20160203409A1 (en) Framework for calculating grouped optimization algorithms within a distributed data store
CN110516985A (en) Warehouse selection method, system, computer system and computer readable storage medium storing program for executing
CN103365923A (en) Method and device for assessing partition schemes of database

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190716

WD01 Invention patent application deemed withdrawn after publication