CN110019414A - Big data digging system based on Distributed Parallel Computing - Google Patents
Big data digging system based on Distributed Parallel Computing Download PDFInfo
- Publication number
- CN110019414A CN110019414A CN201711491787.6A CN201711491787A CN110019414A CN 110019414 A CN110019414 A CN 110019414A CN 201711491787 A CN201711491787 A CN 201711491787A CN 110019414 A CN110019414 A CN 110019414A
- Authority
- CN
- China
- Prior art keywords
- data
- module
- task schedule
- schedule control
- responsible
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/283—Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a kind of big data digging system based on Distributed Parallel Computing, which includes: client modules, task schedule control module, algoritic module and data set module;Wherein, the client modules are responsible for user and provide system interaction interface and access interface;The task schedule control module is the core of whole system task schedule, is responsible for management and calls each functional component and coordinate the operation of each component;The algoritic module is mainly responsible for management parallel algorithm library;The data set module is responsible for pre-processing initial data, and provides efficient data access interface, will submit to the task schedule control module to treated data efficient and the algoritic module uses.
Description
Technical field
The invention belongs to big data digging technology fields, are related to a kind of big data excavation system based on Distributed Parallel Computing
System.
Background technique
In recent years, sharply increasing with data volume, facing data mining realization problem, data complexity and system
The limited contradiction of computing capability becomes increasingly conspicuous, and traditional one-of-a-kind system shows that speed is slow, efficiency bottom, energy in calculating process
The deficiencies of high is consumed, needs to realize large-scale calculations using parallel computation.
Cloud computing platform is the computing platform for having dynamic resource allocation and scheduling, virtualization and High Availabitity feature, can be full
The requirement of sufficient data mining calculated performance provides strong support for parallel data mining.
Summary of the invention
It is an object of that present invention to provide a kind of big data digging system based on Distributed Parallel Computing, for conventional individual
The problems such as the deficiencies of system shows slow speed, efficiency bottom in calculating process, energy consumption is high, utilize the think of of database fragment
Think, by the way that information after data fragmentation to be stored into each partial node, is responsible for each point by a unified central distribution unit
Summarizing and safeguarding for nodal information, efficiently solves the problems, such as mass data processing, improves the utilization rate of resource, realizes pair
The on-demand offer of user.
In order to solve the above technical problems, the present invention adopts the following technical scheme that: a kind of based on Distributed Parallel Computing
Big data digging system, the system include: client modules, task schedule control module, algoritic module and data set module;
Wherein, the client modules are responsible for user and provide system interaction interface and access interface;The task schedule controls mould
Block is the core of whole system task schedule, is responsible for management and calls each functional component and coordinate the operation of each component;The algorithm
Module is mainly responsible for management parallel algorithm library;The data set module is responsible for pre-processing initial data, and provides efficient
Data access interface, will submit to the task schedule control module and the algoritic module to treated data efficient
It uses.
Further, the client modules include the terminal users such as computer, mobile phone.
Further, the task schedule control module is made of task schedule control unit with knowledge base;Wherein, described
Task schedule control unit can receive user interface and the client of open interface sending requests and coordinates other each functional components realities
Existing system function;The knowledge base is structuring in knowledge engineering, easy to operate, Yi Liyong, comprehensively organized knowledge cluster, is adopted
The knowledge piece set interknited for being stored, organize, managing and being used in computer storage with knowledge representation mode, for
Customer satisfaction system data mining results can be used as useful knowledge deposit knowledge base, to guide user's evaluation Result.
Further, the Parallel Algorithms for Data Mining library in the algoritic module is an important functional component, is to calculate
The important support of method module, the management to it are realized by the task schedule control unit.
Further, the data set module mainly includes the data source and data of data warehouse and data file composition
The functional components such as pretreatment and data access management.
The present invention have compared with prior art it is below the utility model has the advantages that
The present invention program shows slow speed, efficiency bottom, energy consumption height etc. for conventional individual system in calculating process
The problems such as insufficient, using the thought of database fragment, by the way that information after data fragmentation to be stored into each partial node, by one
Unified central distribution unit is responsible for summarizing and safeguarding for each partial node information, efficiently solves mass data processing and asks
Topic, improves the utilization rate of resource, realizes the on-demand offer to user.
Detailed description of the invention
Fig. 1 is the integrated stand composition of the big data digging system based on Distributed Parallel Computing.
Fig. 2 is that the task schedule of the big data digging system based on Distributed Parallel Computing controls the scheduling relationship of each component
Schematic diagram.
Specific embodiment
With reference to the accompanying drawing and specific embodiment to the present invention carry out in further detail with complete explanation.It is understood that
It is that described herein the specific embodiments are only for explaining the present invention, rather than limitation of the invention.
Big data digging system based on Distributed Parallel Computing is the complication system for including multiple technologies, there is three big cores
Module composition, by being responsible for the task schedule control module of task schedule, managing the algoritic module in parallel algorithm library and organizing and manage
Manage the data set module of data, layout of each module in platform model such as Fig. 1.
Referring to Fig.1, a kind of big data digging system based on Distributed Parallel Computing of the invention, which includes: visitor
Family end module, task schedule control module, algoritic module and data set module;Wherein, the client modules are responsible for using
Family provides system interaction interface and access interface;The task schedule control module is the core of whole system task schedule,
It is responsible for management to call each functional component and coordinate the operation of each component;The algoritic module is mainly responsible for management parallel algorithm library;
The data set module is responsible for pre-processing initial data, and provides efficient data access interface, will treated number
It is used according to the task schedule control module and the algoritic module is efficiently submitted to.
Task schedule control module
Task schedule control module is made of two parts, task schedule control unit and knowledge base.Task schedule control unit
Client's request that part can receive user interface and open interface issues, and coordinate other each functional components and realize system function
Energy;Knowledge base is structuring in knowledge engineering, easy to operate, Yi Liyong, comprehensive organized knowledge cluster, using representation of knowledge side
The knowledge piece set that formula was stored in computer storage, and organized, manages and used interknit, for customer satisfaction system number
According to Result, it can be used as useful knowledge deposit knowledge base, user's evaluation Result can be guided in this way.
Task schedule control unit is the core of whole system task schedule, and management calls each functional component and coordinates each portion
The operation of part.
When user submits data mining to request, user submits some data minings by user interface or open interface
Necessary parameter and master data give task schedule control unit, and task schedule control unit is generated using existing information control matches
File is set, includes the parameter of finished Parallel Algorithms for Data Mining and the base that data access unit needs to use in configuration file
This information;Task schedule control unit will be according to parameter and master data schedule parallel data mining algorithm library, from algorithm simultaneously
The data mining algorithm that selection is suitble to this to excavate in library;Then task schedule control unit dispatches data access component, data
Access unit control extracts data from data warehouse and data file;After the completion of data preparation, data access component passes through tune
The Parallel Algorithms for Data Mining that degree information informs that the starting of task schedule control unit is chosen carries out data mining.What system was excavated
As a result not only knowledge base can be exported but also can be stored in using visual mode according to the requirement of user, for later excavation knot
The evaluation of fruit.
Algoritic module
Parallel Algorithms for Data Mining library is an important functional component, and algorithms library is algoritic module important support.This number
It is that data and algorithm is allowed to separate according to the main purpose that Mining Platform introduces algorithms library, the degree of coupling of offer can be provided in this way, side
Just respective upgrading and maintenance.When there is new algorithm to need to expand Parallel Algorithms for Data Mining library, it is only necessary in task schedule
The expansion just completed to algorithms library is registered in control unit.
Management for Parallel Algorithms for Data Mining library is realized by task schedule control unit, the wound of algorithms library
It founds a capital, expand and the calling of algorithm and cancellation are that task schedule control unit manages and controls.All parallel data minings
Algorithm must be registered in task scheduling controlling component and could be used, and task schedule control unit is each algorithm in its registration
A control block is created in table, records the parameter of algorithm, when data mining, these parameters can help task schedule to control
Component selects suitable algorithm from Parallel Algorithms for Data Mining library.
The angle analyzed from data, data mining can be divided into two types: the data mining of description type and forecasting type
Data mining.Some significant properties in the presence of data are expressed in description type data mining in a manner of brief, general description;In advance
The data mining of survey type obtains one or a set of data model and analyzing provided data set application ad hoc approach, and will
This model is used to predict the related properties of the following new data.Description type data mining includes association analysis, sequence analysis, cluster point
The methods of analysis, and forecasting type data mining includes the methods of classification and statistical regression, common prediction model includes decision tree, mind
Through network, linear regression etc..Contain above-mentioned typical algorithm in the parallel algorithm library of parallel data mining platform, by analysis and
Study the algorithm efficiency with higher and stronger stability in algorithms library.Algorithms library is containing the same of above-mentioned classic algorithm
When also reserved the extended interface of algorithms library, when there is outstanding data mining algorithm, only extended interface need to be called real
The expansion of existing algorithms library.
Data set module
If there will be no ideal data mining results, the major function of data set module to be for good data environment
The isomerism of data is reduced, eliminates noise data, AFR control and inconsistent data, and provide efficient data access interface,
So as to treated data efficient submit to task schedule control unit and Parallel Algorithms for Data Mining uses, and then provide
The Accuracy and high efficiency of data mining.Data set module mainly includes data warehouse, data file, data prediction and data
Access management component continues with the structure that the function of each component is discussed and elaborates data access management component.
Data in data warehouse are to carry out tissue according to theme, and the data of storage can provide letter from the viewpoint of history
Breath faces multi-data source, and the data warehouse after through over cleaning and converting can provide ideal discovery knowledge into data mining
Environment.Data file refers to the file of database, and the data file of a database includes the total data of entire database, number
According to the physical support of logical data base when file.In data mining process, data access management component can direct access number
According to warehouse and data file, when the result of first time data mining cannot allow user to be satisfied with, parallel data mining platform can be with
Second of data mining and Increment Mining are carried out using the data guidance user in data file, is until obtaining satisfied result
Only.Data prediction is a kind of technology for improving the quality of data and data mining results quality, holds that mining process more effectively, more
Easily, data preprocessing method include data cleansing, data integration and conversion, hough transformation, attributive concept layering automatically generate
Deng.
Data access management component is the core component of data call in data set module, is mentioned for Parallel Algorithms for Data Mining
It is supported for data.The function of data access management component includes: to respond the scheduling request of task schedule control unit, from data bins
Data are accessed in library or data file;The distribution and placement of data are realized using MapReduce programming model;For parallel data
Mining algorithm provides read-write interface.
The above description is only a preferred embodiment of the present invention, is not intended to restrict the invention, for those skilled in the art
For, the invention can have various changes and changes.All any modifications made within the spirit and principles of the present invention are equal
Replacement, improvement etc., should all be included in the protection scope of the present invention.
Claims (5)
1. the big data digging system based on Distributed Parallel Computing, which is characterized in that the system comprises: client modules,
Task schedule control module, algoritic module and data set module;Wherein, the client modules are responsible for user and provide system
Interactive interface and access interface;The task schedule control module is the core of whole system task schedule, is responsible for management and adjusts
With each functional component and coordinate the operation of each component;The algoritic module is mainly responsible for management parallel algorithm library;The data set
Module is responsible for pre-processing initial data, and provides efficient data access interface, will mention to treated data efficient
It gives the task schedule control module and the algoritic module uses.
2. the big data digging system according to claim 1 based on Distributed Parallel Computing, which is characterized in that the visitor
Family end module includes the terminal users such as computer, mobile phone.
3. the big data digging system according to claim 1 based on Distributed Parallel Computing, which is characterized in that described
Business dispatching control module is made of task schedule control unit with knowledge base;Wherein, the task schedule control unit can receive
The client that user interface and open interface issue requests and coordinates other each functional components realization system functions;The knowledge base is
Structuring, easy to operate, Yi Liyong, comprehensively organized knowledge cluster in knowledge engineering, using knowledge representation mode in computer
The knowledge piece set interknited for storing, organize, managing and using in memory, for customer satisfaction system data mining knot
Fruit can be used as useful knowledge deposit knowledge base, to guide user's evaluation Result.
4. the big data digging system according to claim 1 based on Distributed Parallel Computing, which is characterized in that the calculation
Parallel Algorithms for Data Mining library in method module is an important functional component, is the important support of algoritic module, to its
Management is realized by the task schedule control unit.
5. the big data digging system according to claim 1 based on Distributed Parallel Computing, which is characterized in that the number
It mainly include the data source and data prediction and data access management etc. of data warehouse and data file composition according to collection module
Functional component.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711491787.6A CN110019414A (en) | 2017-12-30 | 2017-12-30 | Big data digging system based on Distributed Parallel Computing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711491787.6A CN110019414A (en) | 2017-12-30 | 2017-12-30 | Big data digging system based on Distributed Parallel Computing |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110019414A true CN110019414A (en) | 2019-07-16 |
Family
ID=67187226
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711491787.6A Pending CN110019414A (en) | 2017-12-30 | 2017-12-30 | Big data digging system based on Distributed Parallel Computing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110019414A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110633308A (en) * | 2019-08-28 | 2019-12-31 | 北京浪潮数据技术有限公司 | Data mining method, system and related device |
CN111260969A (en) * | 2020-03-06 | 2020-06-09 | 华南农业大学 | Data mining course teaching practice system and teaching practice method based on system |
-
2017
- 2017-12-30 CN CN201711491787.6A patent/CN110019414A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110633308A (en) * | 2019-08-28 | 2019-12-31 | 北京浪潮数据技术有限公司 | Data mining method, system and related device |
CN111260969A (en) * | 2020-03-06 | 2020-06-09 | 华南农业大学 | Data mining course teaching practice system and teaching practice method based on system |
CN111260969B (en) * | 2020-03-06 | 2021-12-14 | 华南农业大学 | Data mining course teaching practice system and teaching practice method based on system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107491345B (en) | Method for writing picture data and distributed NewSQ L database system | |
US20210191954A1 (en) | Push model for intermediate query results | |
CN104915450B (en) | A kind of big data storage and retrieval method and system based on HBase | |
CN104123369B (en) | A kind of implementation method of the configuration management Database Systems based on graphic data base | |
CN103345514B (en) | Streaming data processing method under big data environment | |
CN107193967A (en) | A kind of multi-source heterogeneous industry field big data handles full link solution | |
CN100594497C (en) | System for implementing network search caching and search method | |
CN105956087B (en) | Data version management system and method | |
CN103577605A (en) | Data warehouse based on data fusion and data mining and application method of data warehouse | |
CN103430144A (en) | Data source analytics | |
CN104834557B (en) | A kind of data analysing method based on Hadoop | |
CN103077070B (en) | Cloud computing management system and management method for cloud computing systems | |
CN109508355A (en) | A kind of data pick-up method, system and terminal device | |
CN107402926A (en) | A kind of querying method and query facility | |
CN114721833A (en) | Intelligent cloud coordination method and device based on platform service type | |
Vo et al. | A multi-core approach to efficiently mining high-utility itemsets in dynamic profit databases | |
CN104317957B (en) | A kind of open platform of report form processing, system and report processing method | |
CN101986661A (en) | Improved MapReduce data processing method under virtual machine cluster | |
CN109710668A (en) | A kind of multi-source heterogeneous data access middleware construction method | |
CN112632025A (en) | Power grid enterprise management decision support application system based on PAAS platform | |
Mukherjee | Synthesis of non-replicated dynamic fragment allocation algorithm in distributed database systems | |
CN110019414A (en) | Big data digging system based on Distributed Parallel Computing | |
US20160203409A1 (en) | Framework for calculating grouped optimization algorithms within a distributed data store | |
CN110516985A (en) | Warehouse selection method, system, computer system and computer readable storage medium storing program for executing | |
CN103365923A (en) | Method and device for assessing partition schemes of database |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190716 |
|
WD01 | Invention patent application deemed withdrawn after publication |