CN108153748A - A kind of early-stage preparations method of mining data - Google Patents
A kind of early-stage preparations method of mining data Download PDFInfo
- Publication number
- CN108153748A CN108153748A CN201611097402.3A CN201611097402A CN108153748A CN 108153748 A CN108153748 A CN 108153748A CN 201611097402 A CN201611097402 A CN 201611097402A CN 108153748 A CN108153748 A CN 108153748A
- Authority
- CN
- China
- Prior art keywords
- data
- node
- preparation
- mining
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5083—Techniques for rebalancing the load in a distributed system
Abstract
The present invention relates to a kind of early-stage preparations methods of mining data, and a calculate node is selected to serve as data preparation node, the Primary Stage Data preparation of data mining is separated from control node, mitigate the operation load of control node, accelerate system processing speed.
Description
【Technical field】
The invention belongs to data cleansing field more particularly to the early-stage preparations methods of mining data.
【Background technology】
Data mining is the external service that computing system can usually provide, in the prior art for the system of multiple nodes
For, data mining service is typically to be provided by control node, since the data preparation of data mining service early period will expend ratio
The more time, control node, which carries out data cleansing work, will necessarily occupy its comparable computing resource, due to control node also
Scheduling, distribution and resources control of the task of progress etc., therefore the processing load of control node has been aggravated, to the hard of control node
Part configuration requirement is very high, if operation load requirement is not achieved in control node configuration, is easy to cause control node crash, system
Paralysis.
Based on the above problem, there is an urgent need for a kind of early-stage preparations methods of new mining data now, mitigate the fortune of control node
Row load, accelerates system processing speed.
【Invention content】
In order to solve the above problem of the prior art, the present invention proposes a kind of early-stage preparations method of mining data.
The technical solution adopted by the present invention is as follows:
1. a kind of early-stage preparations method of mining data, which is characterized in that this method comprises the following steps:
(1) when control node receives data mining service request, one is selected to calculate section from multiple calculate nodes
Point is as data preparation node;
(2) the data preparation node receives data preparation thread from control node and is stored in local, wherein data cleansing
Thread is realized by multiple data preparation components;
(3) the data preparation node can select different data preparation components to be combined and form data preparation thread
And run, the data preparation of different function is realized respectively.
Beneficial effects of the present invention include:The preliminary preparation of data mining is loaded and is isolated from control node
Come, mitigate the operation load of control node, accelerate system processing speed.
【Description of the drawings】
Attached drawing described herein is to be used to provide further understanding of the present invention, and forms the part of the application, but
It does not constitute improper limitations of the present invention, in the accompanying drawings:
Fig. 1 is the structure chart of present system.
Fig. 2 is the flow chart of the early-stage preparations method of mining data of the present invention.
【Specific embodiment】
Come that the present invention will be described in detail below in conjunction with attached drawing and specific embodiment, illustrative examples therein and say
It is bright to be only used for explaining the present invention, but not as a limitation of the invention.
It is the system that the present invention is applied referring to attached drawing 1, which includes a control node and multiple calculate nodes.
Referring to attached drawing 2, a kind of early-stage preparations method of mining data, this method comprises the following steps:
(1) node in system is divided into multiple calculate nodes and a control node according to performance, in control node
Scheduling thread, load monitoring thread, data cleansing thread and data mining thread are stored with, wherein control node is responsible in system
Task scheduling, each calculate node load monitoring and service is externally provided, in one embodiment, the services package
Data mining service is included, in other embodiments, other can also be included and externally serviced;
(2) the load monitoring thread in control node monitors the operation load of each calculate node in real time;
(3) when control node receives data mining service request, above-mentioned load monitoring thread is to each calculate node
The Real-time Monitoring Data of present load operating status is analyzed and is compared, and the calculate node of selection wherein minimum operation load is made
It is in embodiments of the present invention data cleansing node for the processing node of data mining preliminary preparation, in an implementation
In mode, if having 1 control node and 5 calculate nodes, what each current operation task of calculate node occupied respectively is
Resource of uniting is 60%, 65%, 70%, 75%, 80%, then selects the calculate node of minimum operation load 60% as data cleansing
Node;The data cleansing thread and the present load operation shape of the above-mentioned data cleansing node monitored that control node is stored
State (60%) sends above-mentioned data cleansing node to;
(4) the data cleansing node receives above-mentioned data cleansing thread and present load operating status (60%), preserves
In local, the data cleansing thread of the application is realized by abstract component in one embodiment, is cleaned serviced component and is included
Data normalization module, wrong data searching module, data de-duplication module, data correlation, data merging, data analysis,
Data enhance module, in one embodiment, can be by data normalization module, wrong data searching module and repeated data
Removing module is classified as basic cleaning module collection, and data correlation, data, which merge, is classified as raising cleaning module collection, data analysis,
Data enhancing is classified as additional cleaning module collection;In other embodiments, it can also increase and other cleaning services are set
Component and increase different cleaning module collection.
(5) the data cleansing node selects different data cleansing components to carry out respectively according to present load operating status
Combination forms data cleansing thread and runs, and realizes that data are cleaned substantially, data improve cleaning and data add cleaning respectively.
In one embodiment, the data cleansing node is by the present load operating status of storage and the first preset negative
Threshold value is carried to be compared, it is in one embodiment, false if present load operating status is not less than the first default load threshold
If present load operating status is 60%, the first default load threshold is 55%, and 60% is not less than 55%, then data cleansing node
Basic cleaning module collection, input module, connection component, data capsule component and output precision is selected to form new data cleaning thread
And run, realize master data cleaning task;
In other embodiments, if present load operating status is less than the first default load threshold, it is assumed that current negative
It is 50% to carry operating status, and the first default load threshold is 55%, and 50% is less than 55%, then by present load operating status and the
Two default load thresholds are compared;If present load operating status is not less than the second default load threshold, it is assumed that second is pre-
If load threshold is 40%, 50% is not less than 40%, then data cleansing node selects basic cleaning module collection and improves cleaning mould
Block collection and input module, connection component, data capsule component, output precision form new data cleaning thread and run, and realize
Data cleansing task is improved, if present load operating status is less than the second default load threshold, it is assumed that present load runs shape
State is 50%, and the second load threshold is 52%, and 50% is less than 52%, then data cleansing node selects basic cleaning module collection, carries
High cleaning module collection and additional cleaning module collection and input module, connection component, data capsule component, output precision form new
Data cleansing thread is simultaneously run, and realizes additional data cleaning task.In one embodiment, the first load threshold, second negative
It carries threshold value to preset, can also be modified adjustment by control node.
Since the application is according to the different suitable cleaning tasks of the current loading condition of data cleansing node selection, in reality
While existing cleaning task, and the influence to task run of data cleansing node itself is reduced as far as possible so that system load
Balance.
In one embodiment, for solving, multi-source data standard is skimble-scamble to ask the data normalization module
Topic, according to the unified and standard describing mode of the data warehouse pre-established, realizes full storage data standardized format;The error number
It is used to searching and deleting unreasonable data, illogical data and inconsistency data according to searching module;The data de-duplication mould
Block is used to identifying and deleting approximately duplicated data.
In one embodiment, the data analysis module is used to carry out initial data according to user-defined pattern
Correlation analysis is targetedly analyzed according to user-defined personality analysis demand;The data enhancing module is used for
Using external dictionaries and rule, incomplete data in initial data, the field omitted are supplemented or in a manner of increasing field
Add additional information.
In one embodiment, the data association module is used to find and identify related data and be associated, than
Such as be relevant to the age of same name field, professional field is associated, establish incidence relation;The data combiners block is used for
It was found that it and identifies homogeneous data and merges, for example multiple purchaser records under same date merge, and carry out purchase number
Amount merges addition or the quantity purchase of a middle of the month same article merges addition.
By the above method, the present invention loads selection one according to the operation of each calculate node and serves as data preparation section
The preliminary preparation of data mining is loaded and is separated from control node by point, is mitigated the operation load of control node, is added
Fast system processing speed.
The above is only the better embodiment of the present invention, therefore all constructions according to described in present patent application range,
The equivalent change or modification that feature and principle are done, is included in the range of present patent application.
Claims (4)
1. a kind of early-stage preparations method of mining data, which is characterized in that this method comprises the following steps:
(1) when control node receives data mining service request, a calculate node is selected to make from multiple calculate nodes
For data preparation node;
(2) the data preparation node receives data preparation thread from control node and is stored in local, wherein data cleansing thread
It is realized by multiple data preparation components;
(3) the data preparation node can select different data preparation components to be combined composition data preparation thread and transport
Row realizes the data preparation of different function respectively.
2. the early-stage preparations method of mining data according to claim 1, which is characterized in that data preparation is clear for data
It washes, data preparation component is cleaning serviced component, and cleaning serviced component includes data normalization module, wrong data searches mould
Block, data de-duplication module.
3. the early-stage preparations method of mining data according to claim 2, which is characterized in that cleaning serviced component includes number
According to standardized module, wrong data searching module, data de-duplication module.
4. the early-stage preparations method of mining data according to claim 2, which is characterized in that the data normalization module
It is real according to the unified and standard describing mode of the data warehouse pre-established for solving the problems, such as that multi-source data standard is skimble-scamble
Now full storage data standardized format;The wrong data searching module is for searching and delete unreasonable data, illogical
Data and inconsistency data;The data de-duplication module is used to identifying and deleting approximately duplicated data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611097402.3A CN108153748A (en) | 2016-12-02 | 2016-12-02 | A kind of early-stage preparations method of mining data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611097402.3A CN108153748A (en) | 2016-12-02 | 2016-12-02 | A kind of early-stage preparations method of mining data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108153748A true CN108153748A (en) | 2018-06-12 |
Family
ID=62469339
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611097402.3A Pending CN108153748A (en) | 2016-12-02 | 2016-12-02 | A kind of early-stage preparations method of mining data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108153748A (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101252546A (en) * | 2008-04-15 | 2008-08-27 | 中国科学技术大学 | Method and apparatus for migrating medium stream online service |
CN103365726A (en) * | 2013-07-08 | 2013-10-23 | 华中科技大学 | Resource management method and system facing GPU (Graphic Processing Unit) cluster |
CN104809194A (en) * | 2015-04-23 | 2015-07-29 | 重庆工业职业技术学院 | Data mining platform, system and method |
CN105094982A (en) * | 2014-09-23 | 2015-11-25 | 航天恒星科技有限公司 | Multi-satellite remote sensing data processing system |
CN106126601A (en) * | 2016-06-20 | 2016-11-16 | 华南理工大学 | A kind of social security distributed preprocess method of big data and system |
-
2016
- 2016-12-02 CN CN201611097402.3A patent/CN108153748A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101252546A (en) * | 2008-04-15 | 2008-08-27 | 中国科学技术大学 | Method and apparatus for migrating medium stream online service |
CN103365726A (en) * | 2013-07-08 | 2013-10-23 | 华中科技大学 | Resource management method and system facing GPU (Graphic Processing Unit) cluster |
CN105094982A (en) * | 2014-09-23 | 2015-11-25 | 航天恒星科技有限公司 | Multi-satellite remote sensing data processing system |
CN104809194A (en) * | 2015-04-23 | 2015-07-29 | 重庆工业职业技术学院 | Data mining platform, system and method |
CN106126601A (en) * | 2016-06-20 | 2016-11-16 | 华南理工大学 | A kind of social security distributed preprocess method of big data and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106776984B (en) | A kind of cleaning method of distributed system mining data | |
Vera-Baquero et al. | Real-time business activity monitoring and analysis of process performance on big-data domains | |
Gordon | A general purpose systems simulator | |
CN111061788B (en) | Multi-source heterogeneous data conversion integration system based on cloud architecture and implementation method thereof | |
CN106126601A (en) | A kind of social security distributed preprocess method of big data and system | |
CN106095940A (en) | A kind of data migration method of task based access control load | |
CN105989163A (en) | Data real-time processing method and system | |
CN104035786B (en) | The optimization method and system of a kind of software timer | |
WO2023103247A1 (en) | Component and strategy linkage method and apparatus, and device, system and storage medium | |
CN104572301A (en) | Resource distribution method and system | |
CN108021449A (en) | One kind association journey implementation method, terminal device and storage medium | |
CN103488674A (en) | Computing system and method for implementing and controlling thereof | |
CN108897876A (en) | A kind of data cut-in method and device | |
CN112148779A (en) | Method, device and storage medium for determining service index | |
CN105278945A (en) | Program visualization device, program visualization method, and program visualization program | |
CN108073658B (en) | Data synchronization system and method | |
CN103186384A (en) | Business-component-oriented software designing and analyzing system and using method thereof | |
CN108153748A (en) | A kind of early-stage preparations method of mining data | |
CN110008382B (en) | Method, system and equipment for determining TopN data | |
CN108153642A (en) | A kind of method that selection calculate node is loaded according to operation | |
CN116225312A (en) | Mirror image cleaning method and device, electronic equipment and storage medium | |
CN113326131B (en) | Data processing method, device, equipment and storage medium | |
CN115099972A (en) | Transaction data processing method, device and equipment based on event-driven architecture | |
CN109144486A (en) | A kind of workflow implementation method statelessly changed | |
CN114579469A (en) | Full link interface test method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 101399 No. 2 East Airport Road, Shunyi Airport Economic Core Area, Beijing (1st, 5th and 7th floors of Industrial Park 1A-4) Applicant after: Zhongke Star Map Co., Ltd. Address before: 101399 Building 1A-4, National Geographic Information Technology Industrial Park, Guomen Business District, Shunyi District, Beijing Applicant before: Space Star Technology (Beijing) Co., Ltd. |
|
CB02 | Change of applicant information | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180612 |
|
RJ01 | Rejection of invention patent application after publication |