CN108153748A - A kind of early-stage preparations method of mining data - Google Patents

A kind of early-stage preparations method of mining data Download PDF

Info

Publication number
CN108153748A
CN108153748A CN201611097402.3A CN201611097402A CN108153748A CN 108153748 A CN108153748 A CN 108153748A CN 201611097402 A CN201611097402 A CN 201611097402A CN 108153748 A CN108153748 A CN 108153748A
Authority
CN
China
Prior art keywords
data
node
preparation
mining
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611097402.3A
Other languages
Chinese (zh)
Inventor
安西民
林殷
朱巧霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Space Star Technology (beijing) Co Ltd
Original Assignee
Space Star Technology (beijing) Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Space Star Technology (beijing) Co Ltd filed Critical Space Star Technology (beijing) Co Ltd
Priority to CN201611097402.3A priority Critical patent/CN108153748A/en
Publication of CN108153748A publication Critical patent/CN108153748A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system

Abstract

The present invention relates to a kind of early-stage preparations methods of mining data, and a calculate node is selected to serve as data preparation node, the Primary Stage Data preparation of data mining is separated from control node, mitigate the operation load of control node, accelerate system processing speed.

Description

A kind of early-stage preparations method of mining data
【Technical field】
The invention belongs to data cleansing field more particularly to the early-stage preparations methods of mining data.
【Background technology】
Data mining is the external service that computing system can usually provide, in the prior art for the system of multiple nodes For, data mining service is typically to be provided by control node, since the data preparation of data mining service early period will expend ratio The more time, control node, which carries out data cleansing work, will necessarily occupy its comparable computing resource, due to control node also Scheduling, distribution and resources control of the task of progress etc., therefore the processing load of control node has been aggravated, to the hard of control node Part configuration requirement is very high, if operation load requirement is not achieved in control node configuration, is easy to cause control node crash, system Paralysis.
Based on the above problem, there is an urgent need for a kind of early-stage preparations methods of new mining data now, mitigate the fortune of control node Row load, accelerates system processing speed.
【Invention content】
In order to solve the above problem of the prior art, the present invention proposes a kind of early-stage preparations method of mining data.
The technical solution adopted by the present invention is as follows:
1. a kind of early-stage preparations method of mining data, which is characterized in that this method comprises the following steps:
(1) when control node receives data mining service request, one is selected to calculate section from multiple calculate nodes Point is as data preparation node;
(2) the data preparation node receives data preparation thread from control node and is stored in local, wherein data cleansing Thread is realized by multiple data preparation components;
(3) the data preparation node can select different data preparation components to be combined and form data preparation thread And run, the data preparation of different function is realized respectively.
Beneficial effects of the present invention include:The preliminary preparation of data mining is loaded and is isolated from control node Come, mitigate the operation load of control node, accelerate system processing speed.
【Description of the drawings】
Attached drawing described herein is to be used to provide further understanding of the present invention, and forms the part of the application, but It does not constitute improper limitations of the present invention, in the accompanying drawings:
Fig. 1 is the structure chart of present system.
Fig. 2 is the flow chart of the early-stage preparations method of mining data of the present invention.
【Specific embodiment】
Come that the present invention will be described in detail below in conjunction with attached drawing and specific embodiment, illustrative examples therein and say It is bright to be only used for explaining the present invention, but not as a limitation of the invention.
It is the system that the present invention is applied referring to attached drawing 1, which includes a control node and multiple calculate nodes.
Referring to attached drawing 2, a kind of early-stage preparations method of mining data, this method comprises the following steps:
(1) node in system is divided into multiple calculate nodes and a control node according to performance, in control node Scheduling thread, load monitoring thread, data cleansing thread and data mining thread are stored with, wherein control node is responsible in system Task scheduling, each calculate node load monitoring and service is externally provided, in one embodiment, the services package Data mining service is included, in other embodiments, other can also be included and externally serviced;
(2) the load monitoring thread in control node monitors the operation load of each calculate node in real time;
(3) when control node receives data mining service request, above-mentioned load monitoring thread is to each calculate node The Real-time Monitoring Data of present load operating status is analyzed and is compared, and the calculate node of selection wherein minimum operation load is made It is in embodiments of the present invention data cleansing node for the processing node of data mining preliminary preparation, in an implementation In mode, if having 1 control node and 5 calculate nodes, what each current operation task of calculate node occupied respectively is Resource of uniting is 60%, 65%, 70%, 75%, 80%, then selects the calculate node of minimum operation load 60% as data cleansing Node;The data cleansing thread and the present load operation shape of the above-mentioned data cleansing node monitored that control node is stored State (60%) sends above-mentioned data cleansing node to;
(4) the data cleansing node receives above-mentioned data cleansing thread and present load operating status (60%), preserves In local, the data cleansing thread of the application is realized by abstract component in one embodiment, is cleaned serviced component and is included Data normalization module, wrong data searching module, data de-duplication module, data correlation, data merging, data analysis, Data enhance module, in one embodiment, can be by data normalization module, wrong data searching module and repeated data Removing module is classified as basic cleaning module collection, and data correlation, data, which merge, is classified as raising cleaning module collection, data analysis, Data enhancing is classified as additional cleaning module collection;In other embodiments, it can also increase and other cleaning services are set Component and increase different cleaning module collection.
(5) the data cleansing node selects different data cleansing components to carry out respectively according to present load operating status Combination forms data cleansing thread and runs, and realizes that data are cleaned substantially, data improve cleaning and data add cleaning respectively.
In one embodiment, the data cleansing node is by the present load operating status of storage and the first preset negative Threshold value is carried to be compared, it is in one embodiment, false if present load operating status is not less than the first default load threshold If present load operating status is 60%, the first default load threshold is 55%, and 60% is not less than 55%, then data cleansing node Basic cleaning module collection, input module, connection component, data capsule component and output precision is selected to form new data cleaning thread And run, realize master data cleaning task;
In other embodiments, if present load operating status is less than the first default load threshold, it is assumed that current negative It is 50% to carry operating status, and the first default load threshold is 55%, and 50% is less than 55%, then by present load operating status and the Two default load thresholds are compared;If present load operating status is not less than the second default load threshold, it is assumed that second is pre- If load threshold is 40%, 50% is not less than 40%, then data cleansing node selects basic cleaning module collection and improves cleaning mould Block collection and input module, connection component, data capsule component, output precision form new data cleaning thread and run, and realize Data cleansing task is improved, if present load operating status is less than the second default load threshold, it is assumed that present load runs shape State is 50%, and the second load threshold is 52%, and 50% is less than 52%, then data cleansing node selects basic cleaning module collection, carries High cleaning module collection and additional cleaning module collection and input module, connection component, data capsule component, output precision form new Data cleansing thread is simultaneously run, and realizes additional data cleaning task.In one embodiment, the first load threshold, second negative It carries threshold value to preset, can also be modified adjustment by control node.
Since the application is according to the different suitable cleaning tasks of the current loading condition of data cleansing node selection, in reality While existing cleaning task, and the influence to task run of data cleansing node itself is reduced as far as possible so that system load Balance.
In one embodiment, for solving, multi-source data standard is skimble-scamble to ask the data normalization module Topic, according to the unified and standard describing mode of the data warehouse pre-established, realizes full storage data standardized format;The error number It is used to searching and deleting unreasonable data, illogical data and inconsistency data according to searching module;The data de-duplication mould Block is used to identifying and deleting approximately duplicated data.
In one embodiment, the data analysis module is used to carry out initial data according to user-defined pattern Correlation analysis is targetedly analyzed according to user-defined personality analysis demand;The data enhancing module is used for Using external dictionaries and rule, incomplete data in initial data, the field omitted are supplemented or in a manner of increasing field Add additional information.
In one embodiment, the data association module is used to find and identify related data and be associated, than Such as be relevant to the age of same name field, professional field is associated, establish incidence relation;The data combiners block is used for It was found that it and identifies homogeneous data and merges, for example multiple purchaser records under same date merge, and carry out purchase number Amount merges addition or the quantity purchase of a middle of the month same article merges addition.
By the above method, the present invention loads selection one according to the operation of each calculate node and serves as data preparation section The preliminary preparation of data mining is loaded and is separated from control node by point, is mitigated the operation load of control node, is added Fast system processing speed.
The above is only the better embodiment of the present invention, therefore all constructions according to described in present patent application range, The equivalent change or modification that feature and principle are done, is included in the range of present patent application.

Claims (4)

1. a kind of early-stage preparations method of mining data, which is characterized in that this method comprises the following steps:
(1) when control node receives data mining service request, a calculate node is selected to make from multiple calculate nodes For data preparation node;
(2) the data preparation node receives data preparation thread from control node and is stored in local, wherein data cleansing thread It is realized by multiple data preparation components;
(3) the data preparation node can select different data preparation components to be combined composition data preparation thread and transport Row realizes the data preparation of different function respectively.
2. the early-stage preparations method of mining data according to claim 1, which is characterized in that data preparation is clear for data It washes, data preparation component is cleaning serviced component, and cleaning serviced component includes data normalization module, wrong data searches mould Block, data de-duplication module.
3. the early-stage preparations method of mining data according to claim 2, which is characterized in that cleaning serviced component includes number According to standardized module, wrong data searching module, data de-duplication module.
4. the early-stage preparations method of mining data according to claim 2, which is characterized in that the data normalization module It is real according to the unified and standard describing mode of the data warehouse pre-established for solving the problems, such as that multi-source data standard is skimble-scamble Now full storage data standardized format;The wrong data searching module is for searching and delete unreasonable data, illogical Data and inconsistency data;The data de-duplication module is used to identifying and deleting approximately duplicated data.
CN201611097402.3A 2016-12-02 2016-12-02 A kind of early-stage preparations method of mining data Pending CN108153748A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611097402.3A CN108153748A (en) 2016-12-02 2016-12-02 A kind of early-stage preparations method of mining data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611097402.3A CN108153748A (en) 2016-12-02 2016-12-02 A kind of early-stage preparations method of mining data

Publications (1)

Publication Number Publication Date
CN108153748A true CN108153748A (en) 2018-06-12

Family

ID=62469339

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611097402.3A Pending CN108153748A (en) 2016-12-02 2016-12-02 A kind of early-stage preparations method of mining data

Country Status (1)

Country Link
CN (1) CN108153748A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101252546A (en) * 2008-04-15 2008-08-27 中国科学技术大学 Method and apparatus for migrating medium stream online service
CN103365726A (en) * 2013-07-08 2013-10-23 华中科技大学 Resource management method and system facing GPU (Graphic Processing Unit) cluster
CN104809194A (en) * 2015-04-23 2015-07-29 重庆工业职业技术学院 Data mining platform, system and method
CN105094982A (en) * 2014-09-23 2015-11-25 航天恒星科技有限公司 Multi-satellite remote sensing data processing system
CN106126601A (en) * 2016-06-20 2016-11-16 华南理工大学 A kind of social security distributed preprocess method of big data and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101252546A (en) * 2008-04-15 2008-08-27 中国科学技术大学 Method and apparatus for migrating medium stream online service
CN103365726A (en) * 2013-07-08 2013-10-23 华中科技大学 Resource management method and system facing GPU (Graphic Processing Unit) cluster
CN105094982A (en) * 2014-09-23 2015-11-25 航天恒星科技有限公司 Multi-satellite remote sensing data processing system
CN104809194A (en) * 2015-04-23 2015-07-29 重庆工业职业技术学院 Data mining platform, system and method
CN106126601A (en) * 2016-06-20 2016-11-16 华南理工大学 A kind of social security distributed preprocess method of big data and system

Similar Documents

Publication Publication Date Title
CN106776984B (en) A kind of cleaning method of distributed system mining data
Vera-Baquero et al. Real-time business activity monitoring and analysis of process performance on big-data domains
Gordon A general purpose systems simulator
CN111061788B (en) Multi-source heterogeneous data conversion integration system based on cloud architecture and implementation method thereof
CN106126601A (en) A kind of social security distributed preprocess method of big data and system
CN106095940A (en) A kind of data migration method of task based access control load
CN105989163A (en) Data real-time processing method and system
CN104035786B (en) The optimization method and system of a kind of software timer
WO2023103247A1 (en) Component and strategy linkage method and apparatus, and device, system and storage medium
CN104572301A (en) Resource distribution method and system
CN108021449A (en) One kind association journey implementation method, terminal device and storage medium
CN103488674A (en) Computing system and method for implementing and controlling thereof
CN108897876A (en) A kind of data cut-in method and device
CN112148779A (en) Method, device and storage medium for determining service index
CN105278945A (en) Program visualization device, program visualization method, and program visualization program
CN108073658B (en) Data synchronization system and method
CN103186384A (en) Business-component-oriented software designing and analyzing system and using method thereof
CN108153748A (en) A kind of early-stage preparations method of mining data
CN110008382B (en) Method, system and equipment for determining TopN data
CN108153642A (en) A kind of method that selection calculate node is loaded according to operation
CN116225312A (en) Mirror image cleaning method and device, electronic equipment and storage medium
CN113326131B (en) Data processing method, device, equipment and storage medium
CN115099972A (en) Transaction data processing method, device and equipment based on event-driven architecture
CN109144486A (en) A kind of workflow implementation method statelessly changed
CN114579469A (en) Full link interface test method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 101399 No. 2 East Airport Road, Shunyi Airport Economic Core Area, Beijing (1st, 5th and 7th floors of Industrial Park 1A-4)

Applicant after: Zhongke Star Map Co., Ltd.

Address before: 101399 Building 1A-4, National Geographic Information Technology Industrial Park, Guomen Business District, Shunyi District, Beijing

Applicant before: Space Star Technology (Beijing) Co., Ltd.

CB02 Change of applicant information
RJ01 Rejection of invention patent application after publication

Application publication date: 20180612

RJ01 Rejection of invention patent application after publication