CN108153748A

CN108153748A - A kind of early-stage preparations method of mining data

Info

Publication number: CN108153748A
Application number: CN201611097402.3A
Authority: CN
Inventors: 安西民; 林殷; 朱巧霞
Original assignee: Space Star Technology (beijing) Co Ltd
Current assignee: Space Star Technology (beijing) Co Ltd
Priority date: 2016-12-02
Filing date: 2016-12-02
Publication date: 2018-06-12

Abstract

The present invention relates to a kind of early-stage preparations methods of mining data, and a calculate node is selected to serve as data preparation node, the Primary Stage Data preparation of data mining is separated from control node, mitigate the operation load of control node, accelerate system processing speed.

Description

A kind of early-stage preparations method of mining data

【Technical field】

The invention belongs to data cleansing field more particularly to the early-stage preparations methods of mining data.

【Background technology】

Data mining is the external service that computing system can usually provide, in the prior art for the system of multiple nodes For, data mining service is typically to be provided by control node, since the data preparation of data mining service early period will expend ratio The more time, control node, which carries out data cleansing work, will necessarily occupy its comparable computing resource, due to control node also Scheduling, distribution and resources control of the task of progress etc., therefore the processing load of control node has been aggravated, to the hard of control node Part configuration requirement is very high, if operation load requirement is not achieved in control node configuration, is easy to cause control node crash, system Paralysis.

Based on the above problem, there is an urgent need for a kind of early-stage preparations methods of new mining data now, mitigate the fortune of control node Row load, accelerates system processing speed.

【Invention content】

In order to solve the above problem of the prior art, the present invention proposes a kind of early-stage preparations method of mining data.

The technical solution adopted by the present invention is as follows：

1. a kind of early-stage preparations method of mining data, which is characterized in that this method comprises the following steps：

(1) when control node receives data mining service request, one is selected to calculate section from multiple calculate nodes Point is as data preparation node；

(2) the data preparation node receives data preparation thread from control node and is stored in local, wherein data cleansing Thread is realized by multiple data preparation components；

(3) the data preparation node can select different data preparation components to be combined and form data preparation thread And run, the data preparation of different function is realized respectively.

Beneficial effects of the present invention include：The preliminary preparation of data mining is loaded and is isolated from control node Come, mitigate the operation load of control node, accelerate system processing speed.

【Description of the drawings】

Attached drawing described herein is to be used to provide further understanding of the present invention, and forms the part of the application, but It does not constitute improper limitations of the present invention, in the accompanying drawings：

Fig. 1 is the structure chart of present system.

Fig. 2 is the flow chart of the early-stage preparations method of mining data of the present invention.

【Specific embodiment】

Come that the present invention will be described in detail below in conjunction with attached drawing and specific embodiment, illustrative examples therein and say It is bright to be only used for explaining the present invention, but not as a limitation of the invention.

It is the system that the present invention is applied referring to attached drawing 1, which includes a control node and multiple calculate nodes.

Referring to attached drawing 2, a kind of early-stage preparations method of mining data, this method comprises the following steps：

(1) node in system is divided into multiple calculate nodes and a control node according to performance, in control node Scheduling thread, load monitoring thread, data cleansing thread and data mining thread are stored with, wherein control node is responsible in system Task scheduling, each calculate node load monitoring and service is externally provided, in one embodiment, the services package Data mining service is included, in other embodiments, other can also be included and externally serviced；

(2) the load monitoring thread in control node monitors the operation load of each calculate node in real time；

(3) when control node receives data mining service request, above-mentioned load monitoring thread is to each calculate node The Real-time Monitoring Data of present load operating status is analyzed and is compared, and the calculate node of selection wherein minimum operation load is made It is in embodiments of the present invention data cleansing node for the processing node of data mining preliminary preparation, in an implementation In mode, if having 1 control node and 5 calculate nodes, what each current operation task of calculate node occupied respectively is Resource of uniting is 60%, 65%, 70%, 75%, 80%, then selects the calculate node of minimum operation load 60% as data cleansing Node；The data cleansing thread and the present load operation shape of the above-mentioned data cleansing node monitored that control node is stored State (60%) sends above-mentioned data cleansing node to；

(4) the data cleansing node receives above-mentioned data cleansing thread and present load operating status (60%), preserves In local, the data cleansing thread of the application is realized by abstract component in one embodiment, is cleaned serviced component and is included Data normalization module, wrong data searching module, data de-duplication module, data correlation, data merging, data analysis, Data enhance module, in one embodiment, can be by data normalization module, wrong data searching module and repeated data Removing module is classified as basic cleaning module collection, and data correlation, data, which merge, is classified as raising cleaning module collection, data analysis, Data enhancing is classified as additional cleaning module collection；In other embodiments, it can also increase and other cleaning services are set Component and increase different cleaning module collection.

(5) the data cleansing node selects different data cleansing components to carry out respectively according to present load operating status Combination forms data cleansing thread and runs, and realizes that data are cleaned substantially, data improve cleaning and data add cleaning respectively.

In one embodiment, the data cleansing node is by the present load operating status of storage and the first preset negative Threshold value is carried to be compared, it is in one embodiment, false if present load operating status is not less than the first default load threshold If present load operating status is 60%, the first default load threshold is 55%, and 60% is not less than 55%, then data cleansing node Basic cleaning module collection, input module, connection component, data capsule component and output precision is selected to form new data cleaning thread And run, realize master data cleaning task；

In other embodiments, if present load operating status is less than the first default load threshold, it is assumed that current negative It is 50% to carry operating status, and the first default load threshold is 55%, and 50% is less than 55%, then by present load operating status and the Two default load thresholds are compared；If present load operating status is not less than the second default load threshold, it is assumed that second is pre- If load threshold is 40%, 50% is not less than 40%, then data cleansing node selects basic cleaning module collection and improves cleaning mould Block collection and input module, connection component, data capsule component, output precision form new data cleaning thread and run, and realize Data cleansing task is improved, if present load operating status is less than the second default load threshold, it is assumed that present load runs shape State is 50%, and the second load threshold is 52%, and 50% is less than 52%, then data cleansing node selects basic cleaning module collection, carries High cleaning module collection and additional cleaning module collection and input module, connection component, data capsule component, output precision form new Data cleansing thread is simultaneously run, and realizes additional data cleaning task.In one embodiment, the first load threshold, second negative It carries threshold value to preset, can also be modified adjustment by control node.

Since the application is according to the different suitable cleaning tasks of the current loading condition of data cleansing node selection, in reality While existing cleaning task, and the influence to task run of data cleansing node itself is reduced as far as possible so that system load Balance.

In one embodiment, for solving, multi-source data standard is skimble-scamble to ask the data normalization module Topic, according to the unified and standard describing mode of the data warehouse pre-established, realizes full storage data standardized format；The error number It is used to searching and deleting unreasonable data, illogical data and inconsistency data according to searching module；The data de-duplication mould Block is used to identifying and deleting approximately duplicated data.

In one embodiment, the data analysis module is used to carry out initial data according to user-defined pattern Correlation analysis is targetedly analyzed according to user-defined personality analysis demand；The data enhancing module is used for Using external dictionaries and rule, incomplete data in initial data, the field omitted are supplemented or in a manner of increasing field Add additional information.

In one embodiment, the data association module is used to find and identify related data and be associated, than Such as be relevant to the age of same name field, professional field is associated, establish incidence relation；The data combiners block is used for It was found that it and identifies homogeneous data and merges, for example multiple purchaser records under same date merge, and carry out purchase number Amount merges addition or the quantity purchase of a middle of the month same article merges addition.

By the above method, the present invention loads selection one according to the operation of each calculate node and serves as data preparation section The preliminary preparation of data mining is loaded and is separated from control node by point, is mitigated the operation load of control node, is added Fast system processing speed.

The above is only the better embodiment of the present invention, therefore all constructions according to described in present patent application range, The equivalent change or modification that feature and principle are done, is included in the range of present patent application.

Claims

(1) when control node receives data mining service request, a calculate node is selected to make from multiple calculate nodes For data preparation node；

(2) the data preparation node receives data preparation thread from control node and is stored in local, wherein data cleansing thread It is realized by multiple data preparation components；

(3) the data preparation node can select different data preparation components to be combined composition data preparation thread and transport Row realizes the data preparation of different function respectively.

2. the early-stage preparations method of mining data according to claim 1, which is characterized in that data preparation is clear for data It washes, data preparation component is cleaning serviced component, and cleaning serviced component includes data normalization module, wrong data searches mould Block, data de-duplication module.

3. the early-stage preparations method of mining data according to claim 2, which is characterized in that cleaning serviced component includes number According to standardized module, wrong data searching module, data de-duplication module.

4. the early-stage preparations method of mining data according to claim 2, which is characterized in that the data normalization module It is real according to the unified and standard describing mode of the data warehouse pre-established for solving the problems, such as that multi-source data standard is skimble-scamble Now full storage data standardized format；The wrong data searching module is for searching and delete unreasonable data, illogical Data and inconsistency data；The data de-duplication module is used to identifying and deleting approximately duplicated data.