CN110457370A

CN110457370A - Outlier Detection system and method for cleaning in data mining based on artificial intelligence

Info

Publication number: CN110457370A
Application number: CN201910740294.4A
Authority: CN
Inventors: 冷强奎; 秦玉平
Original assignee: Bohai University
Current assignee: Bohai University
Priority date: 2019-08-12
Filing date: 2019-08-12
Publication date: 2019-11-15

Abstract

The invention belongs to data mining technology fields, disclose Outlier Detection system and method for cleaning in a kind of data mining based on artificial intelligence, according to certain moment isolated point information detector node data to isolated point information detector node clustering, the linear dimensionality reduction of data, it is fitted to data and curves, detection curve, compare the trend and curve similarity of test curve and detection curve, it detects whether there is exception, existing abnormal data is pre-processed, is detected, is identified, data scrubbing.The problem of solving the problems, such as accurately be selected from Pearson correlation coefficients or Spearman's correlation coefficient during available data excavates Outlier Detection, and cannot effectively being cleared up for the isolated point data detected.The present invention finds out corresponding wrong data by detecting isolated point and is removed and clears up, and has achieved the purpose that improve data source data quality, provides new thinking for data mining aspect.

Description

Outlier Detection system and method for cleaning in data mining based on artificial intelligence

Technical field

The invention belongs to isolated points in data mining technology field more particularly to a kind of data mining based on artificial intelligence Detection system and method for cleaning.

Background technique

Isolated point can also refer to be in data acquisition system with the feature of most of data or inconsistent data.Isolated point (Outlier)) refer to the data for not meeting model as data.When excavating normal class knowledge, usually always using they as Interference signal is handled.When it is found that these data can (such as credit fraud, intrusion detection provide useful for the application of certain class When letter meaning, a new research topic, i.e. isolated charged body just are provided for data mining.It was found that and detection isolated point method By extensive discussions, mainly have based on probability statistics, based on distance and based on the class method of the detection techniques such as deviation, Bamett etc. Establish the Outlier Detection concept based on statistical method.Based on state from Outlier Detection method by Knorr and Ng etc. one It is described in detail in serial article.It now can be with reference to the research such as Aming and Agrawal in the Outlier Detection technology of deviation.At present Isolated charged body becomes very the research for having application value as safety detections means such as credit card fraud, data network illegal invasions Branch.However, when carrying out data dependence analysis during available data excavation Outlier Detection, if there is to analyzed number It, then can not be accurately from Pearson correlation coefficients or Spearman's correlation coefficient according to having the uncomprehending situation of which kind of incidence relation In selected；Simultaneously as error in data often shows as isolated point, therefore by detecting and removing the isolated point in data source The purpose that can reach data scrubbing improves the quality of data of data source；But and not all isolated point is all wrong data, therefore The metadata that how can also be combined domain knowledge after detecting isolated point or be stored need to be studied, corresponding mistake is therefrom found out Data.

In conclusion problem of the existing technology is:

Available data is excavated during Outlier Detection when needing to carry out data dependence analysis, if there is to dividing Analysing data has the uncomprehending situation of which kind of incidence relation, then can not be accurately related from Pearson correlation coefficients or Spearman It is selected in coefficient.

Simultaneously as error in data often shows as isolated point, therefore can by detecting and removing the isolated point in data source Achieve the purpose that data scrubbing, improves the quality of data of data source；But and not all isolated point is all wrong data, therefore is needed The metadata how research can also combine domain knowledge or be stored after detecting isolated point, therefrom find out corresponding error number According to.

Summary of the invention

In view of the problems of the existing technology, the present invention provides isolated points in a kind of data mining based on artificial intelligence Detection system and method for cleaning.

The invention is realized in this way isolated point method for cleaning in a kind of data mining based on artificial intelligence, the base Isolated point method for cleaning includes: in the data mining of artificial intelligence

According to certain mutually in the same time isolated point information detector node data to isolated point information detector node clustering, to point Each cluster after cluster is respectively trained super ellipsoids and accordingly calculates each axial length of super ellipsoids, using axial length proportionality coefficient as coefficient to orphan The point linear dimensionality reduction of information data is found, the data after dimensionality reduction are fitted to data and curves, as test curve.

Identical dimensionality reduction, curve fit process are made to the data of subsequent time same time period, the curve after fitting is as inspection Survey curve.

Compare the trend and curve similarity of test curve and detection curve, the isolated point information data that detection node is collected With the presence or absence of abnormal data.The abnormal data that will be present is as isolated point data.

The isolated point data of acquisition is called in as data for clearance by JDBC interface.

Data for clearance are pre-processed.

Outlier Detection, identification and processing are carried out to pretreated data.

Pass through JDBC interface export treated result data to source of new data.

Further, the method for obtaining isolated point data specifically includes:

S1: test data is chosen.

S2: node clustering is carried out to the test data of selection.

S3: the super ellipsoids to the cluster training divided just comprising all nodes in cluster, and calculate the axis of corresponding super ellipsoids It is long.

S4: Data Dimensionality Reduction is carried out according to the axial length of each super ellipsoids.

S5: corresponding curve matching is carried out to the data after the axial length dimensionality reduction according to each super ellipsoids.

S6: detection data is chosen.

S7: processing detection data.

S8: carrying out similarity-rough set for test curve and detection curve, determines data with the presence or absence of abnormal data.

Further, the detailed process of step S2 are as follows:

Data are calculated by the node data of selection to node clustering according to the data of the identical moment point of each node In the license radius of each dimension,

Judge r_i ^dWithIt is whether adjacent.If adjacent, node i, j belongs to a cluster on dimension direction, only meets section Point is when all belonging to the same cluster in all k dimensions, title node i, the same cluster of j, meanwhile, if two cluster C_iAnd C_jCluster sectionWithMeet

When being set up to all k, then cluster C_iAnd C_jCombinable is a cluster, and cluster radius is

CR=[MIN ({ min_i,min_j}),MAX({max_i,max_j})]。

The detailed process of step S3 are as follows:

Connection between data attribute described with the proportionate relationship between each axial length of super ellipsoids, super ellipsoids it is each Axial length is respectively σ_pl≥σ_p-1l≥σ_p-2l≥…≥σ₁l.Wherein, σ_i(1≤i≤p) indicates the covariance matrix Σ's of data set D The square root of characteristic value, the mean value of data set D is indicated with μ, then corresponds to the axial length of super ellipsoids

The detailed process of step S4 are as follows: calculate the corresponding proportionality coefficient a of each axial length of super ellipsoids_iAnd as linear drop The coefficient d of dimension, i.e.,

The detailed process of step S5 are as follows: carry out curve fitting to the data after dimensionality reduction in two-dimensional surface.Ten groups of data fittings At eight smooth nonlinear function curves and its starting point is moved to origin, the curve after translation is as test curve f (x)。

The detailed process of step S7 are as follows: data drop is carried out to the test data of selection according to the method for the step S4 and S5 Peacekeeping curve matching, obtains detection curve g (x).

Step S8 needs to determine exceptional value, detailed process by judging the similarity degree of two curves are as follows:

If f (x) is the test curve of fitting, g (x) is the curve to be detected of fitting, for preset threshold value c (0 < C < 1), when curve f (x) and curve g (x) satisfaction, to arbitrary x ∈ X, have

| f (x)-g (x) | < c

Or meet

Then claim to be no different constant value presence at the node, otherwise it is assumed that there are exceptional values.

Further, Outlier Detection is carried out to pretreated data, knowledge method for distinguishing includes:

When there is new opplication to need to dispose in a data network, application rule is formulated by feature recognition module；It then will rule Isolated point Data Detection is then carried out, it is regular to application to be detected with deployed application rule, if not isolated points According to, directly deployment new opplication；The rule of isolated point data is eliminated if there is isolated point data, according to priority judgment criterion The priority of rule is obtained, and eliminates the rule of isolated point data according to priority；The rule of isolated point data will be eliminated It is configured in data network.

It specifically includes:

Step 1), when there is new opplication request, the rule that application is generated carries out data model conversion, i.e., by regular partition For spatial domain S and action fields A；Then rule is forwarded to isolated point data detection module, and judges that new opplication itself rule is It is no to belong to the application type that can produce isolated point data, it is no to then follow the steps 3) if it is execution step 2).

Step 2), taking-up one does not detect in the rule of new opplication, and this applies existing rule in a data network A carry out step 4) not detected is taken out in then, if all rules have all detected, executes step 3).

Step 3), taking-up one does not detect in the rule of new opplication, and in deployed other application rule An execution step 4) not detected is taken out, if all rules have all detected, executes step 8).

Two rule spatial domains are denoted as by step 4) respectively: Sn and So, action fields are denoted as: An and Ao, priority are denoted as: Pn and Po；Then four new regular R in separated space domain and generation₁, R₂, R₃, R₄, this four regular spatial domains are respectively as follows: S₁ =Sn-So, S₂=So-Sn, S₃=So ∩ Sn, S₄=Sn ∩ So；Action fields are respectively as follows: A₁=An, A₂=Ao, A₃=Ao, A₄= An。

Step 5) detects the content after being spatially separating, if S₃And S₄It is not null set and A₃And A₄Corresponding Movement be it is different, then be judged as isolated point data, execute step 7)；Otherwise it is judged as and does not isolate point data, executes step It is rapid 6).

Step 6) determines whether this step is to jump from step 2), if so, return step 2), otherwise return Return step 3).

Step 7) eliminates the rule of isolated point data.

The rule of not isolated point data is configured in data network, to dispose new opplication by step 8).

Data scrubbing is carried out after the isolated point data obtained, then after carrying out Data Matching；The data matching method includes:

Step 1: inputting keyword in related application, and carry out that data are fuzzy to be looked into the oracle database of backstage simultaneously It askes, to search the information to match；

Step 2: if match query, in the search procedure of backstage, if any corresponding data information, then prompted, And show relevant information, to carry out selection use for user；If there are a plurality of identical data in query process, Then take the complete data of basic data information.

Step 3: if inquiry mismatches, during background query, such as without corresponding data information, then being mentioned Show, please re-type and saves data.

Further, fuzzy query method the following steps are included:

Step I, initial and the preservation of isolated point information are edited, part isolated point information corresponds to two or more lead-ins It is female.

Step II, the mapping relations between isolated point information and initial are established.

Step III, database table structure is established according to search field.

Step IV, when user edits information and saves, the field for including with the information is obtained according to the mapping relations Corresponding lead-in superclass, and in the database by the mapping relations record between field and lead-in superclass.

Step V, the initial of user input query field.

Step VI, it is obtained corresponding with the inquiry field according to the mapping relations between the field and lead-in superclass Isolated point information, and show.

Further, JDBC interface calls in the data for needing to clear up in data source in system, executes data scrubbing.

Data prediction refers to standardized data record format, according to predefined rule, corresponding in data record Field is converted into same format.

The method that artificial detection exceptional value is imitated using fuzzy set theory is subject to algorithms library, rule base and data scrubbing The auxiliary of log completes the relevant operation to isolated point.

Further, isolated point method for cleaning specifically includes in the data mining based on artificial intelligence:

Step 1 utilizes data information in search program searching database by data retrieval module.

Step 2, main control module are classified by categorization module using data of the sort program to retrieval.

Step 3 carries out correlation analysis using data of the analysis program to retrieval by correlating module.Pass through Feature recognition module is identified using data characteristic of the recognizer to retrieval.

Step 4 judges the isolated point of data by isolated point judgment module using determining program according to feature identification.It is logical It crosses cleaning modul and data isolated point is cleared up using liquidation procedures.

Step 5 stores the data after cleaning using Cloud Server by cloud storage module.

Step 6, by display module using the data information of display display retrieval and to data processed result.

Further, the correlating module analysis method includes:

(1) by analysis program according to the parameter value of independent variable and the parameter value of dependent variable, calculate the independent variable with Pearson correlation coefficients and Spearman's correlation coefficient between the dependent variable, the independent variable and the dependent variable have pair It should be related to.

(2) according to the Pearson correlation coefficients and the Spearman's correlation coefficient, the independent variable and institute are determined The relevant parameter between dependent variable is stated, the relevant parameter between the independent variable and the dependent variable is greater than or equal to the first number Value, and it is less than or equal to second value, if the Pearson correlation coefficients and the Spearman's correlation coefficient are unequal, institute Stating the first numerical value is the smaller value in the Pearson correlation coefficients and the Spearman's correlation coefficient, and the second value is The larger value in the Pearson correlation coefficients and the Spearman's correlation coefficient, if the Pearson correlation coefficients and institute It is equal to state Spearman's correlation coefficient, first numerical value and the second value are the Pearson correlation coefficients or described Spearman's correlation coefficient.

Further, described according to the Pearson correlation coefficients and the Spearman's correlation coefficient, determine it is described from Relevant parameter between variable and the dependent variable, comprising:

The Pearson correlation coefficients are multiplied with the Spearman's correlation coefficient, obtain third value.

The Pearson correlation coefficients are added with the Spearman's correlation coefficient, obtain the 4th numerical value.

By the third value divided by, multiplied by 2, obtaining the 5th numerical value after the 4th numerical value.

Determine that the relevant parameter between the independent variable and the dependent variable is the 5th numerical value.

It is described according to the Pearson correlation coefficients and the Spearman's correlation coefficient, determine the independent variable and institute State the relevant parameter between dependent variable, comprising:

When the Pearson correlation coefficients and the absolute value of the difference of the Spearman's correlation coefficient are greater than first threshold When, determine that the relevant parameter between the independent variable and the dependent variable is the second value；

When the Pearson correlation coefficients and the absolute value of the difference of the Spearman's correlation coefficient are less than or equal to institute When stating first threshold, determine that the relevant parameter between the independent variable and the dependent variable is the 5th numerical value.

The data mining Outlier Detection system based on artificial intelligence that another object of the present invention is to provide a kind of, it is described Data mining Outlier Detection system based on artificial intelligence includes:

Data retrieval module is connect with main control module, for passing through data information in search program searching database.

Main control module is sentenced with data retrieval module, categorization module, correlating module, feature recognition module, isolated point Disconnected module, cleaning modul, cloud storage module, display module connection, for controlling the normal work of modules by central processing unit Make.

Categorization module is connect with main control module, for being classified by data of the sort program to retrieval.

Correlating module is connect with main control module, for carrying out correlation by data of the analysis program to retrieval Analysis.

Feature recognition module is connect with main control module, for being identified by data characteristic of the recognizer to retrieval.

Isolated point judgment module, connect with main control module, for judging data according to feature identification by determining program Isolated point.

Cleaning modul is connect with main control module, for being cleared up by liquidation procedures data isolated point.

Cloud storage module, connect with main control module, for being stored by Cloud Server to the data after cleaning.

Display module is connect with main control module, for the data information by display display retrieval and to data processing As a result.

Advantages of the present invention and good effect are as follows:

The present invention calculates the pearson correlation system between two groups of data of independent variable and dependent variable by correlating module Several and Spearman's correlation coefficient, then calculated Pearson correlation coefficients and Spearman's correlation coefficient, determine A new relevant parameter characterizes the correlation between independent variable and dependent variable out, and the value of the relevant parameter is in Pearson's phase Between relationship number and Spearman's correlation coefficient, by the correlation between relevant parameter characterization independent variable and dependent variable, then from Pearson correlation coefficients and Spearman's correlation coefficient are selected, even if not knowing that there is analyzed data which kind of association to close System, can also determine the correlation between data.Meanwhile number can be increased to the wrong data in data source by cleaning modul The concept of isolated point is introduced, data are utilized the problem of reducing the quality of data, influence data mining effect according to the difficulty of source cleaning Mistake often shows as the characteristic of isolated point, by the metadata for detecting isolated point and combining domain knowledge or being stored, looks for Corresponding wrong data and the method removed out, achieve the purpose that data scrubbing, improve the quality of data of data source.

The present invention solve available data excavate Outlier Detection during carry out data dependence analysis when, if there is There is the uncomprehending situation of which kind of incidence relation to analyzed data, then it can not be accurately from Pearson correlation coefficients or this Pierre The problem of being selected in graceful related coefficient；Meanwhile solve detect isolated point after can also be in conjunction with domain knowledge or being stored Metadata, therefrom find out corresponding wrong data.

Isolated point information detector node data is right to isolated point information detector node clustering in the same time for certain phase of the invention Each cluster after sub-clustering is respectively trained super ellipsoids and accordingly calculates each axial length of super ellipsoids, using axial length proportionality coefficient as coefficient pair The linear dimensionality reduction of isolated point information data, the data after dimensionality reduction are fitted to data and curves, as test curve；It is identical to subsequent time The data of period make identical dimensionality reduction, curve fit process, and the curve after fitting is as detection curve；Compare test curve and inspection The trend and curve similarity of curve are surveyed, the isolated point information data that detection node is collected whether there is abnormal data；It will be present Abnormal data as isolated point data.

Data scrubbing is carried out after the isolated point data that the present invention further obtains, then after carrying out Data Matching；The data Matching process includes: step 1: inputting keyword in related application, and carries out data mould in the oracle database of backstage simultaneously Paste inquiry, to search the information to match；Step 2: if match query, in the search procedure of backstage, if any corresponding number It is believed that breath, then prompted, and relevant information is shown, to carry out selection use for user；If in query process, out Existing a plurality of identical data, then take the complete data of basic data information；Step 3: if inquiry mismatches, in background query In the process, it is such as then prompted without corresponding data information, please be re-type and saves data.It can get accurate data, It ensure that the safety of data.

The present invention is having new opplication to need to be deployed in data in pretreated data progress Outlier Detection, identification When in network, application rule is formulated by feature recognition module；Then by rule carry out isolated point Data Detection, to application rule with Deployed application rule is detected, if not isolating point data, directly disposes new opplication；If there is isolated points It is eliminated according to by the rule of isolated point data, the priority of rule is obtained according to priority judgment criterion, and disappeared according to priority Except the rule of isolated point data；The rule for eliminating isolated point data is configured in data network, can get accurately isolated Point data.

Detailed description of the invention

Fig. 1 is the data mining Outlier Detection method flow diagram provided in an embodiment of the present invention based on artificial intelligence.

Fig. 2 is the data mining Outlier Detection system structure diagram provided in an embodiment of the present invention based on artificial intelligence.

In figure: 1, data retrieval module；2, main control module；3, categorization module；4, correlating module；5, feature identifies Module；6, isolated point judgment module；7, cleaning modul；8, cloud storage module；9, display module.

Specific embodiment

In order to further understand the content, features and effects of the present invention, the following examples are hereby given, and cooperate attached drawing Detailed description are as follows.

To solve the above problems, being explained in detail with reference to the accompanying drawing to the present invention.

As shown in Figure 1, the data mining Outlier Detection method provided by the invention based on artificial intelligence includes following step It is rapid:

S101 utilizes data information in search program searching database by data retrieval module.

S102, main control module are classified by categorization module using data of the sort program to retrieval.

S103 carries out correlation analysis using data of the analysis program to retrieval by correlating module.Pass through spy Sign identification module is identified using data characteristic of the recognizer to retrieval.

S104 judges the isolated point of data by isolated point judgment module using determining program according to feature identification.Pass through Cleaning modul clears up data isolated point using liquidation procedures.

S105 stores the data after cleaning using Cloud Server by cloud storage module.

S106, by display module using the data information of display display retrieval and to data processed result.

As shown in Fig. 2, the data mining Outlier Detection system provided in an embodiment of the present invention based on artificial intelligence includes: Data retrieval module 1, main control module 2, categorization module 3, correlating module 4, feature recognition module 5, isolated point judge mould Block 6, cleaning modul 7, cloud storage module 8, display module 9.

Data retrieval module 1 is connect with main control module 2, for passing through data information in search program searching database.

Main control module 2, with data retrieval module 1, categorization module 3, correlating module 4, feature recognition module 5, orphan Vertical point judgment module 6, cleaning modul 7, cloud storage module 8, display module 9 connect, each for being controlled by central processing unit Module works normally.

Categorization module 3 is connect with main control module 2, for being classified by data of the sort program to retrieval.

Correlating module 4 is connect with main control module 2, related for being carried out by data of the analysis program to retrieval Property analysis.

Feature recognition module 5 is connect with main control module 2, for being known by data characteristic of the recognizer to retrieval Not.

Isolated point judgment module 6 is connect with main control module 2, for judging data according to feature identification by determining program Isolated point.

Cleaning modul 7 is connect with main control module 2, for being cleared up by liquidation procedures data isolated point.

Cloud storage module 8 is connect with main control module 2, for being stored by Cloud Server to the data after cleaning.

Display module 9 is connect with main control module 2, for by display display retrieval data information and to data at Manage result.

The invention will be further described combined with specific embodiments below.

Embodiment 1

4 analysis method of correlating module provided by the invention is as follows:

It is provided by the invention according to the Pearson correlation coefficients and the Spearman's correlation coefficient, determine it is described from Relevant parameter between variable and the dependent variable, comprising:

When the Pearson correlation coefficients and the absolute value of the difference of the Spearman's correlation coefficient are greater than first threshold When, determine that the relevant parameter between the independent variable and the dependent variable is the second value.

Embodiment 2

7 method for cleaning of cleaning modul provided by the invention is as follows:

1) data for clearance are called in by JDBC interface.

2) data are pre-processed.

3) Outlier Detection, identification and processing are carried out to data.

4) result data is exported to source of new data by JDBC interface.

In step 1) provided by the invention, JDBC is the abbreviation of JavaDataBaseConnectivity, i.e. Java data Library connection, the interface call in the data for needing to clear up in data source in system, execute data scrubbing.

In step 2) provided by the invention, data prediction refers to standardized data record format, according to predefined rule Then, the respective field in data record is converted into same format.

In step 3) provided by the invention, the method that artificial detection exceptional value is imitated using fuzzy set theory is calculated The auxiliary of Faku County, rule base and data scrubbing log completes the relevant operation to isolated point.

Embodiment 3

It is provided in an embodiment of the present invention to be based on isolated point method for cleaning in the data mining of artificial intelligence and include:

The method for obtaining isolated point data specifically includes:

S1: test data is chosen.

S2: node clustering is carried out to the test data of selection.

S6: detection data is chosen.

S7: processing detection data.

The detailed process of step S2 are as follows:

Data are calculated by the node data of selection to node clustering according to the data of the identical moment point of each node In the license radius of each dimension.

When being set up to all k, then cluster C_iAnd C_jCombinable is a cluster, and cluster radius is CR=[MIN ({ min_i,min_j}), MAX({max_i,max_j})]。

The detailed process of step S3 are as follows:

The detailed process of S7 are as follows: according to the step S4 and S5 method to the test data of selection carry out Data Dimensionality Reduction and Curve matching obtains detection curve g (x).

| f (x)-g (x) | < c

Or meet

Data scrubbing is carried out after the isolated point data obtained, then after carrying out Data Matching.The data matching method includes:

Step 1: inputting keyword in related application, and carry out that data are fuzzy to be looked into the oracle database of backstage simultaneously It askes, to search the information to match.

Step 2: if match query, in the search procedure of backstage, if any corresponding data information, then prompted, And show relevant information, to carry out selection use for user.If there are a plurality of identical data in query process, Then take the complete data of basic data information.

Embodiment 4

It is provided in an embodiment of the present invention to include: to pretreated data progress Outlier Detection, knowledge method for distinguishing

It specifically includes:

Step 7) eliminates the rule of isolated point data.

The above is only the preferred embodiments of the present invention, and is not intended to limit the present invention in any form, Any simple modification made to the above embodiment according to the technical essence of the invention, equivalent variations and modification, belong to In the range of technical solution of the present invention.

Claims

1. isolated point method for cleaning in a kind of data mining based on artificial intelligence, which is characterized in that described to be based on artificial intelligence Data mining in isolated point method for cleaning include:

According to certain, mutually isolated point information detector node data is to isolated point information detector node clustering in the same time, after sub-clustering Each cluster be respectively trained super ellipsoids and accordingly calculate each axial length of super ellipsoids, using axial length proportionality coefficient as coefficient to isolated point The linear dimensionality reduction of information data, the data after dimensionality reduction are fitted to data and curves, as test curve；

Identical dimensionality reduction, curve fit process are made to the data of subsequent time same time period, the curve after fitting is bent as detection Line；

Compare the trend and curve similarity of test curve and detection curve, whether is the isolated point information data that detection node is collected There are abnormal datas；The abnormal data that will be present is as isolated point data；

The isolated point data of acquisition is called in as data for clearance by JDBC interface；

Data for clearance are pre-processed；

Outlier Detection, identification are carried out to pretreated data；

Pass through JDBC interface export treated result data to source of new data.

2. isolated point method for cleaning in the data mining based on artificial intelligence as described in claim 1, which is characterized in that obtain The method of isolated point data specifically includes:

S1: test data is chosen；

S2: node clustering is carried out to the test data of selection；

S3: the super ellipsoids to the cluster training divided just comprising all nodes in cluster, and calculate the axial length of corresponding super ellipsoids；

S4: Data Dimensionality Reduction is carried out according to the axial length of each super ellipsoids；

S5: corresponding curve matching is carried out to the data after the axial length dimensionality reduction according to each super ellipsoids；

S6: detection data is chosen；

S7: processing detection data；

3. isolated point method for cleaning in the data mining based on artificial intelligence as claimed in claim 2, which is characterized in that step The detailed process of S5 are as follows: carry out curve fitting to the data after dimensionality reduction in two-dimensional surface；Ten groups of data are fitted to eight light Its starting point is simultaneously moved to origin by sliding nonlinear function curve, and the curve after translation is as test curve f (x)；

The detailed process of step S7 are as follows: according to the step S4 and S5 method to the test data of selection carry out Data Dimensionality Reduction and Curve matching obtains detection curve g (x)；

| f (x)-g (x) | < c

Or meet

4. isolated point method for cleaning in the data mining based on artificial intelligence as described in claim 1, which is characterized in that pre- Data that treated carry out Outlier Detection, know method for distinguishing

When there is new opplication to need to dispose in a data network, application rule is formulated by feature recognition module；Then by rule into Row isolated point Data Detection, it is regular to application to be detected with deployed application rule, if not isolating point data, directly Socket part affixes one's name to new opplication；The rule of isolated point data is eliminated if there is isolated point data, is obtained according to priority judgment criterion The priority of rule, and eliminate according to priority the rule of isolated point data；The rule configuration of isolated point data will be eliminated Into data network；

It specifically includes:

Step 1, when there is new opplication request, the rule that application is generated carries out data model conversion, i.e., is sky by regular partition Between domain S and action fields A；Then rule is forwarded to isolated point data detection module, and judges whether new opplication itself rule belongs to It is no to then follow the steps 3 if it is execution step 2 in the application type that can produce isolated point data；

Step 2, one is taken out in the rule of new opplication not detect, and in a data network this using being taken in existing rule A carry out step 4 not detected out executes step 3 if all rules have all detected；

Step 3, one is taken out in the rule of new opplication not detect, and take out one in deployed other application rule The execution step 4 that item does not detect executes step 8 if all rules have all detected；

Step 4, two rule spatial domains are denoted as respectively: Sn and So, action fields are denoted as: An and Ao, priority are denoted as: Pn and Po；Then four new regular R in separated space domain and generation₁, R₂, R₃, R₄, this four regular spatial domains are respectively as follows: S₁=Sn- So, S₂=So-Sn, S₃=So ∩ Sn, S₄=Sn ∩ So；Action fields are respectively as follows: A₁=An, A₂=Ao, A₃=Ao, A₄=An；

Step 5, the content after being spatially separating is detected, if S₃And S₄It is not null set and A₃And A₄Corresponding movement is It is different, then it is judged as isolated point data, executes step 7；Otherwise it is judged as and does not isolate point data, executes step 6；

Step 6, determine whether this step is to jump from step 2, if so, return step 2, otherwise return step 3；

Step 7, the rule of isolated point data is eliminated；

Step 8, the rule of not isolated point data is configured in data network, to dispose new opplication；

Step 1: inputting keyword in related application, and carry out data fuzzy query in the oracle database of backstage simultaneously, come Search the information to match；

Step 2: if match query, in the search procedure of backstage, if any corresponding data information, then being prompted, and will Relevant information is shown, to carry out selection use for user；If occurring a plurality of identical data in query process, then taking The complete data of basic data information；

Step 3: if inquiry mismatches, during background query, such as without corresponding data information, then being prompted, asked It re-types and saves data.

5. isolated point method for cleaning in the data mining based on artificial intelligence as claimed in claim 4, which is characterized in that fuzzy The method of inquiry the following steps are included:

Step I, initial and the preservation of isolated point information are edited, part isolated point information corresponds to two or more initials；

Step II, the mapping relations between isolated point information and initial are established；

Step III, database table structure is established according to search field；

Step IV, it when user edits information and saves, is obtained corresponding with the field that the information includes according to the mapping relations Lead-in superclass, and by between field and lead-in superclass mapping relations record in the database；

Step V, the initial of user input query field；

Step VI, it is obtained according to the mapping relations between the field and lead-in superclass corresponding with the inquiry field isolated Point information, and show.

6. isolated point method for cleaning in the data mining based on artificial intelligence as described in claim 1, which is characterized in that JDBC Interface calls in the data for needing to clear up in data source in system, executes data scrubbing；

Data prediction refers to standardized data record format, according to predefined rule, the respective field in data record It is converted into same format；

The method that artificial detection exceptional value is imitated using fuzzy set theory is subject to algorithms library, rule base and data scrubbing log Auxiliary, complete relevant operation to isolated point.

7. isolated point method for cleaning in the data mining based on artificial intelligence as described in claim 1, which is characterized in that described Isolated point method for cleaning specifically includes in data mining based on artificial intelligence:

Step 1 utilizes data information in search program searching database by data retrieval module；

Step 2, main control module are classified by categorization module using data of the sort program to retrieval；

Step 3 carries out correlation analysis using data of the analysis program to retrieval by correlating module；Pass through feature Identification module is identified using data characteristic of the recognizer to retrieval；

Step 4 judges the isolated point of data by isolated point judgment module using determining program according to feature identification；By clear Reason module clears up data isolated point using liquidation procedures；

Step 5 stores the data after cleaning using Cloud Server by cloud storage module；

8. isolated point method for cleaning in the data mining based on artificial intelligence as claimed in claim 7, which is characterized in that described Correlating module analysis method includes:

(1) by analysis program according to the parameter value of independent variable and the parameter value of dependent variable, calculate the independent variable with it is described Pearson correlation coefficients and Spearman's correlation coefficient between dependent variable, the independent variable have corresponding close with the dependent variable System；

(2) according to the Pearson correlation coefficients and the Spearman's correlation coefficient, determine the independent variable and it is described because Relevant parameter between variable, the relevant parameter between the independent variable and the dependent variable are greater than or equal to the first numerical value, and Less than or equal to second value, if the Pearson correlation coefficients and the Spearman's correlation coefficient are unequal, described One numerical value is the smaller value in the Pearson correlation coefficients and the Spearman's correlation coefficient, and the second value is described The larger value in Pearson correlation coefficients and the Spearman's correlation coefficient, if the Pearson correlation coefficients and it is described this Joseph Pearman related coefficient is equal, and first numerical value and the second value are the Pearson correlation coefficients or this described skin Germania related coefficient.

9. isolated point method for cleaning in the data mining based on artificial intelligence as claimed in claim 8, which is characterized in that described According to the Pearson correlation coefficients and the Spearman's correlation coefficient, determine between the independent variable and the dependent variable Relevant parameter, comprising:

The Pearson correlation coefficients are multiplied with the Spearman's correlation coefficient, obtain third value；

The Pearson correlation coefficients are added with the Spearman's correlation coefficient, obtain the 4th numerical value；

By the third value divided by, multiplied by 2, obtaining the 5th numerical value after the 4th numerical value；

Determine that the relevant parameter between the independent variable and the dependent variable is the 5th numerical value；

It is described according to the Pearson correlation coefficients and the Spearman's correlation coefficient, determine the independent variable and it is described because Relevant parameter between variable, comprising:

When the Pearson correlation coefficients and the absolute value of the difference of the Spearman's correlation coefficient are greater than first threshold, really Relevant parameter between the fixed independent variable and the dependent variable is the second value；

When the absolute value of the Pearson correlation coefficients and the difference of the Spearman's correlation coefficient is less than or equal to described the When one threshold value, determine that the relevant parameter between the independent variable and the dependent variable is the 5th numerical value.

10. it is a kind of implement described in claim 1 isolated point method for cleaning in the data mining based on artificial intelligence based on artificial The data mining Outlier Detection system of intelligence, which is characterized in that the data mining Outlier Detection based on artificial intelligence System includes:

Data retrieval module is connect with main control module, for passing through data information in search program searching database；

Main control module judges mould with data retrieval module, categorization module, correlating module, feature recognition module, isolated point Block, cleaning modul, cloud storage module, display module connection, work normally for controlling modules by central processing unit；

Categorization module is connect with main control module, for being classified by data of the sort program to retrieval；

Correlating module is connect with main control module, for carrying out correlation analysis by data of the analysis program to retrieval；

Feature recognition module is connect with main control module, for being identified by data characteristic of the recognizer to retrieval；

Isolated point judgment module, connect with main control module, for judging the isolated of data according to feature identification by determining program Point；

Cleaning modul is connect with main control module, for being cleared up by liquidation procedures data isolated point；

Cloud storage module, connect with main control module, for being stored by Cloud Server to the data after cleaning；

Display module is connect with main control module, for the data information by display display retrieval and to data processed result.