CN109101661A

CN109101661A - The detection method and device of abnormal point in a kind of data sample set

Info

Publication number: CN109101661A
Application number: CN201811069817.9A
Authority: CN
Inventors: 肖迪
Original assignee: Neusoft Corp
Current assignee: Neusoft Corp
Priority date: 2018-09-13
Filing date: 2018-09-13
Publication date: 2018-12-28

Abstract

The invention discloses the detection methods and device of abnormal point in a kind of data sample set, pass through the calculating quantified to each data sample in data sample set, obtain the target exceptional value of each data sample, can determine whether each data sample is abnormal point in the data sample set according to the target exceptional value of each data sample.Wherein, according to the space length between each data sample and other data samples and/or the target class cluster in the target cluster analysis result for data sample set where the data sample, the target exceptional value of each data sample is determined.In this way, the not accurate enough problem of the abnormal point that the subjective judgement that effective solution relies solely on technical staff detects, pass through the target exceptional value of each data sample in the data sample set of quantization, it can determine the abnormal point in the data sample set, accurately so as to so that the subsequent data analysis result to the data sample set is relatively reliable and effective.

Description

The detection method and device of abnormal point in a kind of data sample set

Technical field

The present invention relates to technical field of data processing, more particularly to a kind of detection side of abnormal point in data sample set Method and device.

Background technique

In data sample set, the feature of the feature of some data samples and other most of data samples exists significant Difference, then these data samples are the abnormal point in the data sample set.In general, the pretreatment before data analysis Process needs to detect abnormal point from data sample set.Wherein, the accuracy of outlier detection, to data analysis result Accuracy have very important influence.

Currently, the detection mode of abnormal point will largely be realized by the subjective judgement of technical staff.For example, logical It crosses the mode of data visualization and image displaying is carried out to the data sample in data sample set, technical staff paddles one's own canoe subjectivity Experience and knowledge judge the feature difference situation of each data sample in image, to identify the exception in data sample set Point.But the feature difference between each data sample excessively is determined by the subjective judgement of technical staff, it is difficult to guarantee data Feature difference between sample objectively, is scientifically assessed, to be difficult to ensure the accuracy of outlier detection.

Summary of the invention

The technical problem to be solved by the invention is to provide the detection methods and dress of abnormal point in a kind of data sample set Set so that in data sample set each data sample relative to other data samples feature difference can by objectively, section Ground quantization is learned, so that the abnormal point in data sample set can be accurately detected.

In a first aspect, the embodiment of the invention provides a kind of detection methods of abnormal point in data sample set, comprising:

Obtain the target data sample in the data sample set；

According to the space length between the target data sample and other data samples and/or it is being directed to the data sample Target class cluster where target data sample described in the target cluster analysis result of this set determines the target data sample Target exceptional value；Wherein, other described data samples be the data sample set in addition to the target data sample Data sample；

According to the target exceptional value of the target data sample, determine whether the target data sample is the data sample Abnormal point in this set.

Optionally, the space length according between the target data sample and other data samples and/or in needle Target class cluster where target data sample described in target cluster analysis result to the data sample set, determine described in The target exceptional value of target data sample, comprising:

The space length between the target data sample and other each described data samples is calculated separately, and according to institute The space length between target data sample and other each described data samples is stated, the first of the target data sample is calculated Exceptional value；

Clustering is carried out to the data sample set, obtains the target cluster analysis result, and according to the mesh The target class cluster where target data sample described in cluster analysis result is marked, calculate the target data sample second is abnormal Value；

According to fusion weight, the first exceptional value of the target data sample and the second exceptional value are fused to the target The target exceptional value of data sample.

Optionally, the space length according between the target data sample and other each described data samples, Calculate the first exceptional value of the target data sample, comprising:

With the space length between the target data sample and other each described data samples, the number of targets is determined Distance is superimposed according between sample and other each described data samples；

The distance that is superimposed between the target data sample and other each described data samples is overlapped, institute is obtained State the first exceptional value of target data sample.

Optionally, superposition distance be specially to the space length carry out obtained from Nonlinear Mapping it is non-linear away from From.

Optionally, the space length is specially Minkowski Distance three times.

Optionally, described that clustering is carried out to the data sample set, obtain the cluster analysis result, comprising:

The data sample as initial cluster center is chosen in the data sample set；

Using initial cluster center as current cluster centre, using the current cluster centre to the data sample set Clustering is carried out, current cluster analysis result is obtained；

If being unsatisfactory for iteration stopping condition, redefined according to the current class cluster in the current cluster analysis result described Current cluster centre, is returned again to execute later and described is clustered using the current cluster centre to the data sample set Analysis；

If meeting iteration stopping condition, the current cluster analysis result is determined as the target cluster analysis result.

Optionally, first initial cluster center selected is the set of data samples from the data sample set The smallest data sample of first exceptional value described in conjunction.

Optionally, the current class cluster according in the current cluster analysis result redefines in the current cluster The heart, comprising:

According to the current class cluster, the first exceptional value of each data sample in the current class cluster is calculated；

Based on the reservation data sample in the current class cluster, the class cluster center of the current class cluster is calculated；Wherein, described First exceptional value of reservation data sample is respectively less than the data sample in the current class cluster in addition to the reservation data sample The first exceptional value；

The current cluster centre is redefined, so that working as described in the class cluster center conduct of the current class cluster Preceding cluster centre.

Optionally, the target class cluster where the target data sample according to the target cluster analysis result, Calculate the second exceptional value of the target data sample, comprising:

According to the target class cluster, determine in the target class cluster quantity of data sample and the target data sample with Space length between the class cluster center of the target class cluster；

According to the class of the quantity of data sample and the target data sample and the target class cluster in the target class cluster Space length between cluster center calculates the second exceptional value of the target data sample.

Second aspect, the embodiment of the invention also provides a kind of detection devices of abnormal point in data sample set, comprising:

Module is obtained, for obtaining the target data sample in the data sample set；

First determining module, for according between the target data sample and other data samples space length and/ Or for the target class cluster where target data sample described in the target cluster analysis result of the data sample set, really The target exceptional value of the fixed target data sample；Wherein, other described data samples in the data sample set remove institute State the data sample except target data sample；

Second determining module determines the target data sample for the target exceptional value according to the target data sample Whether this is abnormal point in the data sample set.

Optionally, first determining module, comprising:

First computational submodule, for calculating separately between the target data sample and other each described data samples Space length calculate institute and according to the space length between the target data sample and other each described data samples State the first exceptional value of target data sample；

Second computational submodule obtains the target cluster point for carrying out clustering to the data sample set Analysis as a result, and the target class cluster where the target data sample according to the target cluster analysis result, calculate the mesh Mark the second exceptional value of data sample；

Submodule is merged, is used for according to fusion weight, the first exceptional value of the target data sample and second is abnormal Value is fused to the target exceptional value of the target data sample.

Optionally, first computational submodule, comprising:

First determination unit, for the space between the target data sample and other each described data samples away from From determining and be superimposed distance between the target data sample and other each described data samples；

Superpositing unit, for by between the target data sample and other each described data samples be superimposed distance into Row superposition, obtains the first exceptional value of the target data sample.

Optionally, the space length is specially Minkowski Distance three times.

Optionally, second computational submodule, comprising:

Selection unit, for choosing the data sample as initial cluster center in the data sample set；

Cluster cell is used for using initial cluster center as current cluster centre, using the current cluster centre to institute It states data sample set and carries out clustering, obtain current cluster analysis result；

Second determination unit, if for being unsatisfactory for iteration stopping condition, according to working as in the current cluster analysis result Preceding class cluster redefines the current cluster centre, returns again to execute the utilization current cluster centre to the number later Clustering is carried out according to sample set；

Third determination unit, if being determined as the current cluster analysis result described for meeting iteration stopping condition Target cluster analysis result.

Optionally, second determination unit, comprising:

First computation subunit, for calculating each data sample in the current class cluster according to the current class cluster First exceptional value；

Second computation subunit, for calculating the current class cluster based on the reservation data sample in the current class cluster Class cluster center；Wherein, first exceptional value for retaining data sample is respectively less than in the current class cluster except the encumbrance According to the first exceptional value of the data sample except sample；

Subelement is determined, for redefining to the current cluster centre, so that the class of the current class cluster Cluster center is as the current cluster centre.

Optionally, second computational submodule, comprising:

4th determination unit, for according to the target class cluster, determine in the target class cluster quantity of data sample and Space length between the target data sample and the class cluster center of the target class cluster；

Computing unit, for according to the quantity of data sample in the target class cluster and the target data sample with it is described Space length between the class cluster center of target class cluster, calculates the second exceptional value of the target data sample.

The third aspect, it is described the embodiment of the invention also provides a kind of detection device of abnormal point in data sample set Equipment includes processor and memory:

Said program code is transferred to the processor for storing program code by the memory；

Described in the processor is used for according to provided by the instruction execution first aspect present invention in said program code The detection method of abnormal point in data sample set.

Fourth aspect, the embodiment of the invention also provides a kind of storage medium, the storage medium is for storing program generation Code, said program code are used to execute the detection side of abnormal point in the data sample set provided by first aspect present invention Method.

Compared with prior art, the embodiment of the present invention has the advantage that

In embodiments of the present invention, it by the calculating quantified to each data sample in data sample set, obtains The target exceptional value of each data sample can determine whether each data sample is this according to the target exceptional value of each data sample Abnormal point in data sample set.Wherein, specifically calculate target exceptional value mode may is that according to each data sample with Space length between other data samples and/or the data sample in the target cluster analysis result for data sample set Target class cluster where this determines the target exceptional value of each data sample.In this way, quantization that can be accurate, scientific goes out to count According to the feature of data sample in sample set, effective solution relies solely on the exception that the subjective judgement of technical staff detects The not accurate enough and less reliable problem of point can by the target exceptional value of each data sample in the data sample set of quantization Accurately to determine the abnormal point in the data sample set, so as to so that the follow-up data to the data sample set divides It is relatively reliable and effective to analyse result.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The some embodiments recorded in invention, for those of ordinary skill in the art, without creative efforts, It is also possible to obtain other drawings based on these drawings.

Fig. 1 is the process signal of the detection method of abnormal point in a kind of data sample set provided in an embodiment of the present invention Figure；

Fig. 2 is a kind of flow example figure of implementation of step 102 provided in an embodiment of the present invention；

Fig. 3 is the flow example figure of another implementation of step 102 provided in an embodiment of the present invention；

Fig. 4 is the flow example figure of another implementation of step 102 provided in an embodiment of the present invention；

Fig. 5 is the structural representation of the detection device of abnormal point in a kind of data sample set provided in an embodiment of the present invention Figure；

Fig. 6 is the structural representation of the detection device of abnormal point in a kind of data sample set provided in an embodiment of the present invention Figure.

Specific embodiment

In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only this Invention a part of the embodiment, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art exist Every other embodiment obtained under the premise of creative work is not made, shall fall within the protection scope of the present invention.

Currently, being carried out to the data sample in data sample set abnormal in the preprocessing process before data analysis Point detection, depends on the subjective judgement of technical staff usually to realize.For example, by data sample by way of data visualization Data sample in set intuitively shows technical staff in the form of images, then by the self-dependent subjective warp of technical staff It tests and judges whether the corresponding each data sample of image is abnormal point with knowledge.But the subjectivity of technical staff is relied solely in this way Judge the abnormal point detected, it is difficult to guarantee that the abnormal point in data sample set objectively, scientifically can be determined and be arranged It removes, it is not accurate enough and reliable to lead to the detection of abnormal point, to be likely to make the subsequent data analysis to the data sample set As a result it produces serious influence.

Based on this, to solve the above-mentioned problems, the embodiment of the invention provides abnormal points in a kind of data sample set Detection method obtains the target of each data sample by the calculating quantified to each data sample in data sample set Exceptional value, wherein the specific mode for calculating target exceptional value may is that according between each data sample and other data samples Space length and/or the target class in the target cluster analysis result for data sample set where the data sample Cluster determines the target exceptional value of each data sample；It, can be according to each number after equivalent dissolves the target exceptional value of each data sample According to the target exceptional value of sample, determine whether each data sample is abnormal point in the data sample set.In this way, effective solution It has determined and has relied solely on the not accurate enough and less reliable problem of abnormal point that the subjective judgement of technical staff detects, it can be accurate , the scientific target exceptional value for quantifying data sample in data sample set out, so as to accurately determine the data Abnormal point in sample set, and then may insure that the subsequent data analysis result of the data sample set is relatively reliable and has Effect.

With reference to the accompanying drawing, the various non-limiting embodiments in embodiment that the present invention will be described in detail.

It is the process of the detection method of abnormal point in a kind of data sample set provided in an embodiment of the present invention referring to Fig. 1 Schematic diagram.In the present embodiment, this method can specifically include 101~step 103 of following step:

Step 101, the target data sample in data sample set is obtained.

It is understood that data sample set is the set for including multiple data samples, wherein each data sample It can be expressed as the vector of a N-dimensional data (wherein, N is positive integer).For example, for data sample set X, including data sample This x₁、x₂、…、x_i..., wherein data sample x_iIt can indicate are as follows: x_i={ x_i1, x_i2..., x_iN}。

When specific implementation, any one data sample therein can be determined from all data samples of data sample set This, as target data sample, and the target data sample is object to be detected, executes subsequent step and detects the number of targets It whether is abnormal point according to sample.It should be noted that for all data samples in data sample set, may require by Object as outlier detection in the present embodiment, that is, each data sample is required to execute as target data sample primary The present embodiment.

Step 102, according to the space length between target data sample and other data samples and/or for data sample Target class cluster in the target cluster analysis result of this set where target data sample determines that the target of target data sample is different Constant value；Wherein, other data samples are the data sample in data sample set in addition to target data sample.

It is understood that target exceptional value, refer to the feature of any one data sample in data sample set with The size being had differences between the feature of data sample in the data sample set other than any one data sample, It is the outlier detection to data sample for indicating the intensity of anomaly of any one data sample in data sample set Quantization.For any one data sample in data sample set, if target exceptional value is bigger, any one number is indicated Between feature according to the data sample in the feature of sample and the data sample set other than any one data sample Difference it is bigger, further relate to any one data sample be abnormal point a possibility that it is bigger；If target exceptional value is smaller, Indicate the number in the feature and the data sample set of any one data sample other than any one data sample It is smaller according to the difference between the feature of sample, it is smaller to further relate to a possibility that any one data sample is abnormal point.

It include three kinds of concrete implementation modes in step 102:, can basis in the first implementation when specific implementation Space length between target data sample and other data samples determines the target exceptional value of target data sample.As one A example, the implementation can specifically include: firstly, calculating separately between target data sample and other each data samples Space length；Then, according to the space length between target data sample and other each data samples, the number of targets is calculated Target exceptional value according to the first exceptional value of sample, as the target data sample.

In second of implementation, can for data sample set target cluster analysis result in target data sample Target class cluster where this determines the target exceptional value of target data sample.As an example, which specifically can be with It include: firstly, obtaining target cluster analysis result to data sample set progress clustering；Then, it is clustered according to the target The target class cluster in result where target data sample is analyzed, the second exceptional value of the target data sample is calculated, as the mesh Mark the target exceptional value of data sample.

In the third implementation, can also according to the space length between target data sample and other data samples, And the target class cluster in the target cluster analysis result for data sample set where target data sample, determine mesh Mark the target exceptional value of data sample.As an example, which may include: firstly, according to the first realization side Mode shown in example in formula obtains the first exceptional value of the target data sample；Then, according in second of implementation Example shown in mode obtain the second exceptional value of the target data sample；Finally, according to fusion weight, by the target data The first exceptional value and the second exceptional value of sample are fused to the target exceptional value of the target data sample.

It is understood that in the present embodiment, the first exceptional value refers to any one data in data sample set Sample is based on the above-mentioned calculated exceptional value of the first implementation, and it is to data which, which is properly termed as local anomaly value, A kind of quantization parameter of the outlier detection of sample；Second exceptional value refers to any one data sample in data sample set Based on the above-mentioned calculated exceptional value of second of implementation, it is to data which, which can be referred to as whole exceptional value, Another quantization parameter of the outlier detection of sample.It should be noted that as whether being abnormal for determining data sample The target exceptional value of point can also be the first exceptional value and second either the first exceptional value, is also possible to the second exceptional value Exceptional value carries out fused exceptional value.

For the first implementation, space length refers between target data sample and other data samples Distance, such as: in a kind of situation, which can be Euclidean distance；In another case, in order to make abnormal data sample Can be embodied on space length with the difference of normal data sample it is more obvious, to promote the accurate of outlier detection Property, which is also possible to Minkowski Distance three times.

For example, for data sample x₁And x₂, then, data sample x₁With data sample x₂Between Euclidean distance Calculation formula are as follows:

Data sample x₁With data sample x₂Between Minkowski Distance three times calculation formula are as follows:

It is understood that the embodiment that Minkowski Distance can be more obvious three times is abnormal compared to Euclidean distance The difference of data sample and normal data sample on space length, so as to more accurately realize the detection of abnormal point.

When specific implementation, as shown in Fig. 2, step 102 can specifically include:

Step 201, the space length between target data sample and other each data samples is calculated separately.

Step 202, with the space length between target data sample and other each data samples, target data sample is determined This is superimposed distance between other each described data samples.

Step 203, the distance that is superimposed between target data sample and other each described data samples is overlapped, is obtained To the first exceptional value of target data sample.

Wherein, the sum of the distance for being superimposed distance, referring to the distance for superposition, rather than obtained after space length superposition. The superposition distance, corresponding with space length to exist, a space length corresponds to a superposition distance.

As an example, being superimposed distance and can directly take between the target data sample and other each data samples Space length itself between target data sample and other each data samples.So, mode according to Fig.2, calculates this The calculation formula of first exceptional value of target data sample can be with are as follows:

Wherein, in a kind of situation, M can be other data in the data sample set other than target data sample The quantity of sample；In another case, M may be the quantity of all data samples in the data sample set, due to mesh It marks data sample and the space length of its own is 0, then, the calculating effect of formula described in above-mentioned two situations (3) is complete It is complete consistent.

For example, it is assumed that data sample set X includes data sample x₁、x₂、x₃、x₄、x₅, the mesh that is got according to step 101 Mark data sample is x₁, it is possible to calculate target data sample x according to above-mentioned formula (1) or formula (2)₁Respectively and Data sample x₂、x₃、x₄、x₅Between space length d₁₂、d₁₃、d₁₄And d₁₅, then, according to above-mentioned formula (3), available L (x₁, X) and=d₁₂+d₁₃+d₁₄+d₁₅。

As another example, in order to make larger feature difference existing for target data sample and other data samples more It, can also be by target data sample and other each numbers in first exceptional value of the prominent target data sample being embodied in Certain processing is carried out according to the space length between sample, is obtained between the target data sample and other each data samples It is superimposed distance, such as: Nonlinear Mapping is carried out to space length, the corresponding non-linear distance of space length is obtained, as the sky Between the corresponding superposition distance of distance.In this way, obtaining corresponding superposition distance will be bigger when space length is larger,；Conversely, When space length is smaller, obtaining corresponding superposition distance will be smaller, so as to so that the first exceptional value can protrude embodiment The unusual condition of data sample.At this point, mode according to Fig.2, calculates the calculating of the first exceptional value of the target data sample Formula can be with are as follows:

Wherein,For Nonlinear Mapping, Sigmoid function specifically can be used, it may be assumed thatThe letter Number is a S type growth type function, which has monotonic increase, its inverse function monotonic increase and functional value between 0 to 1 The characteristics of.When the argument of function (that is, space length) is bigger, obtained dependent variable (that is, superposition distance) is closer to 1；Instead It, when the argument of function (that is, space length) gets over hour, obtained dependent variable (that is, superposition distance) is closer to 0.

For example, it is assumed that the target data sample in data sample set X is x₁, and target data sample x₁Sum number respectively According to sample x₂、x₃、x₄、x₅Between space length be d₁₂、d₁₃、d₁₄And d₁₅, then, to space length d₁₂、d₁₃、d₁₄And d₁₅Into The corresponding Nonlinear Mapping of row Sigmoid function obtains the corresponding superposition distance S of each space length₁₂、S₁₃、S₁₄And S_15,According to Above-mentioned formula (4), available L (x₁, X) and=S₁₂+S₁₃+S₁₄+S₁₅。

In this way, can according to the space between other data samples in target data sample and data sample set away from From, determine the first exceptional value of the target data sample, as the quantization of the local anomaly degree to the target data sample, A possibility that target data sample is abnormal point size is objectively embodied, the master of data sample set outlier detection is overcome The property seen, therefore, the first implementation provides a kind of science and reliable abnormal point assessment strategy.

For second of implementation, referring to Fig. 3, step 102 be can specifically include:

Step 301, clustering is carried out to data sample set, obtains target cluster analysis result；

Step 302, according to the target class cluster where target data sample in the target cluster analysis result, the target is calculated Second exceptional value of data sample.

It is understood that clustering, refers to and is grouped physics or abstract data sample, by similar data sample The analytic process of multiple class clusters of this composition.In the present embodiment, clustering is carried out to data sample set, such as can adopted With K-means++ clustering algorithm, clustering is carried out to data sample set.It, will be in data sample set after clustering Multiple data samples are categorized into multiple class clusters, obtain the relevant information of this multiple class cluster and each class cluster, and it is poly- to be denoted as target Alanysis result.

Illustrate the specific implementation of the step 301: the first step by taking K-means++ clustering algorithm as an example, from data sample set Middle selection initial cluster center, comprising: first choose first initial cluster center, then calculate first initial cluster center The biggish data sample of space length is chosen for using such as wheel disc mechanism with the space length between each data sample Second initial cluster center, until selecting K cluster centre；Second step, for each data in data sample set Sample calculates its space length for arriving K initial cluster center, and each data sample is assigned to the smallest with its space length In the corresponding class cluster of initial cluster center；Third step recalculates the first time cluster centre of such cluster for each class cluster； 4th step judges whether to meet iteration stopping condition, if it is, class cluster and relevant information that the secondary clustering is obtained are made Otherwise above-mentioned second step and third step are repeated, until meeting iteration stopping condition for cluster analysis result.

When specific implementation, which be can specifically include:

Step 3011, the data sample as initial cluster center is chosen in data sample set.

As an example, when choosing initial cluster center, a number can be first randomly selected in data sample set According to sample, the initial cluster center selected as first；Again from data sample set choose with this selected just Beginning cluster centre is apart from farther away data sample, as other initial cluster centers.

As another example, in order to keep the initial cluster center chosen more suitable, so that process of cluster analysis be made to make Computing resource is less, convergence rate faster, can be first first from data sample set when choosing initial cluster center The initial cluster center selected is the smallest data sample of the first exceptional value in data sample set；Again from data sample set Middle selection and the initial cluster center selected are apart from farther away data sample, as other initial cluster centers.This Sample is chosen in the initial clustering that the corresponding data sample of the first exceptional value of minimum is selected as first in data sample set The heart, since corresponding first exceptional value is minimum, then, the data sample being selected in data sample set except this is selected Data sample except data sample space length it is minimum, that is, the data sample region being selected is data sample This most intensive region, in this way, the initial cluster center selected can effectively reduce the number of subsequent clustering.

Step 3012, using initial cluster center as current cluster centre, using current cluster centre to the data sample Set carries out clustering, obtains current cluster analysis result.

It is understood that may include with multiple initial cluster centers in current cluster analysis result in current cluster Multiple current class clusters that the heart divides.And for each current class cluster, it may each comprise multiple data samples, wherein data sample Abnormal point in set is likely to be present in a current class cluster, it is also possible to be respectively present in multiple current class clusters.A kind of situation Under, if in current class cluster both having included normal data sample or including abnormal point, due to abnormal point feature and deserve There are notable differences for the feature of other data samples in preceding class cluster, therefore, the sky at the class cluster center of the abnormal point and the current class cluster Between distance farther out.In another case, if the negligible amounts for the data sample for including in current class cluster, it may deserve All data samples in preceding class cluster are abnormal point, such as only exist a data sample in certain current class cluster, i.e., it is believed that The data sample is the abnormal point in data sample set.

Step 3013, judge whether to meet iteration stopping condition, if not, thening follow the steps 3014；Otherwise, step is executed 3015。

It is understood that whether complete iteration stopping condition, be used to indicate the clustering carried out to data sample set At, if meeting the iteration stopping condition, can no longer to the data sample can carry out next time iteration and cluster point Analysis；If being unsatisfactory for the iteration stopping condition, there is still a need for redefining new current cluster centre, and above-mentioned steps are executed 3012.Iteration stopping condition can specifically include but be not limited to: in a kind of situation, can be clustering number reach it is default Number；In another case, can also be current cluster analysis result and the cluster analysis result that last clustering obtains It compares, the mean value of the space length of each data sample is less than preset threshold.

It should be noted that needing to carry out an iteration stop condition after obtaining current cluster analysis result every time Judgement be then considered as until meeting iteration stopping condition to the completion of the clustering of the data sample set.

Step 3014, current cluster centre is redefined according to the current class cluster in current cluster analysis result, returns to step Rapid 3012.

It is understood that is obtained is current poly- when being directed to step 3012 with initial cluster center as current cluster centre In alanysis result, due to including multiple data samples in each current class cluster, it is easy to the mass center of each current class cluster occur (i.e. class cluster center) is not the corresponding current cluster centre of the current class cluster, then, it needs in next iteration, needs weight It newly determines current cluster centre, such as can be the class cluster center conduct that the current class cluster for embodying current class cluster feature will be more capable of The current cluster centre redefined.

As an example, step 3014 redefines current cluster centre and can specifically include when realizing:

S1 calculates the first exceptional value of each data sample in current class cluster according to current class cluster.

It, can be using each data sample in data sample set as target data sample, Ke Yigen when specific implementation The first exceptional value of each data sample is calculated according to above-mentioned mode shown in Fig. 2, it specifically can be using formula (3) or public Formula (4) is calculated.

It is understood that no matter the abnormal point in data sample set is present in a current class cluster, still deposit respectively Be in multiple current class clusters, can accurately distinguish abnormal point and normal data sample, abnormal possibility compared with Big data sample can calculate biggish first exceptional value；And the abnormal lesser data sample of possibility, calculated the One exceptional value correspondence is smaller.

S2 calculates the class cluster center of current class cluster based on the reservation data sample in current class cluster；Wherein, the encumbrance The first exception of the data sample in current class cluster in addition to the reservation data sample is respectively less than according to the first exceptional value of sample Value.

It is understood that in order in this current cluster analysis result, normal data sample current class cluster in the majority In abnormal point, to the current class cluster, when clustering next time, the current cluster centre that redefines is not had an impact, can will Partial data sample retains data sample based on the part and goes to recalculate currently as data sample is retained in the current class cluster The class cluster center of class cluster.

Wherein, retain data sample, need the normal data samples covered in the current class cluster more as far as possible, and reject Abnormal point in the current class cluster, it is possible to by the current class cluster, the corresponding data sample of lesser first exceptional value is made For retain data sample, and by biggish first exceptional value for data sample reject.Moreover, in order to based on reservation data sample The clustering of this progress still can quickly and effectively restrain, and the quantity for retaining data sample cannot be less than in current class cluster 90 the percent of data sample quantity.

As an example, can by the data sample in current class cluster, according to each data sample the first exceptional value from It is small to big sequence, then will come preceding 90% data sample as reservation data sample.For example, it is assumed that being wrapped in current class cluster C Include data sample x₁、x₂、……、x₁₀₀, sort from small to large according to the first exceptional value of each data sample are as follows: x₁₀₀、 x₉₉、……、x₂、x₁, then, the reservation data sample selected includes: x₁₀₀、x₉₉、……、x₁₂、x₁₁, the data sample of rejecting Including x₁~x₁₁。

It should be noted that the quantity for retaining data sample cannot be less than percent of data sample quantity in current class cluster 90, this can according to need the default data sample that retains in current class cluster 9 percent tenth is that a preset lowest threshold Data sample accounting, still, the accounting of setting has to be larger than equal to 90 percent.

The class cluster center that the current class cluster is calculated based on reservation data sample, specifically may refer to following formula:

Wherein, μ_iFor the class cluster center of i-th of current class cluster, C_iCollection for the data sample for including in i-th of current class cluster It closes, andFor the set in i-th of current class cluster including reservation data sample, N_iTo retain data in this i-th current class cluster The quantity of sample.

S3 redefines current cluster centre, so that working as described in the class cluster center conduct of the current class cluster Preceding cluster centre.

It is understood that can be corresponded to the current class cluster of each of data sample set according to above-mentioned S1 and S2 Class cluster center, it is possible to complete the data sample using the class cluster center of the current class cluster as new current cluster centre The new current cluster centre of this set redefines.

It should be noted that after having redefined new current cluster centre every time, it is still desirable to which this is new current Cluster centre feeds back to step 3012, and then sequence executes step 3012, then carries out the judgement in step 3013, and so on, Until meeting iteration stopping condition, then it is considered as the clustering completion to the data sample set, i.e., executable following step 3015。

Step 3015, current cluster analysis result is determined as target cluster analysis result.

It is understood that when the current cluster analysis result that step 3012 obtains meets iteration stopping condition, for example, The number of clustering reaches preset times；In another example the cluster that current cluster analysis result and last clustering obtain Analysis result is compared, and the mean value of the space length of each data sample is less than preset threshold, then explanation is to the data sample set Clustering, can no longer carry out iteration and clustering next time etc. operation, directly by the current cluster analysis result As target cluster analysis result, data basis is provided for subsequent the second exceptional value for calculating target data sample.

After having introduced the specific implementation of step 301, step 302 is then according to target clustering determined by step 301 As a result the target class cluster where middle target data sample calculates the second exceptional value of the target data sample.

As an example, step 302 in specific implementation, may include:

Step 3021, according to target class cluster, the quantity and target data sample of data sample in the target class cluster are determined Space length between the class cluster center of target class cluster；

Step 3022, according to the class of the quantity of data sample in target class cluster and target data sample and the target class cluster Space length between cluster center calculates the second exceptional value of the target data sample.

When specific implementation, the second exceptional value of each data sample can be determined according to following two factors: first, it should The data sample quantity that current class cluster includes belonging to data sample；Second, the data sample and the current class cluster belonging to it Space length between class cluster center.

As an example, the calculation formula in step 3022 is as follows:

Wherein, C x_iAffiliated current class cluster,The data sample for including for the current class cluster C Jing Guo normalized Quantity, ρ (x_i, C) and x can be embodied_iSpace length between the class cluster center of current class cluster C.It is understood that a kind of In the case of, ρ (x_i, C) and it can be as x_iSpace length itself between the class cluster center of current class cluster C；In another case, In order to more intuitive the second exceptional value for embodying each data center, the ρ (x_i, C) and it is also possible to x_iWith current class cluster C Class cluster center between space length in this prior in class cluster C in the space length at each data sample distance-like cluster center Order coefficient can be used for embodying x_iCome the position in current class cluster C, the ρ (x_i, C) specifically it can be to x_iWith current class cluster Corresponding order coefficient obtained is normalized in space length between the class cluster center of C, then, the ρ (x_i,C) Value range can be 0~1, if the ρ (x_i, C) and closer to 0, illustrate the class cluster center of the data sample Yu current class cluster C Between space length it is closer, the position of arrangement is more forward；Otherwise, if the ρ (x_i, C) and closer to 1, illustrate the data sample Space length between the class cluster center of current class cluster C is remoter, and the position of arrangement is more rearward.

As an example it is assumed that data sample set includes 100 data samples, and in target cluster analysis result, target Data sample x₁The target class cluster at place is C₁, and target class cluster C₁Including data sample x₁、x₂、……、x₅, according to step 3021 can determine that the quantity of data sample in target class cluster is 5, target data sample x₁With target class cluster C₁Class cluster center it Between space length be d_x, target data sample x can be calculated according to above-mentioned formula (6)₁The second exceptional value are as follows:

In this way, target cluster analysis result can be obtained, further according to mesh by carrying out clustering to data sample set The target class cluster in cluster analysis result where target data sample is marked, the second exceptional value of target data sample is calculated, as Quantization to the whole intensity of anomaly of the target data sample objectively embodies a possibility that target data sample is abnormal point Size overcomes the subjectivity of data sample set outlier detection, therefore, second of implementation provide it is a kind of science and can The abnormal point assessment strategy leaned on.

For the third implementation, in order to which what can more be integrated embodies target data sample in data sample set Intensity of anomaly can combine the first above-mentioned implementation and second of implementation, and two exceptional values are carried out data Fusion, completely analyze a possibility that target data sample is abnormal point size from part and whole two aspects, more The comprehensive, scientifical target exceptional value for obtaining target data sample.

When specific implementation, referring to fig. 4, step 102 be can specifically include:

Step 401, the space length between target data sample and other each data samples is calculated separately, and according to mesh The space length between data sample and other each described data samples is marked, the first exceptional value of target data sample is calculated.

Step 402, clustering is carried out to data sample set, obtains target cluster analysis result, and poly- according to target Target class cluster in alanysis result where target data sample calculates the second exceptional value of target data sample.

Step 403, according to fusion weight, the first exceptional value of target data sample and the second exceptional value are fused to described The target exceptional value of target data sample.

It is understood that the implementation of above-mentioned steps 401 may refer to above-mentioned the first implementation shown in Fig. 2 In description, the implementation of step 402 may refer to the description in above-mentioned second of implementation shown in Fig. 3, here not It repeats again.

It is understood that fusion weight, can be technical staff according to multiple experimental data and carries out statistics and analysis, Obtained empirical value, the second exceptional value for the first exceptional value and target data sample to target data sample are melted It closes, the target exceptional value of fused target data sample can embody the intensity of anomaly of target data sample.

As an example, step 403 can specifically be calculated according to the following formula:

I(x_i, X) and=L (x_i,X)+αG(x_i, X) ... formula (7)

Wherein, L (x_i, X) and it is target data sample x_iThe first exceptional value, G (x_i, X) and it is target data sample x_iSecond Exceptional value, α are fusion weight.

In this way, can be by the way that the first above-mentioned implementation and second of implementation be combined, by two exceptional values The fusion of data is carried out, the more comprehensive, scientifical target exceptional value for obtaining target data sample overcomes data sample set The subjectivity of outlier detection, therefore, the third implementation provides a kind of scientific, comprehensive, accurate and objective abnormal comment Estimate strategy.

By the specific implementation of above-mentioned three kinds of steps 102, the target exceptional value of target data sample can be determined, Whether for the target data sample in determining data sample set the numerical basis of judgement is provided for abnormal point.

Step 103, according to the target exceptional value of target data sample, determine whether target data sample is set of data samples Abnormal point in conjunction.

It is understood that determining whether target data sample is that the foundation of abnormal point in data sample set can be into The flexible setting of row.When specific implementation, the foundation of the determination includes but is not limited to following two kinds of concrete implementation modes:

In one example, step 103 can be completed by presetting abnormal point number to be detected.When specific implementation, first Step, can obtain each target data sample using each data sample in data sample set as target data sample Target exceptional value；Second step, to the target exceptional value of each target data sample according to row from small to large or from big to small Sequence；Third step in each target data sample after sequence, will preset abnormal point to be detected before target exceptional value is maximum Several corresponding target data samples, as abnormal point.So, if the target exceptional value of target data sample does not come maximum Preceding preset in abnormal point number to be detected, it is determined that the target data sample is not the exception in the data sample set Point；Otherwise, if the target exceptional value of target data sample come it is maximum before preset in abnormal point number to be detected, really The fixed target data sample is the abnormal point in the data sample set.

In another example, step 103 can be completed by presetting exceptional value threshold value.When specific implementation, it can be determined that mesh Whether the target exceptional value of mark data sample is less than the default exceptional value threshold value, if it is less, determining the target data sample It is not the abnormal point in the data sample set；Otherwise, if it is not, then determining that the target data sample is the data sample Abnormal point in set.

It follows that in embodiments of the present invention, passing through what is quantified to each data sample in data sample set It calculates, obtains the target exceptional value of each data sample, each data sample can be determined according to the target exceptional value of each data sample It whether is abnormal point in the data sample set.Wherein, the mode for specifically calculating target exceptional value may is that according to each number According to the space length between sample and other data samples and/or in the target cluster analysis result for data sample set Target class cluster where the data sample determines the target exceptional value of each data sample.In this way, can be accurate, science Quantify the feature of data sample in data sample set out, effective solution relies solely on the subjective judgement detection of technical staff The not accurate enough and less reliable problem of abnormal point out, the target by each data sample in the data sample set of quantization are different Constant value can accurately determine the abnormal point in the data sample set, so as to so that after to the data sample set Continuous data analysis result is relatively reliable and effective.

Correspondingly, the embodiment of the invention also provides a kind of detection devices of abnormal point in data sample set, such as Fig. 5 institute Show, which can specifically include:

Module 501 is obtained, for obtaining the target data sample in the data sample set；

First determining module 502, for according to the space length between the target data sample and other data samples And/or for the target class where target data sample described in the target cluster analysis result of the data sample set Cluster determines the target exceptional value of the target data sample；Wherein, other described data samples are in the data sample set Data sample in addition to the target data sample；

Second determining module 503 determines the target data for the target exceptional value according to the target data sample Whether sample is abnormal point in the data sample set.

Optionally, shown first determining module 502, can specifically include:

Optionally, first computational submodule, comprising:

Optionally, the space length is specially Minkowski Distance three times.

Optionally, second computational submodule, comprising:

Optionally, second determination unit, comprising:

Optionally, second computational submodule, comprising:

Foregoing description is the associated description of the detection device of abnormal point in data sample set, wherein specific implementation And the effect reached, it may refer to the description of the detection method embodiment of abnormal point in data sample set shown in FIG. 1, this In repeat no more.

In addition, the embodiment of the invention also provides a kind of detection devices of abnormal point in data sample set, such as Fig. 6 institute Show, which includes processor 601 and memory 602:

Said program code is transferred to the processor 601 for storing program code by the memory 602；

The processor 601 is used for according to provided by the embodiment shown in FIG. 1 of the instruction execution in said program code The detection method of abnormal point in data sample set.

The specific implementation of the detection device of abnormal point and the effect reached, may refer in the data sample set The description of the detection method embodiment of abnormal point in data sample set shown in FIG. 1, which is not described herein again.

In addition, the storage medium is for storing program code, institute the embodiment of the invention also provides a kind of storage medium Program code is stated for executing the detection method of abnormal point in data sample set provided by embodiment shown in FIG. 1.

It should be noted that, in this document, relational terms such as first and second and the like are used merely to a reality Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation In any actual relationship or order or sequence.The terms "include", "comprise" or its any other variant are intended to non-row His property includes, so that the process, method, article or equipment for including a series of elements not only includes those elements, and And further include other elements that are not explicitly listed, or further include for this process, method, article or equipment institute it is intrinsic Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including institute State in the process, method, article or equipment of element that there is also other identical elements.

For device embodiment, since it corresponds essentially to embodiment of the method, so related place is referring to method reality Apply the part explanation of example.Method, apparatus and apparatus embodiments described above is only schematical, wherein the work It may or may not be physically separated for the unit of separate part description, component shown as a unit can be Or it may not be physical unit, it can it is in one place, or may be distributed over multiple network units.It can be with Some or all of the modules therein is selected to achieve the purpose of the solution of this embodiment according to the actual needs.The common skill in this field Art personnel can understand and implement without creative efforts.

The above is only the specific embodiment of the application, it is noted that for the ordinary skill people of the art For member, under the premise of not departing from the application principle, several improvements and modifications can also be made, these improvements and modifications are also answered It is considered as the protection scope of the application.

Claims

1. the detection method of abnormal point in a kind of data sample set characterized by comprising

Obtain the target data sample in the data sample set；

According to the space length between the target data sample and other data samples and/or it is being directed to the set of data samples Target class cluster where target data sample described in the target cluster analysis result of conjunction determines the mesh of the target data sample Mark exceptional value；Wherein, other described data samples are the number in the data sample set in addition to the target data sample According to sample；

According to the target exceptional value of the target data sample, determine whether the target data sample is the set of data samples Abnormal point in conjunction.

2. detection method according to claim 1, which is characterized in that described according to the target data sample and other numbers According to the space length between sample and/or the number of targets described in the target cluster analysis result for the data sample set According to the target class cluster where sample, the target exceptional value of the target data sample is determined, comprising:

The space length between the target data sample and other each described data samples is calculated separately, and according to the mesh The space length between data sample and other each described data samples is marked, calculate the target data sample first is abnormal Value；

Clustering is carried out to the data sample set, obtains the target cluster analysis result, and poly- according to the target Target class cluster where target data sample described in alanysis result calculates the second exceptional value of the target data sample；

According to fusion weight, the first exceptional value of the target data sample and the second exceptional value are fused to the target data The target exceptional value of sample.

3. detection method according to claim 2, which is characterized in that described according to the target data sample and each institute The space length between other data samples is stated, the first exceptional value of the target data sample is calculated, comprising:

With the space length between the target data sample and other each described data samples, the target data sample is determined This is superimposed distance between other each described data samples；

The distance that is superimposed between the target data sample and other each described data samples is overlapped, the mesh is obtained Mark the first exceptional value of data sample.

4. detection method according to claim 2, which is characterized in that described to carry out cluster point to the data sample set Analysis, obtains the cluster analysis result, comprising:

The data sample as initial cluster center is chosen in the data sample set；

Using initial cluster center as current cluster centre, the data sample set is carried out using the current cluster centre Clustering obtains current cluster analysis result；

If being unsatisfactory for iteration stopping condition, redefined according to the current class cluster in the current cluster analysis result described current Cluster centre, return again to later execute it is described using the current cluster centre to the data sample set carry out cluster divide Analysis；

5. detection method according to claim 4, which is characterized in that first selects from the data sample set Initial cluster center be the data sample set described in the smallest data sample of the first exceptional value.

6. detection method according to claim 4, which is characterized in that described according in the current cluster analysis result Current class cluster redefines the current cluster centre, comprising:

Based on the reservation data sample in the current class cluster, the class cluster center of the current class cluster is calculated；Wherein, the reservation First exceptional value of data sample is respectively less than the of the data sample in the current class cluster in addition to the reservation data sample One exceptional value；

The current cluster centre is redefined, so that the class cluster center of the current class cluster is as described current poly- Class center.

7. detection method according to claim 2, which is characterized in that described according to institute in the target cluster analysis result The target class cluster where target data sample is stated, the second exceptional value of the target data sample is calculated, comprising:

According to the target class cluster, determine in the target class cluster quantity of data sample and the target data sample with it is described Space length between the class cluster center of target class cluster；

According in the class cluster of the quantity of data sample in the target class cluster and the target data sample and the target class cluster Space length between the heart calculates the second exceptional value of the target data sample.

8. the detection device of abnormal point in a kind of data sample set characterized by comprising

First determining module, for according between the target data sample and other data samples space length and/or Target class cluster where target data sample described in target cluster analysis result for the data sample set, determines institute State the target exceptional value of target data sample；Wherein, other described data samples are that the mesh is removed in the data sample set Mark the data sample except data sample；

Second determining module determines that the target data sample is for the target exceptional value according to the target data sample The no abnormal point in the data sample set.

9. the detection device of abnormal point, the equipment include processor and memory in a kind of data sample set:

The processor is used for according to the described in any item data samples of instruction execution claim 1 to 7 in said program code The detection method of abnormal point in this set.

10. a kind of storage medium, the storage medium is for storing program code, and said program code is for perform claim requirement The detection method of abnormal point in 1 to 7 described in any item data sample set.