WO2020155756A1

WO2020155756A1 - Method and device for optimizing abnormal point proportion based on clustering and sse

Info

Publication number: WO2020155756A1
Application number: PCT/CN2019/117363
Authority: WO
Inventors: 杨志鸿; 徐亮; 阮晓雯
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-01-28
Filing date: 2019-11-12
Publication date: 2020-08-06
Also published as: CN109961086A

Abstract

The present application discloses a method and a device for optimizing a abnormal point proportion based on clustering and SSE. The method comprises the steps of receiving a collection of data points to be classified, and clustering the collection of data points to be classified by k-means clustering to obtain multiple clusters; obtaining data points corresponding to each cluster of the multiple clusters, and constructing a single classification support vector machine corresponding to each cluster according to a preset current abnormal point proportion and each cluster; continuously adjusting the current abnormal point proportion until the residual variation exceeds a variation threshold, taking the current abnormal point proportion plus the step size as an optimal abnormal point proportion; and classifying the selected clusters according to the single classification support vector machine and the optimal abnormal point proportion to obtain an optimal classification result.

Description

Method and device for optimizing the proportion of abnormal points based on clustering and SSE

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on January 28, 2019, the application number is 201910079217.9, and the application name is "Method and Device for Optimizing the Proportion of Outliers Based on Clustering and SSE". The reference is incorporated in this application.

Technical field

This application relates to the technical field of intelligent decision-making, and in particular to a method and device for optimizing the proportion of abnormal points based on clustering and SSE.

Background technique

Outlier analysis is the process of checking whether the data has input errors and contains unreasonable data. It is very dangerous to ignore the existence of outliers. Including the outliers in the calculation and analysis process of the data without eliminating them will cause bad results. influences.

At present, there are often multiple normal point centers in the massive user data collected during the operation of enterprises. If the massive user data is not divided first and then abnormal point detection is performed, the unsupervised model used for abnormal point detection will have a poor discrimination effect and cannot detect abnormal point data finely.

Summary of the invention

The embodiments of the present application provide a method, device, computer equipment and storage medium for optimizing the proportion of abnormal points based on clustering and SSE, aiming to solve the problem that there are often multiple normal point centers in massive user data in the prior art. Dividing a large amount of user data before performing outlier detection will result in poor discrimination of the unsupervised model used for outlier detection, and the problem of inability to finely detect outlier data.

In the first aspect, an embodiment of the present application provides a method for optimizing the proportion of abnormal points based on clustering and SSE, which includes:

Receiving a set of data points to be classified, and clustering the set of data points to be classified through k-means clustering to obtain multiple clusters;

Obtain data points corresponding to each cluster included in multiple clusters, and construct a one-to-one correspondence with each cluster for abnormal point detection according to the preset current proportion of abnormal points and each cluster Single classification support vector machine;

Classify the selected clusters according to the single-class support vector machine and the current abnormal point ratio to obtain the normal point center of the normal category in the classification result;

Obtaining the residual sum of squares of each data point of the abnormal category in the classification result and the center of the normal point to obtain the current residual sum of squares;

Subtract the preset step size from the current abnormal point ratio to update the current abnormal point ratio;

Classify the selected clusters according to the single-class support vector machine and the current abnormal point ratio to obtain the data points of the current abnormal category, and obtain the residual difference between each data point of the current abnormal category and the center of the normal point Take as the next residual sum of squares and difference of squares;

By dividing the difference between the next residual sum of squares and the current residual sum of squares by the step size, the residual variation range is obtained;

Determine whether the residual variation range exceeds a preset variation range threshold;

If the residual variation range exceeds the variation range threshold, the current abnormal point ratio plus the step length is used as the optimal abnormal point ratio; and

The selected clusters are classified according to the single-class support vector machine and the optimal proportion of abnormal points to obtain the optimal classification result.

In the second aspect, an embodiment of the present application provides a device for optimizing the proportion of abnormal points based on clustering and SSE, which includes:

The clustering unit is configured to receive a set of data points to be classified, and cluster the set of data points to be classified through k-means clustering to obtain multiple clusters;

The multi-model construction unit is used to obtain the data points corresponding to each cluster included in the multiple clusters, and construct one-to-one with each cluster according to the preset current abnormal point ratio and each cluster Corresponding single-class support vector machine for outlier detection;

The normal point center obtaining unit is used to classify the selected cluster according to the single-class support vector machine and the current abnormal point ratio to obtain the normal point center of the normal category in the classification result;

The first residual calculation unit is configured to obtain the residual sum of squares of each data point of the abnormal category in the classification result and the center of the normal point to obtain the current residual sum of squares;

The first ratio update unit is configured to subtract a preset step size from the current abnormal point ratio to update the current abnormal point ratio;

The second residual calculation unit is used to classify the selected clusters according to the single-class support vector machine and the current abnormal point ratio to obtain data points of the current abnormal category, and obtain each data point of the current abnormal category The residual error from the center of the normal point is taken as the next residual sum of squares and difference of squares;

An amplitude calculation unit, configured to divide the difference between the next residual sum of squares and the current residual sum of squares by the step size to obtain the residual variation range;

A judging unit for judging whether the residual variation range exceeds a preset variation range threshold;

An optimal ratio obtaining unit, configured to, if the residual variation range exceeds the variation range threshold, use the current abnormal point ratio plus the step length as the optimal abnormal point ratio; and

The optimal classification unit is used to classify the selected clusters according to the single classification support vector machine and the optimal abnormal point ratio to obtain the optimal classification result.

In the third aspect, an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and running on the processor, and the processor executes the computer The program implements the clustering and SSE-based abnormal point ratio optimization method described in the first aspect above.

In a fourth aspect, the embodiments of the present application also provide a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program that, when executed by a processor, causes the processor to execute the aforementioned first On the one hand, the optimization method based on clustering and SSE-based abnormal point ratio.

Description of the drawings

In order to more clearly describe the technical solutions of the embodiments of the present application, the following will briefly introduce the drawings used in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present application. Ordinary technical personnel can obtain other drawings based on these drawings without creative work.

FIG. 1 is a schematic flowchart of a method for optimizing the proportion of abnormal points based on clustering and SSE provided by an embodiment of the application;

2 is a schematic diagram of a sub-process of the method for optimizing the proportion of abnormal points based on clustering and SSE according to an embodiment of the application;

FIG. 3 is a schematic diagram of another sub-process of the method for optimizing the proportion of abnormal points based on clustering and SSE according to an embodiment of the application;

4 is a schematic diagram of another sub-process of the method for optimizing the proportion of abnormal points based on clustering and SSE according to an embodiment of the application;

FIG. 5 is another flow diagram of the method for optimizing the proportion of abnormal points based on clustering and SSE according to an embodiment of the application;

FIG. 6 is a schematic block diagram of a device for optimizing the proportion of abnormal points based on clustering and SSE according to an embodiment of the application;

FIG. 7 is a schematic block diagram of subunits of the device for optimizing the proportion of abnormal points based on clustering and SSE according to an embodiment of the application;

FIG. 8 is a schematic block diagram of another subunit of the device for optimizing the proportion of abnormal points based on clustering and SSE according to an embodiment of the application;

9 is a schematic block diagram of another subunit of the device for optimizing the proportion of abnormal points based on clustering and SSE according to an embodiment of the application;

10 is another schematic block diagram of the device for optimizing the proportion of abnormal points based on clustering and SSE according to an embodiment of the application;

FIG. 11 is a schematic block diagram of a computer device provided by an embodiment of the application.

detailed description

The following will clearly and completely describe the technical solutions in the embodiments of the present application in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

It should be understood that when used in this specification and the appended claims, the terms "including" and "including" indicate the existence of the described features, wholes, steps, operations, elements and/or components, but do not exclude one or The existence or addition of multiple other features, wholes, steps, operations, elements, components, and/or collections thereof.

It should also be understood that the terms used in the specification of this application are only for the purpose of describing specific embodiments and are not intended to limit the application. As used in the specification of this application and the appended claims, unless the context clearly indicates other circumstances, the singular forms "a", "an" and "the" are intended to include plural forms.

It should be further understood that the term "and/or" used in the specification and appended claims of this application refers to any combination of one or more of the items listed in the associated and all possible combinations, and includes these combinations .

Please refer to FIG. 1. FIG. 1 is a schematic flowchart of an SSE-based abnormal point ratio optimization method provided by an embodiment of the application. The SSE-based abnormal point ratio optimization method is applied to a server, and the method uses application software installed in the server. Carry out execution.

As shown in Figure 1, the method includes steps S101 to S181.

S101. Receive a set of data points to be classified, and cluster the set of data points to be classified through k-means clustering to obtain multiple clusters.

In this embodiment, after the server of the enterprise receives the massive business data uploaded by each business end, these business data can be regarded as a collection of data points to be classified. For example, the set of data points to be classified may be the user's insurance policy data, including at least fields such as the name of the applicant, the age of the applicant, the number of the applicant's insurance policy, the amount of insurance, the insurance period, and the phone number of the applicant. At this time, one of the field data can be selectively selected as the main data, and the remaining fields are used as the attribute data of the above-mentioned main field. For example, the insurance period field is used as the main data, and fields such as the telephone number and ID number of the applicant are used as its attribute data.

In an embodiment, as shown in FIG. 2, step S101 includes:

S1011, selecting the same number of data points as the preset number of cluster clusters from a plurality of data point sets to be classified, and using the selected data points as the initial cluster center of each cluster;

S1012: Divide the set of data points to be classified according to the difference between each data point in the set of data points to be classified and each initial cluster center to obtain an initial clustering result;

S1013. Obtain the adjusted cluster center of each cluster according to the initial clustering result;

S1014. According to the adjusted clustering center, divide the set of data points to be classified according to the difference value from the adjusted clustering center, until the clustering result remains the same more than the preset number of times to obtain The number of clusters corresponds to the number of clusters.

In this embodiment, when clustering the set of data points to be classified, one of the fields is selected as the primary key, and the remaining fields are used as the attribute data. Specifically, the k-means algorithm is used when clustering the set of data points to be classified, and the process is as follows:

a) Randomly select k data points to be classified from the set of n data points to be classified, and use them as the initial clustering centers of k clusters; among them, the initial total number of data points to be classified in the set of data points to be classified is n , Select k data points arbitrarily from them (k<n, k is a parameter specified by the user, that is, the number of expected clusters, that is, the preset number of clusters), and take the initially selected k data points as The initial cluster center.

b) Calculate the dissimilarity between the remaining data points to be classified to the initial cluster centers of k clusters, and classify the remaining data points to be classified into the clusters with the lowest dissimilarity to obtain the initial clustering results; that is, For each remaining data point to be classified, select the initial cluster center that is closest to it, and classify it into the same category with the initial cluster center; in this way, a large number of data points to be classified are classified based on the initial cluster center selected initially Divided into k clusters, each cluster of data has an initial cluster center.

c) Based on the initial clustering results, recalculate the respective cluster centers of the k clusters; the specific calculation method is to take the arithmetic mean of the primary attributes of all data points to be classified in each cluster, and choose the one closest to the arithmetic mean of the primary attributes The data points to be classified are used as the new cluster centers, and the better cluster centers in the cluster data are reselected.

d) Re-cluster all the elements in the n data points to be classified according to the new cluster center;

e) Repeat step d) until the clustering result does not change, and the clustering result corresponding to the preset number of clusters is obtained.

After the cluster classification is completed, the massive collection of data points to be classified can be grouped quickly to obtain multiple clusters.

S110. Obtain data points corresponding to each cluster included in a plurality of clusters, and construct a one-to-one correspondence with each cluster for abnormalities according to the preset current ratio of abnormal points and each cluster. Single-class support vector machine for point detection.

In this embodiment, for example, after the server receives the set of data points to be classified uploaded by the business end and completes the clustering and grouping, if the initial current abnormal point ratio is set to 0.5 (for example, the initial current abnormal point ratio Denoted as m ₀ ), it means that the ratio of normal point samples and abnormal point samples in the expected single-class support vector machine classification result is 1:1. Since it is assumed that there are more normal points than abnormal points, the abnormal point category contains a large number of misclassified normal points. When the proportion of abnormal points decreases, normal points in the abnormal point category will be eliminated. At this point, a single-class support vector machine for outlier detection is constructed according to the preset current proportion of abnormal points and the samples to be classified, as a model basis for subsequent adjustment of the current proportion of abnormal points and reclassification.

In an embodiment, as shown in FIG. 3, step S110 includes:

S111: Obtain the first parameter and the second parameter of the hyperplane corresponding to the single-class support vector machine corresponding to each cluster cluster according to the preset current abnormal point ratio and each cluster cluster;

S112: According to the first parameter and the second parameter of the hyperplane, and the current abnormal point ratio, construct a single-class support vector machine for abnormal point detection in a one-to-one correspondence with each cluster.

In this embodiment, the single-class support vector machine is OneClassSVM, and its classification model is as follows:

st(w·φ(x _i ))≥b-ξ _i , ξ _i ≥0;

Among them, ξ _i represents the slack variable; v is an upper limit set in the score of outliers, or the lower bound of the number of examples in the training data set as support vectors;

According to the Lagrangian transformation, the above classification model is transformed into:

This method creates a hyperplane with parameters w and b, which has the largest distance from the zero point in the feature space, and separates the zero point from all data points.

Through the above method, after a single-class support vector machine is constructed for multiple clusters, each cluster is classified according to its corresponding single-class support vector machine.

S120: Classify the selected clusters according to the single-class support vector machine and the current abnormal point ratio, and obtain the normal point center of the normal category in the classification result.

In this embodiment, when one of the multiple clusters is selected as the target cluster cluster to obtain the optimal anomalous point ratio as an example, the selected cluster should be selected according to the current anomaly point ratio set initially. After the clusters are classified by the single-class support vector machine, the normal point center corresponding to the data point of the normal category in the classification result can be determined, and this normal point center is constant in the subsequent process.

In an embodiment, as shown in FIG. 4, step S120 includes:

S121. Classify the selected cluster cluster according to the corresponding single-class support vector machine and the current abnormal point ratio to obtain a classification result corresponding to the selected cluster cluster; wherein, the classification result includes a normal category Data points and abnormal categories of data points;

S122. Obtain the average value corresponding to the data points of the normal category in the classification result to obtain the initial normal point center;

S123. Obtain the data point of the data points of the normal category that is closest to the center of the initial normal point in the classification result, and use it as the normal point center corresponding to the data point of the normal category.

In this embodiment, the selected clusters are first classified according to the single-class support vector machine and the current abnormal point ratio, and a classification result including data points of normal categories and data points of abnormal categories is obtained. In order to determine the center of the normal point at this time, it is necessary to obtain the average value of the data points of the normal category first, and then use the data point closest to the average value among the data points of the normal category as the normal point center. When the center of the normal point is fixed, the proportion of abnormal points can be continuously adjusted, and the optimal abnormality can be obtained according to the change trend of the specified parameters (such as the average Euclidean distance between each data point of the current abnormal category and the center of the normal point) Point ratio.

S130. Obtain the residual sum of squares of each data point of the abnormal category in the classification result and the center of the normal point to obtain the current residual sum of squares.

In this embodiment, the residual sum of squares is a measure of the degree of model fit in a linear model. A continuous curve is used to approximate or compare discrete points on a plane to represent a data processing of the functional relationship between coordinates. method. For example, in the measurement accuracy and the like, the residual sum of squares ^{_{^{(V 2) = V 1 2}}} + V 1 2 + ... + V n 2, where V _i is the residual of measured data l _i, l _i, for example, the remaining amount of data The difference can represent the residual of the data point l _i of the abnormal category. In order to determine the residual difference between each data point of the abnormal category and the normal point, it is necessary to calculate the residual sum of squares of each data point of the abnormal category and the center of the normal point as the current residual sum of squares, from the current residual The sum of squares can tell whether each data point of the abnormal category is far away from the center of the normal point.

S140: Subtract a preset step length from the current abnormal point ratio to update the current abnormal point ratio.

In this embodiment, the purpose of subtracting the preset step size from the current abnormal point ratio is to continuously adjust the current abnormal point ratio so as to obtain the optimal abnormal point ratio through the trial method.

S150. Classify the selected clusters according to the single-class support vector machine and the current abnormal point ratio to obtain data points of the current abnormal category, and obtain the center of each data point of the current abnormal category and the normal point. The residual sum of squares is used as the next residual sum of squares.

In this embodiment, the current abnormal point ratio is updated by subtracting the step size from the current abnormal point ratio. At this time, there is no need to determine the normal point center again, only the data points of the abnormal category in the classification result are obtained, and then the abnormality is calculated. The residual sum of squares of each data point of the category and the center of the normal point is used as the next residual sum of squares.

S160: Divide the difference between the next residual sum of squares and the current residual sum of squares by the step size to obtain a residual variation range.

In this embodiment, for example, the current residual sum of squares obtained in step S130 is regarded as SSE ₀ , then the next residual sum of squares obtained in the first execution of step S150 is regarded as SSE ₁ , and the result obtained in the second execution of step S150 The next residual sum of squares is regarded as SSE ₂ (the corresponding current residual sum of squares is SSE ₁ at this time),..., the next residual sum of squares obtained from the Nth execution of step S150 is regarded as SSE _N (this time corresponding to The current residual sum of squares is SSE _N-1 ). If the preset step length is denoted as l, the residual variation range is calculated by (SSE _N -SSE _N-1 )/l, where N is a positive integer greater than 0.

S170. Determine whether the residual variation range exceeds a preset variation range threshold.

In this embodiment, when the residual error changes abruptly, it means that the latest current anomaly point ratio at this moment is not the optimal anomaly point ratio. The current anomaly point ratio of the state before the latest current anomaly point ratio at this moment can be considered as the maximum. Proportion of excellent and abnormal points.

S180. If the residual variation range exceeds the variation range threshold, the current abnormal point ratio plus the step size is used as the optimal abnormal point ratio.

In this embodiment, if the residual variation range exceeds the preset variation range threshold, it means that some real abnormal points are classified as normal points, resulting in a sudden increase in the sum of squared residuals from the abnormal point to the normal center point. The last state of the abnormal point ratio (that is, the current abnormal point ratio plus the step size) can be used as the optimal abnormal point ratio.

In an embodiment, as shown in FIG. 5, after step S170, the method further includes:

S190. If the residual variation range does not exceed the variation range threshold, subtract the step size from the current abnormal point ratio to update the current abnormal point ratio, and update the current residual square sum through the next residual square sum, Return to step S150.

In this embodiment, when the residual variation range still maintains a smooth transition, it means that the reduced proportion of abnormal points is not enough to significantly affect the sum of squared residuals between each data point of the abnormal category and the center of the normal point. The current outlier ratio minus the step size to update the current outlier ratio, and the next residual sum of squares is used to update the current residual sum of squares. For example, when (SSE _N -SSE _N-1 )/l does not exceed the preset variation threshold, first use SSE ₁ as the current residual sum of squares, and (m ₀ -l) as the current abnormal point ratio and return to execution again Step S150 is to obtain SSE ₂ ; then when it flows to step S170 again, (SSE ₂ -SSE ₁ )/l is used as the residual variation range, and so on, until the residual variation range exceeds the preset variation range threshold. can.

S181. Classify the selected clusters according to the single classification support vector machine and the optimal anomaly point ratio to obtain an optimal classification result.

In this embodiment, after the optimal anomaly point ratio is determined, the selected cluster can be classified according to the single-class support vector machine and the optimal anomaly point ratio to obtain the optimal classification result, and The unsupervised classification model with the best classification effect.

In an embodiment, after step S181, the method further includes:

Send the optimal classification result and the optimal abnormal point ratio to the business end corresponding to the set of data points to be classified, and simultaneously send the optimal classification result and the optimal abnormal point ratio to the cloud server ；

The storage area corresponding to the optimal classification result and the optimal abnormal point ratio is formatted and deleted.

In this embodiment, if the optimal classification result corresponding to the set of data points to be classified and the optimal abnormal point ratio are obtained in the server, the optimal classification result and the The optimal abnormal point ratio is sent to the business end corresponding to the set of data points to be classified, so as to realize effective notification of the classification result of the business end.

Moreover, in order to reduce the pressure of data storage in the server, the optimal classification result and the optimal abnormal point ratio can be sent to the cloud server in time at this time, and the corresponding data point set to be classified can be matched by the cloud server. Effective storage of the optimal classification results and the optimal proportion of abnormal points. In this process, the set of data points to be classified corresponding to the optimal classification result and the optimal abnormal point ratio may also be synchronized to the cloud server. When the set of data points to be classified, the optimal classification result, and the optimal abnormal point ratio are synchronized from the server to the cloud server, the unique machine identification code (such as IMEI serial number) of the business end must be used as the data identification bit for unique data Logo.

At this time, after the optimal classification result and the optimal abnormal point ratio are synchronously sent to the cloud server, the storage area corresponding to the optimal classification result and the optimal abnormal point ratio in the server can be formatted It can be deleted to effectively release storage space.

In an embodiment, before formatting and deleting the storage area corresponding to the optimal classification result and the optimal abnormal point ratio, the method further includes:

Dividing the difference between the preset current abnormal point ratio and the optimal abnormal point ratio by the step length to obtain the number of iterations;

The number of iterations is sent to the business end corresponding to the set of data points to be classified, and the number of iterations is synchronously sent to the cloud server.

In this embodiment, in order to clearly know how many iterations have passed between the preset current anomaly point ratio and the optimal anomaly point ratio, at this time, the preset current anomaly point ratio and the optimal anomaly point ratio may be compared The difference in the ratio is divided by the step size to obtain the number of iterations. After the number of iterations is known, the number of iterations can be sent to the business end corresponding to the set of data points to be classified, and the business end can accumulate experience in setting the optimal abnormal point ratio accordingly.

This method realizes the accurate classification of massive data and the detection of abnormal points in each classification. The proportion of abnormal points in the detection process is automatically adjusted and obtained without setting based on experience.

The embodiment of the present application also provides a device for optimizing the proportion of abnormal points based on clustering and SSE. The device for optimizing the proportion of abnormal points based on clustering and SSE is used to perform any of the aforementioned methods for optimizing the proportion of abnormal points based on clustering and SSE Examples. Specifically, please refer to FIG. 6, which is a schematic block diagram of an abnormal point ratio optimization device based on clustering and SSE provided in an embodiment of the present application. The device 100 for optimizing the proportion of abnormal points based on clustering and SSE may be configured in a server.

As shown in FIG. 6, the device 100 for optimizing the proportion of abnormal points based on clustering and SSE includes a clustering unit 101, a multi-model construction unit 110, a normal point center acquisition unit 120, a first residual calculation unit 130, and a first ratio update unit. 140. The second residual calculation unit 150, the amplitude calculation unit 160, the judgment unit 170, the optimal ratio acquisition unit 180, and the optimal classification unit 181.

The clustering unit 101 is configured to receive a set of data points to be classified, and cluster the set of data points to be classified through k-means clustering to obtain multiple clusters.

In an embodiment, as shown in FIG. 7, the clustering unit 101 includes:

The initial cluster center obtaining unit 1011 is used to select the same number of data points as the preset number of cluster clusters from a plurality of data point sets to be classified, and use the selected data point as the initial cluster center of each cluster ；

The initial clustering unit 1012 is configured to divide the set of data points to be classified according to the difference between each data point in the set of data points to be classified and each initial cluster center to obtain an initial clustering result;

The cluster center adjustment unit 1013 is configured to obtain the adjusted cluster center of each cluster according to the initial clustering result;

The cluster adjustment unit 1014 is configured to divide the set of data points to be classified according to the difference value from the adjusted cluster center according to the adjusted cluster center, until the clustering result remains the same more than the preset number of times The number of times, the cluster cluster corresponding to the preset number of cluster clusters is obtained.

The multi-model construction unit 110 is used to obtain data points corresponding to each cluster included in a plurality of clusters, and construct a data point corresponding to each cluster according to the preset current abnormal point ratio and each cluster. A corresponding single-class support vector machine for outlier detection.

In an embodiment, as shown in FIG. 8, the multi-model construction unit 110 includes:

The classification parameter obtaining unit 111 is configured to obtain the first parameter and the second parameter of the hyperplane corresponding to the single classification support vector machine of each cluster according to the preset current abnormal point ratio and each cluster;

The model acquisition unit 112 is configured to construct a single-class support vector machine for abnormal point detection in a one-to-one correspondence with each cluster according to the first parameter and the second parameter of the hyperplane and the current abnormal point ratio.

The normal point center obtaining unit 120 is configured to classify the selected cluster according to the single-class support vector machine and the current abnormal point ratio to obtain the normal point center of the normal category in the classification result.

In an embodiment, as shown in FIG. 9, the normal point center obtaining unit 120 includes:

The initial classification unit 121 is configured to classify the selected cluster according to the corresponding single-class support vector machine and the current proportion of abnormal points to obtain a classification result corresponding to the selected cluster; wherein, the classification The results include normal category data points and abnormal category data points;

The distance average calculation unit 122 is configured to obtain the average value corresponding to the data points of the normal category in the classification result to obtain the initial normal point center;

The normal point center adjustment unit 123 is configured to obtain the data point closest to the initial normal point center among the data points of the normal category in the classification result as the normal point center corresponding to the data points of the normal category.

The first residual calculation unit 130 is configured to obtain the residual square sum of each data point of the abnormal category in the classification result and the center of the normal point to obtain the current residual square sum.

The first ratio update unit 140 is configured to subtract a preset step size from the current abnormal point ratio to update the current abnormal point ratio.

The second residual calculation unit 150 is configured to classify the selected cluster according to the single-class support vector machine and the current abnormal point ratio to obtain the data points of the current abnormal category, and obtain each data of the current abnormal category The residual sum of squares between the point and the center of the normal point is taken as the next residual sum of squares.

The amplitude calculation unit 160 is configured to divide the difference between the next residual sum of squares and the current residual sum of squares by the step size to obtain the residual variation amplitude.

The determining unit 170 is configured to determine whether the residual variation range exceeds a preset variation range threshold.

The optimal ratio acquisition unit 180 is configured to, if the residual variation range exceeds the variation range threshold, use the current abnormal point ratio plus the step length as the optimal abnormal point ratio.

In an embodiment, as shown in FIG. 10, the device 100 for optimizing the proportion of abnormal points based on clustering and SSE further includes:

The second ratio update unit 190 is configured to, if the residual variation range does not exceed the variation range threshold, subtract the step size from the current anomaly point ratio to update the current anomaly point ratio, and use the next residual sum of squares to calculate Update the current residual sum of squares, return to the execution to classify the sample to be classified according to the single-class support vector machine and the current anomaly point ratio to obtain the data points of the current anomaly category, and obtain each data point of the current anomaly category and all The residual sum of squares at the center of the normal point is used as the step of the next residual sum of squares.

The optimal classification unit 181 is configured to classify the selected clusters according to the single classification support vector machine and the optimal anomaly point ratio to obtain an optimal classification result.

The device realizes accurate classification of massive data and detection of abnormal points in each classification, and the proportion of abnormal points in the detection process is automatically adjusted and obtained without setting based on experience.

The above-mentioned device for optimizing the proportion of abnormal points based on clustering and SSE can be implemented in the form of a computer program, which can be run on a computer device as shown in FIG.

Please refer to FIG. 11, which is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 500 is a server, and the server may be an independent server or a server cluster composed of multiple servers.

11, the computer device 500 includes a processor 502, a memory, and a network interface 505 connected through a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.

The non-volatile storage medium 503 can store an operating system 5031 and a computer program 5032. When the computer program 5032 is executed, the processor 502 can execute the method for optimizing the proportion of abnormal points based on clustering and SSE.

The processor 502 is used to provide calculation and control capabilities, and support the operation of the entire computer device 500.

The internal memory 504 provides an environment for the operation of the computer program 5032 in the non-volatile storage medium 503. When the computer program 5032 is executed by the processor 502, the processor 502 can execute the method for optimizing the proportion of abnormal points based on clustering and SSE .

The network interface 505 is used for network communication, such as providing data information transmission. Those skilled in the art can understand that the structure shown in FIG. 11 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device 500 to which the solution of the present application is applied. The specific computer device 500 may include more or fewer components than shown in the figure, or combine certain components, or have a different component arrangement.

The processor 502 is configured to run a computer program 5032 stored in a memory to implement the method for optimizing the proportion of abnormal points based on clustering and SSE disclosed in the embodiments of the present application.

Those skilled in the art can understand that the embodiment of the computer device shown in FIG. 11 does not constitute a limitation on the specific configuration of the computer device. In other embodiments, the computer device may include more or less components than those shown in the figure. Or combine certain components, or different component arrangements. For example, in some embodiments, the computer device may only include a memory and a processor. In such embodiments, the structures and functions of the memory and the processor are consistent with the embodiment shown in FIG. 11, and will not be repeated here.

It should be understood that, in this embodiment of the application, the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. Among them, the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.

In another embodiment of the present application, a computer-readable storage medium is provided. The computer-readable storage medium may be a non-volatile computer-readable storage medium. The computer-readable storage medium stores a computer program, where the computer program is executed by a processor to implement the method for optimizing the proportion of abnormal points based on clustering and SSE disclosed in the embodiments of the present application.

The storage medium is a physical, non-transitory storage medium, such as a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk, or an optical disk that can store program codes. medium.

Those skilled in the art can clearly understand that, for the convenience and conciseness of description, the specific working process of the equipment, device and unit described above can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.

The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Anyone familiar with the technical field can easily think of various equivalents within the technical scope disclosed in this application. Modifications or replacements, these modifications or replacements shall be covered within the protection scope of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims

An optimization method for the proportion of abnormal points based on clustering and SSE, including:

Receiving a set of data points to be classified, and clustering the set of data points to be classified through k-means clustering to obtain multiple clusters;

Obtain data points corresponding to each cluster included in multiple clusters, and construct a one-to-one correspondence with each cluster for abnormal point detection according to the preset current proportion of abnormal points and each cluster Single classification support vector machine;

Classify the selected clusters according to the single-class support vector machine and the current abnormal point ratio to obtain the normal point center of the normal category in the classification result;

Obtaining the residual sum of squares of each data point of the abnormal category in the classification result and the center of the normal point to obtain the current residual sum of squares;

Subtract the preset step size from the current abnormal point ratio to update the current abnormal point ratio;

Classify the selected clusters according to the single-class support vector machine and the current abnormal point ratio to obtain the data points of the current abnormal category, and obtain the residual difference between each data point of the current abnormal category and the center of the normal point Take as the next residual sum of squares and difference of squares;

By dividing the difference between the next residual sum of squares and the current residual sum of squares by the step size, the residual variation range is obtained;

Determine whether the residual variation range exceeds a preset variation range threshold;

If the residual variation range exceeds the variation range threshold, the current abnormal point ratio plus the step length is used as the optimal abnormal point ratio; and

The selected clusters are classified according to the single-class support vector machine and the optimal proportion of abnormal points to obtain the optimal classification result.
The method for optimizing the proportion of abnormal points based on clustering and SSE according to claim 1, wherein the clustering of the set of data points to be classified by k-means clustering to obtain multiple clusters comprises:

Select the same number of data points as the preset number of cluster clusters from a plurality of data point sets to be classified, and use the selected data point as the initial cluster center of each cluster;

Dividing the set of data points to be classified according to the difference between each data point in the set of data points to be classified and each initial cluster center to obtain an initial clustering result;

According to the initial clustering results, obtain the adjusted cluster center of each cluster;

According to the adjusted clustering center, the set of data points to be classified is divided according to the difference value from the adjusted clustering center, until the clustering result remains the same more than the preset number of times, and the preset cluster is obtained. The cluster cluster corresponding to the number of clusters.
The method for optimizing the proportion of abnormal points based on clustering and SSE according to claim 1, wherein after determining whether the residual variation range exceeds a preset variation range threshold, the method further comprises:

If the residual variation range does not exceed the variation range threshold, subtract the step size from the current anomaly point ratio to update the current anomaly point ratio, update the current residual sum of squares through the next residual sum of squares, and return to execution The sample to be classified is classified according to the single-class support vector machine and the current abnormal point ratio to obtain the data points of the current abnormal category, and the residual square sum of each data point of the current abnormal category and the center of the normal point is obtained Take as the next step of the residual sum of squares.
The method for optimizing the proportion of abnormal points based on clustering and SSE according to claim 1, wherein said constructing a one-to-one correspondence with each cluster according to the preset current proportion of abnormal points and each cluster Single-class support vector machines for outlier detection include:

Obtain the first parameter and the second parameter of the hyperplane corresponding to the single-class support vector machine of each cluster according to the preset current abnormal point ratio and each cluster;

According to the first parameter and the second parameter of the hyperplane, and the current abnormal point ratio, a single-class support vector machine for abnormal point detection corresponding to each cluster is constructed.
The method for optimizing the proportion of abnormal points based on clustering and SSE according to claim 1, wherein the selected cluster cluster is classified according to the single classification support vector machine and the current proportion of abnormal points to obtain The normal point center of the normal category in the classification result, including:

The selected clusters are classified according to the corresponding single-class support vector machine and the current proportion of abnormal points to obtain the classification results corresponding to the selected clusters; wherein, the classification results include normal category data Points and data points of abnormal categories;

Obtaining the average value corresponding to the normal category data points in the classification result to obtain the initial normal point center;

Obtain the data point of the data points of the normal category that is closest to the center of the initial normal point in the classification result, and use it as the normal point center corresponding to the data point of the normal category.
The method for optimizing the proportion of abnormal points based on clustering and SSE according to claim 1, wherein the selected clusters are classified according to the single classification support vector machine and the optimal proportion of abnormal points to obtain the most After the excellent classification results, it also includes:

Send the optimal classification result and the optimal abnormal point ratio to the business end corresponding to the set of data points to be classified, and simultaneously send the optimal classification result and the optimal abnormal point ratio to the cloud server ；

Formatting and deleting the storage area corresponding to the optimal classification result and the optimal abnormal point ratio.
The method for optimizing the proportion of abnormal points based on clustering and SSE according to claim 6, wherein before formatting and deleting the storage area corresponding to the optimal classification result and the optimal proportion of abnormal points, the method further comprises :

Dividing the difference between the preset current abnormal point ratio and the optimal abnormal point ratio by the step length to obtain the number of iterations;

The number of iterations is sent to the business end corresponding to the set of data points to be classified, and the number of iterations is synchronously sent to the cloud server.
A device for optimizing the proportion of abnormal points based on clustering and SSE, including:

The clustering unit is configured to receive a set of data points to be classified, and cluster the set of data points to be classified through k-means clustering to obtain multiple clusters;

The multi-model construction unit is used to obtain the data points corresponding to each cluster included in the multiple clusters, and construct one-to-one with each cluster according to the preset current abnormal point ratio and each cluster Corresponding single-class support vector machine for outlier detection;

The normal point center obtaining unit is used to classify the selected cluster according to the single-class support vector machine and the current abnormal point ratio to obtain the normal point center of the normal category in the classification result;

The first residual calculation unit is configured to obtain the residual sum of squares of each data point of the abnormal category in the classification result and the center of the normal point to obtain the current residual sum of squares;

The first ratio update unit is configured to subtract a preset step size from the current abnormal point ratio to update the current abnormal point ratio;

The second residual calculation unit is used to classify the selected clusters according to the single-class support vector machine and the current abnormal point ratio to obtain data points of the current abnormal category, and obtain each data point of the current abnormal category The residual error from the center of the normal point is taken as the next residual sum of squares and difference of squares;

An amplitude calculation unit, configured to divide the difference between the next residual sum of squares and the current residual sum of squares by the step size to obtain the residual variation range;

A judging unit for judging whether the residual variation range exceeds a preset variation range threshold;

An optimal ratio obtaining unit, configured to, if the residual variation range exceeds the variation range threshold, use the current abnormal point ratio plus the step length as the optimal abnormal point ratio; and

The optimal classification unit is used to classify the selected clusters according to the single classification support vector machine and the optimal abnormal point ratio to obtain the optimal classification result.
The apparatus for optimizing the proportion of abnormal points based on clustering and SSE according to claim 8, wherein the clustering unit comprises:

The initial cluster center obtaining unit is used to select the same number of data points as the preset number of cluster clusters from a plurality of data point sets to be classified, and use the selected data points as the initial cluster center of each cluster;

The initial clustering unit is used to divide the set of data points to be classified according to the difference between each data point in the set of data points to be classified and each initial cluster center to obtain an initial clustering result;

The cluster center adjustment unit is used to obtain the adjusted cluster center of each cluster according to the initial clustering result;

The cluster adjustment unit is used to divide the set of data points to be classified according to the difference between the adjusted cluster center and the adjusted cluster center according to the adjusted cluster center, until the clustering result remains the same more than the preset number of times , Get the cluster cluster corresponding to the preset number of cluster clusters.
The device for optimizing the proportion of abnormal points based on clustering and SSE according to claim 8, further comprising:

The second ratio update unit is used to update the current anomaly point ratio by subtracting the step size from the current anomaly point ratio if the residual error variation amplitude does not exceed the variation amplitude threshold, and update the current anomaly point ratio by the next residual sum of squares The current residual sum of squares, return to the execution to classify the sample to be classified according to the single-class support vector machine and the current abnormal point ratio to obtain the data points of the current abnormal category, and obtain each data point of the current abnormal category and the The residual sum of squares at the center of the normal point is used as the step of the next residual sum of squares.
A computer device includes a memory, a processor, and a computer program that is stored on the memory and can run on the processor, and the processor implements the following steps when the processor executes the computer program:

Receiving a set of data points to be classified, and clustering the set of data points to be classified through k-means clustering to obtain multiple clusters;

Obtain data points corresponding to each cluster included in multiple clusters, and construct a one-to-one correspondence with each cluster for abnormal point detection according to the preset current proportion of abnormal points and each cluster Single classification support vector machine;

Classify the selected clusters according to the single-class support vector machine and the current abnormal point ratio to obtain the normal point center of the normal category in the classification result;

Obtaining the residual sum of squares of each data point of the abnormal category in the classification result and the center of the normal point to obtain the current residual sum of squares;

Subtract the preset step size from the current abnormal point ratio to update the current abnormal point ratio;

Classify the selected clusters according to the single-class support vector machine and the current abnormal point ratio to obtain the data points of the current abnormal category, and obtain the residual difference between each data point of the current abnormal category and the center of the normal point Take as the next residual sum of squares and difference of squares;

By dividing the difference between the next residual sum of squares and the current residual sum of squares by the step size, the residual variation range is obtained;

Determine whether the residual variation range exceeds a preset variation range threshold;

If the residual variation range exceeds the variation range threshold, the current abnormal point ratio plus the step length is used as the optimal abnormal point ratio; and

The selected clusters are classified according to the single-class support vector machine and the optimal proportion of abnormal points to obtain the optimal classification result.
11. The computer device according to claim 11, wherein the clustering of the set of data points to be classified by k-means clustering to obtain multiple clusters comprises:

Select the same number of data points as the preset number of cluster clusters from a plurality of data point sets to be classified, and use the selected data point as the initial cluster center of each cluster;

Dividing the set of data points to be classified according to the difference between each data point in the set of data points to be classified and each initial cluster center to obtain an initial clustering result;

According to the initial clustering results, obtain the adjusted cluster center of each cluster;

According to the adjusted clustering center, the set of data points to be classified is divided according to the difference value from the adjusted clustering center, until the clustering result remains the same more than the preset number of times, and the preset cluster is obtained. The cluster cluster corresponding to the number of clusters.
11. The computer device according to claim 11, wherein after determining whether the residual variation range exceeds a preset variation range threshold, the method further comprises:

If the residual variation range does not exceed the variation range threshold, subtract the step size from the current abnormal point ratio to update the current abnormal point ratio, update the current residual square sum through the next residual square sum, and return to execution The sample to be classified is classified according to the single-class support vector machine and the current abnormal point ratio to obtain the data points of the current abnormal category, and the residual square sum of each data point of the current abnormal category and the center of the normal point is obtained Take as the next step of the residual sum of squares.
11. The computer device according to claim 11, wherein the single classification support vector for abnormal point detection is constructed in a one-to-one correspondence with each cluster according to a preset proportion of current abnormal points and each cluster Machines, including:

Obtain the first parameter and the second parameter of the hyperplane corresponding to the single-class support vector machine of each cluster according to the preset current abnormal point ratio and each cluster;

According to the first parameter and the second parameter of the hyperplane, and the current abnormal point ratio, a single-class support vector machine for abnormal point detection corresponding to each cluster is constructed.
11. The computer device according to claim 11, wherein the selected cluster is classified according to the single-class support vector machine and the current abnormal point ratio to obtain the normal point center of the normal category in the classification result ,include:

The selected clusters are classified according to the corresponding single-class support vector machine and the current abnormal point ratio, and the classification results corresponding to the selected clusters are obtained; wherein, the classification results include normal category data Points and data points of abnormal categories;

Obtaining the average value corresponding to the normal category data points in the classification result to obtain the initial normal point center;

Obtain the data point of the data points of the normal category that is closest to the center of the initial normal point in the classification result, and use it as the normal point center corresponding to the data point of the normal category.
11. The computer device according to claim 11, wherein said classifying the selected clusters according to the single classification support vector machine and the optimal anomalous point ratio, after obtaining the optimal classification result, further comprises:

Send the optimal classification result and the optimal abnormal point ratio to the business end corresponding to the set of data points to be classified, and simultaneously send the optimal classification result and the optimal abnormal point ratio to the cloud server ；

Formatting and deleting the storage area corresponding to the optimal classification result and the optimal abnormal point ratio.
The method for optimizing the proportion of abnormal points based on clustering and SSE according to claim 16, wherein before formatting and deleting the storage area corresponding to the optimal classification result and the optimal proportion of abnormal points, the method further comprises :

Dividing the difference between the preset current abnormal point ratio and the optimal abnormal point ratio by the step length to obtain the number of iterations;

The number of iterations is sent to the business end corresponding to the set of data points to be classified, and the number of iterations is synchronously sent to the cloud server.
A computer-readable storage medium that stores a computer program that, when executed by a processor, causes the processor to perform the following operations:

Receiving a set of data points to be classified, and clustering the set of data points to be classified through k-means clustering to obtain multiple clusters;

Obtain data points corresponding to each cluster included in multiple clusters, and construct a one-to-one correspondence with each cluster for abnormal point detection according to the preset current proportion of abnormal points and each cluster Single classification support vector machine;

Classify the selected clusters according to the single-class support vector machine and the current abnormal point ratio to obtain the normal point center of the normal category in the classification result;

Obtaining the residual sum of squares of each data point of the abnormal category in the classification result and the center of the normal point to obtain the current residual sum of squares;

Subtract the preset step size from the current abnormal point ratio to update the current abnormal point ratio;

Classify the selected clusters according to the single-class support vector machine and the current abnormal point ratio to obtain the data points of the current abnormal category, and obtain the residual difference between each data point of the current abnormal category and the center of the normal point Take as the next residual sum of squares and difference of squares;

By dividing the difference between the next residual sum of squares and the current residual sum of squares by the step size, the residual variation range is obtained;

Determine whether the residual variation range exceeds a preset variation range threshold;

If the residual variation range exceeds the variation range threshold, the current abnormal point ratio plus the step length is used as the optimal abnormal point ratio; and

The selected clusters are classified according to the single-class support vector machine and the optimal proportion of abnormal points to obtain the optimal classification result.
The computer-readable storage medium according to claim 18, wherein the clustering of the set of data points to be classified by k-means clustering to obtain multiple clusters comprises:

Select the same number of data points as the preset number of cluster clusters from a plurality of data point sets to be classified, and use the selected data point as the initial cluster center of each cluster;

Dividing the set of data points to be classified according to the difference between each data point in the set of data points to be classified and each initial cluster center to obtain an initial clustering result;

According to the initial clustering results, obtain the adjusted cluster center of each cluster;

According to the adjusted clustering center, the set of data points to be classified is divided according to the difference value from the adjusted clustering center, until the clustering result remains the same more than the preset number of times, and the preset cluster is obtained. The cluster cluster corresponding to the number of clusters.
18. The computer-readable storage medium according to claim 18, wherein after determining whether the residual variation range exceeds a preset variation range threshold, the method further comprises:

If the residual variation range does not exceed the variation range threshold, subtract the step size from the current anomaly point ratio to update the current anomaly point ratio, update the current residual sum of squares through the next residual sum of squares, and return to execution The sample to be classified is classified according to the single-class support vector machine and the current abnormal point ratio to obtain the data points of the current abnormal category, and the residual square sum of each data point of the current abnormal category and the center of the normal point is obtained Take as the next step of the residual sum of squares.