WO2020155756A1 - Method and device for optimizing abnormal point proportion based on clustering and sse - Google Patents

Method and device for optimizing abnormal point proportion based on clustering and sse Download PDF

Info

Publication number
WO2020155756A1
WO2020155756A1 PCT/CN2019/117363 CN2019117363W WO2020155756A1 WO 2020155756 A1 WO2020155756 A1 WO 2020155756A1 CN 2019117363 W CN2019117363 W CN 2019117363W WO 2020155756 A1 WO2020155756 A1 WO 2020155756A1
Authority
WO
WIPO (PCT)
Prior art keywords
cluster
abnormal
point
current
clustering
Prior art date
Application number
PCT/CN2019/117363
Other languages
French (fr)
Chinese (zh)
Inventor
杨志鸿
徐亮
阮晓雯
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020155756A1 publication Critical patent/WO2020155756A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Definitions

  • This application relates to the technical field of intelligent decision-making, and in particular to a method and device for optimizing the proportion of abnormal points based on clustering and SSE.
  • Outlier analysis is the process of checking whether the data has input errors and contains unreasonable data. It is very dangerous to ignore the existence of outliers. Including the outliers in the calculation and analysis process of the data without eliminating them will cause bad results. influences.
  • the embodiments of the present application provide a method, device, computer equipment and storage medium for optimizing the proportion of abnormal points based on clustering and SSE, aiming to solve the problem that there are often multiple normal point centers in massive user data in the prior art. Dividing a large amount of user data before performing outlier detection will result in poor discrimination of the unsupervised model used for outlier detection, and the problem of inability to finely detect outlier data.
  • an embodiment of the present application provides a method for optimizing the proportion of abnormal points based on clustering and SSE, which includes:
  • the residual variation range is obtained
  • the current abnormal point ratio plus the step length is used as the optimal abnormal point ratio
  • the selected clusters are classified according to the single-class support vector machine and the optimal proportion of abnormal points to obtain the optimal classification result.
  • an embodiment of the present application provides a device for optimizing the proportion of abnormal points based on clustering and SSE, which includes:
  • the clustering unit is configured to receive a set of data points to be classified, and cluster the set of data points to be classified through k-means clustering to obtain multiple clusters;
  • the multi-model construction unit is used to obtain the data points corresponding to each cluster included in the multiple clusters, and construct one-to-one with each cluster according to the preset current abnormal point ratio and each cluster Corresponding single-class support vector machine for outlier detection;
  • the normal point center obtaining unit is used to classify the selected cluster according to the single-class support vector machine and the current abnormal point ratio to obtain the normal point center of the normal category in the classification result;
  • the first residual calculation unit is configured to obtain the residual sum of squares of each data point of the abnormal category in the classification result and the center of the normal point to obtain the current residual sum of squares;
  • the first ratio update unit is configured to subtract a preset step size from the current abnormal point ratio to update the current abnormal point ratio
  • the second residual calculation unit is used to classify the selected clusters according to the single-class support vector machine and the current abnormal point ratio to obtain data points of the current abnormal category, and obtain each data point of the current abnormal category
  • the residual error from the center of the normal point is taken as the next residual sum of squares and difference of squares
  • An amplitude calculation unit configured to divide the difference between the next residual sum of squares and the current residual sum of squares by the step size to obtain the residual variation range
  • An optimal ratio obtaining unit configured to, if the residual variation range exceeds the variation range threshold, use the current abnormal point ratio plus the step length as the optimal abnormal point ratio;
  • the optimal classification unit is used to classify the selected clusters according to the single classification support vector machine and the optimal abnormal point ratio to obtain the optimal classification result.
  • an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and running on the processor, and the processor executes the computer
  • the program implements the clustering and SSE-based abnormal point ratio optimization method described in the first aspect above.
  • the embodiments of the present application also provide a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program that, when executed by a processor, causes the processor to execute the aforementioned first
  • the optimization method based on clustering and SSE-based abnormal point ratio.
  • FIG. 1 is a schematic flowchart of a method for optimizing the proportion of abnormal points based on clustering and SSE provided by an embodiment of the application;
  • FIG. 2 is a schematic diagram of a sub-process of the method for optimizing the proportion of abnormal points based on clustering and SSE according to an embodiment of the application;
  • FIG. 3 is a schematic diagram of another sub-process of the method for optimizing the proportion of abnormal points based on clustering and SSE according to an embodiment of the application;
  • FIG. 4 is a schematic diagram of another sub-process of the method for optimizing the proportion of abnormal points based on clustering and SSE according to an embodiment of the application;
  • FIG. 5 is another flow diagram of the method for optimizing the proportion of abnormal points based on clustering and SSE according to an embodiment of the application;
  • FIG. 6 is a schematic block diagram of a device for optimizing the proportion of abnormal points based on clustering and SSE according to an embodiment of the application;
  • FIG. 7 is a schematic block diagram of subunits of the device for optimizing the proportion of abnormal points based on clustering and SSE according to an embodiment of the application;
  • FIG. 8 is a schematic block diagram of another subunit of the device for optimizing the proportion of abnormal points based on clustering and SSE according to an embodiment of the application;
  • FIG. 9 is a schematic block diagram of another subunit of the device for optimizing the proportion of abnormal points based on clustering and SSE according to an embodiment of the application;
  • FIG. 10 is another schematic block diagram of the device for optimizing the proportion of abnormal points based on clustering and SSE according to an embodiment of the application;
  • FIG. 11 is a schematic block diagram of a computer device provided by an embodiment of the application.
  • FIG. 1 is a schematic flowchart of an SSE-based abnormal point ratio optimization method provided by an embodiment of the application.
  • the SSE-based abnormal point ratio optimization method is applied to a server, and the method uses application software installed in the server. Carry out execution.
  • the method includes steps S101 to S181.
  • S101 Receive a set of data points to be classified, and cluster the set of data points to be classified through k-means clustering to obtain multiple clusters.
  • these business data can be regarded as a collection of data points to be classified.
  • the set of data points to be classified may be the user's insurance policy data, including at least fields such as the name of the applicant, the age of the applicant, the number of the applicant's insurance policy, the amount of insurance, the insurance period, and the phone number of the applicant.
  • one of the field data can be selectively selected as the main data, and the remaining fields are used as the attribute data of the above-mentioned main field.
  • the insurance period field is used as the main data, and fields such as the telephone number and ID number of the applicant are used as its attribute data.
  • step S101 includes:
  • S1012 Divide the set of data points to be classified according to the difference between each data point in the set of data points to be classified and each initial cluster center to obtain an initial clustering result
  • the k-means algorithm is used when clustering the set of data points to be classified, and the process is as follows:
  • the specific calculation method is to take the arithmetic mean of the primary attributes of all data points to be classified in each cluster, and choose the one closest to the arithmetic mean of the primary attributes
  • the data points to be classified are used as the new cluster centers, and the better cluster centers in the cluster data are reselected.
  • step d) Repeat step d) until the clustering result does not change, and the clustering result corresponding to the preset number of clusters is obtained.
  • the massive collection of data points to be classified can be grouped quickly to obtain multiple clusters.
  • the server receives the set of data points to be classified uploaded by the business end and completes the clustering and grouping
  • the initial current abnormal point ratio is set to 0.5 (for example, the initial current abnormal point ratio Denoted as m 0 )
  • m 0 the initial current abnormal point ratio
  • the abnormal point category contains a large number of misclassified normal points.
  • a single-class support vector machine for outlier detection is constructed according to the preset current proportion of abnormal points and the samples to be classified, as a model basis for subsequent adjustment of the current proportion of abnormal points and reclassification.
  • step S110 includes:
  • S111 Obtain the first parameter and the second parameter of the hyperplane corresponding to the single-class support vector machine corresponding to each cluster cluster according to the preset current abnormal point ratio and each cluster cluster;
  • S112 According to the first parameter and the second parameter of the hyperplane, and the current abnormal point ratio, construct a single-class support vector machine for abnormal point detection in a one-to-one correspondence with each cluster.
  • the single-class support vector machine is OneClassSVM, and its classification model is as follows:
  • ⁇ i represents the slack variable
  • v is an upper limit set in the score of outliers, or the lower bound of the number of examples in the training data set as support vectors
  • This method creates a hyperplane with parameters w and b, which has the largest distance from the zero point in the feature space, and separates the zero point from all data points.
  • each cluster is classified according to its corresponding single-class support vector machine.
  • S120 Classify the selected clusters according to the single-class support vector machine and the current abnormal point ratio, and obtain the normal point center of the normal category in the classification result.
  • the selected cluster when one of the multiple clusters is selected as the target cluster cluster to obtain the optimal anomalous point ratio as an example, the selected cluster should be selected according to the current anomaly point ratio set initially. After the clusters are classified by the single-class support vector machine, the normal point center corresponding to the data point of the normal category in the classification result can be determined, and this normal point center is constant in the subsequent process.
  • step S120 includes:
  • the selected clusters are first classified according to the single-class support vector machine and the current abnormal point ratio, and a classification result including data points of normal categories and data points of abnormal categories is obtained.
  • a classification result including data points of normal categories and data points of abnormal categories is obtained.
  • the center of the normal point is fixed, the proportion of abnormal points can be continuously adjusted, and the optimal abnormality can be obtained according to the change trend of the specified parameters (such as the average Euclidean distance between each data point of the current abnormal category and the center of the normal point) Point ratio.
  • the residual sum of squares is a measure of the degree of model fit in a linear model.
  • a continuous curve is used to approximate or compare discrete points on a plane to represent a data processing of the functional relationship between coordinates. method.
  • V 2 V 1 2 + V 1 2 + ... + V n 2
  • V i is the residual of measured data l i, l i, for example, the remaining amount of data
  • the difference can represent the residual of the data point l i of the abnormal category.
  • S140 Subtract a preset step length from the current abnormal point ratio to update the current abnormal point ratio.
  • the purpose of subtracting the preset step size from the current abnormal point ratio is to continuously adjust the current abnormal point ratio so as to obtain the optimal abnormal point ratio through the trial method.
  • S150 Classify the selected clusters according to the single-class support vector machine and the current abnormal point ratio to obtain data points of the current abnormal category, and obtain the center of each data point of the current abnormal category and the normal point.
  • the residual sum of squares is used as the next residual sum of squares.
  • the current abnormal point ratio is updated by subtracting the step size from the current abnormal point ratio. At this time, there is no need to determine the normal point center again, only the data points of the abnormal category in the classification result are obtained, and then the abnormality is calculated. The residual sum of squares of each data point of the category and the center of the normal point is used as the next residual sum of squares.
  • the current residual sum of squares obtained in step S130 is regarded as SSE 0
  • the next residual sum of squares obtained in the first execution of step S150 is regarded as SSE 1
  • the result obtained in the second execution of step S150 The next residual sum of squares is regarded as SSE 2 (the corresponding current residual sum of squares is SSE 1 at this time)
  • the next residual sum of squares obtained from the Nth execution of step S150 is regarded as SSE N (this time corresponding to The current residual sum of squares is SSE N-1 ).
  • the preset step length is denoted as l
  • the residual variation range is calculated by (SSE N -SSE N-1 )/l, where N is a positive integer greater than 0.
  • the latest current anomaly point ratio at this moment is not the optimal anomaly point ratio.
  • the current anomaly point ratio of the state before the latest current anomaly point ratio at this moment can be considered as the maximum. Proportion of excellent and abnormal points.
  • the residual variation range exceeds the preset variation range threshold, it means that some real abnormal points are classified as normal points, resulting in a sudden increase in the sum of squared residuals from the abnormal point to the normal center point.
  • the last state of the abnormal point ratio (that is, the current abnormal point ratio plus the step size) can be used as the optimal abnormal point ratio.
  • the method further includes:
  • step S190 If the residual variation range does not exceed the variation range threshold, subtract the step size from the current abnormal point ratio to update the current abnormal point ratio, and update the current residual square sum through the next residual square sum, Return to step S150.
  • the residual variation range still maintains a smooth transition, it means that the reduced proportion of abnormal points is not enough to significantly affect the sum of squared residuals between each data point of the abnormal category and the center of the normal point.
  • the current outlier ratio minus the step size to update the current outlier ratio, and the next residual sum of squares is used to update the current residual sum of squares.
  • Step S150 when (SSE N -SSE N-1 )/l does not exceed the preset variation threshold, first use SSE 1 as the current residual sum of squares, and (m 0 -l) as the current abnormal point ratio and return to execution again Step S150 is to obtain SSE 2 ; then when it flows to step S170 again, (SSE 2 -SSE 1 )/l is used as the residual variation range, and so on, until the residual variation range exceeds the preset variation range threshold. can.
  • the selected cluster can be classified according to the single-class support vector machine and the optimal anomaly point ratio to obtain the optimal classification result, and The unsupervised classification model with the best classification effect.
  • step S181 the method further includes:
  • the storage area corresponding to the optimal classification result and the optimal abnormal point ratio is formatted and deleted.
  • the optimal classification result corresponding to the set of data points to be classified and the optimal abnormal point ratio are obtained in the server, the optimal classification result and the The optimal abnormal point ratio is sent to the business end corresponding to the set of data points to be classified, so as to realize effective notification of the classification result of the business end.
  • the optimal classification result and the optimal abnormal point ratio can be sent to the cloud server in time at this time, and the corresponding data point set to be classified can be matched by the cloud server.
  • the set of data points to be classified corresponding to the optimal classification result and the optimal abnormal point ratio may also be synchronized to the cloud server.
  • the unique machine identification code such as IMEI serial number
  • the business end must be used as the data identification bit for unique data Logo.
  • the storage area corresponding to the optimal classification result and the optimal abnormal point ratio in the server can be formatted It can be deleted to effectively release storage space.
  • the method before formatting and deleting the storage area corresponding to the optimal classification result and the optimal abnormal point ratio, the method further includes:
  • the number of iterations is sent to the business end corresponding to the set of data points to be classified, and the number of iterations is synchronously sent to the cloud server.
  • the preset current anomaly point ratio and the optimal anomaly point ratio may be compared The difference in the ratio is divided by the step size to obtain the number of iterations. After the number of iterations is known, the number of iterations can be sent to the business end corresponding to the set of data points to be classified, and the business end can accumulate experience in setting the optimal abnormal point ratio accordingly.
  • This method realizes the accurate classification of massive data and the detection of abnormal points in each classification.
  • the proportion of abnormal points in the detection process is automatically adjusted and obtained without setting based on experience.
  • the embodiment of the present application also provides a device for optimizing the proportion of abnormal points based on clustering and SSE.
  • the device for optimizing the proportion of abnormal points based on clustering and SSE is used to perform any of the aforementioned methods for optimizing the proportion of abnormal points based on clustering and SSE Examples.
  • FIG. 6, is a schematic block diagram of an abnormal point ratio optimization device based on clustering and SSE provided in an embodiment of the present application.
  • the device 100 for optimizing the proportion of abnormal points based on clustering and SSE may be configured in a server.
  • the device 100 for optimizing the proportion of abnormal points based on clustering and SSE includes a clustering unit 101, a multi-model construction unit 110, a normal point center acquisition unit 120, a first residual calculation unit 130, and a first ratio update unit. 140.
  • the clustering unit 101 is configured to receive a set of data points to be classified, and cluster the set of data points to be classified through k-means clustering to obtain multiple clusters.
  • the clustering unit 101 includes:
  • the initial cluster center obtaining unit 1011 is used to select the same number of data points as the preset number of cluster clusters from a plurality of data point sets to be classified, and use the selected data point as the initial cluster center of each cluster ;
  • the initial clustering unit 1012 is configured to divide the set of data points to be classified according to the difference between each data point in the set of data points to be classified and each initial cluster center to obtain an initial clustering result;
  • the cluster center adjustment unit 1013 is configured to obtain the adjusted cluster center of each cluster according to the initial clustering result
  • the cluster adjustment unit 1014 is configured to divide the set of data points to be classified according to the difference value from the adjusted cluster center according to the adjusted cluster center, until the clustering result remains the same more than the preset number of times The number of times, the cluster cluster corresponding to the preset number of cluster clusters is obtained.
  • the multi-model construction unit 110 is used to obtain data points corresponding to each cluster included in a plurality of clusters, and construct a data point corresponding to each cluster according to the preset current abnormal point ratio and each cluster.
  • the multi-model construction unit 110 includes:
  • the classification parameter obtaining unit 111 is configured to obtain the first parameter and the second parameter of the hyperplane corresponding to the single classification support vector machine of each cluster according to the preset current abnormal point ratio and each cluster;
  • the model acquisition unit 112 is configured to construct a single-class support vector machine for abnormal point detection in a one-to-one correspondence with each cluster according to the first parameter and the second parameter of the hyperplane and the current abnormal point ratio.
  • the normal point center obtaining unit 120 is configured to classify the selected cluster according to the single-class support vector machine and the current abnormal point ratio to obtain the normal point center of the normal category in the classification result.
  • the normal point center obtaining unit 120 includes:
  • the initial classification unit 121 is configured to classify the selected cluster according to the corresponding single-class support vector machine and the current proportion of abnormal points to obtain a classification result corresponding to the selected cluster; wherein, the classification The results include normal category data points and abnormal category data points;
  • the distance average calculation unit 122 is configured to obtain the average value corresponding to the data points of the normal category in the classification result to obtain the initial normal point center;
  • the normal point center adjustment unit 123 is configured to obtain the data point closest to the initial normal point center among the data points of the normal category in the classification result as the normal point center corresponding to the data points of the normal category.
  • the first residual calculation unit 130 is configured to obtain the residual square sum of each data point of the abnormal category in the classification result and the center of the normal point to obtain the current residual square sum.
  • the first ratio update unit 140 is configured to subtract a preset step size from the current abnormal point ratio to update the current abnormal point ratio.
  • the second residual calculation unit 150 is configured to classify the selected cluster according to the single-class support vector machine and the current abnormal point ratio to obtain the data points of the current abnormal category, and obtain each data of the current abnormal category
  • the residual sum of squares between the point and the center of the normal point is taken as the next residual sum of squares.
  • the amplitude calculation unit 160 is configured to divide the difference between the next residual sum of squares and the current residual sum of squares by the step size to obtain the residual variation amplitude.
  • the determining unit 170 is configured to determine whether the residual variation range exceeds a preset variation range threshold.
  • the optimal ratio acquisition unit 180 is configured to, if the residual variation range exceeds the variation range threshold, use the current abnormal point ratio plus the step length as the optimal abnormal point ratio.
  • the device 100 for optimizing the proportion of abnormal points based on clustering and SSE further includes:
  • the second ratio update unit 190 is configured to, if the residual variation range does not exceed the variation range threshold, subtract the step size from the current anomaly point ratio to update the current anomaly point ratio, and use the next residual sum of squares to calculate Update the current residual sum of squares, return to the execution to classify the sample to be classified according to the single-class support vector machine and the current anomaly point ratio to obtain the data points of the current anomaly category, and obtain each data point of the current anomaly category and all
  • the residual sum of squares at the center of the normal point is used as the step of the next residual sum of squares.
  • the optimal classification unit 181 is configured to classify the selected clusters according to the single classification support vector machine and the optimal anomaly point ratio to obtain an optimal classification result.
  • the selected cluster can be classified according to the single-class support vector machine and the optimal anomaly point ratio to obtain the optimal classification result, and The unsupervised classification model with the best classification effect.
  • the device realizes accurate classification of massive data and detection of abnormal points in each classification, and the proportion of abnormal points in the detection process is automatically adjusted and obtained without setting based on experience.
  • the above-mentioned device for optimizing the proportion of abnormal points based on clustering and SSE can be implemented in the form of a computer program, which can be run on a computer device as shown in FIG.
  • FIG. 11 is a schematic block diagram of a computer device according to an embodiment of the present application.
  • the computer device 500 is a server, and the server may be an independent server or a server cluster composed of multiple servers.
  • the computer device 500 includes a processor 502, a memory, and a network interface 505 connected through a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.
  • the non-volatile storage medium 503 can store an operating system 5031 and a computer program 5032.
  • the processor 502 can execute the method for optimizing the proportion of abnormal points based on clustering and SSE.
  • the processor 502 is used to provide calculation and control capabilities, and support the operation of the entire computer device 500.
  • the internal memory 504 provides an environment for the operation of the computer program 5032 in the non-volatile storage medium 503.
  • the processor 502 can execute the method for optimizing the proportion of abnormal points based on clustering and SSE .
  • the network interface 505 is used for network communication, such as providing data information transmission.
  • the structure shown in FIG. 11 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device 500 to which the solution of the present application is applied.
  • the specific computer device 500 may include more or fewer components than shown in the figure, or combine certain components, or have a different component arrangement.
  • the processor 502 is configured to run a computer program 5032 stored in a memory to implement the method for optimizing the proportion of abnormal points based on clustering and SSE disclosed in the embodiments of the present application.
  • the embodiment of the computer device shown in FIG. 11 does not constitute a limitation on the specific configuration of the computer device.
  • the computer device may include more or less components than those shown in the figure. Or combine certain components, or different component arrangements.
  • the computer device may only include a memory and a processor. In such embodiments, the structures and functions of the memory and the processor are consistent with the embodiment shown in FIG. 11, and will not be repeated here.
  • the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
  • a computer-readable storage medium may be a non-volatile computer-readable storage medium.
  • the computer-readable storage medium stores a computer program, where the computer program is executed by a processor to implement the method for optimizing the proportion of abnormal points based on clustering and SSE disclosed in the embodiments of the present application.
  • the storage medium is a physical, non-transitory storage medium, such as a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk, or an optical disk that can store program codes. medium.
  • a physical, non-transitory storage medium such as a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk, or an optical disk that can store program codes. medium.

Abstract

The present application discloses a method and a device for optimizing a abnormal point proportion based on clustering and SSE. The method comprises the steps of receiving a collection of data points to be classified, and clustering the collection of data points to be classified by k-means clustering to obtain multiple clusters; obtaining data points corresponding to each cluster of the multiple clusters, and constructing a single classification support vector machine corresponding to each cluster according to a preset current abnormal point proportion and each cluster; continuously adjusting the current abnormal point proportion until the residual variation exceeds a variation threshold, taking the current abnormal point proportion plus the step size as an optimal abnormal point proportion; and classifying the selected clusters according to the single classification support vector machine and the optimal abnormal point proportion to obtain an optimal classification result.

Description

基于聚类和SSE的异常点比例优化方法及装置Method and device for optimizing the proportion of abnormal points based on clustering and SSE
本申请要求于2019年1月28日提交中国专利局、申请号为201910079217.9、申请名称为“基于聚类和SSE的异常点比例优化方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on January 28, 2019, the application number is 201910079217.9, and the application name is "Method and Device for Optimizing the Proportion of Outliers Based on Clustering and SSE". The reference is incorporated in this application.
技术领域Technical field
本申请涉及智能决策技术领域,尤其涉及一种基于聚类和SSE的异常点比例优化方法及装置。This application relates to the technical field of intelligent decision-making, and in particular to a method and device for optimizing the proportion of abnormal points based on clustering and SSE.
背景技术Background technique
异常值分析是检验数据是否有录入错误以及含有不合常理的数据的过程,忽视异常值的存在是十分危险的,不加剔除地把异常值包括进数据的计算分析过程中,对结果会产生不良影响。Outlier analysis is the process of checking whether the data has input errors and contains unreasonable data. It is very dangerous to ignore the existence of outliers. Including the outliers in the calculation and analysis process of the data without eliminating them will cause bad results. influences.
目前,在企业的运营过程中搜集的海量用户数据中,往往有多个正常点中心。若未先对海量用户数据先进行划分再进行异常点检测,会导致用于异常点检测的无监督模型的区分效果较差,无法精细化检测异常点数据。At present, there are often multiple normal point centers in the massive user data collected during the operation of enterprises. If the massive user data is not divided first and then abnormal point detection is performed, the unsupervised model used for abnormal point detection will have a poor discrimination effect and cannot detect abnormal point data finely.
发明内容Summary of the invention
本申请实施例提供了一种基于聚类和SSE的异常点比例优化方法、装置、计算机设备及存储介质,旨在解决现有技术中海量用户数据中往往有多个正常点中心,若未先对海量用户数据先进行划分再进行异常点检测,会导致用于异常点检测的无监督模型的区分效果较差,无法精细化检测异常点数据的问题。The embodiments of the present application provide a method, device, computer equipment and storage medium for optimizing the proportion of abnormal points based on clustering and SSE, aiming to solve the problem that there are often multiple normal point centers in massive user data in the prior art. Dividing a large amount of user data before performing outlier detection will result in poor discrimination of the unsupervised model used for outlier detection, and the problem of inability to finely detect outlier data.
第一方面,本申请实施例提供了一种基于聚类和SSE的异常点比例优化方法,其包括:In the first aspect, an embodiment of the present application provides a method for optimizing the proportion of abnormal points based on clustering and SSE, which includes:
接收待分类数据点集合,通过k-means聚类将所述待分类数据点集合进行聚类,得到多个聚类簇;Receiving a set of data points to be classified, and clustering the set of data points to be classified through k-means clustering to obtain multiple clusters;
获取多个聚类簇中所包括每一聚类簇对应的数据点,根据预设的当前异常点比例及每一聚类簇,构建与每一聚类簇一一对应的用于异常点检测的单分类支持向量机;Obtain data points corresponding to each cluster included in multiple clusters, and construct a one-to-one correspondence with each cluster for abnormal point detection according to the preset current proportion of abnormal points and each cluster Single classification support vector machine;
将所选定的聚类簇根据所述单分类支持向量机及所述当前异常点比例进行分类,得到分类结果中正常类别的正常点中心;Classify the selected clusters according to the single-class support vector machine and the current abnormal point ratio to obtain the normal point center of the normal category in the classification result;
获取所述分类结果中异常类别的每一数据点与所述正常点中心的残差平方和,以获取当前残差平方和;Obtaining the residual sum of squares of each data point of the abnormal category in the classification result and the center of the normal point to obtain the current residual sum of squares;
通过所述当前异常点比例减去预设的步长,以更新当前异常点比例;Subtract the preset step size from the current abnormal point ratio to update the current abnormal point ratio;
将所选定的聚类簇根据所述单分类支持向量机及当前异常点比例进行分类,得到当前异常类别的数据点,获取当前异常类别的每一数据点与所述正常点中心的残差以作为下一残平方和差平方和;Classify the selected clusters according to the single-class support vector machine and the current abnormal point ratio to obtain the data points of the current abnormal category, and obtain the residual difference between each data point of the current abnormal category and the center of the normal point Take as the next residual sum of squares and difference of squares;
通过下一残差平方和与当前残差平方和之差除以所述步长,得到残差变动幅度;By dividing the difference between the next residual sum of squares and the current residual sum of squares by the step size, the residual variation range is obtained;
判断所述残差变动幅度是否超出预设的变动幅度阈值;Determine whether the residual variation range exceeds a preset variation range threshold;
若所述残差变动幅度超出所述变动幅度阈值,将当前异常点比例加上步长作为最优异常点比例;以及If the residual variation range exceeds the variation range threshold, the current abnormal point ratio plus the step length is used as the optimal abnormal point ratio; and
将所选定的聚类簇根据所述单分类支持向量机及最优异常点比例进行分类,得到最优分类结果。The selected clusters are classified according to the single-class support vector machine and the optimal proportion of abnormal points to obtain the optimal classification result.
第二方面,本申请实施例提供了一种基于聚类和SSE的异常点比例优化装置,其包括:In the second aspect, an embodiment of the present application provides a device for optimizing the proportion of abnormal points based on clustering and SSE, which includes:
聚类单元,用于接收待分类数据点集合,通过k-means聚类将所述待分类数据点集合进行聚类,得到多个聚类簇;The clustering unit is configured to receive a set of data points to be classified, and cluster the set of data points to be classified through k-means clustering to obtain multiple clusters;
多模型构建单元,用于获取多个聚类簇中所包括每一聚类簇对应的数据点,根据预设的当前异常点比例及每一聚类簇,构建与每一聚类簇一一对应的用于异常点检测的单分类支持向量机;The multi-model construction unit is used to obtain the data points corresponding to each cluster included in the multiple clusters, and construct one-to-one with each cluster according to the preset current abnormal point ratio and each cluster Corresponding single-class support vector machine for outlier detection;
正常点中心获取单元,用于将所选定的聚类簇根据所述单分类支持向量机及所述当前异常点比例进行分类,得到分类结果中正常类别的正常点中心;The normal point center obtaining unit is used to classify the selected cluster according to the single-class support vector machine and the current abnormal point ratio to obtain the normal point center of the normal category in the classification result;
第一残差计算单元,用于获取所述分类结果中异常类别的每一数据点与所述正常点中心的残差平方和,以获取当前残差平方和;The first residual calculation unit is configured to obtain the residual sum of squares of each data point of the abnormal category in the classification result and the center of the normal point to obtain the current residual sum of squares;
第一比例更新单元,用于通过所述当前异常点比例减去预设的步长,以更新当前异常点比例;The first ratio update unit is configured to subtract a preset step size from the current abnormal point ratio to update the current abnormal point ratio;
第二残差计算单元,用于将所选定的聚类簇根据所述单分类支持向量机及当前异常点比例进行分类,得到当前异常类别的数据点,获取当前异常类别的 每一数据点与所述正常点中心的残差以作为下一残平方和差平方和;The second residual calculation unit is used to classify the selected clusters according to the single-class support vector machine and the current abnormal point ratio to obtain data points of the current abnormal category, and obtain each data point of the current abnormal category The residual error from the center of the normal point is taken as the next residual sum of squares and difference of squares;
幅度计算单元,用于通过下一残差平方和与当前残差平方和之差除以所述步长,得到残差变动幅度;An amplitude calculation unit, configured to divide the difference between the next residual sum of squares and the current residual sum of squares by the step size to obtain the residual variation range;
判断单元,用于判断所述残差变动幅度是否超出预设的变动幅度阈值;A judging unit for judging whether the residual variation range exceeds a preset variation range threshold;
最优比例获取单元,用于若所述残差变动幅度超出所述变动幅度阈值,将当前异常点比例加上步长作为最优异常点比例;以及An optimal ratio obtaining unit, configured to, if the residual variation range exceeds the variation range threshold, use the current abnormal point ratio plus the step length as the optimal abnormal point ratio; and
最优分类单元,用于将所选定的聚类簇根据所述单分类支持向量机及最优异常点比例进行分类,得到最优分类结果。The optimal classification unit is used to classify the selected clusters according to the single classification support vector machine and the optimal abnormal point ratio to obtain the optimal classification result.
第三方面,本申请实施例又提供了一种计算机设备,其包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述第一方面所述的基于聚类和SSE的异常点比例优化方法。In the third aspect, an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and running on the processor, and the processor executes the computer The program implements the clustering and SSE-based abnormal point ratio optimization method described in the first aspect above.
第四方面,本申请实施例还提供了一种计算机可读存储介质,其中所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行上述第一方面所述的基于聚类和SSE的异常点比例优化方法。In a fourth aspect, the embodiments of the present application also provide a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program that, when executed by a processor, causes the processor to execute the aforementioned first On the one hand, the optimization method based on clustering and SSE-based abnormal point ratio.
附图说明Description of the drawings
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly describe the technical solutions of the embodiments of the present application, the following will briefly introduce the drawings used in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present application. Ordinary technical personnel can obtain other drawings based on these drawings without creative work.
图1为本申请实施例提供的基于聚类和SSE的异常点比例优化方法的流程示意图;FIG. 1 is a schematic flowchart of a method for optimizing the proportion of abnormal points based on clustering and SSE provided by an embodiment of the application;
图2为本申请实施例提供的基于聚类和SSE的异常点比例优化方法的子流程示意图;2 is a schematic diagram of a sub-process of the method for optimizing the proportion of abnormal points based on clustering and SSE according to an embodiment of the application;
图3为本申请实施例提供的基于聚类和SSE的异常点比例优化方法的另一子流程示意图;FIG. 3 is a schematic diagram of another sub-process of the method for optimizing the proportion of abnormal points based on clustering and SSE according to an embodiment of the application;
图4为本申请实施例提供的基于聚类和SSE的异常点比例优化方法的另一子流程示意图;4 is a schematic diagram of another sub-process of the method for optimizing the proportion of abnormal points based on clustering and SSE according to an embodiment of the application;
图5为本申请实施例提供的基于聚类和SSE的异常点比例优化方法的另一 流程示意图;FIG. 5 is another flow diagram of the method for optimizing the proportion of abnormal points based on clustering and SSE according to an embodiment of the application;
图6为本申请实施例提供的基于聚类和SSE的异常点比例优化装置的示意性框图;FIG. 6 is a schematic block diagram of a device for optimizing the proportion of abnormal points based on clustering and SSE according to an embodiment of the application;
图7为本申请实施例提供的基于聚类和SSE的异常点比例优化装置的子单元示意性框图;FIG. 7 is a schematic block diagram of subunits of the device for optimizing the proportion of abnormal points based on clustering and SSE according to an embodiment of the application;
图8为本申请实施例提供的基于聚类和SSE的异常点比例优化装置的另一子单元示意性框图;FIG. 8 is a schematic block diagram of another subunit of the device for optimizing the proportion of abnormal points based on clustering and SSE according to an embodiment of the application;
图9为本申请实施例提供的基于聚类和SSE的异常点比例优化装置的另一子单元示意性框图;9 is a schematic block diagram of another subunit of the device for optimizing the proportion of abnormal points based on clustering and SSE according to an embodiment of the application;
图10为本申请实施例提供的基于聚类和SSE的异常点比例优化装置的另一示意性框图;10 is another schematic block diagram of the device for optimizing the proportion of abnormal points based on clustering and SSE according to an embodiment of the application;
图11为本申请实施例提供的计算机设备的示意性框图。FIG. 11 is a schematic block diagram of a computer device provided by an embodiment of the application.
具体实施方式detailed description
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present application in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.
应当理解,当在本说明书和所附权利要求书中使用时,术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that when used in this specification and the appended claims, the terms "including" and "including" indicate the existence of the described features, wholes, steps, operations, elements and/or components, but do not exclude one or The existence or addition of multiple other features, wholes, steps, operations, elements, components, and/or collections thereof.
还应当理解,在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。It should also be understood that the terms used in the specification of this application are only for the purpose of describing specific embodiments and are not intended to limit the application. As used in the specification of this application and the appended claims, unless the context clearly indicates other circumstances, the singular forms "a", "an" and "the" are intended to include plural forms.
还应当进一步理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It should be further understood that the term "and/or" used in the specification and appended claims of this application refers to any combination of one or more of the items listed in the associated and all possible combinations, and includes these combinations .
请参阅图1,图1为本申请实施例提供的基于SSE的异常点比例优化方法的流程示意图,该基于SSE的异常点比例优化方法应用于服务器中,该方法通 过安装于服务器中的应用软件进行执行。Please refer to FIG. 1. FIG. 1 is a schematic flowchart of an SSE-based abnormal point ratio optimization method provided by an embodiment of the application. The SSE-based abnormal point ratio optimization method is applied to a server, and the method uses application software installed in the server. Carry out execution.
如图1所示,该方法包括步骤S101~S181。As shown in Figure 1, the method includes steps S101 to S181.
S101、接收待分类数据点集合,通过k-means聚类将所述待分类数据点集合进行聚类,得到多个聚类簇。S101. Receive a set of data points to be classified, and cluster the set of data points to be classified through k-means clustering to obtain multiple clusters.
在本实施例中,当企业的服务器接收了各业务端上传的海量业务数据后,这些业务数据可视为待分类数据点集合。例如,待分类数据点集合可以是用户的保单数据,至少包括投保人姓名、投保人年龄、投保人保单数量、投保金额、投保年限、投保人手机号码等字段。此时可有选择性的选择其中一个字段数据作为主数据,而剩余的字段则作为上述主字段的属性数据。例如投保年限字段作为主数据,投保人的电话号码、身份证号等字段作为其属性数据。In this embodiment, after the server of the enterprise receives the massive business data uploaded by each business end, these business data can be regarded as a collection of data points to be classified. For example, the set of data points to be classified may be the user's insurance policy data, including at least fields such as the name of the applicant, the age of the applicant, the number of the applicant's insurance policy, the amount of insurance, the insurance period, and the phone number of the applicant. At this time, one of the field data can be selectively selected as the main data, and the remaining fields are used as the attribute data of the above-mentioned main field. For example, the insurance period field is used as the main data, and fields such as the telephone number and ID number of the applicant are used as its attribute data.
在一实施例中,如图2所示,步骤S101包括:In an embodiment, as shown in FIG. 2, step S101 includes:
S1011、在多个待分类数据点集合中选取与预设的聚类簇数相同个数的数据点,将所选取的数据点作为每一簇的初始聚类中心;S1011, selecting the same number of data points as the preset number of cluster clusters from a plurality of data point sets to be classified, and using the selected data points as the initial cluster center of each cluster;
S1012、根据所述待分类数据点集合中各数据点与各初始聚类中心的相异值,将所述待分类数据点集合进行划分,得到初始聚类结果;S1012: Divide the set of data points to be classified according to the difference between each data point in the set of data points to be classified and each initial cluster center to obtain an initial clustering result;
S1013、根据初始聚类结果,获取每一簇的调整后聚类中心;S1013. Obtain the adjusted cluster center of each cluster according to the initial clustering result;
S1014、根据调整后聚类中心,将所述待分类数据点集合根据与调整后聚类中心的相异值进行划分,直至聚类结果保持相同的次数多于预设的次数,得到与预设的聚类簇数对应的聚类簇。S1014. According to the adjusted clustering center, divide the set of data points to be classified according to the difference value from the adjusted clustering center, until the clustering result remains the same more than the preset number of times to obtain The number of clusters corresponds to the number of clusters.
在本实施例中,对待分类数据点集合进行聚类时,选择其中一个字段作为主键,其余字段作为属性数据。具体的,对待分类数据点集合进行聚类时采用k-means算法,过程如下:In this embodiment, when clustering the set of data points to be classified, one of the fields is selected as the primary key, and the remaining fields are used as the attribute data. Specifically, the k-means algorithm is used when clustering the set of data points to be classified, and the process is as follows:
a)从n个待分类数据点集合中任意选取k个待分类数据点,并作为k个簇的初始聚类中心;其中,待分类数据点集合中待分类数据点的初始总个数为n,从其中任意选择k个数据点(k<n,k是用户指定的参数,即所期望的簇的个数,也即预设的聚类簇数),将初始选择的k个数据点作为初始聚类中心。a) Randomly select k data points to be classified from the set of n data points to be classified, and use them as the initial clustering centers of k clusters; among them, the initial total number of data points to be classified in the set of data points to be classified is n , Select k data points arbitrarily from them (k<n, k is a parameter specified by the user, that is, the number of expected clusters, that is, the preset number of clusters), and take the initially selected k data points as The initial cluster center.
b)分别计算剩下的待分类数据点到k个簇初始聚类中心的相异度,将剩下的待分类数据点分别划归到相异度最低的簇,得到初始聚类结果;即是剩下的每一待分类数据点选择距其距离最近的初始聚类中心,并与该初始聚类中心归为一类;这样就以初始选择的初始聚类中心将海量的待分类数据点划分为k簇, 每一簇数据都有一个初始聚类中心。b) Calculate the dissimilarity between the remaining data points to be classified to the initial cluster centers of k clusters, and classify the remaining data points to be classified into the clusters with the lowest dissimilarity to obtain the initial clustering results; that is, For each remaining data point to be classified, select the initial cluster center that is closest to it, and classify it into the same category with the initial cluster center; in this way, a large number of data points to be classified are classified based on the initial cluster center selected initially Divided into k clusters, each cluster of data has an initial cluster center.
c)根据初始聚类结果,重新计算k个簇各自的聚类中心;具体计算方法是取每一簇中所有待分类数据点的主属性的算术平均数,选择一个距离主属性算数平均值最近的待分类数据点作为新的聚类中心,重新选择该簇数据中更优的聚类中心。c) Based on the initial clustering results, recalculate the respective cluster centers of the k clusters; the specific calculation method is to take the arithmetic mean of the primary attributes of all data points to be classified in each cluster, and choose the one closest to the arithmetic mean of the primary attributes The data points to be classified are used as the new cluster centers, and the better cluster centers in the cluster data are reselected.
d)将n个待分类数据点中全部元素按照新的聚类中心重新聚类;d) Re-cluster all the elements in the n data points to be classified according to the new cluster center;
e)重复d)步,直到聚类结果不再变化,得到与预设的聚类簇数对应的聚类结果。e) Repeat step d) until the clustering result does not change, and the clustering result corresponding to the preset number of clusters is obtained.
在完成了聚类分类之后,即可实现快速的将海量的待分类数据点集合进行分组,得到多个聚类簇。After the cluster classification is completed, the massive collection of data points to be classified can be grouped quickly to obtain multiple clusters.
S110、获取多个聚类簇中所包括每一聚类簇对应的数据点,根据预设的当前异常点比例及每一聚类簇,构建与每一聚类簇一一对应的用于异常点检测的单分类支持向量机。S110. Obtain data points corresponding to each cluster included in a plurality of clusters, and construct a one-to-one correspondence with each cluster for abnormalities according to the preset current ratio of abnormal points and each cluster. Single-class support vector machine for point detection.
在本实施例中,例如,服务器接收了业务端所上传的待分类数据点集合并完成聚类分组后,此时若所设置初始的当前异常点比例为0.5(如将初始的当前异常点比例记为m 0),表示所期望的单分类支持向量机的分类结果中正常点样本和异常点样本比例为1:1。由于假设正常点数量比异常点多,因此此时异常点类别中含有大量的错分正常点。当异常点比例减少的时候,异常点类别中的正常点会被剔除。此时,先根据预设的当前异常点比例及待分类样本构建用于异常点检测的单分类支持向量机,作为后续调整当前异常点比例并重新分类的模型基础。 In this embodiment, for example, after the server receives the set of data points to be classified uploaded by the business end and completes the clustering and grouping, if the initial current abnormal point ratio is set to 0.5 (for example, the initial current abnormal point ratio Denoted as m 0 ), it means that the ratio of normal point samples and abnormal point samples in the expected single-class support vector machine classification result is 1:1. Since it is assumed that there are more normal points than abnormal points, the abnormal point category contains a large number of misclassified normal points. When the proportion of abnormal points decreases, normal points in the abnormal point category will be eliminated. At this point, a single-class support vector machine for outlier detection is constructed according to the preset current proportion of abnormal points and the samples to be classified, as a model basis for subsequent adjustment of the current proportion of abnormal points and reclassification.
在一实施例中,如图3所示,步骤S110包括:In an embodiment, as shown in FIG. 3, step S110 includes:
S111、根据预设的当前异常点比例及每一聚类簇,获取各聚类簇相应单分类支持向量机所对应的超平面的第一参数和第二参数;S111: Obtain the first parameter and the second parameter of the hyperplane corresponding to the single-class support vector machine corresponding to each cluster cluster according to the preset current abnormal point ratio and each cluster cluster;
S112、根据超平面的第一参数和第二参数,及所述当前异常点比例,构建与每一聚类簇一一对应的用于异常点检测的单分类支持向量机。S112: According to the first parameter and the second parameter of the hyperplane, and the current abnormal point ratio, construct a single-class support vector machine for abnormal point detection in a one-to-one correspondence with each cluster.
在本实施例中,单分类支持向量机即是OneClassSVM,其分类模型如下:In this embodiment, the single-class support vector machine is OneClassSVM, and its classification model is as follows:
Figure PCTCN2019117363-appb-000001
Figure PCTCN2019117363-appb-000001
s.t.(w·φ(x i))≥b-ξ i,ξ i≥0; st(w·φ(x i ))≥b-ξ i , ξ i ≥0;
其中,ξ i表示松弛变量;v为异常值的分数中所设置的一个上限,或是训练数据集里面做为支持向量的样例数量的下界; Among them, ξ i represents the slack variable; v is an upper limit set in the score of outliers, or the lower bound of the number of examples in the training data set as support vectors;
由拉格朗日变换可知,上述分类模型转化为:According to the Lagrangian transformation, the above classification model is transformed into:
Figure PCTCN2019117363-appb-000002
Figure PCTCN2019117363-appb-000002
这个方法创建了一个参数为w、b的超平面,该超平面与特征空间中的零点距离最大,并且将零点与所有的数据点分隔开。This method creates a hyperplane with parameters w and b, which has the largest distance from the zero point in the feature space, and separates the zero point from all data points.
通过上述方式,在对多个聚类簇分别构建了单分类支持向量机后,每一聚类簇根据其对应的单分类支持向量机进行数据分类。Through the above method, after a single-class support vector machine is constructed for multiple clusters, each cluster is classified according to its corresponding single-class support vector machine.
S120、将所选定的聚类簇根据所述单分类支持向量机及所述当前异常点比例进行分类,得到分类结果中正常类别的正常点中心。S120: Classify the selected clusters according to the single-class support vector machine and the current abnormal point ratio, and obtain the normal point center of the normal category in the classification result.
在本实施例中,当选定多个聚类簇其中一个聚类簇作为目标聚类簇为示例进行最优异常点比例获取时,需根据初始设置的当前异常点比例将所选定的聚类簇由所述单分类支持向量机进行分类后,可以确定分类结果中正常类别的数据点对应的正常点中心,这一正常点中心在后续过程中是恒定不变的。In this embodiment, when one of the multiple clusters is selected as the target cluster cluster to obtain the optimal anomalous point ratio as an example, the selected cluster should be selected according to the current anomaly point ratio set initially. After the clusters are classified by the single-class support vector machine, the normal point center corresponding to the data point of the normal category in the classification result can be determined, and this normal point center is constant in the subsequent process.
在一实施例中,如图4所示,步骤S120包括:In an embodiment, as shown in FIG. 4, step S120 includes:
S121、将所选定的聚类簇根据对应的单分类支持向量机及当前异常点比例进行分类,得到与所选定的聚类簇对应的分类结果;其中,所述分类结果中包括正常类别的数据点和异常类别的数据点;S121. Classify the selected cluster cluster according to the corresponding single-class support vector machine and the current abnormal point ratio to obtain a classification result corresponding to the selected cluster cluster; wherein, the classification result includes a normal category Data points and abnormal categories of data points;
S122、获取所述分类结果中正常类别的数据点所对应的平均值,以获取初始正常点中心;S122. Obtain the average value corresponding to the data points of the normal category in the classification result to obtain the initial normal point center;
S123、获取所述分类结果中正常类别的数据点中与所述初始正常点中心距离最近的数据点,以作为正常类别的数据点对应的正常点中心。S123. Obtain the data point of the data points of the normal category that is closest to the center of the initial normal point in the classification result, and use it as the normal point center corresponding to the data point of the normal category.
在本实施例中,先根据所述单分类支持向量机及当前异常点比例将所选定的聚类簇进行分类后,得到了包括正常类别的数据点和异常类别的数据点的分类结果。此时为了确定正常点中心,需先获取正常类别的数据点的平均值,然后将正常类别的数据点中距离该平均值最近的数据点,以作为正常点中心。当固定所述正常点中心后,即可不断调整异常点比例,根据指定参数(如当前异常类别的每一数据点与所述正常点中心的平均欧式距离)的变化趋势,来获取 最优异常点比例。In this embodiment, the selected clusters are first classified according to the single-class support vector machine and the current abnormal point ratio, and a classification result including data points of normal categories and data points of abnormal categories is obtained. In order to determine the center of the normal point at this time, it is necessary to obtain the average value of the data points of the normal category first, and then use the data point closest to the average value among the data points of the normal category as the normal point center. When the center of the normal point is fixed, the proportion of abnormal points can be continuously adjusted, and the optimal abnormality can be obtained according to the change trend of the specified parameters (such as the average Euclidean distance between each data point of the current abnormal category and the center of the normal point) Point ratio.
S130、获取所述分类结果中异常类别的每一数据点与所述正常点中心的残差平方和,以获取当前残差平方和。S130. Obtain the residual sum of squares of each data point of the abnormal category in the classification result and the center of the normal point to obtain the current residual sum of squares.
在本实施例中,残差平方和是在线性模型中衡量模型拟合程度的一个量,用连续曲线近似地刻画或比拟平面上离散点组,以表示坐标之间函数关系的一种数据处理方法。例如,在等精度测量下,残差平方和(V 2)=V 1 2+V 1 2+…+V n 2,其中V i是测量数据l i的残差,例如量数据l i的残差可以表示异常类别的数据点l i的残差。为了判断异常类别的每一数据点与正常点的残差,需计算异常类别的每一数据点与所述正常点中心的残差平方和,以作为当前残差平方和离,从当前残差平方和可以看出异常类别的每一数据点是否均远离正常点中心。 In this embodiment, the residual sum of squares is a measure of the degree of model fit in a linear model. A continuous curve is used to approximate or compare discrete points on a plane to represent a data processing of the functional relationship between coordinates. method. For example, in the measurement accuracy and the like, the residual sum of squares (V 2) = V 1 2 + V 1 2 + ... + V n 2, where V i is the residual of measured data l i, l i, for example, the remaining amount of data The difference can represent the residual of the data point l i of the abnormal category. In order to determine the residual difference between each data point of the abnormal category and the normal point, it is necessary to calculate the residual sum of squares of each data point of the abnormal category and the center of the normal point as the current residual sum of squares, from the current residual The sum of squares can tell whether each data point of the abnormal category is far away from the center of the normal point.
S140、通过所述当前异常点比例减去预设的步长,以更新当前异常点比例。S140: Subtract a preset step length from the current abnormal point ratio to update the current abnormal point ratio.
在本实施例,将所述当前异常点比例减去预设的步长,是为了不断调整当前异常点比例,以通过试探法得出最优异常点比例。In this embodiment, the purpose of subtracting the preset step size from the current abnormal point ratio is to continuously adjust the current abnormal point ratio so as to obtain the optimal abnormal point ratio through the trial method.
S150、将所选定的聚类簇根据所述单分类支持向量机及当前异常点比例进行分类,得到当前异常类别的数据点,获取当前异常类别的每一数据点与所述正常点中心的残差平方和以作为下一残差平方和。S150. Classify the selected clusters according to the single-class support vector machine and the current abnormal point ratio to obtain data points of the current abnormal category, and obtain the center of each data point of the current abnormal category and the normal point. The residual sum of squares is used as the next residual sum of squares.
在本实施例中,通过将当前异常点比例减去所述步长以更新当前异常点比例,此时无需再次确定正常点中心,只需得到分类结果中的异常类别的数据点,再计算异常类别的每一数据点与所述正常点中心的残差平方和以作为下一残差平方和。In this embodiment, the current abnormal point ratio is updated by subtracting the step size from the current abnormal point ratio. At this time, there is no need to determine the normal point center again, only the data points of the abnormal category in the classification result are obtained, and then the abnormality is calculated. The residual sum of squares of each data point of the category and the center of the normal point is used as the next residual sum of squares.
S160、通过下一残差平方和与当前残差平方和之差除以所述步长,得到残差变动幅度。S160: Divide the difference between the next residual sum of squares and the current residual sum of squares by the step size to obtain a residual variation range.
在本实施例中,通过例如步骤S130中得到的当前残差平方和视为SSE 0,则步骤S150初次执行得到的下一残差平方和视为SSE 1,则步骤S150第二次执行得到的下一残差平方和视为SSE 2(此时对应的当前残差平方和为SSE 1),……,步骤S150第N次执行得到的下一残差平方和视为SSE N(此时对应的当前残差平方和为SSE N-1)。若将预设的步长记为l,则是通过(SSE N-SSE N-1)/l来计算残差变动幅度,其中N为大于0的正整数。 In this embodiment, for example, the current residual sum of squares obtained in step S130 is regarded as SSE 0 , then the next residual sum of squares obtained in the first execution of step S150 is regarded as SSE 1 , and the result obtained in the second execution of step S150 The next residual sum of squares is regarded as SSE 2 (the corresponding current residual sum of squares is SSE 1 at this time),..., the next residual sum of squares obtained from the Nth execution of step S150 is regarded as SSE N (this time corresponding to The current residual sum of squares is SSE N-1 ). If the preset step length is denoted as l, the residual variation range is calculated by (SSE N -SSE N-1 )/l, where N is a positive integer greater than 0.
S170、判断所述残差变动幅度是否超出预设的变动幅度阈值。S170. Determine whether the residual variation range exceeds a preset variation range threshold.
在本实施例中,当残差变动幅度陡然变大,表示此刻最新的当前异常点比 例不是最优异常点比例,可考虑将此刻最新的当前异常点比例之前一个状态的当前异常点比例作为最优异常点比例。In this embodiment, when the residual error changes abruptly, it means that the latest current anomaly point ratio at this moment is not the optimal anomaly point ratio. The current anomaly point ratio of the state before the latest current anomaly point ratio at this moment can be considered as the maximum. Proportion of excellent and abnormal points.
S180、若所述残差变动幅度超出所述变动幅度阈值,将当前异常点比例加上步长作为最优异常点比例。S180. If the residual variation range exceeds the variation range threshold, the current abnormal point ratio plus the step size is used as the optimal abnormal point ratio.
在本实施例中,若残差变动幅度超出预设的变动幅度阈值,表示有部分真实的异常点被划分为正常点,导致异常点到正常中心点的残差平方和突增,此时当前异常点比例的上一状态(即当前异常点比例加上步长)即可作为最优异常点比例。In this embodiment, if the residual variation range exceeds the preset variation range threshold, it means that some real abnormal points are classified as normal points, resulting in a sudden increase in the sum of squared residuals from the abnormal point to the normal center point. The last state of the abnormal point ratio (that is, the current abnormal point ratio plus the step size) can be used as the optimal abnormal point ratio.
在一实施例中,如图5所示,步骤S170之后还包括:In an embodiment, as shown in FIG. 5, after step S170, the method further includes:
S190、若所述残差变动幅度未超出所述变动幅度阈值,将当前异常点比例减去所述步长以更新当前异常点比例,通过下一残差平方和以更新当前残差平方和,返回执行步骤S150。S190. If the residual variation range does not exceed the variation range threshold, subtract the step size from the current abnormal point ratio to update the current abnormal point ratio, and update the current residual square sum through the next residual square sum, Return to step S150.
在本实施例中,当残差变动幅度仍保持平稳过渡,表示所降低的异常点比例不足以明显影响异常类别的每一数据点与所述正常点中心的残差平方和,此时需将当前异常点比例减去步长以更新当前异常点比例,并通过下一残差平方和以更新当前残差平方和。例如当(SSE N-SSE N-1)/l未超出预设的变动幅度阈值,此时先将SSE 1作为当前残差平方和,将(m 0-l)作为当前异常点比例重新返回执行步骤S150以得到SSE 2;之后再次流向步骤S170时即是以(SSE 2-SSE 1)/l作为残差变动幅度,以此类推,直至执行到残差变动幅度超出预设的变动幅度阈值即可。 In this embodiment, when the residual variation range still maintains a smooth transition, it means that the reduced proportion of abnormal points is not enough to significantly affect the sum of squared residuals between each data point of the abnormal category and the center of the normal point. The current outlier ratio minus the step size to update the current outlier ratio, and the next residual sum of squares is used to update the current residual sum of squares. For example, when (SSE N -SSE N-1 )/l does not exceed the preset variation threshold, first use SSE 1 as the current residual sum of squares, and (m 0 -l) as the current abnormal point ratio and return to execution again Step S150 is to obtain SSE 2 ; then when it flows to step S170 again, (SSE 2 -SSE 1 )/l is used as the residual variation range, and so on, until the residual variation range exceeds the preset variation range threshold. can.
S181、将所选定的聚类簇根据所述单分类支持向量机及最优异常点比例进行分类,得到最优分类结果。S181. Classify the selected clusters according to the single classification support vector machine and the optimal anomaly point ratio to obtain an optimal classification result.
在本实施例中,当确定了最优异常点比例后,即可将所选定的聚类簇根据所述单分类支持向量机及最优异常点比例进行分类,得到最优分类结果,得到分类效果最优的无监督分类模型。In this embodiment, after the optimal anomaly point ratio is determined, the selected cluster can be classified according to the single-class support vector machine and the optimal anomaly point ratio to obtain the optimal classification result, and The unsupervised classification model with the best classification effect.
在一实施例中,步骤S181之后还包括:In an embodiment, after step S181, the method further includes:
将所述最优分类结果及所述最优异常点比例发送至所述待分类数据点集合对应的业务端,并将所述最优分类结果及所述最优异常点比例同步发送至云服务器;Send the optimal classification result and the optimal abnormal point ratio to the business end corresponding to the set of data points to be classified, and simultaneously send the optimal classification result and the optimal abnormal point ratio to the cloud server ;
将所述最优分类结果及所述最优异常点比例对应的存储区域进行格式化删 除。The storage area corresponding to the optimal classification result and the optimal abnormal point ratio is formatted and deleted.
在本实施例中,若在服务器中完成了获取了与所述待分类数据点集合对应的最优分类结果及所述最优异常点比例后,可以及时的将该最优分类结果及所述最优异常点比例发送至所述待分类数据点集合对应的业务端,实现对业务端进行分类结果的有效通知。In this embodiment, if the optimal classification result corresponding to the set of data points to be classified and the optimal abnormal point ratio are obtained in the server, the optimal classification result and the The optimal abnormal point ratio is sent to the business end corresponding to the set of data points to be classified, so as to realize effective notification of the classification result of the business end.
而且为了降低服务器中的数据存储压力,此时可及时的将所述最优分类结果及所述最优异常点比例同步发送至云服务器,通过云服务器实现对与所述待分类数据点集合对应的最优分类结果及所述最优异常点比例的有效存储。此过程中,还可以将与所述最优分类结果及所述最优异常点比例对应的述待分类数据点集合同步至云服务器。上述的待分类数据点集合、最优分类结果及最优异常点比例在由服务器同步至云服务器中时,需以业务端的唯一机器识别码(如IMEI串号)为数据标识位来进行唯一数据标识。Moreover, in order to reduce the pressure of data storage in the server, the optimal classification result and the optimal abnormal point ratio can be sent to the cloud server in time at this time, and the corresponding data point set to be classified can be matched by the cloud server. Effective storage of the optimal classification results and the optimal proportion of abnormal points. In this process, the set of data points to be classified corresponding to the optimal classification result and the optimal abnormal point ratio may also be synchronized to the cloud server. When the set of data points to be classified, the optimal classification result, and the optimal abnormal point ratio are synchronized from the server to the cloud server, the unique machine identification code (such as IMEI serial number) of the business end must be used as the data identification bit for unique data Logo.
此时将所述最优分类结果及所述最优异常点比例同步发送至云服务器之后,则可对服务器中将所述最优分类结果及所述最优异常点比例对应的存储区域进行格式化删除,从而有效释放出存储空间。At this time, after the optimal classification result and the optimal abnormal point ratio are synchronously sent to the cloud server, the storage area corresponding to the optimal classification result and the optimal abnormal point ratio in the server can be formatted It can be deleted to effectively release storage space.
在一实施例中,所述将所述最优分类结果及所述最优异常点比例对应的存储区域进行格式化删除之前,还包括:In an embodiment, before formatting and deleting the storage area corresponding to the optimal classification result and the optimal abnormal point ratio, the method further includes:
根据预设的当前异常点比例与所述最优异常点比例之差除以所述步长,得到迭代次数;Dividing the difference between the preset current abnormal point ratio and the optimal abnormal point ratio by the step length to obtain the number of iterations;
将所述迭代次数发送至所述待分类数据点集合对应的业务端,并将所述迭代次数同步发送至云服务器。The number of iterations is sent to the business end corresponding to the set of data points to be classified, and the number of iterations is synchronously sent to the cloud server.
在本实施例中,为了清楚的获知预设的当前异常点比例所述最优异常点比例之间经过了多少次迭代,此时可以根据预设的当前异常点比例与所述最优异常点比例之差除以所述步长,得到迭代次数。当获知了所述迭代次数后,可以将所述迭代次数发送至所述待分类数据点集合对应的业务端,业务端对应则可积累设置最优异常点比例的经验。In this embodiment, in order to clearly know how many iterations have passed between the preset current anomaly point ratio and the optimal anomaly point ratio, at this time, the preset current anomaly point ratio and the optimal anomaly point ratio may be compared The difference in the ratio is divided by the step size to obtain the number of iterations. After the number of iterations is known, the number of iterations can be sent to the business end corresponding to the set of data points to be classified, and the business end can accumulate experience in setting the optimal abnormal point ratio accordingly.
该方法实现了对海量数据的精确分类和对各分类异常点检测,检测过程中的异常点比例是自动调整而获取,无需根据经验设置。This method realizes the accurate classification of massive data and the detection of abnormal points in each classification. The proportion of abnormal points in the detection process is automatically adjusted and obtained without setting based on experience.
本申请实施例还提供一种基于聚类和SSE的异常点比例优化装置,该基于聚类和SSE的异常点比例优化装置用于执行前述基于聚类和SSE的异常点比例 优化方法的任一实施例。具体地,请参阅图6,图6是本申请实施例提供的基于聚类和SSE的异常点比例优化装置的示意性框图。该基于聚类和SSE的异常点比例优化装置100可以配置于服务器中。The embodiment of the present application also provides a device for optimizing the proportion of abnormal points based on clustering and SSE. The device for optimizing the proportion of abnormal points based on clustering and SSE is used to perform any of the aforementioned methods for optimizing the proportion of abnormal points based on clustering and SSE Examples. Specifically, please refer to FIG. 6, which is a schematic block diagram of an abnormal point ratio optimization device based on clustering and SSE provided in an embodiment of the present application. The device 100 for optimizing the proportion of abnormal points based on clustering and SSE may be configured in a server.
如图6所示,基于聚类和SSE的异常点比例优化装置100包括聚类单元101、多模型构建单元110、正常点中心获取单元120、第一残差计算单元130、第一比例更新单元140、第二残差计算单元150、幅度计算单元160、判断单元170、最优比例获取单元180、最优分类单元181。As shown in FIG. 6, the device 100 for optimizing the proportion of abnormal points based on clustering and SSE includes a clustering unit 101, a multi-model construction unit 110, a normal point center acquisition unit 120, a first residual calculation unit 130, and a first ratio update unit. 140. The second residual calculation unit 150, the amplitude calculation unit 160, the judgment unit 170, the optimal ratio acquisition unit 180, and the optimal classification unit 181.
聚类单元101,用于接收待分类数据点集合,通过k-means聚类将所述待分类数据点集合进行聚类,得到多个聚类簇。The clustering unit 101 is configured to receive a set of data points to be classified, and cluster the set of data points to be classified through k-means clustering to obtain multiple clusters.
在一实施例中,如图7所示,聚类单元101包括:In an embodiment, as shown in FIG. 7, the clustering unit 101 includes:
初始聚类中心获取单元1011,用于在多个待分类数据点集合中选取与预设的聚类簇数相同个数的数据点,将所选取的数据点作为每一簇的初始聚类中心;The initial cluster center obtaining unit 1011 is used to select the same number of data points as the preset number of cluster clusters from a plurality of data point sets to be classified, and use the selected data point as the initial cluster center of each cluster ;
初始聚类单元1012,用于根据所述待分类数据点集合中各数据点与各初始聚类中心的相异值,将所述待分类数据点集合进行划分,得到初始聚类结果;The initial clustering unit 1012 is configured to divide the set of data points to be classified according to the difference between each data point in the set of data points to be classified and each initial cluster center to obtain an initial clustering result;
聚类中心调整单元1013,用于根据初始聚类结果,获取每一簇的调整后聚类中心;The cluster center adjustment unit 1013 is configured to obtain the adjusted cluster center of each cluster according to the initial clustering result;
聚类调整单元1014,用于根据调整后聚类中心,将所述待分类数据点集合根据与调整后聚类中心的相异值进行划分,直至聚类结果保持相同的次数多于预设的次数,得到与预设的聚类簇数对应的聚类簇。The cluster adjustment unit 1014 is configured to divide the set of data points to be classified according to the difference value from the adjusted cluster center according to the adjusted cluster center, until the clustering result remains the same more than the preset number of times The number of times, the cluster cluster corresponding to the preset number of cluster clusters is obtained.
多模型构建单元110,用于获取多个聚类簇中所包括每一聚类簇对应的数据点,根据预设的当前异常点比例及每一聚类簇,构建与每一聚类簇一一对应的用于异常点检测的单分类支持向量机。The multi-model construction unit 110 is used to obtain data points corresponding to each cluster included in a plurality of clusters, and construct a data point corresponding to each cluster according to the preset current abnormal point ratio and each cluster. A corresponding single-class support vector machine for outlier detection.
在一实施例中,如图8所示,多模型构建单元110包括:In an embodiment, as shown in FIG. 8, the multi-model construction unit 110 includes:
分类参数获取单元111,用于根据预设的当前异常点比例及每一聚类簇,获取各聚类簇相应单分类支持向量机所对应的超平面的第一参数和第二参数;The classification parameter obtaining unit 111 is configured to obtain the first parameter and the second parameter of the hyperplane corresponding to the single classification support vector machine of each cluster according to the preset current abnormal point ratio and each cluster;
模型获取单元112,用于根据超平面的第一参数和第二参数,及所述当前异常点比例,构建与每一聚类簇一一对应的用于异常点检测的单分类支持向量机。The model acquisition unit 112 is configured to construct a single-class support vector machine for abnormal point detection in a one-to-one correspondence with each cluster according to the first parameter and the second parameter of the hyperplane and the current abnormal point ratio.
正常点中心获取单元120,用于将所选定的聚类簇根据所述单分类支持向量机及所述当前异常点比例进行分类,得到分类结果中正常类别的正常点中心。The normal point center obtaining unit 120 is configured to classify the selected cluster according to the single-class support vector machine and the current abnormal point ratio to obtain the normal point center of the normal category in the classification result.
在一实施例中,如图9所示,正常点中心获取单元120包括:In an embodiment, as shown in FIG. 9, the normal point center obtaining unit 120 includes:
初始分类单元121,用于将所选定的聚类簇根据对应的单分类支持向量机及当前异常点比例进行分类,得到与所选定的聚类簇对应的分类结果;其中,所述分类结果中包括正常类别的数据点和异常类别的数据点;The initial classification unit 121 is configured to classify the selected cluster according to the corresponding single-class support vector machine and the current proportion of abnormal points to obtain a classification result corresponding to the selected cluster; wherein, the classification The results include normal category data points and abnormal category data points;
距离均值计算单元122,用于获取所述分类结果中正常类别的数据点所对应的平均值,以获取初始正常点中心;The distance average calculation unit 122 is configured to obtain the average value corresponding to the data points of the normal category in the classification result to obtain the initial normal point center;
正常点中心调整单元123,用于获取所述分类结果中正常类别的数据点中与所述初始正常点中心距离最近的数据点,以作为正常类别的数据点对应的正常点中心。The normal point center adjustment unit 123 is configured to obtain the data point closest to the initial normal point center among the data points of the normal category in the classification result as the normal point center corresponding to the data points of the normal category.
第一残差计算单元130,用于获取所述分类结果中异常类别的每一数据点与所述正常点中心的残差平方和,以获取当前残差平方和。The first residual calculation unit 130 is configured to obtain the residual square sum of each data point of the abnormal category in the classification result and the center of the normal point to obtain the current residual square sum.
第一比例更新单元140,用于通过所述当前异常点比例减去预设的步长,以更新当前异常点比例。The first ratio update unit 140 is configured to subtract a preset step size from the current abnormal point ratio to update the current abnormal point ratio.
第二残差计算单元150,用于将所选定的聚类簇根据所述单分类支持向量机及当前异常点比例进行分类,得到当前异常类别的数据点,获取当前异常类别的每一数据点与所述正常点中心的残差平方和以作为下一残差平方和。The second residual calculation unit 150 is configured to classify the selected cluster according to the single-class support vector machine and the current abnormal point ratio to obtain the data points of the current abnormal category, and obtain each data of the current abnormal category The residual sum of squares between the point and the center of the normal point is taken as the next residual sum of squares.
幅度计算单元160,用于通过下一残差平方和与当前残差平方和之差除以所述步长,得到残差变动幅度。The amplitude calculation unit 160 is configured to divide the difference between the next residual sum of squares and the current residual sum of squares by the step size to obtain the residual variation amplitude.
判断单元170,用于判断所述残差变动幅度是否超出预设的变动幅度阈值。The determining unit 170 is configured to determine whether the residual variation range exceeds a preset variation range threshold.
最优比例获取单元180,用于若所述残差变动幅度超出所述变动幅度阈值,将当前异常点比例加上步长作为最优异常点比例。The optimal ratio acquisition unit 180 is configured to, if the residual variation range exceeds the variation range threshold, use the current abnormal point ratio plus the step length as the optimal abnormal point ratio.
在一实施例中,如图10所示,基于聚类和SSE的异常点比例优化装置100还包括:In an embodiment, as shown in FIG. 10, the device 100 for optimizing the proportion of abnormal points based on clustering and SSE further includes:
第二比例更新单元190,用于若所述残差变动幅度未超出所述变动幅度阈值,将当前异常点比例减去所述步长以更新当前异常点比例,通过下一残差平方和以更新当前残差平方和,返回执行将所述待分类样本根据所述单分类支持向量机及当前异常点比例进行分类,得到当前异常类别的数据点,获取当前异常类别的每一数据点与所述正常点中心的残差平方和以作为下一残差平方和的步骤。The second ratio update unit 190 is configured to, if the residual variation range does not exceed the variation range threshold, subtract the step size from the current anomaly point ratio to update the current anomaly point ratio, and use the next residual sum of squares to calculate Update the current residual sum of squares, return to the execution to classify the sample to be classified according to the single-class support vector machine and the current anomaly point ratio to obtain the data points of the current anomaly category, and obtain each data point of the current anomaly category and all The residual sum of squares at the center of the normal point is used as the step of the next residual sum of squares.
最优分类单元181,用于将所选定的聚类簇根据所述单分类支持向量机及最优异常点比例进行分类,得到最优分类结果。The optimal classification unit 181 is configured to classify the selected clusters according to the single classification support vector machine and the optimal anomaly point ratio to obtain an optimal classification result.
在本实施例中,当确定了最优异常点比例后,即可将所选定的聚类簇根据 所述单分类支持向量机及最优异常点比例进行分类,得到最优分类结果,得到分类效果最优的无监督分类模型。In this embodiment, after the optimal anomaly point ratio is determined, the selected cluster can be classified according to the single-class support vector machine and the optimal anomaly point ratio to obtain the optimal classification result, and The unsupervised classification model with the best classification effect.
该装置实现了对海量数据的精确分类和对各分类异常点检测,检测过程中的异常点比例是自动调整而获取,无需根据经验设置。The device realizes accurate classification of massive data and detection of abnormal points in each classification, and the proportion of abnormal points in the detection process is automatically adjusted and obtained without setting based on experience.
上述基于聚类和SSE的异常点比例优化装置可以实现为计算机程序的形式,该计算机程序可以在如图11所示的计算机设备上运行。The above-mentioned device for optimizing the proportion of abnormal points based on clustering and SSE can be implemented in the form of a computer program, which can be run on a computer device as shown in FIG.
请参阅图11,图11是本申请实施例提供的计算机设备的示意性框图。该计算机设备500是服务器,服务器可以是独立的服务器,也可以是多个服务器组成的服务器集群。Please refer to FIG. 11, which is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 500 is a server, and the server may be an independent server or a server cluster composed of multiple servers.
参阅图11,该计算机设备500包括通过系统总线501连接的处理器502、存储器和网络接口505,其中,存储器可以包括非易失性存储介质503和内存储器504。11, the computer device 500 includes a processor 502, a memory, and a network interface 505 connected through a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.
该非易失性存储介质503可存储操作系统5031和计算机程序5032。该计算机程序5032被执行时,可使得处理器502执行基于聚类和SSE的异常点比例优化方法。The non-volatile storage medium 503 can store an operating system 5031 and a computer program 5032. When the computer program 5032 is executed, the processor 502 can execute the method for optimizing the proportion of abnormal points based on clustering and SSE.
该处理器502用于提供计算和控制能力,支撑整个计算机设备500的运行。The processor 502 is used to provide calculation and control capabilities, and support the operation of the entire computer device 500.
该内存储器504为非易失性存储介质503中的计算机程序5032的运行提供环境,该计算机程序5032被处理器502执行时,可使得处理器502执行基于聚类和SSE的异常点比例优化方法。The internal memory 504 provides an environment for the operation of the computer program 5032 in the non-volatile storage medium 503. When the computer program 5032 is executed by the processor 502, the processor 502 can execute the method for optimizing the proportion of abnormal points based on clustering and SSE .
该网络接口505用于进行网络通信,如提供数据信息的传输等。本领域技术人员可以理解,图11中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备500的限定,具体的计算机设备500可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。The network interface 505 is used for network communication, such as providing data information transmission. Those skilled in the art can understand that the structure shown in FIG. 11 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device 500 to which the solution of the present application is applied. The specific computer device 500 may include more or fewer components than shown in the figure, or combine certain components, or have a different component arrangement.
其中,所述处理器502用于运行存储在存储器中的计算机程序5032,以实现本申请实施例公开的基于聚类和SSE的异常点比例优化方法。The processor 502 is configured to run a computer program 5032 stored in a memory to implement the method for optimizing the proportion of abnormal points based on clustering and SSE disclosed in the embodiments of the present application.
本领域技术人员可以理解,图11中示出的计算机设备的实施例并不构成对计算机设备具体构成的限定,在其他实施例中,计算机设备可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。例如,在一些实施例中,计算机设备可以仅包括存储器及处理器,在这样的实施例中,存储器 及处理器的结构及功能与图11所示实施例一致,在此不再赘述。Those skilled in the art can understand that the embodiment of the computer device shown in FIG. 11 does not constitute a limitation on the specific configuration of the computer device. In other embodiments, the computer device may include more or less components than those shown in the figure. Or combine certain components, or different component arrangements. For example, in some embodiments, the computer device may only include a memory and a processor. In such embodiments, the structures and functions of the memory and the processor are consistent with the embodiment shown in FIG. 11, and will not be repeated here.
应当理解,在本申请实施例中,处理器502可以是中央处理单元(Central Processing Unit,CPU),该处理器502还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that, in this embodiment of the application, the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. Among them, the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
在本申请的另一实施例中提供计算机可读存储介质。该计算机可读存储介质可以为非易失性的计算机可读存储介质。该计算机可读存储介质存储有计算机程序,其中计算机程序被处理器执行时实现本申请实施例公开的基于聚类和SSE的异常点比例优化方法。In another embodiment of the present application, a computer-readable storage medium is provided. The computer-readable storage medium may be a non-volatile computer-readable storage medium. The computer-readable storage medium stores a computer program, where the computer program is executed by a processor to implement the method for optimizing the proportion of abnormal points based on clustering and SSE disclosed in the embodiments of the present application.
所述存储介质为实体的、非瞬时性的存储介质,例如可以是U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、磁碟或者光盘等各种可以存储程序代码的实体存储介质。The storage medium is a physical, non-transitory storage medium, such as a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk, or an optical disk that can store program codes. medium.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的设备、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and conciseness of description, the specific working process of the equipment, device and unit described above can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Anyone familiar with the technical field can easily think of various equivalents within the technical scope disclosed in this application. Modifications or replacements, these modifications or replacements shall be covered within the protection scope of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims (20)

  1. 一种基于聚类和SSE的异常点比例优化方法,包括:An optimization method for the proportion of abnormal points based on clustering and SSE, including:
    接收待分类数据点集合,通过k-means聚类将所述待分类数据点集合进行聚类,得到多个聚类簇;Receiving a set of data points to be classified, and clustering the set of data points to be classified through k-means clustering to obtain multiple clusters;
    获取多个聚类簇中所包括每一聚类簇对应的数据点,根据预设的当前异常点比例及每一聚类簇,构建与每一聚类簇一一对应的用于异常点检测的单分类支持向量机;Obtain data points corresponding to each cluster included in multiple clusters, and construct a one-to-one correspondence with each cluster for abnormal point detection according to the preset current proportion of abnormal points and each cluster Single classification support vector machine;
    将所选定的聚类簇根据所述单分类支持向量机及所述当前异常点比例进行分类,得到分类结果中正常类别的正常点中心;Classify the selected clusters according to the single-class support vector machine and the current abnormal point ratio to obtain the normal point center of the normal category in the classification result;
    获取所述分类结果中异常类别的每一数据点与所述正常点中心的残差平方和,以获取当前残差平方和;Obtaining the residual sum of squares of each data point of the abnormal category in the classification result and the center of the normal point to obtain the current residual sum of squares;
    通过所述当前异常点比例减去预设的步长,以更新当前异常点比例;Subtract the preset step size from the current abnormal point ratio to update the current abnormal point ratio;
    将所选定的聚类簇根据所述单分类支持向量机及当前异常点比例进行分类,得到当前异常类别的数据点,获取当前异常类别的每一数据点与所述正常点中心的残差以作为下一残平方和差平方和;Classify the selected clusters according to the single-class support vector machine and the current abnormal point ratio to obtain the data points of the current abnormal category, and obtain the residual difference between each data point of the current abnormal category and the center of the normal point Take as the next residual sum of squares and difference of squares;
    通过下一残差平方和与当前残差平方和之差除以所述步长,得到残差变动幅度;By dividing the difference between the next residual sum of squares and the current residual sum of squares by the step size, the residual variation range is obtained;
    判断所述残差变动幅度是否超出预设的变动幅度阈值;Determine whether the residual variation range exceeds a preset variation range threshold;
    若所述残差变动幅度超出所述变动幅度阈值,将当前异常点比例加上步长作为最优异常点比例;以及If the residual variation range exceeds the variation range threshold, the current abnormal point ratio plus the step length is used as the optimal abnormal point ratio; and
    将所选定的聚类簇根据所述单分类支持向量机及最优异常点比例进行分类,得到最优分类结果。The selected clusters are classified according to the single-class support vector machine and the optimal proportion of abnormal points to obtain the optimal classification result.
  2. 根据权利要求1所述的基于聚类和SSE的异常点比例优化方法,其中,所述通过k-means聚类将所述待分类数据点集合进行聚类,得到多个聚类簇,包括:The method for optimizing the proportion of abnormal points based on clustering and SSE according to claim 1, wherein the clustering of the set of data points to be classified by k-means clustering to obtain multiple clusters comprises:
    在多个待分类数据点集合中选取与预设的聚类簇数相同个数的数据点,将所选取的数据点作为每一簇的初始聚类中心;Select the same number of data points as the preset number of cluster clusters from a plurality of data point sets to be classified, and use the selected data point as the initial cluster center of each cluster;
    根据所述待分类数据点集合中各数据点与各初始聚类中心的相异值,将所述待分类数据点集合进行划分,得到初始聚类结果;Dividing the set of data points to be classified according to the difference between each data point in the set of data points to be classified and each initial cluster center to obtain an initial clustering result;
    根据初始聚类结果,获取每一簇的调整后聚类中心;According to the initial clustering results, obtain the adjusted cluster center of each cluster;
    根据调整后聚类中心,将所述待分类数据点集合根据与调整后聚类中心的相异值进行划分,直至聚类结果保持相同的次数多于预设的次数,得到与预设的聚类簇数对应的聚类簇。According to the adjusted clustering center, the set of data points to be classified is divided according to the difference value from the adjusted clustering center, until the clustering result remains the same more than the preset number of times, and the preset cluster is obtained. The cluster cluster corresponding to the number of clusters.
  3. 根据权利要求1所述的基于聚类和SSE的异常点比例优化方法,其中,判断所述残差变动幅度是否超出预设的变动幅度阈值之后,还包括:The method for optimizing the proportion of abnormal points based on clustering and SSE according to claim 1, wherein after determining whether the residual variation range exceeds a preset variation range threshold, the method further comprises:
    若所述残差变动幅度未超出所述变动幅度阈值,将当前异常点比例减去所述步长以更新当前异常点比例,通过下一残差平方和以更新当前残差平方和,返回执行将所述待分类样本根据所述单分类支持向量机及当前异常点比例进行分类,得到当前异常类别的数据点,获取当前异常类别的每一数据点与所述正常点中心的残差平方和以作为下一残差平方和的步骤。If the residual variation range does not exceed the variation range threshold, subtract the step size from the current anomaly point ratio to update the current anomaly point ratio, update the current residual sum of squares through the next residual sum of squares, and return to execution The sample to be classified is classified according to the single-class support vector machine and the current abnormal point ratio to obtain the data points of the current abnormal category, and the residual square sum of each data point of the current abnormal category and the center of the normal point is obtained Take as the next step of the residual sum of squares.
  4. 根据权利要求1所述的基于聚类和SSE的异常点比例优化方法,其中,所述根据预设的当前异常点比例及每一聚类簇,构建与每一聚类簇一一对应的用于异常点检测的单分类支持向量机,包括:The method for optimizing the proportion of abnormal points based on clustering and SSE according to claim 1, wherein said constructing a one-to-one correspondence with each cluster according to the preset current proportion of abnormal points and each cluster Single-class support vector machines for outlier detection include:
    根据预设的当前异常点比例及每一聚类簇,获取各聚类簇相应单分类支持向量机所对应的超平面的第一参数和第二参数;Obtain the first parameter and the second parameter of the hyperplane corresponding to the single-class support vector machine of each cluster according to the preset current abnormal point ratio and each cluster;
    根据超平面的第一参数和第二参数,及所述当前异常点比例,构建与每一聚类簇一一对应的用于异常点检测的单分类支持向量机。According to the first parameter and the second parameter of the hyperplane, and the current abnormal point ratio, a single-class support vector machine for abnormal point detection corresponding to each cluster is constructed.
  5. 根据权利要求1所述的基于聚类和SSE的异常点比例优化方法,其中,所述将所选定的聚类簇根据所述单分类支持向量机及所述当前异常点比例进行分类,得到分类结果中正常类别的正常点中心,包括:The method for optimizing the proportion of abnormal points based on clustering and SSE according to claim 1, wherein the selected cluster cluster is classified according to the single classification support vector machine and the current proportion of abnormal points to obtain The normal point center of the normal category in the classification result, including:
    将所选定的聚类簇根据对应的单分类支持向量机及当前异常点比例进行分类,得到与所选定的聚类簇对应的分类结果;其中,所述分类结果中包括正常类别的数据点和异常类别的数据点;The selected clusters are classified according to the corresponding single-class support vector machine and the current proportion of abnormal points to obtain the classification results corresponding to the selected clusters; wherein, the classification results include normal category data Points and data points of abnormal categories;
    获取所述分类结果中正常类别的数据点所对应的平均值,以获取初始正常点中心;Obtaining the average value corresponding to the normal category data points in the classification result to obtain the initial normal point center;
    获取所述分类结果中正常类别的数据点中与所述初始正常点中心距离最近的数据点,以作为正常类别的数据点对应的正常点中心。Obtain the data point of the data points of the normal category that is closest to the center of the initial normal point in the classification result, and use it as the normal point center corresponding to the data point of the normal category.
  6. 根据权利要求1所述的基于聚类和SSE的异常点比例优化方法,其中,所述将所选定的聚类簇根据所述单分类支持向量机及最优异常点比例进行分类, 得到最优分类结果之后,还包括:The method for optimizing the proportion of abnormal points based on clustering and SSE according to claim 1, wherein the selected clusters are classified according to the single classification support vector machine and the optimal proportion of abnormal points to obtain the most After the excellent classification results, it also includes:
    将所述最优分类结果及所述最优异常点比例发送至所述待分类数据点集合对应的业务端,并将所述最优分类结果及所述最优异常点比例同步发送至云服务器;Send the optimal classification result and the optimal abnormal point ratio to the business end corresponding to the set of data points to be classified, and simultaneously send the optimal classification result and the optimal abnormal point ratio to the cloud server ;
    将所述最优分类结果及所述最优异常点比例对应的存储区域进行格式化删除。Formatting and deleting the storage area corresponding to the optimal classification result and the optimal abnormal point ratio.
  7. 根据权利要求6所述的基于聚类和SSE的异常点比例优化方法,其中,所述将所述最优分类结果及所述最优异常点比例对应的存储区域进行格式化删除之前,还包括:The method for optimizing the proportion of abnormal points based on clustering and SSE according to claim 6, wherein before formatting and deleting the storage area corresponding to the optimal classification result and the optimal proportion of abnormal points, the method further comprises :
    根据预设的当前异常点比例与所述最优异常点比例之差除以所述步长,得到迭代次数;Dividing the difference between the preset current abnormal point ratio and the optimal abnormal point ratio by the step length to obtain the number of iterations;
    将所述迭代次数发送至所述待分类数据点集合对应的业务端,并将所述迭代次数同步发送至云服务器。The number of iterations is sent to the business end corresponding to the set of data points to be classified, and the number of iterations is synchronously sent to the cloud server.
  8. 一种基于聚类和SSE的异常点比例优化装置,包括:A device for optimizing the proportion of abnormal points based on clustering and SSE, including:
    聚类单元,用于接收待分类数据点集合,通过k-means聚类将所述待分类数据点集合进行聚类,得到多个聚类簇;The clustering unit is configured to receive a set of data points to be classified, and cluster the set of data points to be classified through k-means clustering to obtain multiple clusters;
    多模型构建单元,用于获取多个聚类簇中所包括每一聚类簇对应的数据点,根据预设的当前异常点比例及每一聚类簇,构建与每一聚类簇一一对应的用于异常点检测的单分类支持向量机;The multi-model construction unit is used to obtain the data points corresponding to each cluster included in the multiple clusters, and construct one-to-one with each cluster according to the preset current abnormal point ratio and each cluster Corresponding single-class support vector machine for outlier detection;
    正常点中心获取单元,用于将所选定的聚类簇根据所述单分类支持向量机及所述当前异常点比例进行分类,得到分类结果中正常类别的正常点中心;The normal point center obtaining unit is used to classify the selected cluster according to the single-class support vector machine and the current abnormal point ratio to obtain the normal point center of the normal category in the classification result;
    第一残差计算单元,用于获取所述分类结果中异常类别的每一数据点与所述正常点中心的残差平方和,以获取当前残差平方和;The first residual calculation unit is configured to obtain the residual sum of squares of each data point of the abnormal category in the classification result and the center of the normal point to obtain the current residual sum of squares;
    第一比例更新单元,用于通过所述当前异常点比例减去预设的步长,以更新当前异常点比例;The first ratio update unit is configured to subtract a preset step size from the current abnormal point ratio to update the current abnormal point ratio;
    第二残差计算单元,用于将所选定的聚类簇根据所述单分类支持向量机及当前异常点比例进行分类,得到当前异常类别的数据点,获取当前异常类别的每一数据点与所述正常点中心的残差以作为下一残平方和差平方和;The second residual calculation unit is used to classify the selected clusters according to the single-class support vector machine and the current abnormal point ratio to obtain data points of the current abnormal category, and obtain each data point of the current abnormal category The residual error from the center of the normal point is taken as the next residual sum of squares and difference of squares;
    幅度计算单元,用于通过下一残差平方和与当前残差平方和之差除以所述步长,得到残差变动幅度;An amplitude calculation unit, configured to divide the difference between the next residual sum of squares and the current residual sum of squares by the step size to obtain the residual variation range;
    判断单元,用于判断所述残差变动幅度是否超出预设的变动幅度阈值;A judging unit for judging whether the residual variation range exceeds a preset variation range threshold;
    最优比例获取单元,用于若所述残差变动幅度超出所述变动幅度阈值,将当前异常点比例加上步长作为最优异常点比例;以及An optimal ratio obtaining unit, configured to, if the residual variation range exceeds the variation range threshold, use the current abnormal point ratio plus the step length as the optimal abnormal point ratio; and
    最优分类单元,用于将所选定的聚类簇根据所述单分类支持向量机及最优异常点比例进行分类,得到最优分类结果。The optimal classification unit is used to classify the selected clusters according to the single classification support vector machine and the optimal abnormal point ratio to obtain the optimal classification result.
  9. 根据权利要求8所述的基于聚类和SSE的异常点比例优化装置,其中,所述聚类单元,包括:The apparatus for optimizing the proportion of abnormal points based on clustering and SSE according to claim 8, wherein the clustering unit comprises:
    初始聚类中心获取单元,用于在多个待分类数据点集合中选取与预设的聚类簇数相同个数的数据点,将所选取的数据点作为每一簇的初始聚类中心;The initial cluster center obtaining unit is used to select the same number of data points as the preset number of cluster clusters from a plurality of data point sets to be classified, and use the selected data points as the initial cluster center of each cluster;
    初始聚类单元,用于根据所述待分类数据点集合中各数据点与各初始聚类中心的相异值,将所述待分类数据点集合进行划分,得到初始聚类结果;The initial clustering unit is used to divide the set of data points to be classified according to the difference between each data point in the set of data points to be classified and each initial cluster center to obtain an initial clustering result;
    聚类中心调整单元,用于根据初始聚类结果,获取每一簇的调整后聚类中心;The cluster center adjustment unit is used to obtain the adjusted cluster center of each cluster according to the initial clustering result;
    聚类调整单元,用于根据调整后聚类中心,将所述待分类数据点集合根据与调整后聚类中心的相异值进行划分,直至聚类结果保持相同的次数多于预设的次数,得到与预设的聚类簇数对应的聚类簇。The cluster adjustment unit is used to divide the set of data points to be classified according to the difference between the adjusted cluster center and the adjusted cluster center according to the adjusted cluster center, until the clustering result remains the same more than the preset number of times , Get the cluster cluster corresponding to the preset number of cluster clusters.
  10. 根据权利要求8所述的基于聚类和SSE的异常点比例优化装置,其中,还包括:The device for optimizing the proportion of abnormal points based on clustering and SSE according to claim 8, further comprising:
    第二比例更新单元,用于若所述残差变动幅度未超出所述变动幅度阈值,将当前异常点比例减去所述步长以更新当前异常点比例,通过下一残差平方和以更新当前残差平方和,返回执行将所述待分类样本根据所述单分类支持向量机及当前异常点比例进行分类,得到当前异常类别的数据点,获取当前异常类别的每一数据点与所述正常点中心的残差平方和以作为下一残差平方和的步骤。The second ratio update unit is used to update the current anomaly point ratio by subtracting the step size from the current anomaly point ratio if the residual error variation amplitude does not exceed the variation amplitude threshold, and update the current anomaly point ratio by the next residual sum of squares The current residual sum of squares, return to the execution to classify the sample to be classified according to the single-class support vector machine and the current abnormal point ratio to obtain the data points of the current abnormal category, and obtain each data point of the current abnormal category and the The residual sum of squares at the center of the normal point is used as the step of the next residual sum of squares.
  11. 一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现以下步骤:A computer device includes a memory, a processor, and a computer program that is stored on the memory and can run on the processor, and the processor implements the following steps when the processor executes the computer program:
    接收待分类数据点集合,通过k-means聚类将所述待分类数据点集合进行聚类,得到多个聚类簇;Receiving a set of data points to be classified, and clustering the set of data points to be classified through k-means clustering to obtain multiple clusters;
    获取多个聚类簇中所包括每一聚类簇对应的数据点,根据预设的当前异常点比例及每一聚类簇,构建与每一聚类簇一一对应的用于异常点检测的单分类 支持向量机;Obtain data points corresponding to each cluster included in multiple clusters, and construct a one-to-one correspondence with each cluster for abnormal point detection according to the preset current proportion of abnormal points and each cluster Single classification support vector machine;
    将所选定的聚类簇根据所述单分类支持向量机及所述当前异常点比例进行分类,得到分类结果中正常类别的正常点中心;Classify the selected clusters according to the single-class support vector machine and the current abnormal point ratio to obtain the normal point center of the normal category in the classification result;
    获取所述分类结果中异常类别的每一数据点与所述正常点中心的残差平方和,以获取当前残差平方和;Obtaining the residual sum of squares of each data point of the abnormal category in the classification result and the center of the normal point to obtain the current residual sum of squares;
    通过所述当前异常点比例减去预设的步长,以更新当前异常点比例;Subtract the preset step size from the current abnormal point ratio to update the current abnormal point ratio;
    将所选定的聚类簇根据所述单分类支持向量机及当前异常点比例进行分类,得到当前异常类别的数据点,获取当前异常类别的每一数据点与所述正常点中心的残差以作为下一残平方和差平方和;Classify the selected clusters according to the single-class support vector machine and the current abnormal point ratio to obtain the data points of the current abnormal category, and obtain the residual difference between each data point of the current abnormal category and the center of the normal point Take as the next residual sum of squares and difference of squares;
    通过下一残差平方和与当前残差平方和之差除以所述步长,得到残差变动幅度;By dividing the difference between the next residual sum of squares and the current residual sum of squares by the step size, the residual variation range is obtained;
    判断所述残差变动幅度是否超出预设的变动幅度阈值;Determine whether the residual variation range exceeds a preset variation range threshold;
    若所述残差变动幅度超出所述变动幅度阈值,将当前异常点比例加上步长作为最优异常点比例;以及If the residual variation range exceeds the variation range threshold, the current abnormal point ratio plus the step length is used as the optimal abnormal point ratio; and
    将所选定的聚类簇根据所述单分类支持向量机及最优异常点比例进行分类,得到最优分类结果。The selected clusters are classified according to the single-class support vector machine and the optimal proportion of abnormal points to obtain the optimal classification result.
  12. 根据权利要求11所述的计算机设备,其中,所述通过k-means聚类将所述待分类数据点集合进行聚类,得到多个聚类簇,包括:11. The computer device according to claim 11, wherein the clustering of the set of data points to be classified by k-means clustering to obtain multiple clusters comprises:
    在多个待分类数据点集合中选取与预设的聚类簇数相同个数的数据点,将所选取的数据点作为每一簇的初始聚类中心;Select the same number of data points as the preset number of cluster clusters from a plurality of data point sets to be classified, and use the selected data point as the initial cluster center of each cluster;
    根据所述待分类数据点集合中各数据点与各初始聚类中心的相异值,将所述待分类数据点集合进行划分,得到初始聚类结果;Dividing the set of data points to be classified according to the difference between each data point in the set of data points to be classified and each initial cluster center to obtain an initial clustering result;
    根据初始聚类结果,获取每一簇的调整后聚类中心;According to the initial clustering results, obtain the adjusted cluster center of each cluster;
    根据调整后聚类中心,将所述待分类数据点集合根据与调整后聚类中心的相异值进行划分,直至聚类结果保持相同的次数多于预设的次数,得到与预设的聚类簇数对应的聚类簇。According to the adjusted clustering center, the set of data points to be classified is divided according to the difference value from the adjusted clustering center, until the clustering result remains the same more than the preset number of times, and the preset cluster is obtained. The cluster cluster corresponding to the number of clusters.
  13. 根据权利要求11所述的计算机设备,其中,判断所述残差变动幅度是否超出预设的变动幅度阈值之后,还包括:11. The computer device according to claim 11, wherein after determining whether the residual variation range exceeds a preset variation range threshold, the method further comprises:
    若所述残差变动幅度未超出所述变动幅度阈值,将当前异常点比例减去所述步长以更新当前异常点比例,通过下一残差平方和以更新当前残差平方和, 返回执行将所述待分类样本根据所述单分类支持向量机及当前异常点比例进行分类,得到当前异常类别的数据点,获取当前异常类别的每一数据点与所述正常点中心的残差平方和以作为下一残差平方和的步骤。If the residual variation range does not exceed the variation range threshold, subtract the step size from the current abnormal point ratio to update the current abnormal point ratio, update the current residual square sum through the next residual square sum, and return to execution The sample to be classified is classified according to the single-class support vector machine and the current abnormal point ratio to obtain the data points of the current abnormal category, and the residual square sum of each data point of the current abnormal category and the center of the normal point is obtained Take as the next step of the residual sum of squares.
  14. 根据权利要求11所述的计算机设备,其中,所述根据预设的当前异常点比例及每一聚类簇,构建与每一聚类簇一一对应的用于异常点检测的单分类支持向量机,包括:11. The computer device according to claim 11, wherein the single classification support vector for abnormal point detection is constructed in a one-to-one correspondence with each cluster according to a preset proportion of current abnormal points and each cluster Machines, including:
    根据预设的当前异常点比例及每一聚类簇,获取各聚类簇相应单分类支持向量机所对应的超平面的第一参数和第二参数;Obtain the first parameter and the second parameter of the hyperplane corresponding to the single-class support vector machine of each cluster according to the preset current abnormal point ratio and each cluster;
    根据超平面的第一参数和第二参数,及所述当前异常点比例,构建与每一聚类簇一一对应的用于异常点检测的单分类支持向量机。According to the first parameter and the second parameter of the hyperplane, and the current abnormal point ratio, a single-class support vector machine for abnormal point detection corresponding to each cluster is constructed.
  15. 根据权利要求11所述的计算机设备,其中,所述将所选定的聚类簇根据所述单分类支持向量机及所述当前异常点比例进行分类,得到分类结果中正常类别的正常点中心,包括:11. The computer device according to claim 11, wherein the selected cluster is classified according to the single-class support vector machine and the current abnormal point ratio to obtain the normal point center of the normal category in the classification result ,include:
    将所选定的聚类簇根据对应的单分类支持向量机及当前异常点比例进行分类,得到与所选定的聚类簇对应的分类结果;其中,所述分类结果中包括正常类别的数据点和异常类别的数据点;The selected clusters are classified according to the corresponding single-class support vector machine and the current abnormal point ratio, and the classification results corresponding to the selected clusters are obtained; wherein, the classification results include normal category data Points and data points of abnormal categories;
    获取所述分类结果中正常类别的数据点所对应的平均值,以获取初始正常点中心;Obtaining the average value corresponding to the normal category data points in the classification result to obtain the initial normal point center;
    获取所述分类结果中正常类别的数据点中与所述初始正常点中心距离最近的数据点,以作为正常类别的数据点对应的正常点中心。Obtain the data point of the data points of the normal category that is closest to the center of the initial normal point in the classification result, and use it as the normal point center corresponding to the data point of the normal category.
  16. 根据权利要求11所述的计算机设备,其中,所述将所选定的聚类簇根据所述单分类支持向量机及最优异常点比例进行分类,得到最优分类结果之后,还包括:11. The computer device according to claim 11, wherein said classifying the selected clusters according to the single classification support vector machine and the optimal anomalous point ratio, after obtaining the optimal classification result, further comprises:
    将所述最优分类结果及所述最优异常点比例发送至所述待分类数据点集合对应的业务端,并将所述最优分类结果及所述最优异常点比例同步发送至云服务器;Send the optimal classification result and the optimal abnormal point ratio to the business end corresponding to the set of data points to be classified, and simultaneously send the optimal classification result and the optimal abnormal point ratio to the cloud server ;
    将所述最优分类结果及所述最优异常点比例对应的存储区域进行格式化删除。Formatting and deleting the storage area corresponding to the optimal classification result and the optimal abnormal point ratio.
  17. 根据权利要求16所述的基于聚类和SSE的异常点比例优化方法,其中,所述将所述最优分类结果及所述最优异常点比例对应的存储区域进行格式化删 除之前,还包括:The method for optimizing the proportion of abnormal points based on clustering and SSE according to claim 16, wherein before formatting and deleting the storage area corresponding to the optimal classification result and the optimal proportion of abnormal points, the method further comprises :
    根据预设的当前异常点比例与所述最优异常点比例之差除以所述步长,得到迭代次数;Dividing the difference between the preset current abnormal point ratio and the optimal abnormal point ratio by the step length to obtain the number of iterations;
    将所述迭代次数发送至所述待分类数据点集合对应的业务端,并将所述迭代次数同步发送至云服务器。The number of iterations is sent to the business end corresponding to the set of data points to be classified, and the number of iterations is synchronously sent to the cloud server.
  18. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行以下操作:A computer-readable storage medium that stores a computer program that, when executed by a processor, causes the processor to perform the following operations:
    接收待分类数据点集合,通过k-means聚类将所述待分类数据点集合进行聚类,得到多个聚类簇;Receiving a set of data points to be classified, and clustering the set of data points to be classified through k-means clustering to obtain multiple clusters;
    获取多个聚类簇中所包括每一聚类簇对应的数据点,根据预设的当前异常点比例及每一聚类簇,构建与每一聚类簇一一对应的用于异常点检测的单分类支持向量机;Obtain data points corresponding to each cluster included in multiple clusters, and construct a one-to-one correspondence with each cluster for abnormal point detection according to the preset current proportion of abnormal points and each cluster Single classification support vector machine;
    将所选定的聚类簇根据所述单分类支持向量机及所述当前异常点比例进行分类,得到分类结果中正常类别的正常点中心;Classify the selected clusters according to the single-class support vector machine and the current abnormal point ratio to obtain the normal point center of the normal category in the classification result;
    获取所述分类结果中异常类别的每一数据点与所述正常点中心的残差平方和,以获取当前残差平方和;Obtaining the residual sum of squares of each data point of the abnormal category in the classification result and the center of the normal point to obtain the current residual sum of squares;
    通过所述当前异常点比例减去预设的步长,以更新当前异常点比例;Subtract the preset step size from the current abnormal point ratio to update the current abnormal point ratio;
    将所选定的聚类簇根据所述单分类支持向量机及当前异常点比例进行分类,得到当前异常类别的数据点,获取当前异常类别的每一数据点与所述正常点中心的残差以作为下一残平方和差平方和;Classify the selected clusters according to the single-class support vector machine and the current abnormal point ratio to obtain the data points of the current abnormal category, and obtain the residual difference between each data point of the current abnormal category and the center of the normal point Take as the next residual sum of squares and difference of squares;
    通过下一残差平方和与当前残差平方和之差除以所述步长,得到残差变动幅度;By dividing the difference between the next residual sum of squares and the current residual sum of squares by the step size, the residual variation range is obtained;
    判断所述残差变动幅度是否超出预设的变动幅度阈值;Determine whether the residual variation range exceeds a preset variation range threshold;
    若所述残差变动幅度超出所述变动幅度阈值,将当前异常点比例加上步长作为最优异常点比例;以及If the residual variation range exceeds the variation range threshold, the current abnormal point ratio plus the step length is used as the optimal abnormal point ratio; and
    将所选定的聚类簇根据所述单分类支持向量机及最优异常点比例进行分类,得到最优分类结果。The selected clusters are classified according to the single-class support vector machine and the optimal proportion of abnormal points to obtain the optimal classification result.
  19. 根据权利要求18所述的计算机可读存储介质,其中,所述通过k-means聚类将所述待分类数据点集合进行聚类,得到多个聚类簇,包括:The computer-readable storage medium according to claim 18, wherein the clustering of the set of data points to be classified by k-means clustering to obtain multiple clusters comprises:
    在多个待分类数据点集合中选取与预设的聚类簇数相同个数的数据点,将 所选取的数据点作为每一簇的初始聚类中心;Select the same number of data points as the preset number of cluster clusters from a plurality of data point sets to be classified, and use the selected data point as the initial cluster center of each cluster;
    根据所述待分类数据点集合中各数据点与各初始聚类中心的相异值,将所述待分类数据点集合进行划分,得到初始聚类结果;Dividing the set of data points to be classified according to the difference between each data point in the set of data points to be classified and each initial cluster center to obtain an initial clustering result;
    根据初始聚类结果,获取每一簇的调整后聚类中心;According to the initial clustering results, obtain the adjusted cluster center of each cluster;
    根据调整后聚类中心,将所述待分类数据点集合根据与调整后聚类中心的相异值进行划分,直至聚类结果保持相同的次数多于预设的次数,得到与预设的聚类簇数对应的聚类簇。According to the adjusted clustering center, the set of data points to be classified is divided according to the difference value from the adjusted clustering center, until the clustering result remains the same more than the preset number of times, and the preset cluster is obtained. The cluster cluster corresponding to the number of clusters.
  20. 根据权利要求18所述的计算机可读存储介质,其中,判断所述残差变动幅度是否超出预设的变动幅度阈值之后,还包括:18. The computer-readable storage medium according to claim 18, wherein after determining whether the residual variation range exceeds a preset variation range threshold, the method further comprises:
    若所述残差变动幅度未超出所述变动幅度阈值,将当前异常点比例减去所述步长以更新当前异常点比例,通过下一残差平方和以更新当前残差平方和,返回执行将所述待分类样本根据所述单分类支持向量机及当前异常点比例进行分类,得到当前异常类别的数据点,获取当前异常类别的每一数据点与所述正常点中心的残差平方和以作为下一残差平方和的步骤。If the residual variation range does not exceed the variation range threshold, subtract the step size from the current anomaly point ratio to update the current anomaly point ratio, update the current residual sum of squares through the next residual sum of squares, and return to execution The sample to be classified is classified according to the single-class support vector machine and the current abnormal point ratio to obtain the data points of the current abnormal category, and the residual square sum of each data point of the current abnormal category and the center of the normal point is obtained Take as the next step of the residual sum of squares.
PCT/CN2019/117363 2019-01-28 2019-11-12 Method and device for optimizing abnormal point proportion based on clustering and sse WO2020155756A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910079217.9 2019-01-28
CN201910079217.9A CN109961086A (en) 2019-01-28 2019-01-28 Abnormal point ratio optimization method and device based on cluster and SSE

Publications (1)

Publication Number Publication Date
WO2020155756A1 true WO2020155756A1 (en) 2020-08-06

Family

ID=67023504

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117363 WO2020155756A1 (en) 2019-01-28 2019-11-12 Method and device for optimizing abnormal point proportion based on clustering and sse

Country Status (2)

Country Link
CN (1) CN109961086A (en)
WO (1) WO2020155756A1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112801137A (en) * 2021-01-04 2021-05-14 中国石油天然气集团有限公司 Petroleum pipe quality dynamic evaluation method and system based on big data
CN113780354A (en) * 2021-08-11 2021-12-10 国网上海市电力公司 Telemetry data anomaly identification method and device for dispatching automation master station system
CN114077872A (en) * 2021-11-29 2022-02-22 税友软件集团股份有限公司 Data anomaly detection method and related device
CN116416078A (en) * 2023-06-09 2023-07-11 济南百思为科信息工程有限公司 Audit supervision method for maintaining fund accounting safety
CN116781984A (en) * 2023-08-21 2023-09-19 深圳市华星数字有限公司 Set top box data optimized storage method
CN116796214A (en) * 2023-06-07 2023-09-22 南京北极光生物科技有限公司 Data clustering method based on differential features
CN116933107A (en) * 2023-07-24 2023-10-24 水木蓝鲸(南宁)半导体科技有限公司 Data distribution boundary determination method, device, computer equipment and storage medium
CN117520994A (en) * 2024-01-03 2024-02-06 深圳市活力天汇科技股份有限公司 Method and system for identifying abnormal air ticket searching user based on user portrait and clustering technology

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109919185A (en) * 2019-01-28 2019-06-21 平安科技(深圳)有限公司 Abnormal point ratio optimization method, apparatus and computer equipment based on SSE
CN109961086A (en) * 2019-01-28 2019-07-02 平安科技(深圳)有限公司 Abnormal point ratio optimization method and device based on cluster and SSE
CN110458581B (en) * 2019-07-11 2024-01-16 创新先进技术有限公司 Method and device for identifying business turnover abnormality of commercial tenant
CN110990867B (en) * 2019-11-28 2023-02-07 上海观安信息技术股份有限公司 Database-based data leakage detection model modeling method and device, and leakage detection method and system
CN111459926A (en) * 2020-03-26 2020-07-28 广西电网有限责任公司电力科学研究院 Park comprehensive energy anomaly data identification method
CN111540202B (en) * 2020-04-23 2021-07-30 杭州海康威视系统技术有限公司 Similar bayonet determining method and device, electronic equipment and readable storage medium
CN111612085B (en) * 2020-05-28 2023-07-11 上海观安信息技术股份有限公司 Method and device for detecting abnormal points in peer-to-peer group
CN111914942A (en) * 2020-08-12 2020-11-10 烟台海颐软件股份有限公司 Multi-table-combined one-use energy anomaly analysis method
WO2022155939A1 (en) * 2021-01-25 2022-07-28 深圳大学 Data attribute grouping method, apparatus and device, and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105389636A (en) * 2015-12-11 2016-03-09 河海大学 Low-voltage area KFCM-SVR reasonable line loss prediction method
CN106778908A (en) * 2017-01-11 2017-05-31 湖南文理学院 A kind of novelty detection method and apparatus
CN108322363A (en) * 2018-02-12 2018-07-24 腾讯科技(深圳)有限公司 Propelling data abnormality monitoring method, device, computer equipment and storage medium
CN109961086A (en) * 2019-01-28 2019-07-02 平安科技(深圳)有限公司 Abnormal point ratio optimization method and device based on cluster and SSE

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105389636A (en) * 2015-12-11 2016-03-09 河海大学 Low-voltage area KFCM-SVR reasonable line loss prediction method
CN106778908A (en) * 2017-01-11 2017-05-31 湖南文理学院 A kind of novelty detection method and apparatus
CN108322363A (en) * 2018-02-12 2018-07-24 腾讯科技(深圳)有限公司 Propelling data abnormality monitoring method, device, computer equipment and storage medium
CN109961086A (en) * 2019-01-28 2019-07-02 平安科技(深圳)有限公司 Abnormal point ratio optimization method and device based on cluster and SSE

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112801137A (en) * 2021-01-04 2021-05-14 中国石油天然气集团有限公司 Petroleum pipe quality dynamic evaluation method and system based on big data
CN113780354A (en) * 2021-08-11 2021-12-10 国网上海市电力公司 Telemetry data anomaly identification method and device for dispatching automation master station system
CN113780354B (en) * 2021-08-11 2024-01-23 国网上海市电力公司 Remote measurement data anomaly identification method and device for dispatching automation master station system
CN114077872A (en) * 2021-11-29 2022-02-22 税友软件集团股份有限公司 Data anomaly detection method and related device
CN116796214A (en) * 2023-06-07 2023-09-22 南京北极光生物科技有限公司 Data clustering method based on differential features
CN116796214B (en) * 2023-06-07 2024-01-30 南京北极光生物科技有限公司 Data clustering method based on differential features
CN116416078B (en) * 2023-06-09 2023-08-15 济南百思为科信息工程有限公司 Audit supervision method for maintaining fund accounting safety
CN116416078A (en) * 2023-06-09 2023-07-11 济南百思为科信息工程有限公司 Audit supervision method for maintaining fund accounting safety
CN116933107A (en) * 2023-07-24 2023-10-24 水木蓝鲸(南宁)半导体科技有限公司 Data distribution boundary determination method, device, computer equipment and storage medium
CN116781984A (en) * 2023-08-21 2023-09-19 深圳市华星数字有限公司 Set top box data optimized storage method
CN116781984B (en) * 2023-08-21 2023-11-07 深圳市华星数字有限公司 Set top box data optimized storage method
CN117520994A (en) * 2024-01-03 2024-02-06 深圳市活力天汇科技股份有限公司 Method and system for identifying abnormal air ticket searching user based on user portrait and clustering technology
CN117520994B (en) * 2024-01-03 2024-04-19 深圳市活力天汇科技股份有限公司 Method and system for identifying abnormal air ticket searching user based on user portrait and clustering technology

Also Published As

Publication number Publication date
CN109961086A (en) 2019-07-02

Similar Documents

Publication Publication Date Title
WO2020155756A1 (en) Method and device for optimizing abnormal point proportion based on clustering and sse
WO2020155755A1 (en) Spectral clustering-based optimization method for anomaly point ratio, device, and computer apparatus
WO2020155752A1 (en) Outlier detection model verification method and apparatus, and computer device and storage medium
WO2020143304A1 (en) Loss function optimization method and apparatus, computer device, and storage medium
TWI539298B (en) Metrology sampling method with sampling rate decision scheme and computer program product thereof
WO2021142916A1 (en) Proxy-assisted evolutionary algorithm-based airfoil optimization method and apparatus
US9037518B2 (en) Classifying unclassified samples
WO2022111327A1 (en) Risk level data processing method and apparatus, and storage medium and electronic device
WO2021051529A1 (en) Method, apparatus and device for estimating cloud host resources, and storage medium
WO2021179544A1 (en) Sample classification method and apparatus, computer device, and storage medium
JP2005535130A (en) Methods, systems, and media for handling misrepresented measurement data in modern process control systems
JP5733229B2 (en) Classifier creation device, classifier creation method, and computer program
KR102117637B1 (en) Apparatus and method for preprocessinig data
WO2021169445A1 (en) Information recommendation method and apparatus, computer device, and storage medium
TWI709932B (en) Method, device and equipment for monitoring transaction indicators
WO2021098384A1 (en) Data abnormality detection method and apparatus
WO2018006631A1 (en) User level automatic segmentation method and system
CN111176953A (en) Anomaly detection and model training method thereof, computer equipment and storage medium
WO2020155754A1 (en) Outlier proportion optimization method and apparatus, and computer device and storage medium
KR20190008515A (en) Process Monitoring Device and Method using RTC method with improved SAX method
CN114116828A (en) Association rule analysis method, device and storage medium for multidimensional network index
CN114881167B (en) Abnormality detection method, abnormality detection device, electronic device, and medium
CN104992050A (en) Method for selecting prediction model of time sequence characteristic evaluation based on statistical signal processing
CN105306252A (en) Method for automatically judging server failures
CN109257952A (en) The method that is carried out data transmission with the data volume of reduction, system and program product

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19913220

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19913220

Country of ref document: EP

Kind code of ref document: A1