CN110895706B - Method and device for acquiring target cluster number and computer system - Google Patents

Method and device for acquiring target cluster number and computer system Download PDF

Info

Publication number
CN110895706B
CN110895706B CN201911081062.9A CN201911081062A CN110895706B CN 110895706 B CN110895706 B CN 110895706B CN 201911081062 A CN201911081062 A CN 201911081062A CN 110895706 B CN110895706 B CN 110895706B
Authority
CN
China
Prior art keywords
cluster
clustering
division
data set
classified
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911081062.9A
Other languages
Chinese (zh)
Other versions
CN110895706A (en
Inventor
李朋
施斌
彭虎
孙迁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suning Cloud Computing Co Ltd
Original Assignee
Suning Cloud Computing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suning Cloud Computing Co Ltd filed Critical Suning Cloud Computing Co Ltd
Priority to CN201911081062.9A priority Critical patent/CN110895706B/en
Publication of CN110895706A publication Critical patent/CN110895706A/en
Application granted granted Critical
Publication of CN110895706B publication Critical patent/CN110895706B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Abstract

The embodiment of the application discloses a method, a device and a computer system for acquiring a target cluster number, wherein the method comprises the following steps: acquiring a data set to be classified; respectively carrying out cluster division on the data set to be classified according to at least two set cluster numbers to obtain cluster division results corresponding to all the cluster numbers; calculating a clustering effectiveness index corresponding to each clustering number according to the clustering numbers and the corresponding clustering division results; and determining the cluster number corresponding to the cluster effectiveness index with the minimum median value of all the cluster effectiveness indexes as a target cluster number. The technical scheme of the application provides the method for acquiring the target cluster number, and the method can be used for clustering and dividing the data set, so that the accuracy of dividing the data set is improved.

Description

Method and device for acquiring target clustering number and computer system
Technical Field
The invention relates to the field of clustering algorithms, in particular to a method, a device and a computer system for acquiring a target clustering number.
Background
With the advent of the big data age, especially with the development of e-commerce and social networks, data will affect aspects of our lives. However, valuable data is submerged in the ocean of data. In order to extract the value of the data from the data sea, a cluster analysis algorithm is used.
Clustering belongs to unsupervised machine learning and is an important subject in the field of data mining research. The clustering algorithm can divide the data set into different clusters according to the characteristics of the data set under the condition of lacking prior information, the sample point similarity of the same cluster is higher, and the sample point similarity of the different clusters is lower, which is similar to the principle of ' clustering by clusters and people ' clustering '. At present, clustering has a wide application scene in the aspect of data analysis of E-business big data.
In the practical application process of the clustering analysis, a key problem needs to be solved: how to determine the optimal clustering number K opt The value of (c). In practical clustering applications, the value of the clustering number K is usually in a fuzzy range. Therefore, the optimal clustering number is difficult to determine, which causes the clustering algorithm to face relatively large errors in practical application.
In order to solve the technical problem, the application provides a method for acquiring the target cluster number, which can be used for determining the optimal cluster number and improving the accuracy of cluster division.
Disclosure of Invention
In order to solve the defects of the prior art, the present invention mainly aims to provide a method, an apparatus and a computer system for obtaining a target cluster number.
In order to achieve the above object, the present invention provides, in a first aspect, a method for obtaining a target cluster number, the method including:
acquiring a data set to be classified;
respectively carrying out cluster division on the data set to be classified according to at least two set cluster numbers to obtain cluster division results corresponding to all the cluster numbers;
calculating a clustering effectiveness index corresponding to each clustering number according to the clustering numbers and the corresponding clustering division results;
and determining the cluster number corresponding to the cluster effectiveness index with the minimum median value of all the cluster effectiveness indexes as a target cluster number.
In some embodiments, the calculating the cluster validity indicator corresponding to each cluster number specifically includes:
using the formula
Figure BDA0002263972160000021
Calculating a clustering effectiveness index corresponding to the clustering number;
wherein, K represents the cluster number of the current cluster division, NCVI (K) is the cluster effectiveness index corresponding to the cluster number K, and n represents the cluster effectiveness index in the Euclidean space R m Wherein the data set to be classified D = { x = { (x) 1 ,x 2 ,…,x n Has n sample points, the C = { C = } 1 ,C 2 ,…,C k Dividing the data set D to be classified into K clusters, wherein G represents the center points of all the clusters, and D (v) is the corresponding clustering division result i V) represents the distance from the ith cluster center point coordinate to all cluster center point coordinates; d (x) ij And vi) representing the distance from the jth sample point coordinate of the ith cluster to the center point coordinate of the ith cluster.
In some embodiments, before performing cluster partitioning on the to-be-classified data set according to at least two set cluster numbers, the method further includes:
and acquiring target data, and preprocessing the target data to obtain the data set to be classified.
In some embodiments, the preprocessing the target data comprises:
and cleaning the target data, quantifying preset dimension attributes, and generating a data set to be classified.
In some embodiments, the clustering number K ranges from a value of
Figure BDA0002263972160000022
And K is an integer.
In some embodiments, the performing cluster division on the to-be-classified data sets according to at least two set cluster numbers respectively includes:
and (3) performing clustering division on the data set to be classified according to at least two set clustering numbers by using a K-means algorithm.
In some embodiments, after determining that the cluster number corresponding to the cluster validity indicator with the smallest median among all the cluster validity indicators is the target cluster number, the method further includes:
and determining a cluster division result corresponding to the target cluster number as a target cluster division result and outputting the target cluster number and the target cluster division result.
In a second aspect, the present invention provides an apparatus for obtaining a target cluster number, the apparatus comprising:
the data acquisition module is used for acquiring a data set to be classified;
the cluster division module is used for carrying out cluster division on the data set to be classified according to at least two set cluster numbers;
the calculation module is used for calculating the clustering effectiveness index corresponding to each clustering number according to the clustering numbers and the corresponding clustering division results;
and the processing module is used for determining the clustering number corresponding to the clustering effectiveness index with the minimum median value of all the clustering effectiveness indexes as the target clustering number.
In some embodiments, the apparatus further comprises:
and the data processing module is used for acquiring target data and preprocessing the target data to obtain the data set to be classified.
In a third aspect, the present invention provides a computer system, comprising:
one or more processors;
and memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform operations comprising:
acquiring a data set to be classified;
respectively carrying out cluster division on the data set to be classified according to at least two set cluster numbers to obtain cluster division results corresponding to all the cluster numbers;
calculating a clustering effectiveness index corresponding to each clustering number according to the clustering numbers and the corresponding clustering division results;
and determining the cluster number corresponding to the cluster effectiveness index with the minimum median value of all the cluster effectiveness indexes as a target cluster number.
According to the specific embodiments provided herein, the present application discloses the following technical effects:
according to the method and the device, a plurality of cluster division results corresponding to different cluster numbers are obtained through calculation by presetting a plurality of cluster numbers, and then the cluster effectiveness index corresponding to each cluster number is calculated based on the cluster numbers and the cluster division results so as to evaluate the cluster division results. Based on the optimal (namely minimum) clustering effectiveness index, the optimal clustering number can be determined, so that clustering can be performed based on the optimal clustering number, and the accuracy of clustering division results is improved.
Further, compared with the existing clustering effectiveness index, the clustering effectiveness index NCVI provided by the application greatly reduces the calculated amount, reduces the occupation of calculation resources and the requirement on the calculation capability of a calculation platform in the evaluation process, and improves the calculation efficiency.
Of course, it is not necessary for any product to achieve all of the above-described advantages at the same time for the practice of the present application.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a graph of the trend of the cluster validity indicator function NCVI (K) in the present application;
FIG. 2 is a two-dimensional spatial distribution diagram of a data set to be classified collected according to a target cluster number;
FIG. 3 is a diagram illustrating the evaluation effect of each cluster partition evaluation index;
FIG. 4 is a block diagram of a computer system according to the present application;
FIG. 5 is a flow chart of the method of the present application;
fig. 6 is a structural view of the apparatus of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
With the development of the internet field, large internet companies generate billions of user behavior log data every day. In order to provide more accurate and high-quality services for users, users need to be classified according to the log data, users with similar use habits are classified into one type, then different service schemes are formulated for different types of users, and the use experience of the users is improved. This process can be implemented by clustering the log data using a clustering algorithm to obtain clusters of users of different classifications.
When there is a data set with n samples, the cluster division process is a process of dividing the data set into a specified number of clusters according to a division rule. The purpose of cluster partitioning is to make the data similarity in the same cluster as large as possible, and the difference of the data not in the same cluster as large as possible. The specified number is the cluster number described in the present application.
As described in the background, in current practical clustering applications, the value of the clustering number K is usually in a fuzzy range rather than an exact value. And whether the K value is accurate or not is directly related to the clustering division result: the inaccurate K value makes the clustering partitioning result inaccurate, and conversely, the more accurate the K value is, the more accurate the clustering partitioning result is.
In order to select and determine the optimal clustering number, the concept of a clustering effectiveness index is utilized. The cluster effectiveness index is used for evaluating the advantages and disadvantages of the cluster division effect, when the cluster effectiveness index shows that the cluster division result is excellent, the corresponding cluster number is excellent, and when the cluster effectiveness index shows that the cluster division result is inferior, the corresponding cluster number is not the cluster number which is required to be determined. On the basis, the method sets the clustering number in the approximate range, then calculates the clustering division result corresponding to each clustering number of the data set to be classified in the range, and then evaluates each clustering division result by adopting the clustering effectiveness index, wherein the clustering division result corresponding to the minimum (optimal) clustering effectiveness index is optimal, and the clustering number is optimal at the moment. The method can accurately determine the optimal clustering number, and further improve the accuracy of clustering division results. Further, compared with the existing clustering effectiveness index, the clustering effectiveness index NCVI provided by the application greatly reduces the calculated amount, so that the calculation efficiency is improved.
The present application will be described in detail below by way of specific examples:
example one
Taking user log data as an example, the scheme provided by the application can be implemented by the following steps:
step one, collecting log data;
different log collection schemes may be employed for different terminal types. If the JS system is used, the access log information of users at the PC end and the small program end can be collected; the access log information of the mobile APP end user can be collected through an SDK system embedded in the APP.
After the user generates log data of types such as clicks, exposures, accesses, searches, orders, members and the like of APP, PC, small programs, the log data are collected into data middleware such as Kafka, MQ, flume and the like in real time, and then the log data can be obtained in real time according to needs.
Step two, cleaning the collected log data, quantizing preset dimension attributes, and generating a data set to be classified;
after the log data is collected, the log data needs to be extracted, cleaned, converted and loaded, and the processed log data is formatted and stored. And finally, performing quantification operation on the preset dimension attributes according to the requirement so as to perform next clustering analysis.
The preset dimension attributes are dimension attributes related to clustering division, such as attributes of users.
The quantization operation on the preset dimension attribute may be a processing operation on the data, such as giving a weight to the dimension attribute of the data and removing an irrelevant dimension attribute.
Step three, acquiring a data set to be classified;
and (4) importing the data set to be classified in advance, and preparing to carry out the next operation.
Step four, respectively carrying out cluster division on the data set to be classified according to at least two set cluster numbers to obtain cluster division results corresponding to all the cluster numbers;
the at least two cluster numbers can be obtained by taking integers from a preset interval. Assuming that the number of clusters is represented by k, thenThe preset interval is preferably
Figure BDA0002263972160000071
Of course, the at least two cluster numbers may also be any given discontinuous positive integer, which is not limited in this application.
Calculating a clustering effectiveness index corresponding to each clustering number according to the clustering numbers and the corresponding clustering division results;
specifically, the process of calculating the cluster validity index may be:
let the number of clusters be denoted by k, in Euclidean space R m In (2), the data set to be classified D = { x = { x = 1 ,x 2 ,…,x n With n sample points, the clustering algorithm divides the dataset D into K clusters C by a clustering number K = { C = { C } 1 ,C 2 ,…,C k And then, the new clustering validity index NCVI provided by the present invention is:
Figure BDA0002263972160000072
said G represents all of said cluster center points, said d (v) i V) represents the distance from the ith cluster center point to all cluster center points; d (x) ij And vi) representing the distance from the jth sample point of the ith cluster to the center point of the ith cluster.
The clustering effectiveness index corresponding to the current clustering value can be obtained by substituting the corresponding K value into the formula NCVI (K), namely the clustering value.
Fig. 1 shows a trend graph of a cluster validity indicator function NCVI (K) according to a cluster number K, where the NCVI (K) function is composed of two parts:
(1)
Figure BDA0002263972160000073
representing the degree of dispersion of sample points between different classes of clusters, f 1 The value of (c) increases with increasing K. f. of 1 The increasing values of (a) indicate greater variability of data within different clusters.
(2)
Figure BDA0002263972160000074
Representing the degree of dispersion between the centre points of all clusters, f 2 The value of (c) decreases as K increases. f. of 2 The decreasing value of (a) indicates that the difference of data in the same cluster is decreasing.
In the process of solving the optimal clustering number of the data set to be classified, experiments show that when the value of NCVI (K) is minimum, the corresponding K value is the optimal clustering number or the approximate optimal clustering number.
As shown in fig. 1, it can be known from the knowledge related to the mathematical image analysis method that: when f is 1 =f 2 And then, the value of the corresponding clustering effectiveness index function NCVI (K) is minimum, namely the K value corresponding to the dispersion degree of the sample points among all the clusters and the dispersion degree among the central points of all the clusters at the moment is the optimal clustering number or the approximate optimal clustering number of the data set to be classified.
Therefore, the new clustering validity index function NCVI provided by the application considers the dispersion degree of all sample points in the data set from the global (inter-cluster) angle and the local (intra-cluster) angle together, and can truly reflect the dispersion degree between the inter-cluster samples and the intra-cluster samples.
And step six, determining the clustering number corresponding to the clustering effectiveness index with the minimum median value in all the clustering effectiveness indexes as a target clustering number.
Fig. 2 shows a two-dimensional spatial distribution diagram of a cluster partitioning result obtained by performing cluster partitioning using the number of target clusters determined by the method provided by the present application, including sample points and XY axes representing coordinates of the sample points.
The partitioning results are evaluated by the clustering validity index NCVI and the commonly used 5 kinds of clustering validity indexes CVIs, and the obtained results are shown in fig. 3. As can be seen from the test results in FIG. 3, the NCVI index provided by the present invention can solve the optimal cluster number of the data set to be classified; the COP index and the DBI index can only obtain the approximate optimal clustering number; the I index, CH index, and DI index do not solve for the optimal or near optimal cluster number for the data set to be classified. Meanwhile, the new clustering effectiveness index function NCVI provided by the application only needs to calculate the distances from the center point of each cluster to the cluster center point and from all the points in each cluster to the center point in each cluster, so that the calculation amount and the requirement on the calculation capacity are reduced, and the evaluation efficiency is improved.
Therefore, compared with the conventional clustering effectiveness index, the clustering effectiveness index function NCVI provided by the invention has higher accuracy and smaller calculated amount, and the accuracy and efficiency of clustering division are improved.
According to the target cluster number and the corresponding cluster division result obtained by the method, the optimal cluster division result of the log data is obtained, and subsequently, the users can be classified according to the result, so that personalized services are provided for the users of different types, and the use experience of the users is improved.
Example two
Corresponding to the first embodiment, the present application provides a method for acquiring a target cluster number, and as shown in fig. 5, the method includes:
510. acquiring a data set to be classified;
the data to be classified may be pre-stored, or may be imported only when classification is required, which is not limited in the present application.
520. Respectively carrying out cluster division on a data set to be classified according to at least two set cluster numbers to obtain cluster division results corresponding to all the cluster numbers;
the clustering may be performed using a clustering algorithm, preferably a K-means algorithm.
The K-means algorithm is a clustering algorithm for iterative solution, and comprises the steps of randomly selecting K objects as initial clustering centers, then calculating the distance between each object and each seed clustering center, and allocating each object to the nearest clustering center. The cluster centers and the objects assigned to them represent a cluster. The cluster center of a cluster is recalculated for each sample assigned based on the objects existing in the cluster. This process will be repeated until some termination condition is met. The termination condition may be that no (or minimum number) objects are reassigned to different clusters, no (or minimum number) cluster centers are changed again, and the sum of squared errors is locally minimal.
530. Calculating a clustering effectiveness index corresponding to each clustering number according to the clustering numbers and the corresponding clustering division results;
preferably, the calculating the cluster validity index corresponding to each cluster number specifically includes:
using the formula
Figure BDA0002263972160000091
Calculating a clustering effectiveness index corresponding to the clustering number;
wherein n represents in Euclidean space R m Wherein the data set to be classified D = { x = { (x) 1 ,x 2 ,…, x n, wherein K represents the number of clusters of the current clustering division, the solution obtained by the formula NCVI (K) is the clustering effectiveness index corresponding to the K value which is the number of clusters of the current clustering division, and the clustering division divides the data set D to be classified into K clusters C = { C by using a clustering algorithm 1 ,C 2 ,…,C k G represents all of the cluster center points, d (v) i V) represents the distance from the ith cluster center point coordinate to all cluster center point coordinates; d (x) ij And vi) representing the distance from the jth sample point coordinate of the ith cluster to the center point coordinate of the ith cluster.
540. And determining the cluster number corresponding to the cluster effectiveness index with the minimum median value of all the cluster effectiveness indexes as a target cluster number.
Preferably, before the data sets to be classified are respectively clustered and divided according to at least two set clustering numbers, the method further includes:
501. and acquiring target data, and preprocessing the target data to obtain a data set to be classified.
The preprocessing of the target data may specifically include:
502. and cleaning the target data, quantifying preset dimension attributes, and generating a data set to be classified.
In some embodiments, after determining that the cluster number corresponding to the cluster validity indicator with the smallest median among all the cluster validity indicators is the target cluster number, the method further includes:
540. and outputting the target cluster number and a cluster division result corresponding to the target cluster number.
EXAMPLE III
Corresponding to the second embodiment, the present application further provides an apparatus for acquiring a target cluster number, as shown in fig. 6, where the apparatus includes:
a data obtaining module 610, configured to obtain a data set to be classified;
the data to be classified may be pre-stored, or may be imported only when classification is required, which is not limited in the present application.
The cluster partitioning module 620 is configured to perform cluster partitioning on the to-be-classified data set according to at least two set cluster numbers;
the clustering division is preferably divided by using a K-means algorithm, and the set clustering number can be obtained by rounding in a preset interval.
If n samples are collected in the data set to be classified, the set clustering number is k, and the preferred value range of the clustering number k is
Figure BDA0002263972160000101
And k is an integer.
A calculating module 630, configured to calculate a cluster validity indicator corresponding to each cluster number according to the cluster numbers and corresponding cluster partitioning results;
preferably, the specific process of calculating the cluster validity index corresponding to the cluster number includes:
using the formula
Figure BDA0002263972160000111
Calculating a clustering effectiveness index corresponding to the clustering number;
wherein n represents in Euclidean space R m Wherein the data set to be classified D = { x = { (x) 1 ,x 2 ,…,x n N sample points are arranged, the K represents the clustering number of the current clustering division, the solution obtained by the formula NCVI (K) is the clustering effectiveness index corresponding to the clustering number of the current clustering division, namely the K value, and the clustering division divides the data set D to be classified into K clusters C = { C by using a clustering algorithm 1 ,C 2 ,…,C k G represents all of the cluster center points, d (v) i V) represents the distance from the ith cluster center point coordinate to all cluster center point coordinates; d (x) ij And vi) representing the distance from the jth sample point coordinate of the ith cluster to the center point coordinate of the ith cluster.
Preferably, the value range of the clustering number K is
Figure BDA0002263972160000112
And K is an integer.
Preferably, the algorithm used for clustering is a K-means algorithm.
The processing module 640 is configured to determine, as a target cluster number, a cluster number corresponding to a cluster validity index with a smallest median among all cluster validity indexes.
Preferably, the apparatus further comprises:
and the data processing module 650 is configured to acquire target data and preprocess the target data to obtain the to-be-classified data set.
Preferably, the preprocessing the target data may specifically include:
and cleaning the target data, quantifying preset dimension attributes, and generating a data set to be classified.
After the target data are collected, the target data are extracted, cleaned, converted and loaded, and the processed target data are formatted and stored. And finally, performing quantification operation on the preset dimension attributes according to the requirement so as to perform next clustering analysis.
The preset dimension attribute is a dimension attribute related to clustering division, such as an attribute of a user.
The quantization operation on the preset dimension attribute may be a processing operation on the data, such as giving a weight to the dimension attribute of the data and removing an irrelevant dimension attribute.
Preferably, the apparatus may further include:
the output module 660 is configured to output the target cluster number and the cluster division result corresponding to the target cluster number.
Example four
In accordance with the above embodiments, the present application also provides a computer system comprising one or more processors; and memory associated with the one or more processors, the memory for storing program instructions that, when read and executed by the one or more processors, perform operations comprising:
respectively carrying out cluster division on a data set to be classified according to at least two set cluster numbers to obtain cluster division results corresponding to all the cluster numbers;
calculating a clustering effectiveness index corresponding to each clustering number according to the clustering numbers and the corresponding clustering division results;
and determining the cluster number corresponding to the cluster effectiveness index with the minimum median value of all the cluster effectiveness indexes as a target cluster number.
Fig. 4 illustrates an architecture of a computer system, which may specifically include a processor 1510, a video display adapter 1511, a disk drive 1512, an input/output interface 1513, a network interface 1514, and a memory 1520. The processor 1510, video display adapter 1511, disk drive 1512, input/output interface 1513, network interface 1514, and memory 1520 may be communicatively coupled via a communication bus 1530.
The processor 1510 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solution provided by the present Application.
The Memory 1520 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 1520 may store an operating system 1521 for controlling the operation of the computer system 1500, a Basic Input Output System (BIOS) for controlling low-level operations of the computer system 1500. In addition, a web browser 1523, a data storage management system 1524, an icon font processing system 1525, and the like can also be stored. The icon font processing system 1525 may be an application program that implements the operations of the foregoing steps in this embodiment of the application. In summary, when the technical solution provided by the present application is implemented by software or firmware, the relevant program codes are stored in the memory 1520 and called for execution by the processor 1510.
The input/output interface 1513 is used for connecting an input/output module to realize information input and output. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.
The network interface 1514 is used to connect a communication module (not shown) to enable the device to communicatively interact with other devices. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, bluetooth and the like).
The bus 1530 includes a path to transfer information between the various components of the device, such as the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, and the memory 1520.
In addition, the computer system 1500 may also obtain information of specific extraction conditions from the virtual resource object extraction condition information database 1541 for performing condition judgment, and the like.
It should be noted that although the above devices only show the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, the memory 1520, the bus 1530, etc., in a specific implementation, the devices may also include other components necessary for proper operation. Furthermore, it will be understood by those skilled in the art that the apparatus described above may also include only the components necessary to implement the solution of the present application, and not necessarily all of the components shown in the figures.
From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, or the like, and includes several instructions for enabling a computer device (which may be a personal computer, a cloud server, or a network device) to execute the method according to the embodiments or some parts of the embodiments of the present application.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and should not be taken as limiting the scope of the present invention, which is intended to cover any modifications, equivalents, improvements, etc. within the spirit and scope of the present invention.

Claims (9)

1. A method for obtaining a target cluster number, the method comprising:
acquiring a data set to be classified;
respectively carrying out cluster division on the data set to be classified according to at least two set cluster numbers to obtain cluster division results corresponding to all the cluster numbers;
calculating a clustering effectiveness index corresponding to each clustering number according to the clustering numbers and the corresponding clustering division results;
determining the clustering number corresponding to the clustering effectiveness index with the minimum median value of all the clustering effectiveness indexes as a target clustering number;
the calculating of the cluster validity index corresponding to each cluster number specifically includes:
using the formula
Figure FDA0003767563400000011
Calculating a clustering effectiveness index corresponding to the clustering number;
wherein, K represents the cluster number of the current cluster division, NCVI (K) is the cluster effectiveness index corresponding to the cluster number K, and n represents the cluster effectiveness index in the Euclidean space R m Wherein the data set to be classified D = { x = { (x) 1 ,x 2 ,…,x n Has n sample points, said C = { C = } 1 ,C 2 ,…,C k Dividing the data set D to be classified into K clusters, wherein G represents the center points of all the clusters, and D (v) is the corresponding clustering division result i V) represents the distance from the ith cluster center point coordinate to all cluster center point coordinates; d (x) ij And vi) representing the distance from the jth sample point coordinate of the ith cluster to the center point coordinate of the ith cluster.
2. The obtaining method according to claim 1, wherein before performing cluster division on the data set to be classified according to the set at least two cluster numbers, the method further comprises:
and acquiring target data, preprocessing the target data, and generating the data set to be classified.
3. The acquisition method according to claim 2, wherein the preprocessing the target data comprises:
and cleaning the target data and quantifying a preset dimension attribute to generate the data set to be classified.
4. The acquisition method according to claim 1, wherein the value range of the clustering number K is
Figure FDA0003767563400000021
And K is an integer.
5. The obtaining method according to any one of claims 1, 3 and 4, wherein the performing cluster division on the data set to be classified according to at least two set cluster numbers respectively comprises:
and (3) performing clustering division on the data set to be classified according to at least two set clustering numbers by using a K-means algorithm.
6. The obtaining method according to claim 1, wherein after determining that the cluster number corresponding to the cluster validity indicator with the smallest median among all the cluster validity indicators is the target cluster number, the method further comprises:
and determining a cluster division result corresponding to the target cluster number as a target cluster division result, and outputting the target cluster number and the target cluster division result.
7. An apparatus for obtaining a target cluster number, the apparatus comprising:
the data acquisition module is used for acquiring a data set to be classified;
the cluster division module is used for carrying out cluster division on the data set to be classified according to at least two set cluster numbers;
the calculation module is used for calculating the clustering effectiveness index corresponding to each clustering number according to the clustering numbers and the corresponding clustering division results;
the processing module is used for determining the clustering number corresponding to the clustering effectiveness index with the minimum median value of all clustering effectiveness indexes as a target clustering number;
the calculating the cluster effectiveness index corresponding to each cluster number specifically includes:
using the formula
Figure FDA0003767563400000031
Calculating a clustering effectiveness index corresponding to the clustering number;
wherein the content of the first and second substances,k represents the clustering number of the current clustering division, NCVI (K) is a clustering effectiveness index corresponding to the clustering number K, and n represents the cluster effectiveness index in Euclidean space R m To be described classification dataset D = { x 1 ,x 2 ,…,x n Has n sample points, the C = { C = } 1 ,C 2 ,…,C k Dividing the data set D to be classified into K clusters, wherein G represents the center points of all the clusters, and D (v) is the corresponding clustering division result i V) represents the distance from the ith cluster center point coordinate to all cluster center point coordinates; d (x) ij And vi) representing the distance from the jth sample point coordinate of the ith cluster to the center point coordinate of the ith cluster.
8. The acquisition device according to claim 7, characterized in that the device further comprises:
and the data processing module is used for acquiring target data and preprocessing the target data to obtain the data set to be classified.
9. A computer system, the system comprising:
one or more processors;
and a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform operations comprising
Acquiring a data set to be classified;
respectively carrying out cluster division on the data set to be classified according to at least two set cluster numbers to obtain cluster division results corresponding to all the cluster numbers;
calculating a clustering effectiveness index corresponding to each clustering number according to the clustering numbers and the corresponding clustering division results;
determining the clustering number corresponding to the clustering effectiveness index with the minimum median value of all the clustering effectiveness indexes as a target clustering number;
the calculating the cluster effectiveness index corresponding to each cluster number specifically includes:
using the formula
Figure FDA0003767563400000041
Calculating a clustering effectiveness index corresponding to the clustering number;
wherein, K represents the cluster number of the current cluster division, NCVI (K) is the cluster effectiveness index corresponding to the cluster number K, and n represents the cluster effectiveness index in the Euclidean space R m To be described classification dataset D = { x 1 ,x 2 ,…,x n Has n sample points, the C = { C = } 1 ,C 2 ,…,C k Dividing the data set D to be classified into K clusters, wherein G represents the center points of all the clusters, and D (v) is the corresponding clustering division result i V) represents the distance from the ith cluster center point coordinate to all cluster center point coordinates; d (x) ij And vi) representing the distance from the jth sample point coordinate of the ith cluster to the center point coordinate of the ith cluster.
CN201911081062.9A 2019-11-07 2019-11-07 Method and device for acquiring target cluster number and computer system Active CN110895706B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911081062.9A CN110895706B (en) 2019-11-07 2019-11-07 Method and device for acquiring target cluster number and computer system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911081062.9A CN110895706B (en) 2019-11-07 2019-11-07 Method and device for acquiring target cluster number and computer system

Publications (2)

Publication Number Publication Date
CN110895706A CN110895706A (en) 2020-03-20
CN110895706B true CN110895706B (en) 2022-12-27

Family

ID=69786677

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911081062.9A Active CN110895706B (en) 2019-11-07 2019-11-07 Method and device for acquiring target cluster number and computer system

Country Status (1)

Country Link
CN (1) CN110895706B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114519101B (en) * 2020-11-18 2023-06-06 易保网络技术(上海)有限公司 Data clustering method and system, data storage method and system and storage medium
CN112381163B (en) * 2020-11-20 2023-07-25 平安科技(深圳)有限公司 User clustering method, device and equipment
CN112925990B (en) * 2021-02-26 2022-09-06 上海哔哩哔哩科技有限公司 Target group classification method and device

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108268526A (en) * 2016-12-30 2018-07-10 中国移动通信集团北京有限公司 A kind of data classification method and device

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108268526A (en) * 2016-12-30 2018-07-10 中国移动通信集团北京有限公司 A kind of data classification method and device

Also Published As

Publication number Publication date
CN110895706A (en) 2020-03-20

Similar Documents

Publication Publication Date Title
CN110895706B (en) Method and device for acquiring target cluster number and computer system
CN110457577B (en) Data processing method, device, equipment and computer storage medium
CN111178380B (en) Data classification method and device and electronic equipment
CN110909222B (en) User portrait establishing method and device based on clustering, medium and electronic equipment
CN109189876B (en) Data processing method and device
CN111062431A (en) Image clustering method, image clustering device, electronic device, and storage medium
CN111626767B (en) Resource data issuing method, device and equipment
CN112612887A (en) Log processing method, device, equipment and storage medium
CN110674832B (en) Method, device and terminal for identifying enterprise to which Internet user belongs
CN116089367A (en) Dynamic barrel dividing method, device, electronic equipment and medium
CN115033616A (en) Data screening rule verification method and device based on multi-round sampling
CN114785616A (en) Data risk detection method and device, computer equipment and storage medium
CN114417964A (en) Satellite operator classification method and device and electronic equipment
CN113760550A (en) Resource allocation method and resource allocation device
CN112527851A (en) User characteristic data screening method and device and electronic equipment
CN111028383B (en) Vehicle driving data processing method and device
CN107743094B (en) Route access method and route access device
CN110083517A (en) A kind of optimization method and device of user's portrait confidence level
CN112115316A (en) Box separation method and device, electronic equipment and storage medium
CN113672783B (en) Feature processing method, model training method and media resource processing method
CN112667754B (en) Big data processing method and device, computer equipment and storage medium
CN111400594B (en) Information vector determining method, device, equipment and storage medium
CN116226260B (en) Big data decision method, system and cloud service center
CN113962770A (en) Method and device for determining target object, electronic equipment and storage medium
CN115659167A (en) Multi-feature library merging method, device and equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant