CN116610987A

CN116610987A - Kmeans log classification method and device based on distributed sample screening

Info

Publication number: CN116610987A
Application number: CN202310721373.7A
Authority: CN
Inventors: 程永龙; 王钰; 范淑君; 王睿
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2023-06-16
Filing date: 2023-06-16
Publication date: 2023-08-18

Abstract

The application provides a kmeans log classification method and device based on distributed sample screening, and relates to the field of big data. The method comprises the following steps: acquiring N log sample sets and corresponding copies, determining K centers in each log sample set, and carrying out cluster division on the log samples in the copies to obtain cluster centers of each cluster of the copies; forming an initial center set by cluster centers of all clusters in the copy, determining cosine distances among the cluster centers, and carrying out fusion processing on the cluster centers to obtain a first center set; according to cosine distances between cluster centers in the first center set and K centers in the log sample set, K minimum distances are calculated and grade labels of the log sample set are determined; and extracting the log samples of the target number from all log sample sets based on the grade label, determining K centroids from the log samples, and carrying out Kmeans clustering. The application improves the convergence speed and the clustering effect of the clustering algorithm.

Description

Kmeans log classification method and device based on distributed sample screening

Technical Field

The application relates to the technical field of big data, in particular to a kmeans log classification method and device based on distributed sample screening.

Background

The log clustering aims at finding out similar logs, and in analyzing error problems of each user in the process of using an application program, the logs with the same error type are divided into a group, then error methods possibly existing in habits of the user are mined by classification, and suggestions are given for similar aiming methods in subsequent encounters.

At present, a K-means clustering algorithm (K-means clustering algorithm, abbreviated as kmeans clustering algorithm) is generally adopted in the prior art to perform text clustering, and the method comprises the steps of dividing data into K groups, randomly selecting K objects as initial clustering centers, calculating the distance between each object and each seed clustering center, and distributing each object to the closest clustering center. The cluster centers and the objects assigned to them represent a cluster. For each sample assigned, the cluster center of the cluster is recalculated based on the existing objects in the cluster. This process will repeat until a certain termination condition is met.

However, the existing kmeans clustering algorithm performs random extraction of initial centroids in an initialization stage, so that the extracted centroids are too scattered or concentrated and are not particularly uniform, and therefore the clustering convergence speed is slow and the clustering effect is poor.

Disclosure of Invention

The application provides a kmeans log classification method and device based on distributed sample screening, which are used for solving the problems of low convergence speed and poor clustering effect of the conventional clustering algorithm.

In a first aspect, the present application provides a kmeans log classification method based on distributed sample screening, including:

obtaining N log sample sets and corresponding copies of each log sample set, and determining K centers in each log sample set, wherein each log sample set comprises at least one log sample, the copies are identical to the log samples in the corresponding log sample sets, and N and K are positive integers;

according to K centers in each log sample set, carrying out cluster division on log samples in a copy of the log sample set to obtain K clusters of the copy and cluster centers of each cluster;

forming an initial center set by cluster centers of all clusters in the copy, and carrying out fusion processing on the cluster centers according to cosine distances among the cluster centers in the initial center set until a preset fusion ending condition is met to obtain a first center set;

according to cosine distances between cluster centers in the first center set and K centers in the log sample set, K minimum distances of the log sample set are calculated;

Determining a grade label of the log sample set according to K minimum distances of the log sample set;

extracting a target number of log samples from all log sample sets according to the grade label of each log sample set to form a sample data set;

and determining K centroids from the sample data set, and carrying out Kmeans clustering.

In a second aspect, the present application provides a kmeans log classification device based on distributed sample screening, including:

the acquisition module is used for acquiring N log sample sets and copies corresponding to each log sample set, determining K centers in each log sample set, wherein the log sample set comprises at least one log sample, the copies are identical to the log samples in the corresponding log sample sets, and N and K are positive integers;

the center determining module is used for carrying out cluster division on the log samples in the copies of the log sample sets according to the K centers in each log sample set to obtain K clusters of the copies and cluster centers of each cluster;

the center set module is used for forming cluster centers of all clusters in the copy into an initial center set, and carrying out fusion processing on the cluster centers according to cosine distances among the cluster centers in the initial center set until a preset fusion ending condition is met to obtain a first center set;

The distance calculation module is used for calculating K minimum distances of the log sample set according to cosine distances between cluster centers in the first center set and K centers in the log sample set;

the label determining module is used for determining the grade label of the log sample set according to the K minimum distances of the log sample set;

the data set composition module is used for extracting a target number of log samples from all log sample sets according to the grade label of each log sample set to form a sample data set;

and the clustering module is used for determining K centroids from the sample data set and carrying out Kmeans clustering.

In a third aspect, the present application provides an electronic device comprising: a processor, and a memory communicatively coupled to the processor; the memory stores computer-executable instructions; the processor executes the computer-executable instructions stored in the memory to implement the method as described above.

In a fourth aspect, the present application provides a computer readable storage medium having stored therein computer executable instructions for performing a method as described above when executed by a processor.

In a fifth aspect, the application provides a computer program product for implementing a method as described above when being executed by a processor.

According to the kmeans log classification method and device based on distributed sample screening, massive log samples are divided into N different log sample sets, so that the log samples can be processed in a distributed and parallel mode, and the initialization time is reduced. And meanwhile, the initialization stage of the kmeans algorithm is improved, a method for fusing the centroids of all sample sets into a central centroid is provided in the initialization process, the centroids of the whole sample sets are obtained, and the quality condition of the sample sets is measured by calculating the distance between each sample set and the central centroid. And then, samples with different proportions are extracted according to the quality grade to form a quality sample set for selecting the mass center, so that the mass center selection quality is improved, and the initialization convergence speed is accelerated.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

FIG. 1 is a schematic diagram of a clustering algorithm provided by an embodiment of the present application;

Fig. 2 is a flow chart of a kmeans log classification method based on distributed sample screening according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a log sample according to an embodiment of the present application;

fig. 4 is a schematic diagram of a first stage MaPReduce process flow of a kmeans log classification method for distributed sample screening according to an embodiment of the present application;

fig. 5 is a schematic diagram of a two-stage MaPReduce process flow of a kmeans log classification method for distributed sample screening according to an embodiment of the present application;

fig. 6 is a schematic diagram of an initial centroid obtaining flow of a kmeans log classification method based on distributed sample screening according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a kmeans log classification device based on distributed sample screening according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Specific embodiments of the present application have been shown by way of the above drawings and will be described in more detail below. The drawings and the written description are not intended to limit the scope of the inventive concepts in any way, but rather to illustrate the inventive concepts to those skilled in the art by reference to the specific embodiments.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with related laws and regulations and standards, and provide corresponding operation entries for the user to select authorization or rejection.

It should be noted that the kmeans log classifying method and device based on distributed sample screening provided by the application can be used in the technical field of big data, and can also be used in any field except the technical field of big data.

K-means clustering algorithm (K-means clustering algorithm): the method is an iterative solution cluster analysis algorithm, and comprises the steps of dividing data into K groups, randomly selecting K objects as initial cluster centers, calculating the distance between each object and each seed cluster center, and distributing each object to the cluster center closest to the object. The cluster centers and the objects assigned to them represent a cluster. For each sample assigned, the cluster center of the cluster is recalculated based on the existing objects in the cluster. This process will repeat until a certain termination condition is met. The termination condition may be that no (or a minimum number of) objects are reassigned to different clusters.

Text clustering is the finding of similar text, which is significant for data mining. The conventional clustering algorithm has the following disadvantages: (1) The centroids are selected to have randomness in the initialization stage, and the initial centroids are selected to be too close or scattered and not uniform enough, which can lead to slow convergence speed of the algorithm. (2) The initial centroid is selected to discrete points or noise data, which is not beneficial to the convergence of subsequent clustering, so that the clustering effect of the algorithm is not ideal. (3) The larger the data, the more obvious this defect is and the longer the initialization time is.

Aiming at the problems that the existing clustering algorithm has slower convergence speed, unsatisfactory clustering effect and the like when carrying out K-means clustering on texts with more numbers, the application provides a kmeans log classification method and device based on distributed sample screening, and the initialization time is reduced by carrying out distributed parallelization processing on a massive log sample initialization mode. And the initialization stage of the kmeans algorithm is improved. In the initialization process, a method for fusing the centroids of all sample sets into a central centroid is provided, the centroids of the whole sample sets are obtained, and the quality condition of the sample sets is measured by calculating the distance between each sample set and the central centroid. And then, extracting samples with different proportions according to the quality grade to form a quality sample set for selecting the mass center. And finally, in the process of taking the centroid, isolating the centroid by adopting a method of deleting adjacent points around the centroid, and simultaneously, reasonably dispersing the selected position of the centroid by adopting a method of combining a maximum distance product strategy with the virtual centroid. The quality of the initial convergence speed and the initial centroid selection is quickened.

The following describes the technical scheme of the present application and how the technical scheme of the present application solves the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

For example, fig. 1 is a schematic diagram of a clustering algorithm provided in an embodiment of the present application, and as shown in fig. 1, taking a total of 10 sample logs from Z1 to Z10 as an example, Z1 to Z8 are clustered into one cluster, and Z9 to Z10 are clustered into another cluster. If an abnormal outlier exists and does not belong to the two clusters, the sample log represented by the abnormal outlier can be determined as an abnormal log. The log can be a term in the field of computers, and the log can be generated by an application platform during operation; each log describes the date, time, user, action, etc. associated operations. The clustering algorithm can be applied to a text clustering scene, clusters the input massive historical error logs, and divides the error logs into a plurality of groups. The method has the advantages that the method is used for obtaining high-quality centroid post-clustering of a large amount of input log data, K log groups can be obtained quickly and better, and by arranging error logs of each group, error habits and corresponding solutions used by users in the process of using the application programs of the I/O device are formed into solution assets, and reference thinking is provided for subsequent operation and maintenance personnel in analyzing similar problems.

Fig. 2 is a flow chart of a kmeans log classification method based on distributed sample screening, which is provided in an embodiment of the present application, and the method may be applied to a text clustering scene, as shown in fig. 2, and the method may specifically include the following steps:

step S201, N log sample sets and corresponding copies of each log sample set are obtained, K centers in each log sample set are determined, each log sample set comprises at least one log sample, the copies are identical to the corresponding log samples in the corresponding log sample sets, and N and K are positive integers.

In this embodiment, the log samples included in each log sample set may be different, for example, the log samples included in the log sample set R1 are R11, R12, R13, and the log samples included in the log sample set R2 are R21, R22, R23.

The number of the log sample sets may be more than 2, before the subsequent step is performed, each log sample set may obtain a corresponding copy by means of replication, where the copy is identical to the log sample set, and when the log sample set is processed later, the log sample set may be changed, for example, the number of log samples therein is reduced, so that the change may be observed through the copy, so as to facilitate the targeted operation on the copy, which will be described in detail later.

In this embodiment, for a large number of log samples, the large number of log samples may be categorized into different sets, so that N log sample sets in this embodiment may be formed.

In this embodiment, since there may be multiple log samples in each log sample set, there is one log sample in the center, and the log sample may be the center. The log samples in the log sample set may be classified into, for example, a region a, a region B, and a region C, where a center exists in the region a, a center also exists in the region B, and a center also exists in the region C.

Step S202, according to K centers in each log sample set, carrying out cluster division on log samples in a copy of the log sample set to obtain K clusters of the copy and cluster centers of each cluster.

In this embodiment, the log sample set has K centers (each center can be understood as one log sample), and in addition, there may be other log samples in the log sample set, where the log samples may be clustered based on the K centers to obtain K clusters.

Taking the example that the log sample set R1 has 2 centers (for example, the log samples R11 and R12 are both centers), when the log samples R13, R14, and R15 are also present in the copies corresponding to the log sample set R1, if the log sample R13 is near the center R12, the log samples R13 and the center R12 are divided into a first cluster, and if the log sample R14 and the log sample R15 are near the center R11, the log samples R14, R11, and R15 are divided into a second cluster, so that there are two clusters (i.e., the first cluster and the second cluster) of the copies.

For example, fig. 3 is a schematic diagram of a log sample provided in an embodiment of the present application, as shown in fig. 3, by digitizing the log samples, replacing each log sample with a coordinate point, and selecting Z11 and Z12 as the center, where the log sample Z13 is near the center Z12, then the log sample Z13 and the center Z12 are divided into a first cluster, and the log sample Z14 and the log sample Z15 are near the center Z11, then the log samples Z14, Z11, and Z15 are divided into a second cluster.

Step S203, cluster centers of all clusters in the copy are formed into an initial center set, and fusion processing is conducted on the cluster centers according to cosine distances among the cluster centers in the initial center set until a preset fusion ending condition is met, so that a first center set is obtained.

In this embodiment, the cluster center and the log sample may be vectorized, and the cosine distance may be calculated by vectorizing the cluster center and the log sample. When the cosine distance between two cluster centers is too small (e.g., less than a threshold), the two clusters may be fused to obtain a new cluster center.

The fusion end condition may be set, for example, when the number of cluster centers in the initial center set is fused and reduced to a certain number threshold, the fusion is stopped, and the initial center set is taken as the first center set.

Step S204, according to cosine distances between cluster centers in the first center set and K centers in the log sample set, K minimum distances of the log sample set are calculated.

In this embodiment, the first center set may include a plurality of cluster centers, and the cosine distance between each center in the log sample set and the cluster centers may be calculated, and then a cluster center closest to the center is found, where the cosine distance between the cluster center and the center is the minimum distance corresponding to the center, and may be represented by disti, i.e., the minimum distance between the i-th center is disti, so that the minimum distance between the K-th center is distk.

Step S205, determining the grade label of the log sample set according to the K minimum distances of the log sample set.

In this embodiment, since a huge amount of log samples are divided into N different log sample sets, each log sample set has K minimum distances, and based on the K minimum distances, the class labels of the corresponding log sample sets can be determined.

Illustratively, the level labels of the log sample set may be classified into level 1, level 2, level 3, and so on, and the greater the K minimum distances, the higher the level of the level label of the log sample set corresponding to the K minimum distances.

Step S206, extracting a target number of log samples from all the log sample sets according to the grade label of each log sample set to form a sample data set.

In this embodiment, the extraction ratio may be set, for example, the extraction ratio corresponding to the level label 1 is set to be 60%, the extraction ratio corresponding to the level label 2 is set to be 40%, and so on, and the level labels of each log sample set are different, and the corresponding target number of extraction is different. The resulting log samples are finally extracted and combined to form a sample dataset.

And S207, determining K centroids from the sample data set, and carrying out Kmeans clustering.

In this embodiment, the above description has been made on the K-means clustering algorithm (K-means clustering algorithm), which randomly selects K objects as initial cluster centers at the beginning, and in this embodiment, the screened cluster centers (i.e., initial centroids) are used as K objects, so as to replace the step of randomly selecting K objects by the Kmeans clustering algorithm, so as to improve the convergence speed of the algorithm and improve the clustering effect.

According to the embodiment of the application, the massive log samples are divided into N different log sample sets, so that the log samples can be processed in a distributed and parallel manner, and the initialization time is reduced. And meanwhile, the initialization stage of the kmeans algorithm is improved, a method for fusing the centroids of all sample sets into a central centroid is provided in the initialization process, the centroids of the whole sample sets are obtained, and the quality condition of the sample sets is measured by calculating the distance between each sample set and the central centroid. And then, samples with different proportions are extracted according to the quality grade to form a quality sample set for selecting the mass center, so that the mass center selection quality is improved, and the initialization convergence speed is accelerated.

In some embodiments, the step S201 may be specifically implemented by the following steps: acquiring an initial log set, wherein the initial log set comprises at least one log sample; equally dividing the initial log set to obtain N log sample sets; backing up each log sample set to obtain a copy of each log sample set; randomly selecting a first target sample from the log sample set, and deleting a first log sample of which the cosine distance from the first target sample in the log sample set is smaller than a first preset threshold value; acquiring the center of the deleted first log sample as a first center; randomly selecting a second target sample from the log sample set, and deleting the second log sample of which the cosine distance from the second target sample in the log sample set is smaller than a first preset threshold value; acquiring the center of the deleted second log sample as a second center; randomly selecting a Kth target sample from the log sample set, and deleting the Kth log sample of which the cosine distance from the Kth target sample in the log sample set is smaller than a first preset threshold value; and acquiring the center of the deleted Kth log sample as the Kth center.

In this embodiment, the initial log set is equally divided into several parts of dataseti, for example, dataset1, dataset2..and so on, and then the i-th log sample set dataseti is backed up, so that the copy corresponding to the log sample set dataseti is dataset_base.

One first target sample Cn1 may be randomly selected from the dataset1, and the first log samples in the dataset1 having a cosine distance from the first target sample Cn1 smaller than the first preset threshold T1 are deleted, and the centers of the first log samples are denoted as C1. Then randomly selecting a second target sample Cn2 from the dataset1, deleting second log samples with cosine distances from the second target sample Cn2 in the dataset1 smaller than a first preset threshold T1, and marking the centers of the second log samples as C2. And the like until the kth center Ck is found. The K central backups are klist_bak1.

According to the embodiment of the application, each log sample set is processed respectively, so that the processing speed can be increased, and the processing time can be shortened. Meanwhile, through continuously screening log samples in the log sample set by cosine distance, K center samples can be accurately obtained, the accuracy of selecting the subsequent initial centroid is ensured, and the clustering effect of Kmeans is improved.

In some embodiments, the step S202 may be specifically implemented by the following steps: calculating a cosine distance between each log sample in the copy and each center; dividing the log sample closest to the Kth center cosine and the Kth center into the same cluster to obtain K clusters and cluster centers of each cluster.

In this embodiment, the log samples in the corresponding copies of each log sample set are complete (not deleted, the log sample set is mentioned in the description of step S201 that the first log sample, the second log sample, etc. are deleted), after K centers are determined in the log sample set, the corresponding center can be found in the copies, and it can be determined which center the log sample is divided into the same cluster according to the cosine distance between each log sample and each center,

for example, reference may be made to fig. 3 above, where, for example, log sample Z13 is a small cosine distance from center Z12, where log sample Z13 and center Z12 form a cluster, and log sample Z14 and log sample Z15 are a small cosine distance from center Z11, where log samples Z14 and Z15 form a cluster with center Z11.

According to the embodiment of the application, the log samples are divided into the corresponding clusters by the long and short cosine distances, so that the log samples far away from a certain center and other centers can be prevented from being divided into the same cluster, the accuracy of clustering can be effectively improved, and the accuracy of initial centroid selection can be further improved.

In some embodiments, the step S203 may be specifically implemented by the following steps: comparing cosine distances of the centers of all clusters in the initial center set to obtain a first cluster center and a second cluster center with the nearest cosine distances; fusing the first cluster center and the second cluster center to serve as new cluster centers, and calculating the total number of all cluster centers in the current initial center set; if the total number is K, stopping fusion processing on cluster centers to obtain a first center set; if the total number is not K, continuing to fuse to obtain a new cluster center.

In this embodiment, it is assumed that there are Map functions, and each Map function performs cluster division on log samples in one copy to obtain K clusters of each copy and cluster centers of each cluster, so that an initial center set is formed by the Map number x K cluster centers. Recording the initial center set as a set list_cen, and fusing elements in the list_cen set:

(1) The cosine distances of cluster centers in the list_cen set are calculated in pairs, two nearest excellent points of the cosine distances are found, and the two excellent points are fused as the cluster centers, wherein the fusion method comprises the following steps: cluster center C1, cluster center C2, new cluster center cnew= (c1+c2)/2 after fusion;

(2) Deleting the two cluster centers which are just found and fused in the list_cen set, and placing the new cluster center after fusion into the list_cen set. At this time, the number of cluster centers in the list_cen set is map number K-1.

According to the principle, two cluster centers with the closest cosine distance are found out from the list_cen set in a pairwise calculation mode, the two cluster centers are deleted from the list_cen set, then the two cluster centers are fused into a new cluster center and are placed into the list_cen set, and the process is repeated until the number of centers in the list_cen set is set to be K. And finally outputting K centers, and recording the K centers as a center set Kcenter.

According to the embodiment of the application, the cluster centers in the copies corresponding to each log sample set are fused into the new cluster center by utilizing the distance of cosine distance, so that the acquisition quality of the cluster center can be improved integrally, and the selection quality of the subsequent initial centroid is further improved.

In some embodiments, the step S204 may be specifically implemented by the following steps: according to K centers in the log sample set, selecting a Kth log sample from the copies of the log sample set as a Kth element; acquiring a target cluster center corresponding to the Kth element from the first center set according to the Kth element; and calculating the cosine distance between the center of the target cluster and the Kth element to be used as the Kth minimum distance.

In this embodiment, the processing is performed by the Map function, the central center set Kcenter input as the output of the first-stage reduction in the above embodiment, the K center backups of the log sample set as dataset1 (in the case of dataset1, the rest of the dataseti are all thought processing described below) as klist_bak1 (in the case of klist_bak1, the rest of the klist_baki are all thought processing described below).

(1) The minimum distance between klist_bak1 and Kcenter is calculated by the following method:

and (3) extracting an element from the klist_bak1 (namely, selecting a Kth element from a copy of the log sample set), finding out the cluster center closest to the element in the Kcenter (namely, acquiring the cluster center closest to the cosine of the Kth element in the first center set (namely, kcenter), and calculating the cosine distance dist1 between the cluster center and the element. And simultaneously labeling the center taken out from the Kcenter with use.

(2) The next element is fetched from klist_bak1, the nearest center of the Kcenter to the element is found, if the center has no use tag, the cosine distance dist2 of the two elements is calculated, and the center fetched from the Kcenter this time is marked with the use tag. If the use tag already exists in the center fetched from the Kcenter this time, the search is stopped.

With this principle, until the elements of klist_bak1 are taken out, K minimum distances disti are obtained.

Results of Map function output:

in case 1, the accumulated sum dist_sum1 of K minimum distances is output each time the center found from Kcenter is different, i.e., the element in klist_bak1 is in one-to-one relationship with the center in Kcenter.

In case 2, each time there is a repetition of the center found from the Kcenter, i.e., the element in klist_bak1 is not in one-to-one relationship with the center in the Kcenter, the output is null.

The embodiment of the application can be used for measuring the quality condition of the sample set by calculating the minimum distance between each sample set and the central centroid and giving different grade labels to the log sample sets with different qualities, thereby improving the selection quality of the subsequent initial centroid and further improving the clustering effect of the Kmeans algorithm.

In some embodiments, the step S205 may be specifically implemented by the following steps: summing the K minimum distances of the log sample set to obtain a distance accumulation sum corresponding to the log sample set; the distance accumulation sums of all log sample sets are formed into a distance set, and the distance accumulation sums in the distance set are ordered according to the size of the distance accumulation sums; and determining the grade label of the log sample set corresponding to each distance accumulation sum according to the sorting order of the distance accumulation sums.

In this embodiment, the processing can be performed by the idea of the Reduce function:

(1) The set list_dist_sum of the minimum distance accumulated sums of all map function outputs is input.

(2) The elements in list dist sum are sorted incrementally, the first element (i.e. the smallest distance of the accumulated sum) is fetched, and the sample set dataseti to which the accumulated sum belongs is marked with a level 1 label. Then, taking the second (i.e. the accumulation and the second smallest distance) and labeling the sample set dataset to which the accumulation and belongs with a level 2 label. The third and subsequent accumulated sums are then labeled with a level 3 tag for the sample set dataseti to which they belong.

And outputting the label grade of the dataseti with the label grade through the Reduce function processing.

According to the embodiment of the application, each log sample set is sequenced through K minimum distances, so that the high quality condition of each log sample set can be determined, and the log sample sets with different qualities are subjected to targeted log sample extraction in the follow-up process, so that the quality of the log sample for selecting the initial centroid is improved, and the quality of the initial centroid selection is ensured.

In some embodiments, the step S206 may be specifically implemented by the following steps: extracting a corresponding proportional number of log samples of the level label from the log sample set; and (3) forming a high-quality sample set by the log samples extracted from all the log sample sets, and extracting the log samples with target proportions from the high-quality sample set to form a sample data set.

In this embodiment, for sampling the dataseti samples in stage one, if the class label of the dataseti is 1, 80% of the log samples in the log sample set dataseti are randomly extracted as input. If the class label of the dataseti is 2, 60% of the log samples of the log sample set dataseti are randomly extracted as input. If the class label of the dataseti is 3, then 40% of the log samples of the log sample set dataseti are randomly extracted as input. Finally, the extracted samples form a high-quality sample set list_g. The sample data set list_g1 (30% of list_g, i.e. the target proportion is 30% at this time) is randomly extracted from list_g.

According to the embodiment of the application, the high-quality condition of each log sample set is determined according to the grade of the grade label, so that samples with different proportions are extracted to form the high-quality sample set, and the initial centroid is selected. The accuracy of initial centroid selection can be improved, and the clustering effect of the subsequent Kmeans clustering algorithm is improved.

In some embodiments, the step S207 may be specifically implemented by the following steps: randomly selecting a sample centroid adjacent point from a sample data set, deleting a first log sample in the sample data set, wherein the cosine distance between the first log sample and the sample centroid adjacent point is smaller than a preset second threshold value, and acquiring the center of the first log sample as a first centroid; acquiring a farthest sample centroid adjacent point with the farthest center distance from the sample data set, deleting a second log sample with the cosine distance smaller than a preset second threshold value from the farthest sample centroid adjacent point in the sample data set, and acquiring the center of the second log sample as a second centroid; and determining K centroids from the sample data set according to the distances between the residual log samples in the sample data set and the first centroid and the second centroid.

In this embodiment, a centroid neighboring point Cen1 is randomly selected from the set list_g1, first log samples in the list_g with a cosine distance from Cen1 smaller than a preset second threshold T2 are deleted, and the centers of the first log samples are recorded as centroids Cz1. And then finding out a sample centroid adjacent point Cen2 farthest from the Cz1 from the set list_g1, deleting second log samples with the cosine distance from Cen2 smaller than a preset second threshold T2 in the list_g1, and marking the centers of the second log samples as centroids Cz2.

Further, in other embodiments, K centroids may be determined by: multiplying the remaining distance between the log sample and all centroids to form a distance set by the distance product of the log sample, and selecting a maximum distance product from the distance set; taking the log sample corresponding to the maximum distance product as a next new centroid adjacent point, deleting a third log sample with the cosine distance from the new centroid adjacent point in the sample data set smaller than a preset second threshold value, and acquiring the center of the third log sample as a third centroid; and determining whether the number of the centroids in the sample data set is K, if not, continuing to select a new centroid adjacent point and determining a new centroid according to the new centroid adjacent point.

In this embodiment, the method specifically includes the following steps:

step 1: the distances between the rest sample data points in list_g1 and all selected centroids are calculated one by one, the distance products between the log samples and all centroids are recorded as the distance products of the log samples, the distance products of all log samples form a distance product set d, then the distance product with the largest value is selected from the set d, log sample data corresponding to the distance product is used as the next centroid adjacent point, and meanwhile, the log samples around the newly selected centroid adjacent point are deleted, and the centers of the log samples are used as centroids. Wherein, the distance product formula expresses:

Distance product set d: list { (distance of node v from centroid Cz1 (distance of node v from centroid Cz 2) }, where centroid Czk is the currently selected centroid.

In this embodiment, step 1 is repeated until K centroids are found. If the list_g1 is deleted and K centroids are not found, the remaining required centroids are randomly extracted from (list_g-list_g1). And finally, kmeans clustering is carried out by using the obtained K centroids.

The embodiment of the application isolates the centroid by adopting a method of deleting the adjacent points around the centroid, and simultaneously adopts a method of combining a maximum distance product strategy with the virtual centroid to reasonably disperse the selected positions of the centroid, thereby accelerating the initialization convergence speed and the quality of the initial centroid selection.

The overall flow of the kmeans log classification method based on distributed sample screening provided by the embodiment of the application comprises a Map function processing stage and a Reduce function processing stage, wherein a plurality of Map functions can be configured, and each Map function simultaneously carries out the same processing on one log sample set. The kmeans log classification method based on distributed sample screening can be specifically divided into two stages, including a stage of MaPReduce processing, fig. 4 is a schematic diagram of a stage of MaPReduce processing in the kmeans log classification method based on distributed sample screening provided by the embodiment of the present application, and as shown in fig. 4, the stage of MaPReduce processing is divided into a Map function and a Reduce function, and the Map function processing stage includes steps S4011, S4012, and S401i (i takes a value of 1 to N). The Reduce function processing stage includes step S402.

Fig. 5 is a schematic diagram of a two-stage MaPReduce processing flow of a kmeans log classification method for distributed sample screening according to an embodiment of the present application, where, as shown in fig. 5, the two-stage MaPReduce processing is divided into a Map function and a Reduce function, and the Map function processing stage includes steps S5011, S5012, and S501i (i takes a value of 1 to N). The Reduce function processing stage includes step S502.

Fig. 6 is a schematic diagram of an initial centroid obtaining flow of a kmeans log classification method based on distributed sample screening according to an embodiment of the present application, as shown in fig. 6, including step S601, extracting log samples of different proportions from a dataseti log sample set in a stage based on different class labels thereof, and recording all the extracted log samples to form a high-quality sample set list_g. A sample data set list_g1 is randomly extracted from list_g. In step S602, a sample centroid neighboring point Cen1 is randomly selected from the sample data set list_g1, log samples with a cosine distance from Cen1 smaller than the threshold T2 in the sample data set list_g1 are deleted, the centers of the samples are marked as centroids Cz1, then a sample centroid neighboring point Cen2 with the farthest distance from Cz1 is found from the sample data set list_g1, log samples with a cosine distance from Cen2 smaller than the threshold T2 in the sample data set list_g1 are deleted, and the centers of the log samples are marked as centroids Cz2. Step S603, calculating the distance between the rest log samples in the sample data set list_g1 and all selected centroids one by one, recording the distance products between the log samples and all centroids as the distance products of the log samples, forming a distance product set d by the distance products of all log samples, selecting a distance product with the largest value from the set d, taking the log sample corresponding to the distance product as the next centroid adjacent point, deleting the log samples around the newly selected centroid adjacent point, and taking the centers of the log samples as centroids. Step S604, determines whether the number of centroids is K. Step S605 determines whether the sample data set list_g1 is empty. In step S606, the remaining required centroids are randomly extracted from list_g-list_g1. Step S607, the initial centroid is output.

According to the embodiment of the application, the initialization time is reduced by carrying out distributed parallelization processing on the initialization mode of the massive log samples. And the initialization stage of the kmeans algorithm is improved. In the initialization process, a method for fusing the centroids of all sample sets into a central centroid is provided, the centroids of the whole sample sets are obtained, and the quality condition of the sample sets is measured by calculating the distance between each sample set and the central centroid. And then, extracting samples with different proportions according to the quality grade to form a quality sample set for selecting an initial centroid. And in the process of selecting the centroid, a method of deleting adjacent points around the centroid is adopted to isolate the centroid, and a method of combining a maximum distance product strategy with the virtual centroid is adopted to reasonably disperse the selected positions of the centroid. The quality of the initial convergence speed and the initial centroid selection is quickened.

The following are examples of the apparatus of the present application that may be used to perform the method embodiments of the present application. For details not disclosed in the embodiments of the apparatus of the present application, please refer to the embodiments of the method of the present application.

Fig. 7 is a schematic structural diagram of a kmeans log classification device based on distributed sample screening according to an embodiment of the present application, as shown in fig. 7, the log classification device 700 includes an obtaining module 710, a center determining module 720, a center collecting module 730, a distance calculating module 740, a label determining module 750, a dataset forming module 760, and a clustering module 770. The obtaining module 710 is configured to obtain N log sample sets and copies corresponding to each log sample set, determine K centers in each log sample set, where the log sample set includes at least one log sample, the copies are the same as the log samples in the corresponding log sample set, and N and K are positive integers. The center determining module 720 is configured to cluster-divide the log samples in the copy of each log sample set according to the K centers in each log sample set, to obtain K clusters of the copy and a cluster center of each cluster. The center set module 730 is configured to form an initial center set from cluster centers of all clusters in the copy, and perform fusion processing on the cluster centers according to cosine distances between cluster centers in the initial center set until a preset fusion ending condition is met, so as to obtain a first center set. The distance calculating module 740 is configured to calculate K minimum distances of the log sample set according to cosine distances between cluster centers in the first center set and K centers in the log sample set. The tag determination module 750 is configured to determine a class tag of the log sample set according to K minimum distances of the log sample set. The data set composition module 760 is configured to extract a target number of log samples from all log sample sets according to the class label of each log sample set, and compose a sample data set. The clustering module 770 is configured to determine K centroids from the sample dataset for Kmeans clustering.

Optionally, the acquiring module may specifically be configured to: acquiring an initial log set, wherein the initial log set comprises at least one log sample; equally dividing the initial log set to obtain N log sample sets; backing up each log sample set to obtain a copy of each log sample set; randomly selecting a first target sample from the log sample set, and deleting a first log sample of which the cosine distance from the first target sample in the log sample set is smaller than a first preset threshold value; acquiring the center of the deleted first log sample as a first center; randomly selecting a second target sample from the log sample set, and deleting the second log sample of which the cosine distance from the second target sample in the log sample set is smaller than a first preset threshold value; acquiring the center of the deleted second log sample as a second center; randomly selecting a Kth target sample from the log sample set, and deleting the Kth log sample of which the cosine distance from the Kth target sample in the log sample set is smaller than a first preset threshold value; and acquiring the center of the deleted Kth log sample as the Kth center.

Optionally, the center determining module may specifically be configured to: calculating a cosine distance between each log sample in the copy and each center; dividing the log sample closest to the Kth center cosine and the Kth center into the same cluster to obtain K clusters and cluster centers of each cluster.

Optionally, the central aggregation module may specifically be configured to: comparing cosine distances of the centers of all clusters in the initial center set to obtain a first cluster center and a second cluster center with the nearest cosine distances; fusing the first cluster center and the second cluster center to serve as new cluster centers, and calculating the total number of all cluster centers in the current initial center set; if the total number is K, stopping fusion processing on cluster centers to obtain a first center set; if the total number is not K, continuing to fuse to obtain a new cluster center.

Optionally, the central aggregation module may specifically be configured to: new cluster centers are calculated, cnew= (c1+c2)/2, where Cnew is the new cluster center, C1 is the first cluster center, and C2 is the second cluster center.

Optionally, the distance calculating module may specifically be configured to: according to K centers in the log sample set, selecting a Kth log sample from the copies of the log sample set as a Kth element; acquiring a target cluster center corresponding to the Kth element from the first center set according to the Kth element; and calculating the cosine distance between the center of the target cluster and the Kth element to be used as the Kth minimum distance.

Optionally, the distance calculating module may specifically be configured to: and acquiring a cluster center closest to the cosine of the K element in the first center set as a target cluster center.

Optionally, the tag determination module may specifically be configured to: summing the K minimum distances of the log sample set to obtain a distance accumulation sum corresponding to the log sample set; the distance accumulation sums of all log sample sets are formed into a distance set, and the distance accumulation sums in the distance set are ordered according to the size of the distance accumulation sums; and determining the grade label of the log sample set corresponding to each distance accumulation sum according to the sorting order of the distance accumulation sums.

Optionally, the data set composition module may specifically be configured to: extracting a corresponding proportional number of log samples of the level label from the log sample set; and (3) forming a high-quality sample set by the log samples extracted from all the log sample sets, and extracting the log samples with target proportions from the high-quality sample set to form a sample data set.

Alternatively, the clustering module may specifically be configured to: randomly selecting a sample centroid adjacent point from a sample data set, deleting a first log sample in the sample data set, wherein the cosine distance between the first log sample and the sample centroid adjacent point is smaller than a preset second threshold value, and acquiring the center of the first log sample as a first centroid; acquiring a farthest sample centroid adjacent point with the farthest center distance from the sample data set, deleting a second log sample with the cosine distance smaller than a preset second threshold value from the farthest sample centroid adjacent point in the sample data set, and acquiring the center of the second log sample as a second centroid; and determining K centroids from the sample data set according to the distances between the residual log samples in the sample data set and the first centroid and the second centroid.

Alternatively, the clustering module may specifically be configured to: multiplying the remaining distance between the log sample and all centroids to form a distance set by the distance product of the log sample, and selecting a maximum distance product from the distance set; taking the log sample corresponding to the maximum distance product as a next new centroid adjacent point, deleting a third log sample with the cosine distance from the new centroid adjacent point in the sample data set smaller than a preset second threshold value, and acquiring the center of the third log sample as a third centroid; and determining whether the number of the centroids in the sample data set is K, if not, continuing to select a new centroid adjacent point and determining a new centroid according to the new centroid adjacent point.

The device provided by the embodiment of the application can be used for executing the method in the embodiment, and the implementation principle and the technical effect are similar, and are not repeated here.

It should be noted that, it should be understood that the division of the modules of the above apparatus is merely a division of a logic function, and may be fully or partially integrated into a physical entity or may be physically separated. And these modules may all be implemented in software in the form of calls by the processing element; or can be realized in hardware; the method can also be realized in a form of calling software by a processing element, and the method can be realized in a form of hardware by a part of modules. For example, the acquisition module may be a processing element that is set up separately, may be implemented in a chip of the above apparatus, or may be stored in a memory of the above apparatus in the form of program code, and the functions of the above acquisition module may be called and executed by a processing element of the above apparatus. The implementation of the other modules is similar. In addition, all or part of the modules can be integrated together or can be independently implemented. The processing element here may be an integrated circuit with signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in a software form.

Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 8, the electronic device 800 includes: at least one processor 801, memory 802, bus 803, and communication interface 804. Wherein: the processor, communication interface and memory communicate with each other via the bus. The communication interface is used for communicating with other devices. The communication interface comprises a communication interface for data transmission, a display interface or an operation interface for human-computer interaction, and the like. A processor for executing the computer-executable instructions stored in the memory may specifically perform the relevant steps of the methods described in the above embodiments.

Wherein the processor may be a central processing unit, or a specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits configured to implement embodiments of the present application. The one or more processors included in the electronic device may be the same type of processor, such as one or more CPUs; but may also be different types of processors such as one or more CPUs and one or more ASICs.

And the memory is used for storing computer execution instructions. The memory may comprise high speed RAM memory or may also comprise non-volatile memory, such as at least one disk memory.

The present embodiment also provides a computer-readable storage medium having stored therein computer instructions which, when executed by at least one processor of an electronic device, perform the methods provided by the various embodiments described above.

The present embodiment also provides a computer program product comprising computer instructions stored in a readable storage medium. The computer instructions may be read from a readable storage medium by at least one processor of an electronic device, and executed by at least one processor, cause the electronic device to implement the methods provided by the various embodiments described above.

In the present application, "at least one" means one or more, and "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a alone, a and B together, and B alone, wherein a, B may be singular or plural. The character "/" generally indicates that the front and rear associated objects are an "or" relationship; in the formula, the character "/" indicates that the front and rear associated objects are a "division" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural.

It will be appreciated that the various numerical numbers referred to in the embodiments of the present application are merely for ease of description and are not intended to limit the scope of the embodiments of the present application. In the embodiment of the present application, the sequence number of each process does not mean the sequence of the execution sequence, and the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present application in any way.

Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. The kmeans log classification method based on distributed sample screening is characterized by comprising the following steps of:

2. The method of claim 1, wherein the performing fusion processing on cluster centers according to cosine distances between cluster centers in the initial center set until a preset fusion ending condition is met, to obtain a first center set, includes:

comparing cosine distances of the centers of all clusters in the initial center set to obtain a first cluster center and a second cluster center with the nearest cosine distances;

fusing the first cluster center and the second cluster center to serve as new cluster centers, and calculating the total number of all cluster centers in the current initial center set;

if the total number is K, stopping fusion processing on cluster centers to obtain the first center set;

and if the total number is not K, continuing to fuse to obtain a new cluster center.

3. The method of claim 2, wherein the fusing the first cluster center and the second cluster center as a new cluster center comprises:

Cnew＝(C1+C2)/2

in the above formula, cnew is a new cluster center, C1 is a first cluster center, and C2 is a second cluster center.

4. The method according to claim 1, wherein calculating K minimum distances of the log sample set according to cosine distances of cluster centers in the first center set and K centers in the log sample set comprises:

According to K centers in the log sample set, selecting a Kth log sample from the copies of the log sample set as a Kth element;

acquiring a target cluster center corresponding to the Kth element from the first center set according to the Kth element;

and calculating the cosine distance between the center of the target cluster and the Kth element to be used as the Kth minimum distance.

5. The method of claim 4, wherein the obtaining, from the first center set, a target cluster center corresponding to the kth element according to the kth element comprises:

and acquiring a cluster center closest to the cosine of the K element in the first center set as the target cluster center.

6. The method of claim 1, wherein determining the class label for the log sample based on the K minimum distances for the set of log samples comprises:

summing the K minimum distances of the log sample set to obtain a distance accumulation sum corresponding to the log sample set;

the distance accumulation sums of all log sample sets are formed into a distance set, and the distance accumulation sums in the distance set are ordered according to the size of the distance accumulation sums;

And determining the grade label of the log sample set corresponding to each distance accumulation sum according to the sorting order of the distance accumulation sums.

7. The method according to claim 1, wherein the extracting a target number of log samples from all log sample sets according to the class labels of each log sample set to form a sample data set includes:

extracting a corresponding proportional number of log samples of the level label from the log sample set;

and (3) forming a high-quality sample set by the log samples extracted from all the log sample sets, and extracting log samples with target proportions from the high-quality sample set to form the sample data set.

8. The method of claim 1, wherein the determining K centroids from the sample dataset comprises:

randomly selecting a sample centroid adjacent point from the sample data set, deleting a first log sample in the sample data set, wherein the cosine distance between the first log sample and the sample centroid adjacent point is smaller than a preset second threshold value, and acquiring the center of the first log sample as a first centroid;

acquiring a farthest sample centroid adjacent point with the farthest center distance from the sample data set, deleting a second log sample with the cosine distance smaller than the preset second threshold value from the farthest sample centroid adjacent point in the sample data set, and acquiring the center of the second log sample as a second centroid;

And determining K centroids from the sample data set according to the distances between the remaining log samples in the sample data set and the first centroid and the second centroid.

9. The method of claim 8, wherein determining K centroids from the sample dataset based on the distances of the remaining log samples in the sample dataset from the first and second centroids, comprises:

multiplying the remaining log samples by the distances of all centroids, as the distance product of the log samples,

forming distance products of all log samples into a distance set, and selecting a maximum distance product from the distance set;

taking the log sample corresponding to the maximum distance product as a next new centroid adjacent point, deleting a third log sample with the cosine distance from the new centroid adjacent point in the sample data set smaller than a preset second threshold value, and acquiring the center of the third log sample as a third centroid;

and determining whether the number of the centroids in the sample data set is K, if not, continuing to select a new centroid adjacent point and determining a new centroid according to the new centroid adjacent point.

10. Kmeans log classification device based on distributed sample screening, characterized by comprising:

11. An electronic device, comprising: a processor, and a memory communicatively coupled to the processor;

the memory stores computer-executable instructions;

the processor executes computer-executable instructions stored in the memory to implement the method of any one of claims 1 to 9.

12. A computer readable storage medium having stored therein computer executable instructions which when executed by a processor are adapted to carry out the method of any one of claims 1 to 9.

13. A computer program product for implementing the method according to any one of claims 1 to 9 when executed by a processor.