CN113449098A

CN113449098A - Log clustering method, device, equipment and storage medium

Info

Publication number: CN113449098A
Application number: CN202010219766.4A
Authority: CN
Inventors: 田吉华; 尤建; 张智勇
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Shanghai ICT Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Shanghai ICT Co Ltd
Priority date: 2020-03-25
Filing date: 2020-03-25
Publication date: 2021-09-28
Anticipated expiration: 2040-03-25
Also published as: CN113449098B

Abstract

Embodiments of the present invention provide a log clustering method, device, device, and storage medium. The method includes: acquiring first log data, where the first log data includes a link tracking code TID and a log description; The log data is classified to obtain log data of multiple TID categories; according to the K-means clustering algorithm and the edit distance algorithm, the text information corresponding to the log description in the log data of multiple TID categories is clustered to obtain the first log data. Clustering results. The invention can deeply excavate effective information to help the operation and maintenance personnel to detect from the massive logs, so as to compensate the error generated by the vectorized model with a single structure, and save the clustering time while improving the accuracy of the clustering result. .

Description

Log clustering method, device, equipment and storage medium

Technical Field

The present invention relates to the field of log processing, and in particular, to a log clustering method, apparatus, device, and storage medium.

Background

With the development of internet platforms, the application range and the depth of the internet are continuously expanded. When an application program encounters, a log containing service state information such as current memory occupation information and Central Processing Unit (CPU) utilization rate is generated, and in the face of massive log information, the traditional clustering is usually adopted to classify and mine massive logs and effective information to obtain a clustering result, and an operation and maintenance person can track system faults and correspondingly debug and maintain the system by analyzing the clustering result.

In conventional clustering, a word segmentation method is usually adopted to perform word segmentation on logs, for example, a space included in a log is used to perform word segmentation on the log to obtain a log including a plurality of words, the similarity of the two logs is evaluated according to the number of the same words in the two logs, and each log is clustered based on the similarity between the logs to obtain a clustering result. However, the word segmentation method described above will reduce the relevance between log contents and enhance the independence between segmented words, so that structural information containing position relevance in the log cannot be distinguished during clustering, which results in lost word positions and ambiguity, for example: when clustering is performed on 'i take your things' and 'i take my things', both sentences can be classified into one category by neglecting structural information, so that errors occur when clustering results are generated.

In order to solve the above problems, the prior art is continuously improved on the log content vectorization and algorithm, and a vectorization model (or vectorization template) capable of performing structured processing on log information during clustering is provided. However, when one type of similar logs contains multiple structures, the vectorization model has a single structure, and multiple types of templates need to be generated when clustering is performed on the multiple structures, so that clustering time is increased, and the accuracy of clustering results is low.

Disclosure of Invention

The embodiment of the invention provides a log clustering method, a log clustering device, log clustering equipment and a log storage medium, wherein the logs are clustered and analyzed based on link tracking codes TID generated in the logs and various clustering algorithms, and effective information for helping operation and maintenance personnel to detect can be deeply mined from massive logs, so that errors generated by a vectorization model with a single structure are compensated, the accuracy of a clustering result is improved, and meanwhile, the clustering time is saved.

In a first aspect, a method for clustering logs is provided, where the method includes: acquiring first log data, wherein the first log data comprises a link tracking code TID and a log description; classifying the first log data based on the TID to obtain log data of a plurality of TID categories; and clustering text information corresponding to log descriptions in the log data of the TID categories according to a K-means clustering algorithm and an editing distance algorithm to obtain a clustering result of the first log data.

In some implementations of the first aspect, obtaining the first log data includes: acquiring second log data; and removing the semi-structured data or the log data with missing content in the second log data to obtain the first log data.

In some implementations of the first aspect, after acquiring the first log data, further comprising: cleaning the first log data by using a regular expression; and determining the TID and the log description in the first log data according to the cleaned first log data.

In some implementation manners of the first aspect, clustering text information corresponding to log descriptions in log data of multiple TID categories according to a K-means clustering algorithm and an edit distance algorithm to obtain a clustering result of the first log data includes: vectorizing the TIDs and the log description in the log data of the TID categories respectively to obtain a plurality of characteristic dimensions; selecting log descriptions in first log data corresponding to a plurality of characteristic dimensions, and performing word frequency-inverse file frequency TF-IDF numeralization on text information corresponding to the log descriptions in the log data of a plurality of TID categories to obtain a plurality of high-dimensional vectors; performing dimensionality reduction on the high-dimensionality vector according to Principal Component Analysis (PCA) to obtain a low-dimensionality vector; and clustering the low-dimensional vectors according to a K-means clustering algorithm and an edit distance algorithm to obtain a clustering result of the first log data.

In some implementations of the first aspect, further comprising: and evaluating the clustering result according to the evaluation index of the clustering algorithm to obtain the evaluation result, wherein the evaluation index of the clustering algorithm comprises an outline coefficient, a Calinski-Harabasz index and a Thewesson fortunei index.

In some implementations of the first aspect, the first log data further includes: application system name, project name, host address, and log content.

In a second aspect, an apparatus for clustering logs is provided, the apparatus including: the acquisition module is used for acquiring first log data, and the first log data comprises a link tracking code TID and a log description; the classification module is used for classifying the first log data based on the TID to obtain log data of a plurality of TID categories; and the clustering module is used for clustering the text information corresponding to the log description in the log data of the TID categories according to a K-means clustering algorithm and an editing distance algorithm to obtain a clustering result of the first log data.

In some implementations of the second aspect, the obtaining module is specifically configured to: acquiring second log data; and removing the semi-structured data or the log data with missing content in the second log data to obtain the first log data.

In some realizations of the second aspect, after the first log data is obtained, the determining module is further configured to wash the first log data by using a regular expression, and determine the TID and the log description in the first log data according to the washed first log data.

In some implementations of the second aspect, the clustering module is specifically configured to: according to a K-means clustering algorithm and an edit distance algorithm, clustering processing is carried out on text information corresponding to log descriptions in the log data of the TID categories to obtain a clustering result of the first log data, and the method comprises the following steps: vectorizing the TIDs and the log description in the log data of the TID categories respectively to obtain a plurality of characteristic dimensions; selecting log descriptions in first log data corresponding to a plurality of characteristic dimensions, and performing word frequency-inverse file frequency TF-IDF numeralization on text information corresponding to the log descriptions in the log data of a plurality of TID categories to obtain a plurality of high-dimensional vectors; performing dimensionality reduction on the high-dimensionality vector according to Principal Component Analysis (PCA) to obtain a low-dimensionality vector; and clustering the low-dimensional vectors according to a K-means clustering algorithm and an edit distance algorithm to obtain a clustering result of the first log data.

In some implementation manners of the second aspect, the method further comprises an evaluation module, configured to evaluate the clustering result according to a clustering algorithm evaluation index to obtain an evaluation result, where the clustering algorithm evaluation index includes a contour coefficient, a Calinski-Harabasz index, and a davison bauxid index.

In some implementations of the second aspect, the first log data further includes: application system name, project name, host address, and log content.

In a third aspect, a log clustering device is provided, where the device includes: a processor and a memory storing computer program instructions; the processor, when executing the computer program instructions, implements the method of clustering logs of the first aspect or some realizations of the first aspect.

In a fourth aspect, a computer-readable storage medium is provided, on which computer program instructions are stored, which, when executed by a processor, implement the method for clustering logs of the first aspect or some realizations of the first aspect.

The invention relates to the technical field of log processing, in particular to a log clustering method, a device, equipment and a storage medium.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flowchart of a log clustering method according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating evaluation of clustering results including PCA dimensionality reduction according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating evaluation of clustering results without PCA dimensionality reduction provided by the embodiment of the present invention;

FIG. 4 is a distance threshold versus runtime line graph provided by an embodiment of the present invention;

FIG. 5 is a line graph of distance threshold versus cluster number provided by an embodiment of the present invention;

FIG. 6 is a flowchart illustrating another log clustering method according to an embodiment of the present invention;

FIG. 7 is a TID clustering-based contour coefficient evaluation index graph according to an embodiment of the present invention;

FIG. 8 is a CHI index evaluation index graph based on TID clustering provided by the embodiment of the invention;

FIG. 9 is a DBI index evaluation index graph based on TID clustering according to an embodiment of the present invention;

FIG. 10 is a CHI index evaluation index graph based on a conventional hierarchical clustering algorithm according to an embodiment of the present invention;

FIG. 11 is a CHI index evaluation index graph based on TID clustering provided by the embodiment of the invention;

fig. 12 is a schematic structural diagram of a log clustering device according to an embodiment of the present invention;

fig. 13 is a schematic structural diagram of a log clustering device according to an embodiment of the present invention.

Detailed Description

Features and exemplary embodiments of various aspects of the present invention will be described in detail below, and in order to make objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. It will be apparent to one skilled in the art that the present invention may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present invention by illustrating examples of the present invention.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone.

In order to solve the problems that structural information containing position correlation in logs cannot be judged, and when one type of similar logs contain multiple structures, the clustering time is long and the accuracy of clustering results is low due to the fact that a vectorization model is single in structure, the embodiment of the invention provides a log clustering method, a log clustering device, log clustering equipment and a computer readable storage medium.

The technical solutions of the embodiments of the present invention are described below with reference to the accompanying drawings.

Fig. 1 is a schematic flowchart of a log clustering method provided in an embodiment of the present invention, and as shown in fig. 1, an execution subject of the method is a device for log clustering, and the log clustering method may include the following steps:

s101, first log data are acquired.

First, second log data is obtained, wherein the second log data is a large amount of original log data generated by daily work of the system, and the original log data can include normal log data generated when the system normally runs and abnormal log data generated when an application program in the system fails.

The original log data usually contains some semi-structured data with incomplete structural information or log data with missing content, for example, some logs can miss link tracking codes (Trace ID, TID) in the generation process, so that in the face of the original log data, sample denoising can be performed to remove sample noise points with missing information, semi-structured data or log data with missing content in the original log data are removed, first log data including structured information is obtained, clustering is performed by using the first log data, and a subsequent clustering result can be improved.

Then, in one embodiment, the obtained first log data may be subjected to a feature analysis, including: and cleaning useless information such as useless numbers, punctuation marks and the like in the first log data by using a regular expression, analyzing the cleaned first log data by using methods such as word segmentation, classification, statistics and the like, determining the characteristic content in the first log data, and obtaining log characteristics, wherein the log characteristics comprise TID (terminal identification) and log description.

Optionally, in some embodiments, the log features may also include application system name, project name, host address, and log content.

S102, classifying the first log data based on the link tracking code TID to obtain log data of a plurality of TID categories.

And rapidly classifying the first log data by using the TID, and classifying the log data of the same TID into one class, thereby obtaining the log data of a plurality of TID classes.

S103, carrying out clustering processing on the text information corresponding to the log description in the log data of the TID categories according to a K-means clustering algorithm and an editing distance algorithm to obtain a clustering result of the first log data.

In the embodiment of the present invention, a clustering algorithm may be adopted to perform clustering processing on the text information for the log descriptions in the log data of multiple TID categories, so as to obtain a clustering result of the first log data.

In one embodiment, the clustering algorithm may include a K-means clustering algorithm and an edit distance algorithm.

Firstly, before clustering processing is carried out on log data of a plurality of TID categories, vectorization is carried out on log features in the log data of the TID categories respectively, and a plurality of feature dimensions are obtained.

And then selecting log descriptions in the multiple characteristic dimensions, and clustering text information corresponding to the log descriptions in the log data of the multiple TID categories to obtain a clustering result of the first log data.

In some embodiments, clustering text information corresponding to log descriptions in log data of multiple TID categories may include the following steps:

and S1031, performing word frequency-inverse file frequency TF-IDF numeralization on the text information corresponding to the log description in the log data of the TID categories respectively.

Firstly, according to formula (1), performing word Frequency (TF) statistics on text information corresponding to log descriptions in log data of a plurality of TID categories, and calculating the Frequency of a given entry appearing in a log to which the given entry belongs.

In which an entry t is given_i，n_i,kAs an entry t_iIn the log d_jNumber of occurrences, Σ_kn_k,jFor logs d_jSum of the number of occurrences of all entries in tf_i,jAs an entry t_iIn the affiliated log d_jThe frequency of occurrence of (a).

Then, according to formula (2), reverse document frequency (IDF) statistics is performed on the text information corresponding to the log descriptions in the log data of the TID categories, and the importance degree of a given term in all logs is evaluated.

In which an entry t is given_i，|{j:t_i∈d_jIs a term containing t_iIs the total number of logs, | D | is the total number of logs, idf_iAs an entry t_iThe inverse file frequency of (1).

And finally, respectively calculating the high-dimensional vector of each log in all TID category log data according to a formula (3).

tfidf_i,j＝tf_i,j×idf_i (3)

According to TF-IDF numeralization, each log generates a high-dimensional vector with a fixed length.

And S1032, reducing the dimension of the high-dimension vector according to Principal Component Analysis (PCA).

After the high-dimensional vector is obtained, the high-dimensional vector is subjected to dimensionality reduction by Principal Component Analysis (PCA), so that the considered characteristic variables are reduced, and the low-dimensional vector is obtained.

In some embodiments, dimensionality reduction of the high-dimensional vector according to PCA may include the following steps:

step 1, in order to eliminate the influence caused by different dimensions and too large numerical difference of the high-dimensional vectors, the high-dimensional vectors need to be standardized to obtain a standardized matrix.

As a specific example, a high-dimensional vector is represented as a data matrix X (X ═ X)_ij)_n×pWherein i is 1, 2 … n, j is 1, 2 … p, X_ijA j index value representing the i unit.

First, a data matrix X is calculated according to formula (4)_jIs an arithmetic mean of

Then, the data matrix X is calculated according to the formula (5)_jStandard deviation of (a)_j。

Finally, the normalized data matrix y is calculated according to equation (6)_ij。

And 2, establishing a correlation matrix according to the standardized data matrix, and calculating the eigenvalue and the eigenvector of the correlation matrix.

As a specific example, a correlation matrix R can be determined according to the normalized data matrix Y, and the eigenvalue λ of R can be obtained according to the correlation matrix R_jJ is 1 or 2 … p, and the eigenvalues are arranged from small to large to obtain λ₁≥λ₂≥…≥λ_p(ii) a Then, the corresponding characteristic vector alpha is solved according to the characteristic polynomial_i＝(α_i1,α_i2,…α_i1p)，i＝1、2…p。

And 3, calculating the variance contribution rate and the accumulated variance contribution rate according to the eigenvalue and the eigenvector of the correlation matrix.

The eigenvalue of the correlation matrix is equal to the variance of the corresponding principal component, and the magnitude of the eigenvalue reflects the proportion of all information of the original data contained in the ith principal component and the contribution of each principal component.

And 4, calculating the principal component of the high-dimensional vector according to a formula (7).

Z＝Yα (7)

Wherein, Y is the normalized data matrix, and alpha is the characteristic vector of the correlation matrix.

If it is

And the contribution rate beta (S) of the cumulative variance of the S-th principal component is more than or equal to alpha, then Z₁,Z₂,...,Z_sIs a sample X₁,X₂,...X_pHas a significance level of alpha, and contains a main component Z₁,Z₂,...,Z_sTo replace the sample X₁,X₂,...X_pThe method not only reduces the dimensionality of the input high-dimensional vector, but also eliminates the autocorrelation of the original sample space, thereby obtaining the low-dimensional vector.

S1033, clustering the low-dimensional vectors according to the K-means clustering algorithm and the edit distance algorithm to obtain a clustering result of the first log data.

After the low-dimensional vector is obtained, firstly, the low-dimensional vector is subjected to primary Clustering by using a K-means Clustering Algorithm (K-means Clustering Algorithm) to obtain a first Clustering result.

The PCA dimension reduction in S1032 has a certain positive influence on the result of the preliminary clustering.

In some embodiments, the method comprises the steps of carrying out K-means preliminary clustering by adopting near 100 pieces of abnormal log sample information, and evaluating an obtained first clustering result to obtain a clustering result evaluation graph. Fig. 2 is an evaluation graph of a clustering result including PCA dimension reduction provided in an embodiment of the present invention, where a vertical axis represents an evaluation coefficient, and a horizontal axis represents K value selection, and the higher the evaluation coefficient is, the better the clustering effect is, as shown in fig. 2, the evaluation coefficient of the abnormal log sample information subjected to PCA dimension reduction is the highest within an interval of 8 to 11, and is about 0.997. Fig. 3 is an evaluation diagram of a clustering result without PCA dimension reduction provided in the embodiment of the present invention, and as shown in fig. 3, for abnormal log sample information without PCA dimension reduction, an evaluation coefficient of a clustering K value in an interval of 2 to 5 is too low, and evaluation coefficients of clustering K values in an interval of 8 to 100 are substantially the same and generally less than 0.99. Therefore, the PCA dimensionality reduction is carried out on the log data before clustering, and the result quality of the primary clustering can be improved to a certain extent.

On the basis of the primary clustering, the low-dimensional vectors are further clustered by using Edit Distance (Edit Distance), wherein the Edit Distance refers to the minimum number of Edit operations required for converting two strings from one string to another string, so that the similar degree between different logs can be well represented by using the Edit Distance, the Edit Distance of similar logs is short, and the Edit Distance of dissimilar logs is long.

Specifically, the edit distance is calculated according to formula (8).

Wherein, lev_a,b(| i |, | j |) represents the edit distance of the two character strings a, b, i and j correspond to the character string lengths of a, b, respectively.

In some embodiments, a distance threshold is preset as an evaluation basis of the current clustering, if the minimum editing distance between the log A to be clustered and the log B in the existing TID category is smaller than the distance threshold, the log A is classified into a sub-category with the minimum editing distance under the TID category of the log B, otherwise, the log A is classified into a new TID category, and the whole clustering process can be completed by repeating the above process.

In some embodiments, 5000 anomaly log samples containing TIDs are input, and the change of threshold, cluster and running time in the process of analyzing and editing distance clustering is calculated. Table 1 is a change table when clustering is performed according to an edit distance algorithm according to an embodiment of the present invention, and as shown in table 1, in a clustering process for 5000 abnormal log samples input this time, a larger threshold value is, a smaller number of clustering clusters is, and a shorter program running time is.

TABLE 1

Threshold value	Number of clusters	Run time (seconds)	Run time (minutes)
				0.05	17	3248.016	54.1336
0.1	12	2120.227	35.33711667
				0.15	11	2121.372	35.3562
0.2	10	1658.652	27.6442
				0.25	9	1632.79	27.21316667
0.3	8	1798.63	29.97716667

In some embodiments, a number of tests have yielded the run time and cluster number versus distance threshold for clustering according to the edit distance algorithm. Fig. 4 is a graph of distance threshold and operation time, where the horizontal axis represents the size of the distance threshold and the vertical axis represents the operation time (unit: second), the larger the distance threshold, the shorter the operation time in clustering, and the gradual and gradual operation time is greater than 0.2, as shown in fig. 4. Fig. 5 is a line graph of a distance threshold and the number of clusters provided in the embodiment of the present invention, as shown in fig. 5, a horizontal axis represents the size of the distance threshold, a vertical axis represents the number of clusters (unit: one), and the larger the distance threshold, the smaller the number of clusters obtained by clustering.

According to the log clustering method, the request call can be tracked through the TID, when an application program fails, the failure source can be found quickly, the performance bottleneck on each link can be monitored, the logs are classified based on the TID generated in the logs, the logs are clustered and analyzed by using an algorithm fusing various clustering, the condition that a single log category corresponds to multi-structure log content can be effectively compensated, and therefore the accuracy and the clustering speed of log clustering results are effectively improved.

Fig. 6 is a schematic flowchart of another log clustering method according to an embodiment of the present invention, and as shown in fig. 6, the log clustering method may include S101 to S104.

S101, first log data are acquired.

And S104, evaluating the clustering result according to the clustering algorithm evaluation index to obtain an evaluation result, wherein the clustering algorithm evaluation index comprises an outline coefficient, a Calinski-Harabasz index and a Thewessonbergin index.

The contour Coefficient (Silhouette coeffient) is a way to evaluate the clustering effect, and can be used for evaluating the influence of different algorithms or different operation ways of the algorithms on the clustering result on the basis of the same original data by combining two factors of cohesion and separation.

The contour coefficient of each vector in the cluster is calculated according to equation (9).

Where a (i) average represents the distance from the i vector to other points in all the clusters to which it belongs, and b (i) min represents the average distance from the i vector to all the points in the cluster nearest to it, and the value of the contour coefficient is in the range of-1, and the closer to 1, the better the cohesion and separation are.

And averaging the contour coefficients of all the points to obtain the total contour coefficient of the clustering result, wherein the higher the contour coefficient is, the better the clustering effect is.

In some embodiments, 5000 abnormal log samples containing TIDs are input for editing distance clustering, obtained clustering results are evaluated by using contour coefficients, and the relationship between the contour coefficients and the number of clustered clusters is analyzed. Fig. 7 is a TID cluster-based profile coefficient evaluation index graph according to the embodiment of the present invention, where as shown in fig. 7, the vertical axis represents a profile coefficient, and the horizontal axis represents the cluster number of the clustering result, and in TID-based clustering, the clustering result has the highest profile coefficient between 2 and 5 cluster numbers and the best clustering effect, and the cluster number is the second highest between 8 and 11 cluster numbers.

The Calinski-Harabasz (CHI) index is calculated according to the formula (10), and the higher the CHI index value is, the better the clustering effect is.

Where m is the number of samples in the training set, k is the number of classes, B_kAs covariance between classes, W_kIs the covariance matrix of the data inside the class, tr is the trace of the matrix.

In some embodiments, 5000 abnormal log samples containing TIDs are input for K-means clustering and editing distance algorithm clustering, obtained clustering results are evaluated by using a CHI index, and the relationship between the CHI index and the number of clustered clusters is analyzed. Fig. 8 is a CHI index evaluation index graph based on TID clustering according to an embodiment of the present invention, where as shown in fig. 8, the vertical axis represents the CHI index, and the horizontal axis represents the cluster number of the clustering result, and in TID-based clustering, the CHI index value of the clustering result is the highest between 2 and 3 cluster numbers, the clustering effect is the best, and the clustering effect is the second between 5 and 9 cluster numbers.

The Daviesenbergin (DBI) index is the maximum value of the ratio of the sum of the average distances in any two categories of the intra-category distances to the distance between two clustering centroids, and the smaller the DBI index is, the better the clustering effect is.

Calculating the DBI index of the clustering result can comprise the following steps:

step 1, the degree of dispersion is calculated according to formula (11).

Wherein, X_jDenotes the jth data point, A, in the ith class_iDenotes the center of the i-th class, T_iRepresenting the number of data points in the ith class, representing the mean value of the distances from each point to the center when q is 1, representing the standard deviation of the distances from each point to the center when q is 2, and S_iIndicating the degree of scatter of the metric data points in the ith class.

And 2, calculating the distance between the categories according to the formula (12).

Wherein, a_kiValue of the Kth attribute representing the center point of the ith class, a_kjRepresenting the value of the Kth attribute of the center point of the jth class, representing the mean value of the distances from each point to the center when the value of p is 1, representing the standard deviation of the distances from each point to the center when the value of p is 2, N representing the number of the attributes in i, M_ijIndicating the distance of the ith class from the center of the jth class.

And 3, calculating the similarity between the categories according to a formula (13).

Wherein S is_iRepresenting the degree of scatter of the metric data points in the ith class, S_jRepresents the degree of scatter, M, of the metric data points in the jth class_ijDenotes the distance, R, between the ith class and the jth class center_ijIndicating the similarity between the ith class and the jth class.

Step 4, from R_ijMaximum value of

I.e., the value of the maximum similarity among the similarities of the ith class and the other classes, the mean value of the maximum similarity of each class is calculated according to formula (14).

Wherein, N represents the number of categories,

the average value representing the maximum similarity, namely the DBI index of the clustering result, and the number of categories influence the size of the DBI index.

In some embodiments, 5000 abnormal log samples containing TIDs are input for K-means clustering and editing distance algorithm clustering, obtained clustering results are evaluated by using the DBI index, and the relationship between the DBI index and the number of clustered clusters is analyzed. Fig. 9 is a DBI index evaluation index graph based on TID clustering according to an embodiment of the present invention, where as shown in fig. 9, a vertical axis represents a DBI index, and a horizontal axis represents the cluster number of a clustering result, and in TID-based clustering, the CHI index value of the clustering result is the highest between 2 and 3 cluster numbers, and the clustering effect is the best, and the clustering effect is the second between 8 and 11 cluster numbers.

In some embodiments, after the clustering result is evaluated according to the clustering algorithm evaluation index, the clustering result, the evaluation result, the parameter conversion table of the TF-IDF, and the edited distance clustering parameter conversion table are output, wherein the parameter conversion table of the TF-IDF and the edited distance clustering parameter conversion table can be stored in a database in a document form, which is convenient for development and maintenance personnel to check.

In some embodiments, 5000 abnormal log samples are input for traditional hierarchical algorithm clustering (classification is not performed based on TID), and the obtained clustering result is evaluated by using the CHI index. Fig. 10 is a CHI index evaluation index graph based on a conventional hierarchical clustering algorithm provided by an embodiment of the present invention, as shown in fig. 10, a vertical axis represents a CHI index, a horizontal axis represents the number of clusters of a clustering result, the clustering result with the CHI index greater than 0.7 is basically distributed in an area with a high number of clustering clusters (between 50 and 100), and as the number of clustering clusters increases, the CHI index has a tendency of increasing continuously, which obviously does not conform to the principle of cluster extraction of data classes, and the clustering effect is poor.

In some embodiments, 5000 abnormal log samples containing TIDs are input for K-means clustering and edit distance clustering, and the obtained clustering results are evaluated by using the CHI index. Fig. 11 is a CHI index evaluation index graph based on TID clustering according to an embodiment of the present invention, where as shown in fig. 11, the vertical axis represents a CHI index, and the horizontal axis represents the cluster number of the clustering result, and in TID-based clustering, the clustering result with a CHI index greater than 0.85 is substantially distributed in an area with a lower cluster number (between 14 and 25), and as the cluster number increases, the CHI index tends to decrease continuously, so that the clustering effect is better.

According to the log clustering method provided by the embodiment of the invention, the clustering result of the log is evaluated through the contour coefficient, the CHI index and the DBI index, the relation between the evaluation index and the number of clustering clusters in the clustering result can be analyzed, a proper log clustering algorithm is selected through the relation between the evaluation index and the number of clustering clusters in the clustering result, and the number of clustering clusters in the clustering result is adjusted, so that the log clustering effect is effectively improved.

Fig. 12 is a schematic structural diagram of a log clustering apparatus according to an embodiment of the present invention, and as shown in fig. 12, the log clustering apparatus 200 may include: an obtaining module 210, a classifying module 220, and a clustering module 230.

The obtaining module 210 is configured to obtain first log data, where the first log data includes a link tracking code TID and a log description; the classification module 220 is configured to classify the first log data based on the TID to obtain log data of multiple TID categories; and the clustering module 230 is configured to perform clustering processing on text information corresponding to log descriptions in the log data of the multiple TID categories according to a K-means clustering algorithm and an edit distance algorithm to obtain a clustering result of the first log data.

In some embodiments, the obtaining module 210 is specifically configured to: acquiring second log data; and removing the semi-structured data or the log data with missing content in the second log data to obtain the first log data.

In some embodiments, after the first log data is obtained, a determining module 240 is further included for washing the first log data by using a regular expression, and determining the TID and the log description in the first log data according to the washed first log data.

In some embodiments, the clustering module 230 is specifically configured to: according to a K-means clustering algorithm and an edit distance algorithm, clustering processing is carried out on text information corresponding to log descriptions in the log data of the TID categories to obtain a clustering result of the first log data, and the method comprises the following steps: vectorizing the TIDs and the log description in the log data of the TID categories respectively to obtain a plurality of characteristic dimensions; selecting log descriptions in first log data corresponding to a plurality of characteristic dimensions, and performing word frequency-inverse file frequency TF-IDF numeralization on text information corresponding to the log descriptions in the log data of a plurality of TID categories to obtain a plurality of high-dimensional vectors; performing dimensionality reduction on the high-dimensionality vector according to Principal Component Analysis (PCA) to obtain a low-dimensionality vector; and clustering the low-dimensional vectors according to a K-means clustering algorithm and an edit distance algorithm to obtain a clustering result of the first log data.

In some embodiments, the evaluation module 250 is further included to evaluate the clustering result according to a clustering algorithm evaluation index to obtain an evaluation result, where the clustering algorithm evaluation index includes a contour coefficient, a Calinski-Harabasz index, and a davisenburg index.

In some embodiments, the first log data further comprises: application system name, project name, host address, and log content.

The log clustering device provided by the embodiment of the invention can deeply dig effective information for helping operation and maintenance personnel to detect from massive logs, so that the error generated by a vectorization model with a single structure is compensated, the accuracy of a clustering result is improved, and the clustering time is saved.

Fig. 13 is a schematic diagram of a hardware structure of a log clustering device according to an embodiment of the present invention.

As shown in fig. 13, the clustering device 300 of logs in the present embodiment includes an input device 301, an input interface 302, a central processor 303, a memory 304, an output interface 305, and an output device 306. The input interface 302, the central processing unit 303, the memory 304, and the output interface 305 are connected to each other through a bus 310, and the input device 301 and the output device 306 are connected to the bus 310 through the input interface 302 and the output interface 305, respectively, and further connected to other components of the information acquisition device 300.

Specifically, the input device 301 receives input information from the outside and transmits the input information to the central processor 303 through the input interface 302; central processor 303 processes the input information based on computer-executable instructions stored in memory 304 to generate output information, stores the output information temporarily or permanently in memory 304, and then transmits the output information to output device 306 through output interface 305; the output device 306 outputs the output information to the outside of the information acquisition device 300 for use by the user.

In one embodiment, the clustering device 300 of the log shown in fig. 13 includes: a memory 304 for storing programs; a processor 303 for executing the program stored in the memory to execute the method of the embodiment shown in fig. 1 or fig. 6 provided by the embodiment of the present invention.

An embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium has computer program instructions stored thereon; the computer program instructions, when executed by a processor, implement the method of the embodiment of fig. 1 or fig. 6 provided by embodiments of the present invention.

It is to be understood that the invention is not limited to the specific arrangements and instrumentality described above and shown in the drawings. A detailed description of known methods is omitted herein for the sake of brevity. In the above embodiments, several specific steps are described and shown as examples. However, the method processes of the present invention are not limited to the specific steps described and illustrated, and those skilled in the art can make various changes, modifications and additions or change the order between the steps after comprehending the spirit of the present invention.

The functional blocks shown in the above-described structural block diagrams may be implemented as hardware, software, firmware, or a combination thereof. When implemented in hardware, it may be, for example, an electronic Circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link. A "machine-readable medium" may include any medium that can store or transfer information. Examples of machine-readable media include electronic circuits, semiconductor Memory devices, Read-Only memories (ROMs), flash memories, erasable ROMs (eroms), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, Radio Frequency (RF) links, and so forth. The code segments may be downloaded via computer networks such as the internet, intranet, etc.

It should also be noted that the exemplary embodiments mentioned in this patent describe some methods or systems based on a series of steps or devices. However, the present invention is not limited to the order of the above-described steps, that is, the steps may be performed in the order mentioned in the embodiments, may be performed in an order different from the order in the embodiments, or may be performed simultaneously.

As described above, only the specific embodiments of the present invention are provided, and it can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system, the module and the unit described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. It should be understood that the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the present invention, and these modifications or substitutions should be covered within the scope of the present invention.

Claims

1. A method for clustering logs, the method comprising:

acquiring first log data, wherein the first log data comprises a link tracking code TID and a log description;

classifying the first log data based on the TID to obtain log data of a plurality of TID categories;

and clustering text information corresponding to the log description in the log data of the TID categories according to a K-means clustering algorithm and an editing distance algorithm to obtain a clustering result of the first log data.

2. The method of claim 1, wherein obtaining first log data comprises:

acquiring second log data;

and removing the semi-structured data or the log data with missing content in the second log data to obtain the first log data.

3. The method of claim 1 or 2, wherein after said obtaining the first log data, the method further comprises:

cleaning the first log data by using a regular expression;

and determining the TID and the log description in the first log data according to the cleaned first log data.

4. The method of claim 1, wherein the clustering the text information corresponding to the log description in the log data of the plurality of TID categories according to a K-means clustering algorithm and an edit distance algorithm to obtain a clustering result of the first log data comprises:

vectorizing the TIDs and the log description in the log data of the TID categories respectively to obtain a plurality of characteristic dimensions;

selecting log descriptions in the first log data corresponding to the characteristic dimensions, and performing word frequency-inverse file frequency TF-IDF numeralization on text information corresponding to the log descriptions in the log data of the TID categories to obtain a plurality of high-dimensional vectors;

performing dimensionality reduction on the high-dimensionality vector according to Principal Component Analysis (PCA) to obtain a low-dimensionality vector;

and clustering the low-dimensional vectors according to a K-means clustering algorithm and an editing distance algorithm to obtain a clustering result of the first log data.

5. The method of claim 1 or 4, further comprising:

and evaluating the clustering result according to a clustering algorithm evaluation index to obtain an evaluation result, wherein the clustering algorithm evaluation index comprises an outline coefficient, a Calinski-Harabasz index and a Theisenbergin index.

6. The method of claim 1, wherein the first log data further comprises:

application system name, project name, host address, and log content.

7. An apparatus for clustering logs, the apparatus comprising:

the acquisition module is used for acquiring first log data, and the first log data comprises a link tracking code TID and a log description;

the classification module is used for classifying the first log data based on the TID to obtain log data of a plurality of TID categories;

and the clustering module is used for clustering the text information corresponding to the log description in the log data of the TID categories according to a K-means clustering algorithm and an editing distance algorithm to obtain a clustering result of the first log data.

8. The apparatus of claim 7, wherein the clustering module is specifically configured to:

9. The apparatus of claim 7, further comprising:

and the evaluation module is used for evaluating the clustering result according to a clustering algorithm evaluation index to obtain an evaluation result, wherein the clustering algorithm evaluation index comprises an outline coefficient, a Calinski-Harabasz index and a Thewessonberg index.

10. An apparatus for clustering logs, the apparatus comprising: a processor and a memory storing computer program instructions;

the processor, when executing the computer instructions, implements a method of clustering logs according to any one of claims 1 to 6.

11. A computer-readable storage medium, having stored thereon computer program instructions, which, when executed by a processor, implement a method of clustering logs according to any one of claims 1 to 6.