CN113449098A - Log clustering method, device, equipment and storage medium - Google Patents
Log clustering method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN113449098A CN113449098A CN202010219766.4A CN202010219766A CN113449098A CN 113449098 A CN113449098 A CN 113449098A CN 202010219766 A CN202010219766 A CN 202010219766A CN 113449098 A CN113449098 A CN 113449098A
- Authority
- CN
- China
- Prior art keywords
- clustering
- log data
- log
- tid
- algorithm
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 61
- 238000003064 k means clustering Methods 0.000 claims abstract description 24
- 238000011156 evaluation Methods 0.000 claims description 46
- 239000013598 vector Substances 0.000 claims description 42
- 238000000513 principal component analysis Methods 0.000 claims description 23
- 230000009467 reduction Effects 0.000 claims description 15
- 230000015654 memory Effects 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 7
- 238000004140 cleaning Methods 0.000 claims description 3
- 238000012423 maintenance Methods 0.000 abstract description 5
- 239000011159 matrix material Substances 0.000 description 17
- 230000000694 effects Effects 0.000 description 13
- 230000008569 process Effects 0.000 description 12
- 238000012545 processing Methods 0.000 description 12
- 230000002159 abnormal effect Effects 0.000 description 10
- 238000010586 diagram Methods 0.000 description 8
- 230000011218 segmentation Effects 0.000 description 5
- 238000006243 chemical reaction Methods 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 description 1
- 238000012935 Averaging Methods 0.000 description 1
- 238000007792 addition Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 239000006185 dispersion Substances 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000005406 washing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
- G06F18/2135—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the invention provides a method, a device, equipment and a storage medium for clustering logs, wherein the method comprises the following steps: acquiring first log data, wherein the first log data comprises a link tracking code TID and a log description; classifying the first log data based on the TID to obtain log data of a plurality of TID categories; and clustering text information corresponding to log descriptions in the log data of the TID categories according to a K-means clustering algorithm and an editing distance algorithm to obtain a clustering result of the first log data. The method can deeply dig effective information for helping operation and maintenance personnel to detect from massive logs, so that the error generated by a vectorization model with a single structure is compensated, the accuracy of a clustering result is improved, and the clustering time is saved.
Description
Technical Field
The present invention relates to the field of log processing, and in particular, to a log clustering method, apparatus, device, and storage medium.
Background
With the development of internet platforms, the application range and the depth of the internet are continuously expanded. When an application program encounters, a log containing service state information such as current memory occupation information and Central Processing Unit (CPU) utilization rate is generated, and in the face of massive log information, the traditional clustering is usually adopted to classify and mine massive logs and effective information to obtain a clustering result, and an operation and maintenance person can track system faults and correspondingly debug and maintain the system by analyzing the clustering result.
In conventional clustering, a word segmentation method is usually adopted to perform word segmentation on logs, for example, a space included in a log is used to perform word segmentation on the log to obtain a log including a plurality of words, the similarity of the two logs is evaluated according to the number of the same words in the two logs, and each log is clustered based on the similarity between the logs to obtain a clustering result. However, the word segmentation method described above will reduce the relevance between log contents and enhance the independence between segmented words, so that structural information containing position relevance in the log cannot be distinguished during clustering, which results in lost word positions and ambiguity, for example: when clustering is performed on 'i take your things' and 'i take my things', both sentences can be classified into one category by neglecting structural information, so that errors occur when clustering results are generated.
In order to solve the above problems, the prior art is continuously improved on the log content vectorization and algorithm, and a vectorization model (or vectorization template) capable of performing structured processing on log information during clustering is provided. However, when one type of similar logs contains multiple structures, the vectorization model has a single structure, and multiple types of templates need to be generated when clustering is performed on the multiple structures, so that clustering time is increased, and the accuracy of clustering results is low.
Disclosure of Invention
The embodiment of the invention provides a log clustering method, a log clustering device, log clustering equipment and a log storage medium, wherein the logs are clustered and analyzed based on link tracking codes TID generated in the logs and various clustering algorithms, and effective information for helping operation and maintenance personnel to detect can be deeply mined from massive logs, so that errors generated by a vectorization model with a single structure are compensated, the accuracy of a clustering result is improved, and meanwhile, the clustering time is saved.
In a first aspect, a method for clustering logs is provided, where the method includes: acquiring first log data, wherein the first log data comprises a link tracking code TID and a log description; classifying the first log data based on the TID to obtain log data of a plurality of TID categories; and clustering text information corresponding to log descriptions in the log data of the TID categories according to a K-means clustering algorithm and an editing distance algorithm to obtain a clustering result of the first log data.
In some implementations of the first aspect, obtaining the first log data includes: acquiring second log data; and removing the semi-structured data or the log data with missing content in the second log data to obtain the first log data.
In some implementations of the first aspect, after acquiring the first log data, further comprising: cleaning the first log data by using a regular expression; and determining the TID and the log description in the first log data according to the cleaned first log data.
In some implementation manners of the first aspect, clustering text information corresponding to log descriptions in log data of multiple TID categories according to a K-means clustering algorithm and an edit distance algorithm to obtain a clustering result of the first log data includes: vectorizing the TIDs and the log description in the log data of the TID categories respectively to obtain a plurality of characteristic dimensions; selecting log descriptions in first log data corresponding to a plurality of characteristic dimensions, and performing word frequency-inverse file frequency TF-IDF numeralization on text information corresponding to the log descriptions in the log data of a plurality of TID categories to obtain a plurality of high-dimensional vectors; performing dimensionality reduction on the high-dimensionality vector according to Principal Component Analysis (PCA) to obtain a low-dimensionality vector; and clustering the low-dimensional vectors according to a K-means clustering algorithm and an edit distance algorithm to obtain a clustering result of the first log data.
In some implementations of the first aspect, further comprising: and evaluating the clustering result according to the evaluation index of the clustering algorithm to obtain the evaluation result, wherein the evaluation index of the clustering algorithm comprises an outline coefficient, a Calinski-Harabasz index and a Thewesson fortunei index.
In some implementations of the first aspect, the first log data further includes: application system name, project name, host address, and log content.
In a second aspect, an apparatus for clustering logs is provided, the apparatus including: the acquisition module is used for acquiring first log data, and the first log data comprises a link tracking code TID and a log description; the classification module is used for classifying the first log data based on the TID to obtain log data of a plurality of TID categories; and the clustering module is used for clustering the text information corresponding to the log description in the log data of the TID categories according to a K-means clustering algorithm and an editing distance algorithm to obtain a clustering result of the first log data.
In some implementations of the second aspect, the obtaining module is specifically configured to: acquiring second log data; and removing the semi-structured data or the log data with missing content in the second log data to obtain the first log data.
In some realizations of the second aspect, after the first log data is obtained, the determining module is further configured to wash the first log data by using a regular expression, and determine the TID and the log description in the first log data according to the washed first log data.
In some implementations of the second aspect, the clustering module is specifically configured to: according to a K-means clustering algorithm and an edit distance algorithm, clustering processing is carried out on text information corresponding to log descriptions in the log data of the TID categories to obtain a clustering result of the first log data, and the method comprises the following steps: vectorizing the TIDs and the log description in the log data of the TID categories respectively to obtain a plurality of characteristic dimensions; selecting log descriptions in first log data corresponding to a plurality of characteristic dimensions, and performing word frequency-inverse file frequency TF-IDF numeralization on text information corresponding to the log descriptions in the log data of a plurality of TID categories to obtain a plurality of high-dimensional vectors; performing dimensionality reduction on the high-dimensionality vector according to Principal Component Analysis (PCA) to obtain a low-dimensionality vector; and clustering the low-dimensional vectors according to a K-means clustering algorithm and an edit distance algorithm to obtain a clustering result of the first log data.
In some implementation manners of the second aspect, the method further comprises an evaluation module, configured to evaluate the clustering result according to a clustering algorithm evaluation index to obtain an evaluation result, where the clustering algorithm evaluation index includes a contour coefficient, a Calinski-Harabasz index, and a davison bauxid index.
In some implementations of the second aspect, the first log data further includes: application system name, project name, host address, and log content.
In a third aspect, a log clustering device is provided, where the device includes: a processor and a memory storing computer program instructions; the processor, when executing the computer program instructions, implements the method of clustering logs of the first aspect or some realizations of the first aspect.
In a fourth aspect, a computer-readable storage medium is provided, on which computer program instructions are stored, which, when executed by a processor, implement the method for clustering logs of the first aspect or some realizations of the first aspect.
The invention relates to the technical field of log processing, in particular to a log clustering method, a device, equipment and a storage medium.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flowchart of a log clustering method according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating evaluation of clustering results including PCA dimensionality reduction according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating evaluation of clustering results without PCA dimensionality reduction provided by the embodiment of the present invention;
FIG. 4 is a distance threshold versus runtime line graph provided by an embodiment of the present invention;
FIG. 5 is a line graph of distance threshold versus cluster number provided by an embodiment of the present invention;
FIG. 6 is a flowchart illustrating another log clustering method according to an embodiment of the present invention;
FIG. 7 is a TID clustering-based contour coefficient evaluation index graph according to an embodiment of the present invention;
FIG. 8 is a CHI index evaluation index graph based on TID clustering provided by the embodiment of the invention;
FIG. 9 is a DBI index evaluation index graph based on TID clustering according to an embodiment of the present invention;
FIG. 10 is a CHI index evaluation index graph based on a conventional hierarchical clustering algorithm according to an embodiment of the present invention;
FIG. 11 is a CHI index evaluation index graph based on TID clustering provided by the embodiment of the invention;
fig. 12 is a schematic structural diagram of a log clustering device according to an embodiment of the present invention;
fig. 13 is a schematic structural diagram of a log clustering device according to an embodiment of the present invention.
Detailed Description
Features and exemplary embodiments of various aspects of the present invention will be described in detail below, and in order to make objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. It will be apparent to one skilled in the art that the present invention may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present invention by illustrating examples of the present invention.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone.
In order to solve the problems that structural information containing position correlation in logs cannot be judged, and when one type of similar logs contain multiple structures, the clustering time is long and the accuracy of clustering results is low due to the fact that a vectorization model is single in structure, the embodiment of the invention provides a log clustering method, a log clustering device, log clustering equipment and a computer readable storage medium.
The technical solutions of the embodiments of the present invention are described below with reference to the accompanying drawings.
Fig. 1 is a schematic flowchart of a log clustering method provided in an embodiment of the present invention, and as shown in fig. 1, an execution subject of the method is a device for log clustering, and the log clustering method may include the following steps:
s101, first log data are acquired.
First, second log data is obtained, wherein the second log data is a large amount of original log data generated by daily work of the system, and the original log data can include normal log data generated when the system normally runs and abnormal log data generated when an application program in the system fails.
The original log data usually contains some semi-structured data with incomplete structural information or log data with missing content, for example, some logs can miss link tracking codes (Trace ID, TID) in the generation process, so that in the face of the original log data, sample denoising can be performed to remove sample noise points with missing information, semi-structured data or log data with missing content in the original log data are removed, first log data including structured information is obtained, clustering is performed by using the first log data, and a subsequent clustering result can be improved.
Then, in one embodiment, the obtained first log data may be subjected to a feature analysis, including: and cleaning useless information such as useless numbers, punctuation marks and the like in the first log data by using a regular expression, analyzing the cleaned first log data by using methods such as word segmentation, classification, statistics and the like, determining the characteristic content in the first log data, and obtaining log characteristics, wherein the log characteristics comprise TID (terminal identification) and log description.
Optionally, in some embodiments, the log features may also include application system name, project name, host address, and log content.
S102, classifying the first log data based on the link tracking code TID to obtain log data of a plurality of TID categories.
And rapidly classifying the first log data by using the TID, and classifying the log data of the same TID into one class, thereby obtaining the log data of a plurality of TID classes.
S103, carrying out clustering processing on the text information corresponding to the log description in the log data of the TID categories according to a K-means clustering algorithm and an editing distance algorithm to obtain a clustering result of the first log data.
In the embodiment of the present invention, a clustering algorithm may be adopted to perform clustering processing on the text information for the log descriptions in the log data of multiple TID categories, so as to obtain a clustering result of the first log data.
In one embodiment, the clustering algorithm may include a K-means clustering algorithm and an edit distance algorithm.
Firstly, before clustering processing is carried out on log data of a plurality of TID categories, vectorization is carried out on log features in the log data of the TID categories respectively, and a plurality of feature dimensions are obtained.
And then selecting log descriptions in the multiple characteristic dimensions, and clustering text information corresponding to the log descriptions in the log data of the multiple TID categories to obtain a clustering result of the first log data.
In some embodiments, clustering text information corresponding to log descriptions in log data of multiple TID categories may include the following steps:
and S1031, performing word frequency-inverse file frequency TF-IDF numeralization on the text information corresponding to the log description in the log data of the TID categories respectively.
Firstly, according to formula (1), performing word Frequency (TF) statistics on text information corresponding to log descriptions in log data of a plurality of TID categories, and calculating the Frequency of a given entry appearing in a log to which the given entry belongs.
In which an entry t is giveni,ni,kAs an entry tiIn the log djNumber of occurrences, Σknk,jFor logs djSum of the number of occurrences of all entries in tfi,jAs an entry tiIn the affiliated log djThe frequency of occurrence of (a).
Then, according to formula (2), reverse document frequency (IDF) statistics is performed on the text information corresponding to the log descriptions in the log data of the TID categories, and the importance degree of a given term in all logs is evaluated.
In which an entry t is giveni,|{j:ti∈djIs a term containing tiIs the total number of logs, | D | is the total number of logs, idfiAs an entry tiThe inverse file frequency of (1).
And finally, respectively calculating the high-dimensional vector of each log in all TID category log data according to a formula (3).
tfidfi,j=tfi,j×idfi (3)
According to TF-IDF numeralization, each log generates a high-dimensional vector with a fixed length.
And S1032, reducing the dimension of the high-dimension vector according to Principal Component Analysis (PCA).
After the high-dimensional vector is obtained, the high-dimensional vector is subjected to dimensionality reduction by Principal Component Analysis (PCA), so that the considered characteristic variables are reduced, and the low-dimensional vector is obtained.
In some embodiments, dimensionality reduction of the high-dimensional vector according to PCA may include the following steps:
As a specific example, a high-dimensional vector is represented as a data matrix X (X ═ X)ij)n×pWherein i is 1, 2 … n, j is 1, 2 … p, XijA j index value representing the i unit.
Then, the data matrix X is calculated according to the formula (5)jStandard deviation of (a)j。
Finally, the normalized data matrix y is calculated according to equation (6)ij。
And 2, establishing a correlation matrix according to the standardized data matrix, and calculating the eigenvalue and the eigenvector of the correlation matrix.
As a specific example, a correlation matrix R can be determined according to the normalized data matrix Y, and the eigenvalue λ of R can be obtained according to the correlation matrix RjJ is 1 or 2 … p, and the eigenvalues are arranged from small to large to obtain λ1≥λ2≥…≥λp(ii) a Then, the corresponding characteristic vector alpha is solved according to the characteristic polynomiali=(αi1,αi2,…αi1p),i=1、2…p。
And 3, calculating the variance contribution rate and the accumulated variance contribution rate according to the eigenvalue and the eigenvector of the correlation matrix.
The eigenvalue of the correlation matrix is equal to the variance of the corresponding principal component, and the magnitude of the eigenvalue reflects the proportion of all information of the original data contained in the ith principal component and the contribution of each principal component.
And 4, calculating the principal component of the high-dimensional vector according to a formula (7).
Z=Yα (7)
Wherein, Y is the normalized data matrix, and alpha is the characteristic vector of the correlation matrix.
If it isAnd the contribution rate beta (S) of the cumulative variance of the S-th principal component is more than or equal to alpha, then Z1,Z2,...,ZsIs a sample X1,X2,...XpHas a significance level of alpha, and contains a main component Z1,Z2,...,ZsTo replace the sample X1,X2,...XpThe method not only reduces the dimensionality of the input high-dimensional vector, but also eliminates the autocorrelation of the original sample space, thereby obtaining the low-dimensional vector.
S1033, clustering the low-dimensional vectors according to the K-means clustering algorithm and the edit distance algorithm to obtain a clustering result of the first log data.
After the low-dimensional vector is obtained, firstly, the low-dimensional vector is subjected to primary Clustering by using a K-means Clustering Algorithm (K-means Clustering Algorithm) to obtain a first Clustering result.
The PCA dimension reduction in S1032 has a certain positive influence on the result of the preliminary clustering.
In some embodiments, the method comprises the steps of carrying out K-means preliminary clustering by adopting near 100 pieces of abnormal log sample information, and evaluating an obtained first clustering result to obtain a clustering result evaluation graph. Fig. 2 is an evaluation graph of a clustering result including PCA dimension reduction provided in an embodiment of the present invention, where a vertical axis represents an evaluation coefficient, and a horizontal axis represents K value selection, and the higher the evaluation coefficient is, the better the clustering effect is, as shown in fig. 2, the evaluation coefficient of the abnormal log sample information subjected to PCA dimension reduction is the highest within an interval of 8 to 11, and is about 0.997. Fig. 3 is an evaluation diagram of a clustering result without PCA dimension reduction provided in the embodiment of the present invention, and as shown in fig. 3, for abnormal log sample information without PCA dimension reduction, an evaluation coefficient of a clustering K value in an interval of 2 to 5 is too low, and evaluation coefficients of clustering K values in an interval of 8 to 100 are substantially the same and generally less than 0.99. Therefore, the PCA dimensionality reduction is carried out on the log data before clustering, and the result quality of the primary clustering can be improved to a certain extent.
On the basis of the primary clustering, the low-dimensional vectors are further clustered by using Edit Distance (Edit Distance), wherein the Edit Distance refers to the minimum number of Edit operations required for converting two strings from one string to another string, so that the similar degree between different logs can be well represented by using the Edit Distance, the Edit Distance of similar logs is short, and the Edit Distance of dissimilar logs is long.
Specifically, the edit distance is calculated according to formula (8).
Wherein, leva,b(| i |, | j |) represents the edit distance of the two character strings a, b, i and j correspond to the character string lengths of a, b, respectively.
In some embodiments, a distance threshold is preset as an evaluation basis of the current clustering, if the minimum editing distance between the log A to be clustered and the log B in the existing TID category is smaller than the distance threshold, the log A is classified into a sub-category with the minimum editing distance under the TID category of the log B, otherwise, the log A is classified into a new TID category, and the whole clustering process can be completed by repeating the above process.
In some embodiments, 5000 anomaly log samples containing TIDs are input, and the change of threshold, cluster and running time in the process of analyzing and editing distance clustering is calculated. Table 1 is a change table when clustering is performed according to an edit distance algorithm according to an embodiment of the present invention, and as shown in table 1, in a clustering process for 5000 abnormal log samples input this time, a larger threshold value is, a smaller number of clustering clusters is, and a shorter program running time is.
TABLE 1
Threshold value | Number of clusters | Run time (seconds) | Run time (minutes) |
0.05 | 17 | 3248.016 | 54.1336 |
0.1 | 12 | 2120.227 | 35.33711667 |
0.15 | 11 | 2121.372 | 35.3562 |
0.2 | 10 | 1658.652 | 27.6442 |
0.25 | 9 | 1632.79 | 27.21316667 |
0.3 | 8 | 1798.63 | 29.97716667 |
In some embodiments, a number of tests have yielded the run time and cluster number versus distance threshold for clustering according to the edit distance algorithm. Fig. 4 is a graph of distance threshold and operation time, where the horizontal axis represents the size of the distance threshold and the vertical axis represents the operation time (unit: second), the larger the distance threshold, the shorter the operation time in clustering, and the gradual and gradual operation time is greater than 0.2, as shown in fig. 4. Fig. 5 is a line graph of a distance threshold and the number of clusters provided in the embodiment of the present invention, as shown in fig. 5, a horizontal axis represents the size of the distance threshold, a vertical axis represents the number of clusters (unit: one), and the larger the distance threshold, the smaller the number of clusters obtained by clustering.
According to the log clustering method, the request call can be tracked through the TID, when an application program fails, the failure source can be found quickly, the performance bottleneck on each link can be monitored, the logs are classified based on the TID generated in the logs, the logs are clustered and analyzed by using an algorithm fusing various clustering, the condition that a single log category corresponds to multi-structure log content can be effectively compensated, and therefore the accuracy and the clustering speed of log clustering results are effectively improved.
Fig. 6 is a schematic flowchart of another log clustering method according to an embodiment of the present invention, and as shown in fig. 6, the log clustering method may include S101 to S104.
S101, first log data are acquired.
S102, classifying the first log data based on the link tracking code TID to obtain log data of a plurality of TID categories.
S103, carrying out clustering processing on the text information corresponding to the log description in the log data of the TID categories according to a K-means clustering algorithm and an editing distance algorithm to obtain a clustering result of the first log data.
And S104, evaluating the clustering result according to the clustering algorithm evaluation index to obtain an evaluation result, wherein the clustering algorithm evaluation index comprises an outline coefficient, a Calinski-Harabasz index and a Thewessonbergin index.
The contour Coefficient (Silhouette coeffient) is a way to evaluate the clustering effect, and can be used for evaluating the influence of different algorithms or different operation ways of the algorithms on the clustering result on the basis of the same original data by combining two factors of cohesion and separation.
The contour coefficient of each vector in the cluster is calculated according to equation (9).
Where a (i) average represents the distance from the i vector to other points in all the clusters to which it belongs, and b (i) min represents the average distance from the i vector to all the points in the cluster nearest to it, and the value of the contour coefficient is in the range of-1, and the closer to 1, the better the cohesion and separation are.
And averaging the contour coefficients of all the points to obtain the total contour coefficient of the clustering result, wherein the higher the contour coefficient is, the better the clustering effect is.
In some embodiments, 5000 abnormal log samples containing TIDs are input for editing distance clustering, obtained clustering results are evaluated by using contour coefficients, and the relationship between the contour coefficients and the number of clustered clusters is analyzed. Fig. 7 is a TID cluster-based profile coefficient evaluation index graph according to the embodiment of the present invention, where as shown in fig. 7, the vertical axis represents a profile coefficient, and the horizontal axis represents the cluster number of the clustering result, and in TID-based clustering, the clustering result has the highest profile coefficient between 2 and 5 cluster numbers and the best clustering effect, and the cluster number is the second highest between 8 and 11 cluster numbers.
The Calinski-Harabasz (CHI) index is calculated according to the formula (10), and the higher the CHI index value is, the better the clustering effect is.
Where m is the number of samples in the training set, k is the number of classes, BkAs covariance between classes, WkIs the covariance matrix of the data inside the class, tr is the trace of the matrix.
In some embodiments, 5000 abnormal log samples containing TIDs are input for K-means clustering and editing distance algorithm clustering, obtained clustering results are evaluated by using a CHI index, and the relationship between the CHI index and the number of clustered clusters is analyzed. Fig. 8 is a CHI index evaluation index graph based on TID clustering according to an embodiment of the present invention, where as shown in fig. 8, the vertical axis represents the CHI index, and the horizontal axis represents the cluster number of the clustering result, and in TID-based clustering, the CHI index value of the clustering result is the highest between 2 and 3 cluster numbers, the clustering effect is the best, and the clustering effect is the second between 5 and 9 cluster numbers.
The Daviesenbergin (DBI) index is the maximum value of the ratio of the sum of the average distances in any two categories of the intra-category distances to the distance between two clustering centroids, and the smaller the DBI index is, the better the clustering effect is.
Calculating the DBI index of the clustering result can comprise the following steps:
Wherein, XjDenotes the jth data point, A, in the ith classiDenotes the center of the i-th class, TiRepresenting the number of data points in the ith class, representing the mean value of the distances from each point to the center when q is 1, representing the standard deviation of the distances from each point to the center when q is 2, and SiIndicating the degree of scatter of the metric data points in the ith class.
And 2, calculating the distance between the categories according to the formula (12).
Wherein, akiValue of the Kth attribute representing the center point of the ith class, akjRepresenting the value of the Kth attribute of the center point of the jth class, representing the mean value of the distances from each point to the center when the value of p is 1, representing the standard deviation of the distances from each point to the center when the value of p is 2, N representing the number of the attributes in i, MijIndicating the distance of the ith class from the center of the jth class.
And 3, calculating the similarity between the categories according to a formula (13).
Wherein S isiRepresenting the degree of scatter of the metric data points in the ith class, SjRepresents the degree of scatter, M, of the metric data points in the jth classijDenotes the distance, R, between the ith class and the jth class centerijIndicating the similarity between the ith class and the jth class.
Step 4, from RijMaximum value ofI.e., the value of the maximum similarity among the similarities of the ith class and the other classes, the mean value of the maximum similarity of each class is calculated according to formula (14).
Wherein, N represents the number of categories,the average value representing the maximum similarity, namely the DBI index of the clustering result, and the number of categories influence the size of the DBI index.
In some embodiments, 5000 abnormal log samples containing TIDs are input for K-means clustering and editing distance algorithm clustering, obtained clustering results are evaluated by using the DBI index, and the relationship between the DBI index and the number of clustered clusters is analyzed. Fig. 9 is a DBI index evaluation index graph based on TID clustering according to an embodiment of the present invention, where as shown in fig. 9, a vertical axis represents a DBI index, and a horizontal axis represents the cluster number of a clustering result, and in TID-based clustering, the CHI index value of the clustering result is the highest between 2 and 3 cluster numbers, and the clustering effect is the best, and the clustering effect is the second between 8 and 11 cluster numbers.
In some embodiments, after the clustering result is evaluated according to the clustering algorithm evaluation index, the clustering result, the evaluation result, the parameter conversion table of the TF-IDF, and the edited distance clustering parameter conversion table are output, wherein the parameter conversion table of the TF-IDF and the edited distance clustering parameter conversion table can be stored in a database in a document form, which is convenient for development and maintenance personnel to check.
In some embodiments, 5000 abnormal log samples are input for traditional hierarchical algorithm clustering (classification is not performed based on TID), and the obtained clustering result is evaluated by using the CHI index. Fig. 10 is a CHI index evaluation index graph based on a conventional hierarchical clustering algorithm provided by an embodiment of the present invention, as shown in fig. 10, a vertical axis represents a CHI index, a horizontal axis represents the number of clusters of a clustering result, the clustering result with the CHI index greater than 0.7 is basically distributed in an area with a high number of clustering clusters (between 50 and 100), and as the number of clustering clusters increases, the CHI index has a tendency of increasing continuously, which obviously does not conform to the principle of cluster extraction of data classes, and the clustering effect is poor.
In some embodiments, 5000 abnormal log samples containing TIDs are input for K-means clustering and edit distance clustering, and the obtained clustering results are evaluated by using the CHI index. Fig. 11 is a CHI index evaluation index graph based on TID clustering according to an embodiment of the present invention, where as shown in fig. 11, the vertical axis represents a CHI index, and the horizontal axis represents the cluster number of the clustering result, and in TID-based clustering, the clustering result with a CHI index greater than 0.85 is substantially distributed in an area with a lower cluster number (between 14 and 25), and as the cluster number increases, the CHI index tends to decrease continuously, so that the clustering effect is better.
According to the log clustering method provided by the embodiment of the invention, the clustering result of the log is evaluated through the contour coefficient, the CHI index and the DBI index, the relation between the evaluation index and the number of clustering clusters in the clustering result can be analyzed, a proper log clustering algorithm is selected through the relation between the evaluation index and the number of clustering clusters in the clustering result, and the number of clustering clusters in the clustering result is adjusted, so that the log clustering effect is effectively improved.
Fig. 12 is a schematic structural diagram of a log clustering apparatus according to an embodiment of the present invention, and as shown in fig. 12, the log clustering apparatus 200 may include: an obtaining module 210, a classifying module 220, and a clustering module 230.
The obtaining module 210 is configured to obtain first log data, where the first log data includes a link tracking code TID and a log description; the classification module 220 is configured to classify the first log data based on the TID to obtain log data of multiple TID categories; and the clustering module 230 is configured to perform clustering processing on text information corresponding to log descriptions in the log data of the multiple TID categories according to a K-means clustering algorithm and an edit distance algorithm to obtain a clustering result of the first log data.
In some embodiments, the obtaining module 210 is specifically configured to: acquiring second log data; and removing the semi-structured data or the log data with missing content in the second log data to obtain the first log data.
In some embodiments, after the first log data is obtained, a determining module 240 is further included for washing the first log data by using a regular expression, and determining the TID and the log description in the first log data according to the washed first log data.
In some embodiments, the clustering module 230 is specifically configured to: according to a K-means clustering algorithm and an edit distance algorithm, clustering processing is carried out on text information corresponding to log descriptions in the log data of the TID categories to obtain a clustering result of the first log data, and the method comprises the following steps: vectorizing the TIDs and the log description in the log data of the TID categories respectively to obtain a plurality of characteristic dimensions; selecting log descriptions in first log data corresponding to a plurality of characteristic dimensions, and performing word frequency-inverse file frequency TF-IDF numeralization on text information corresponding to the log descriptions in the log data of a plurality of TID categories to obtain a plurality of high-dimensional vectors; performing dimensionality reduction on the high-dimensionality vector according to Principal Component Analysis (PCA) to obtain a low-dimensionality vector; and clustering the low-dimensional vectors according to a K-means clustering algorithm and an edit distance algorithm to obtain a clustering result of the first log data.
In some embodiments, the evaluation module 250 is further included to evaluate the clustering result according to a clustering algorithm evaluation index to obtain an evaluation result, where the clustering algorithm evaluation index includes a contour coefficient, a Calinski-Harabasz index, and a davisenburg index.
In some embodiments, the first log data further comprises: application system name, project name, host address, and log content.
The log clustering device provided by the embodiment of the invention can deeply dig effective information for helping operation and maintenance personnel to detect from massive logs, so that the error generated by a vectorization model with a single structure is compensated, the accuracy of a clustering result is improved, and the clustering time is saved.
Fig. 13 is a schematic diagram of a hardware structure of a log clustering device according to an embodiment of the present invention.
As shown in fig. 13, the clustering device 300 of logs in the present embodiment includes an input device 301, an input interface 302, a central processor 303, a memory 304, an output interface 305, and an output device 306. The input interface 302, the central processing unit 303, the memory 304, and the output interface 305 are connected to each other through a bus 310, and the input device 301 and the output device 306 are connected to the bus 310 through the input interface 302 and the output interface 305, respectively, and further connected to other components of the information acquisition device 300.
Specifically, the input device 301 receives input information from the outside and transmits the input information to the central processor 303 through the input interface 302; central processor 303 processes the input information based on computer-executable instructions stored in memory 304 to generate output information, stores the output information temporarily or permanently in memory 304, and then transmits the output information to output device 306 through output interface 305; the output device 306 outputs the output information to the outside of the information acquisition device 300 for use by the user.
In one embodiment, the clustering device 300 of the log shown in fig. 13 includes: a memory 304 for storing programs; a processor 303 for executing the program stored in the memory to execute the method of the embodiment shown in fig. 1 or fig. 6 provided by the embodiment of the present invention.
An embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium has computer program instructions stored thereon; the computer program instructions, when executed by a processor, implement the method of the embodiment of fig. 1 or fig. 6 provided by embodiments of the present invention.
It is to be understood that the invention is not limited to the specific arrangements and instrumentality described above and shown in the drawings. A detailed description of known methods is omitted herein for the sake of brevity. In the above embodiments, several specific steps are described and shown as examples. However, the method processes of the present invention are not limited to the specific steps described and illustrated, and those skilled in the art can make various changes, modifications and additions or change the order between the steps after comprehending the spirit of the present invention.
The functional blocks shown in the above-described structural block diagrams may be implemented as hardware, software, firmware, or a combination thereof. When implemented in hardware, it may be, for example, an electronic Circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link. A "machine-readable medium" may include any medium that can store or transfer information. Examples of machine-readable media include electronic circuits, semiconductor Memory devices, Read-Only memories (ROMs), flash memories, erasable ROMs (eroms), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, Radio Frequency (RF) links, and so forth. The code segments may be downloaded via computer networks such as the internet, intranet, etc.
It should also be noted that the exemplary embodiments mentioned in this patent describe some methods or systems based on a series of steps or devices. However, the present invention is not limited to the order of the above-described steps, that is, the steps may be performed in the order mentioned in the embodiments, may be performed in an order different from the order in the embodiments, or may be performed simultaneously.
As described above, only the specific embodiments of the present invention are provided, and it can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system, the module and the unit described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. It should be understood that the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the present invention, and these modifications or substitutions should be covered within the scope of the present invention.
Claims (11)
1. A method for clustering logs, the method comprising:
acquiring first log data, wherein the first log data comprises a link tracking code TID and a log description;
classifying the first log data based on the TID to obtain log data of a plurality of TID categories;
and clustering text information corresponding to the log description in the log data of the TID categories according to a K-means clustering algorithm and an editing distance algorithm to obtain a clustering result of the first log data.
2. The method of claim 1, wherein obtaining first log data comprises:
acquiring second log data;
and removing the semi-structured data or the log data with missing content in the second log data to obtain the first log data.
3. The method of claim 1 or 2, wherein after said obtaining the first log data, the method further comprises:
cleaning the first log data by using a regular expression;
and determining the TID and the log description in the first log data according to the cleaned first log data.
4. The method of claim 1, wherein the clustering the text information corresponding to the log description in the log data of the plurality of TID categories according to a K-means clustering algorithm and an edit distance algorithm to obtain a clustering result of the first log data comprises:
vectorizing the TIDs and the log description in the log data of the TID categories respectively to obtain a plurality of characteristic dimensions;
selecting log descriptions in the first log data corresponding to the characteristic dimensions, and performing word frequency-inverse file frequency TF-IDF numeralization on text information corresponding to the log descriptions in the log data of the TID categories to obtain a plurality of high-dimensional vectors;
performing dimensionality reduction on the high-dimensionality vector according to Principal Component Analysis (PCA) to obtain a low-dimensionality vector;
and clustering the low-dimensional vectors according to a K-means clustering algorithm and an editing distance algorithm to obtain a clustering result of the first log data.
5. The method of claim 1 or 4, further comprising:
and evaluating the clustering result according to a clustering algorithm evaluation index to obtain an evaluation result, wherein the clustering algorithm evaluation index comprises an outline coefficient, a Calinski-Harabasz index and a Theisenbergin index.
6. The method of claim 1, wherein the first log data further comprises:
application system name, project name, host address, and log content.
7. An apparatus for clustering logs, the apparatus comprising:
the acquisition module is used for acquiring first log data, and the first log data comprises a link tracking code TID and a log description;
the classification module is used for classifying the first log data based on the TID to obtain log data of a plurality of TID categories;
and the clustering module is used for clustering the text information corresponding to the log description in the log data of the TID categories according to a K-means clustering algorithm and an editing distance algorithm to obtain a clustering result of the first log data.
8. The apparatus of claim 7, wherein the clustering module is specifically configured to:
vectorizing the TIDs and the log description in the log data of the TID categories respectively to obtain a plurality of characteristic dimensions;
selecting log descriptions in the first log data corresponding to the characteristic dimensions, and performing word frequency-inverse file frequency TF-IDF numeralization on text information corresponding to the log descriptions in the log data of the TID categories to obtain a plurality of high-dimensional vectors;
performing dimensionality reduction on the high-dimensionality vector according to Principal Component Analysis (PCA) to obtain a low-dimensionality vector;
and clustering the low-dimensional vectors according to a K-means clustering algorithm and an editing distance algorithm to obtain a clustering result of the first log data.
9. The apparatus of claim 7, further comprising:
and the evaluation module is used for evaluating the clustering result according to a clustering algorithm evaluation index to obtain an evaluation result, wherein the clustering algorithm evaluation index comprises an outline coefficient, a Calinski-Harabasz index and a Thewessonberg index.
10. An apparatus for clustering logs, the apparatus comprising: a processor and a memory storing computer program instructions;
the processor, when executing the computer instructions, implements a method of clustering logs according to any one of claims 1 to 6.
11. A computer-readable storage medium, having stored thereon computer program instructions, which, when executed by a processor, implement a method of clustering logs according to any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010219766.4A CN113449098B (en) | 2020-03-25 | 2020-03-25 | Log clustering method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010219766.4A CN113449098B (en) | 2020-03-25 | 2020-03-25 | Log clustering method, device, equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113449098A true CN113449098A (en) | 2021-09-28 |
CN113449098B CN113449098B (en) | 2024-08-13 |
Family
ID=77806842
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010219766.4A Active CN113449098B (en) | 2020-03-25 | 2020-03-25 | Log clustering method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113449098B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114510518A (en) * | 2022-04-15 | 2022-05-17 | 北京快立方科技有限公司 | Self-adaptive aggregation method and system for massive structured data and electronic equipment |
CN114741673A (en) * | 2022-06-13 | 2022-07-12 | 深圳竹云科技股份有限公司 | Behavior risk detection method, clustering model construction method and device |
CN114816964A (en) * | 2022-06-29 | 2022-07-29 | 深圳竹云科技股份有限公司 | Risk model construction method, risk detection device and computer equipment |
CN114826876A (en) * | 2022-01-11 | 2022-07-29 | 杭州金硕信息技术有限公司 | Cloud service fault detection system and method based on log analysis and online simulation |
CN115617953A (en) * | 2022-11-15 | 2023-01-17 | 成都九洲电子信息系统股份有限公司 | Intelligent diagnosis method and system for network service link fault |
CN117390297A (en) * | 2023-12-13 | 2024-01-12 | 天津和光同德科技股份有限公司 | Large-scale talent intelligence library information optimization matching method |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101888309A (en) * | 2010-06-30 | 2010-11-17 | 中国科学院计算技术研究所 | Online log analysis method |
JP2014120001A (en) * | 2012-12-17 | 2014-06-30 | Kddi Corp | Monitoring device, monitoring method of monitoring object host, monitoring program, and recording medium |
CN104239436A (en) * | 2014-08-27 | 2014-12-24 | 南京邮电大学 | Network hot event detection method based on text classification and clustering analysis |
CN106100885A (en) * | 2016-06-23 | 2016-11-09 | 浪潮电子信息产业股份有限公司 | Network security alarm system and design scheme |
CN107368516A (en) * | 2017-05-25 | 2017-11-21 | 全球能源互联网研究院 | A kind of log audit method and device based on hierarchical clustering |
CN108320166A (en) * | 2018-02-06 | 2018-07-24 | 上海致趣广告有限公司 | A kind of business opportunity progress method for tracing and system |
CN109062763A (en) * | 2018-07-31 | 2018-12-21 | 云南大学 | One kind dynamic realtime from SVN log event stream excavates the movable method of software process |
CN109284371A (en) * | 2018-09-03 | 2019-01-29 | 平安证券股份有限公司 | Anti- fraud method, electronic device and computer readable storage medium |
CN110019070A (en) * | 2017-11-10 | 2019-07-16 | 北京安码科技有限公司 | A kind of security log clustering method based on Hadoop and system of calling to account |
CN110288004A (en) * | 2019-05-30 | 2019-09-27 | 武汉大学 | A kind of diagnosis method for system fault and device excavated based on log semanteme |
-
2020
- 2020-03-25 CN CN202010219766.4A patent/CN113449098B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101888309A (en) * | 2010-06-30 | 2010-11-17 | 中国科学院计算技术研究所 | Online log analysis method |
JP2014120001A (en) * | 2012-12-17 | 2014-06-30 | Kddi Corp | Monitoring device, monitoring method of monitoring object host, monitoring program, and recording medium |
CN104239436A (en) * | 2014-08-27 | 2014-12-24 | 南京邮电大学 | Network hot event detection method based on text classification and clustering analysis |
CN106100885A (en) * | 2016-06-23 | 2016-11-09 | 浪潮电子信息产业股份有限公司 | Network security alarm system and design scheme |
CN107368516A (en) * | 2017-05-25 | 2017-11-21 | 全球能源互联网研究院 | A kind of log audit method and device based on hierarchical clustering |
CN110019070A (en) * | 2017-11-10 | 2019-07-16 | 北京安码科技有限公司 | A kind of security log clustering method based on Hadoop and system of calling to account |
CN108320166A (en) * | 2018-02-06 | 2018-07-24 | 上海致趣广告有限公司 | A kind of business opportunity progress method for tracing and system |
CN109062763A (en) * | 2018-07-31 | 2018-12-21 | 云南大学 | One kind dynamic realtime from SVN log event stream excavates the movable method of software process |
CN109284371A (en) * | 2018-09-03 | 2019-01-29 | 平安证券股份有限公司 | Anti- fraud method, electronic device and computer readable storage medium |
CN110288004A (en) * | 2019-05-30 | 2019-09-27 | 武汉大学 | A kind of diagnosis method for system fault and device excavated based on log semanteme |
Non-Patent Citations (3)
Title |
---|
JINGWEN ZHOU 等: "A data set for user request trace-oriented montoring and its applications", 《IEEE TRANSACTIONS ON SERVICES COMPUTING》, 31 August 2018 (2018-08-31), pages 699 - 712 * |
唐文: "基于消息中间件的调用链跟踪设计与实现", 《电脑知识与技术》, vol. 15, no. 30, pages 54 - 55 * |
郑荣: "统一日志系统中的日志获取模块与日志检索模块的设计与实现", 《中国优秀硕士学位论文 全文数据库·信息科技辑》, no. 08, 15 August 2018 (2018-08-15), pages 1 - 72 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114826876A (en) * | 2022-01-11 | 2022-07-29 | 杭州金硕信息技术有限公司 | Cloud service fault detection system and method based on log analysis and online simulation |
CN114826876B (en) * | 2022-01-11 | 2024-05-03 | 杭州金硕信息技术有限公司 | Cloud service fault detection system and method based on log analysis and online simulation |
CN114510518A (en) * | 2022-04-15 | 2022-05-17 | 北京快立方科技有限公司 | Self-adaptive aggregation method and system for massive structured data and electronic equipment |
CN114510518B (en) * | 2022-04-15 | 2022-07-12 | 北京快立方科技有限公司 | Self-adaptive aggregation method and system for massive structured data and electronic equipment |
CN114741673A (en) * | 2022-06-13 | 2022-07-12 | 深圳竹云科技股份有限公司 | Behavior risk detection method, clustering model construction method and device |
CN114741673B (en) * | 2022-06-13 | 2022-08-26 | 深圳竹云科技股份有限公司 | Behavior risk detection method, clustering model construction method and device |
CN114816964A (en) * | 2022-06-29 | 2022-07-29 | 深圳竹云科技股份有限公司 | Risk model construction method, risk detection device and computer equipment |
CN115617953A (en) * | 2022-11-15 | 2023-01-17 | 成都九洲电子信息系统股份有限公司 | Intelligent diagnosis method and system for network service link fault |
CN117390297A (en) * | 2023-12-13 | 2024-01-12 | 天津和光同德科技股份有限公司 | Large-scale talent intelligence library information optimization matching method |
CN117390297B (en) * | 2023-12-13 | 2024-02-27 | 天津和光同德科技股份有限公司 | Large-scale talent intelligence library information optimization matching method |
Also Published As
Publication number | Publication date |
---|---|
CN113449098B (en) | 2024-08-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113449098B (en) | Log clustering method, device, equipment and storage medium | |
US8495429B2 (en) | Log message anomaly detection | |
CN108182523A (en) | The treating method and apparatus of fault data, computer readable storage medium | |
CN111930547A (en) | Fault positioning method and device and storage medium | |
CN101986296B (en) | Noise data cleaning method based on semantic ontology | |
US20220058171A1 (en) | Leveraging a collection of training tables to accurately predict errors within a variety of tables | |
CN112685324B (en) | Method and system for generating test scheme | |
US20170168911A1 (en) | Computer-implemented method, information processing device, and recording medium | |
CN110674442B (en) | Page monitoring method, device, equipment and computer readable storage medium | |
WO2023208136A1 (en) | Kpi anomaly detection method and apparatus, device and medium | |
CN116132263B (en) | Alarm solution recommending method and device, electronic equipment and storage medium | |
CN113407721A (en) | Method, device and computer storage medium for detecting log sequence abnormity | |
CN115098679A (en) | Method, device, equipment and medium for detecting abnormality of text classification labeling sample | |
CN111368534A (en) | Application log noise reduction method and device | |
CN116841779A (en) | Abnormality log detection method, abnormality log detection device, electronic device and readable storage medium | |
CN113723555A (en) | Abnormal data detection method and device, storage medium and terminal | |
CN114139636B (en) | Abnormal operation processing method and device | |
CN115953123A (en) | Method, device and equipment for generating robot automation flow and storage medium | |
CN114863574A (en) | Handwritten signature recognition method, device, equipment, medium and program product | |
CN115146062A (en) | Intelligent event analysis method and system fusing expert recommendation and text clustering | |
CN112905370A (en) | Topological graph generation method, anomaly detection method, device, equipment and storage medium | |
CN114416573A (en) | Defect analysis method, device, equipment and medium for application program | |
CN113723542A (en) | Log clustering processing method and system | |
Singh et al. | Detection of file level clone for high level cloning | |
CN110348005B (en) | Distribution network equipment state data processing method and device, computer equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |