CN113535955B

CN113535955B - Method and device for quickly classifying logs

Info

Publication number: CN113535955B
Application number: CN202110804922.8A
Authority: CN
Inventors: 屠彧; 李家炎; 许广洋; 徐晨灿
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2021-07-16
Filing date: 2021-07-16
Publication date: 2022-10-28
Anticipated expiration: 2041-07-16
Also published as: CN113535955A

Abstract

The document relates to the technical field of computers, is applicable to the fields of finance, banks and the like, and particularly relates to a method and a device for rapidly classifying logs. The method comprises the following steps: performing text vectorization processing on a log library to obtain a log text vector set, clustering the log text vector set to obtain a log clustering result, analyzing the log clustering result to obtain a log template, matching input logs according to the log template to obtain a log matching template list, and obtaining a log association template according to the log matching template list, wherein the log association template is used for finishing log classification. By the method and the device, the logs are classified quickly, and the efficiency of log classification is improved.

Description

Log rapid classification method and device

Technical Field

The invention relates to the technical field of computers, can be applied to the field of finance, and particularly relates to a method and a device for rapidly classifying logs.

Background

With the continuous development of science and technology, the quality of science and technology services is continuously improved, in order to achieve efficient operation and maintenance, a large number of logs can be reported when a server fails, operation and maintenance personnel can analyze the failure reasons through the logs, but the problem that the logs are rapidly classified is brought by a large number of logs, and in addition, as the same failure possibly reports a plurality of logs, the operation and maintenance personnel are difficult to rapidly locate the failure reasons in the large number of logs.

At present, the method for manually classifying the logs through experience has the problems of low efficiency, large workload and the like. The log classification by using the clustering algorithm of machine learning is a common method, but the conventional clustering algorithm can only classify the newly added log in a clustering mode again, cannot realize the rapid classification of the log in a mode of matching the log with a template, and has huge log clustering calculation amount.

At present, a method capable of rapidly classifying logs is needed, so that the problems of low efficiency and large calculation amount of log classification in the prior art are solved.

Disclosure of Invention

In order to solve the problems of low log classification efficiency and large calculation amount in the prior art, embodiments of the present disclosure provide a method and an apparatus for rapidly classifying logs, which can more accurately cluster logs to generate a log template, and further obtain a log association template, and achieve rapid classification of logs through the log association template, thereby achieving the purpose of rapidly locating faults.

Provided herein is a method for rapidly classifying logs, including,

performing text vectorization processing on the log library to obtain a log text vector set;

clustering the log text vector set to obtain a log clustering result;

analyzing the log clustering result to obtain a log template;

matching the input logs according to the log template to obtain a log matching template list;

and obtaining a log association template according to the log matching template list, wherein the log association template is used for finishing log classification.

Embodiments herein also provide a log fast classifying apparatus, including,

the text vectorization unit is used for performing text vectorization processing on the log library to obtain a log text vector set;

the log clustering unit is used for clustering the log text vector set to obtain a log clustering result;

the log template generating unit is used for analyzing the log clustering result to obtain a log template;

the log matching unit is used for matching the input logs according to the log template to obtain a log matching template list;

and the log association template generating unit is used for obtaining a log association template according to the log matching template list, and the log association template is used for finishing log classification.

Embodiments herein also provide a computer device comprising a memory, a processor, and a computer program stored on the memory, the processor implementing the above-described method when executing the computer program.

Embodiments herein also provide a computer storage medium having a computer program stored thereon, the computer program, when executed by a processor of a computer device, performing the above-described method.

By using the embodiment, the text vectorization unit performs text vectorization processing on the logs in the log library to obtain a log text vector set, then clusters the log text vector set to obtain a log clustering result, wherein the log clustering result comprises a plurality of log categories, then analyzes the plurality of categories in the log clustering result respectively to obtain a log template of each category, matches the input logs according to the log templates to obtain a log matching template list, wherein the log matching template list can include but is not limited to the corresponding relationship between the logs and the log templates, then obtains a log association template according to the log template matching list, and finally classifies the logs according to the log association template to quickly locate faults. The input logs are classified quickly by generating the log association template, and the log classification efficiency is improved.

Drawings

In order to more clearly illustrate the embodiments or technical solutions in the prior art, the drawings used in the embodiments or technical solutions in the prior art are briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a schematic structural diagram of a log fast classifying device according to an embodiment of the present disclosure;

FIG. 2 is a flowchart illustrating a method for fast classifying a log according to an embodiment of the present disclosure;

FIG. 3 is a detailed block diagram of a log fast classifying device according to an embodiment of the present disclosure;

FIG. 4 is a flow diagram illustrating fast categorization of logs according to an embodiment herein;

FIG. 5 is a flow diagram illustrating generation of a log template according to an embodiment herein;

FIG. 6 is a flow diagram illustrating generation of a log association template according to an embodiment herein;

fig. 7 is a schematic structural diagram of a computing device according to an embodiment of the present disclosure.

[ description of reference ]:

101. a text vectorization unit;

102. a log clustering unit;

103. a log template generating unit;

104. a log matching unit;

105. a log association template generation unit;

301. a text vectorization unit;

3011. a data cleaning module;

3012. a public attribute replacement module;

3013. a text encoding module;

302. a log clustering unit;

3021. a log clustering module;

3022. a similarity calculation module;

3023. a similarity comparison module;

303. a log template generating unit;

3031. a discrete vector elimination module;

3032. a vocabulary amount calculation module;

3033. a log template generation module;

304. a log matching unit;

3041. a log text vectorization module to be classified;

3042. a log template matching module;

305. a log association template generating unit;

3051. a log template combination module;

3052. an associated template selection module;

3053. a log classification module;

701. a computer device;

702. a processor;

703. a memory;

704. a drive mechanism;

705. an input/output module;

706. an input device;

707. an output device;

708. a presentation device;

709. a graphical user interface;

710. a network interface;

711. a communication link;

712. a communication bus.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments herein without making any creative effort, shall fall within the scope of protection.

As shown in fig. 1, the structure schematic diagram of the log fast classifying device in the embodiment of the present disclosure includes a text vectorization unit 101, a log clustering unit 102, a log template generation unit 103, a log matching unit 104, and a log association template generation unit 105, where in the present disclosure, the log template generation unit 103 generates a log template corresponding to each category in the log clustering result according to the log clustering result obtained by the log clustering unit 102, classifies an input log by using the log association template generated by the log association template generation unit 105, and obtains a fault cause represented by the log association template.

The method comprises the steps that a text vectorization unit 101 conducts text vectorization processing on logs in a log library to obtain a log text vector set, a log clustering unit 102 conducts clustering on the log text vector set to obtain log categories and log text vectors included in the categories, a log template generation unit 103 generates log templates corresponding to the categories in log clustering results according to the log clustering results obtained by the log clustering unit 102, a log matching unit 104 conducts regular matching on input logs and the log templates to obtain corresponding relations between the input logs and the log templates, a log template matching list is built, a log association template generation unit 105 obtains log association templates according to the log matching template list, the input logs are classified according to the log association templates, and fault causes are obtained. The log described in the embodiments herein may be, but is not limited to, a network device alarm log.

As shown in fig. 2, a method for quickly classifying logs according to an embodiment of the present disclosure describes, in a present figure, performing text vectorization on a log library, clustering the logs, generating a log template of each log category in a clustering result, matching input logs according to the log templates to obtain a log association template, and classifying the input logs through the log association template to obtain a failure cause, where the method includes:

step 201: performing text vectorization processing on the log library to obtain a log text vector set;

step 202: clustering the log text vector set to obtain a log clustering result;

step 203: analyzing the log clustering result to obtain a log template;

step 204: matching the input logs according to the log template to obtain a log matching template list;

step 205: and obtaining a log association template according to the log matching template list, wherein the log association template is used for finishing log classification.

According to the method of the embodiment, firstly, text vectorization processing is carried out on logs in a log library, log texts are segmented according to spaces, punctuation marks and special symbols, each word segmented by the logs is coded and converted into a text vector capable of being recognized by a machine learning algorithm, a log text vector set corresponding to the log library is obtained, then the log text vector set is clustered, log clustering results are obtained, log text vectors which belong to log categories and the log text vectors belong to the log categories, the clustering results are analyzed, log templates corresponding to the categories in the log clustering results are generated, input logs and the log templates are matched regularly, the corresponding relation between the input logs and the log templates is obtained, a matching template list is built, the input logs are to be classified, log association templates are obtained according to the log matching template list, the input logs are classified according to the log association template list, and the fault reasons are obtained.

According to an embodiment of the present disclosure, performing text vectorization processing on a log library to obtain a log text vector set further includes performing data cleaning on the log library, segmenting specific fields of each log in the log library to obtain a log text, and encoding the log text to obtain a log text vector set.

In the step, the log file is read according to lines and is coded in the utf-8 format, and the unrecognized text is discarded, so that errors in subsequent processing are prevented from affecting training precision.

In the log text, some common fields exist, and the common fields have no practical effect on the clustering of the log and the generation of the log template, but the existence of the common fields can increase the vocabulary of the log text vector set, so that the common fields of the log need to be replaced by wildcards, and the special fields of the log are reserved. For example, the network device log may have public fields such as TIME, IP address, ethernet port, and RULE name, and these public fields are replaced with TIME, IP, ETH, and RULE using a regularization script, so as to reduce the vocabulary amount in the log and further reduce the amount of clustering calculation.

And then carrying out text vectorization processing, segmenting the log according to spaces, punctuation marks and special symbols, extracting each word in the log by using a tokenizer tool, and then encoding each word segmented from the log by using a one-hot encoding algorithm, so that the log can be processed by a machine learning algorithm.

The one-hot encoding algorithm uses an N-bit status register to encode N states. Each state corresponds to an independent register bit, and at any time, only one register bit is valid for the one-hot encoding. For example:

sex: [ "male", "male" ]

The region: [ "Europe", "US", "Asia" ]

The browser: [ "Firefox", "Chrome", "Safari", "Internet Explorer" ]

The samples [ large "," US "," Internet Explorer "] are encoded by one-hot encoding method, the" large "corresponds to [1,0], the" US "corresponds to [0,1,0], and the" Internet Explorer "corresponds to [0,0,0,1]. The result of the full feature digitization is: [1,0,0,1,0,0,0,0,1].

And then compressing the dimensionality of each word by using a word-embedding algorithm, converting the log into vectors with equal dimensionality, and improving the efficiency of model training.

The input of the word-embedding algorithm is a group of non-overlapping words in the original text, for example, the sentence Apple on an applet tree, so that for the convenience of processing, the words are placed in a dictionary [ "applet", "on", "a", "tree" ], and the dictionary can be regarded as an input of the word-embedding algorithm; the output of the word-embedding algorithm is a numerical representation corresponding to each word, for example, the vector corresponding to the applet is [1,0,0,0], the vector corresponding to an is [0,0,1,0], so that the machine learning algorithm can construct a model based on the numerical representation of the word.

According to an embodiment of the present disclosure, clustering the log text vector set to obtain a log clustering result further includes clustering the log text vector set multiple times by using a k-means algorithm to obtain multiple groups of log classification results, calculating cosine similarity of each group of log classification results, and selecting the log classification result with the largest sum of cosine similarity as the log clustering result.

In the step, in order to increase the accuracy of log clustering, the log text vector set is clustered for multiple times through a k-means algorithm to obtain multiple groups of log classification results, each group of log classification results includes a centroid, a cluster where the centroid is located, and the number of logs in each cluster, the centroid is a classification category in the log classification results, and the logs in each cluster are considered to belong to the same log category.

Although there is similarity among the logs in each category, the logs in each category are still separate individuals, and therefore, the similarity of each category needs to be calculated to obtain the similarity of each classification result. In the embodiments herein, the log clustering result is determined by using cosine similarity, which is to measure the similarity between vectors by measuring the cosine value of the included angle between two vectors, the cosine value of the angle of 0 degree is 1, while the cosine value of any other angle is not greater than 1, so that the cosine value of the angle between two vectors determines whether the two vectors point to the same direction approximately.

The cosine similarity calculation formula is as follows:

and K is the number of clusters, x is a log text vector, ci is the ith cluster, ci is the centroid of the cluster Ci, and mi is the number of logs in the ith cluster.

And finally, summing the cosine similarity of each category in the classification result, and selecting the classification result with the largest cosine similarity sum as the log clustering result.

According to an embodiment herein, clustering the set of log text vectors a plurality of times by a K-means algorithm further includes selecting K points in the set of log text vectors as centroids, the K points representing K clustering results, assigning each log vector in the set of log text vectors to the nearest centroid to form K clusters, recalculating the centroid of each cluster until the centroid no longer changes, resulting in a log classification result.

In the present step, the process is carried out,

(1) Randomly selecting K sample points as an initial clustering centroid, wherein the clustering centroid is a data point in a log text vector set:

a＝a ₁ +a ₂ +…+a _K

wherein a represents a category.

(2) For each log vector x in the set of log text vectors _i The distances from the cluster centers to the K cluster centers are calculated and are distributed to the classification corresponding to the cluster center with the smallest distance.

(3) For each category a _j Its cluster center, i.e. the centroid of all samples belonging to the class, is recalculated:

wherein, c _i For the ith cluster, a set of vectors belonging to the ith category is represented.

(4) Repeating steps (2) - (3) until the centroid is no longer changed,

according to an embodiment herein, analyzing the log clustering result, obtaining a log template further comprises,

calculating the vocabulary of different categories in the log clustering result, selecting k words with the largest occurrence frequency to generate a log template corresponding to the category, wherein k is a natural number more than or equal to 1, matching the log text vectors in the category through the log template, and reducing the value of k to regenerate the log template corresponding to the category when all the log text vectors cannot be matched.

In this step, calculating the vocabulary of different categories in the log clustering result further includes performing regularization operation on the log text vectors of each category in the log clustering result, and removing discrete log text vectors. In this implementation, all the log text vectors in each category are selected, the mean X and standard deviation S of the log text vector lengths are calculated, log text vectors with lengths within X +/-S are retained, and other log text vectors are deleted from the category.

Then respectively calculating the vocabulary of each category in the log clustering result, and sequencing the words according to the descending order of the occurrence times to obtain a sequencing list of each category;

setting an initial length k as the length +1 of the longest log in the category, selecting the first k words in the sorted list of the category, and generating a log template of the category in a regular expression form;

matching the log text vectors in the category through the log template of the category;

when the log template of the category can match all log text vectors in the category, the log template is determined as the final log template of the category.

And when the log template of the category cannot be matched with all log text vectors in the category, calculating k = k-1, and selecting the first k words in the sorted list of the category again to generate a regular expression for matching to obtain the log template which can be matched with all the log text vectors in the category. Meanwhile, the background staff analyze the regular expressions of the log templates to obtain fault reasons, and mark the fault reasons on the log templates respectively.

According to an embodiment of the present disclosure, matching the input logs according to the log template to obtain a log matching template list further includes performing text vectorization processing on the input logs to obtain log text vectors, traversing the log templates corresponding to each category in the clustering result, matching the log text vectors, and recording a correspondence between the logs and the log templates in the log matching template list.

In this step, the input log is a log to be classified, and the log template obtained based on the log library is used for classifying the input log, so that the classification speed of the log is increased, and the classification efficiency is improved.

Firstly, performing text vectorization processing on an input log in the same way as the text vectorization processing on the log in a log library to obtain a log text vector, traversing log templates corresponding to various categories in a clustering result, performing regular matching on the log text vector through the log templates to obtain a log template capable of matching the log text vector, and then recording the corresponding relation between the log and the matched template in a log matching template list.

According to an embodiment of the present disclosure, calculating the correlation between log templates according to the log matching template list to obtain a log correlation template further includes continuously selecting m log templates from the log template matching list to obtain a plurality of log template combinations, and recording the occurrence frequency of each log template combination, where m is the number of log templates set according to a requirement, and selecting a log template combination with the occurrence frequency greater than or equal to a set occurrence frequency threshold as the log correlation template, and classifying logs according to the log correlation template to quickly locate a fault.

In the step, firstly, the number m of log templates in the log template combination is set according to the requirement;

continuously selecting m log templates from the log template matching list, carrying out OR operation to obtain a plurality of log template combinations, and recording the times of occurrence of each log template combination in the log template matching list;

selecting a template combination with the occurrence times more than or equal to a set occurrence time threshold value as a log association template;

and finally, classifying the logs corresponding to the log association templates based on a log template matching list, and simultaneously, respectively obtaining the fault reasons of the log association templates by background staff according to the fault reasons marked by the log templates in the log association templates so as to position the fault reasons of the input logs.

Fig. 3 is a detailed structure diagram of the log fast classifying device according to the embodiment of the present disclosure, and the detailed structure of the log fast classifying device is described in this diagram, and specifically includes a text vectorization unit 301, a log clustering unit 302, a log template generating unit 303, a log matching unit 304, and a log association template generating unit 305.

According to an embodiment herein, the text vectorization unit 301 further includes a data cleaning module 3011, configured to perform data cleaning on the log library.

According to an embodiment of the present disclosure, the text vectorization unit 301 further includes a common attribute replacement module 3012, which replaces common attributes of the logs in the log library with wildcards, so as to reduce the amount of vocabulary in the logs, and further reduce the amount of computation of the clusters.

According to an embodiment of the present disclosure, the text vectorization unit 301 further includes a text encoding module 3013, performs text vectorization on the log after replacing the common attribute, divides the log according to spaces, punctuation marks and special symbols, extracts each word in the log by using a tokenizer tool, and then encodes each word divided from the log by using a one-hot encoding algorithm, so that a machine learning algorithm can process each word in the log, and then compresses the dimension of each word by using a word-embedding algorithm, converts the log into vectors with equal dimensions, thereby improving the efficiency of model training.

According to an embodiment of the present disclosure, the log clustering unit 302 further includes a log clustering module 3021, which performs multiple clustering according to the log text vector set obtained by the text vectorization unit 301 to obtain multiple groups of log classification results, where each group of log classification results includes a centroid, a cluster where the centroid is located, and the number of logs in each cluster. And determining the classification result of the text vector set by calculating the cosine similarity of each classification result.

According to an embodiment of the present disclosure, the log clustering unit 302 further includes a similarity calculating module 3022, configured to calculate cosine similarities of multiple log classification results obtained by clustering the log text vector set multiple times by the log clustering module 3021, and finally sum the cosine similarities of each category in the classification results to obtain a cosine similarity sum of each log classification result.

According to an embodiment herein, the log clustering unit 302 further includes a similarity comparison module 3023, configured to compare sizes of cosine similarity sums of each group of log classification results obtained by the similarity calculation module 3022, and select a classification result with largest cosine similarity sum as the log clustering result.

According to an embodiment of the present disclosure, the log template generating unit 303 further includes a discrete vector rejecting module 3031, which performs a regularization operation on log text vectors of each category in the log clustering result obtained by the log clustering unit 302 to reject discrete log text vectors.

According to an embodiment of the present disclosure, the log template generating unit 303 further includes a vocabulary amount calculating module 3032, which calculates the vocabulary amount of each category in the log clustering result retained by the discrete vector eliminating module 3031, and sorts the words according to the descending order of the occurrence times to obtain a sorted list of each category.

For example, a certain log in the log library is: feb 23.

The result of the calculation by the vocabulary calculation module 3032 is: the ranking of the occurrence frequency is Interface, protocol, down, gigabit Ethernet, changed, state, to, and the occurrence frequency of the rest vocabulary is too low (for example, less than 10 times) and is not counted.

According to an embodiment herein, the log template generating unit 303 further includes a log template generating module 3033, configured to generate a log template. Firstly, setting an initial length k as the length +1 of the longest log in the category, selecting the first k words in the sorted list of the category, and generating the log template of the category in the form of a regular expression.

For example, according to the sorted list obtained in the above embodiment, assuming that the longest log length in the category is 6, the initial length is 7, and the top 7 words Interface, protocol, down, gigabit ethernet, changed, state, to in the sorted list are obtained.

And matching the log text vectors in the category through the log template of the category.

When the log template of the category can match all log text vectors in the category, the log template of the category is determined as a final log template of the category.

And when the log template of the category cannot be matched with all log text vectors in the category, calculating k = k-1, and selecting the first k words in the sorted list of the category again to generate a regular expression for matching to obtain the log template which can be matched with all the log text vectors in the category. For example, the log contents according to the above embodiment and the above ordered list:

first matching the log Feb 23 15 17% LINK-3-UPDOWN: line protocol on Interface gigabit Ethernet 0/8, changed state to down: keeping 7 words, the matching result is obtained as follows:

.*protocol.*Interface GigabitEthernet.*,changed state to down.

and if the other logs in the log classification of the log can be matched with the 7 words, generating a regular expression by the 7 words to obtain the log template of the classification.

If at least one log in other logs in the log classification of the log cannot match the 7 words, executing 7-1, namely, truncating the word to, and keeping 6 words for next matching, wherein the obtained matching result is as follows:

.*protocol.*Interface GigabitEthernet.*,changed state.*down.

until the number k of the reserved words can be matched with all the logs in the log classification where the log is located, generating a regular expression by the k words to obtain the log template of the classification.

According to an embodiment of the present disclosure, the log matching unit 304 further includes a log text vectorization module 3041 for performing text vectorization processing on the log to be classified by using the method in the text vectorization unit 301 to obtain a log text vector.

According to an embodiment of the present disclosure, the log matching unit 304 further includes a log template matching module 3042, which performs a regular matching on the log text vector obtained by the log text vectorization module 3041 to be classified according to the log template generated by the log template generating unit 303, to obtain a log template capable of matching the log text vector, and then records a corresponding relationship between the log and the matched template in the log matching template list. The log template matching list of the embodiment is shown in table 1.

TABLE 1

Log numbering	1	2	3	4	5	6	7	8	9	10
											Template numbering	100	102	101	100	102	103	100	104	103	100

According to an embodiment of the present disclosure, the log association template generating unit 305 further includes a log template combination module, and m log templates are continuously selected from the log template matching list obtained by the log matching unit 304 for performing an operation to obtain a plurality of log template combinations, and the occurrence number of each template combination is recorded, where m is the number of log templates set by a user, and the user can adjust the value of m according to a requirement. For example, according to the log matching template list of the above embodiment, the log template number is:

100,102,101,100,102,103,100,104,103,100

if the user sets m =2, the log template combination shown in table 2 is obtained, and as shown in table 2, the log template combination list in the embodiment of the present disclosure is obtained.

TABLE 2

Template combination number	1001	1002	1003	1001	1004	1005	1006	1007	1005
										Member	100\|102	102\|101	101\|100	100\|102	102\|103	103\|100	100\|104	104\|103	103\|100

And counts the number of occurrences of the template combinations.

According to an embodiment of the present disclosure, the log-associated template generating unit 305 further includes an associated template selecting module 3052, which selects, according to the number of times that each template combination obtained by the log template combination module 3051 appears in the log template matching list generated by the log matching unit 304, a template combination with the number of times that the template combination appears being greater than or equal to the threshold according to a threshold set by a user, so as to obtain a plurality of log-associated templates, for example, according to the log matching template list of the above embodiment, if the threshold set by the user is 2, the number of times that the log template combination numbers 1001 and 1005 appear is greater than or equal to 2, so that the or operation of the log templates 100 and 102 represented by the log template combination 1001 is the log-associated template 1001, and the or operation of the log templates 103 and 100 represented by the log template combination 1005 is the log-associated template 1005.

According to an embodiment of the present disclosure, the log association template generating unit 305 further includes a log classifying module 3053, which classifies the log according to the log association template obtained by the association template selecting module 3052, for example, according to the above embodiment, a log association template shown in table 3 is obtained, and as shown in table 3, a list of log association templates in the embodiment of the present disclosure is obtained.

TABLE 3

Therefore, the log numbers 1, 2, 4, 5 are classified as the log association template 1001, and the failure cause of the log numbers 1, 2, 4, 5 can be located as the failure cause represented by the log association template 1001; the log numbers 6, 7, 9 and 10 are classified into a log association template 1005, so that the failure reasons of the log numbers 6, 7, 9 and 10 can be located as the failure reasons represented by the log association template 1005; the log number 3 is classified into the log template 101, and the fault reason of the log number 3 can be positioned as the fault reason represented by the log template 101; the log number 8 is classified as the log template 104, and the failure cause of the log number 8 can be located as the failure cause represented by the log template 104.

Fig. 4 is a flowchart illustrating fast log classification in an embodiment of the present disclosure, where the embodiment illustrated in the present disclosure describes clustering logs in a log library, then generating log templates of each classification in a clustering result, matching input logs through the log templates, obtaining a log association template according to a matching result, and finally classifying the input logs according to the log association template, where the specific process is as follows:

step 401: and extracting the logs in the log library.

In this step, first, all logs recorded in the log library are extracted, and the logs in the log library are subjected to cluster analysis.

Step 402: and cleaning the log.

In this step, the log text extracted in step 401 may have a garbled condition due to different chinese codes, and text that cannot be recognized is discarded.

Step 403: the common field in the replacement log is a wildcard.

In this step, the common fields in the log cleaned in step 402 are replaced with wildcards, the specific fields of the log are reserved, the vocabulary in the log is reduced, and the calculation amount of the clusters is further reduced.

Step 404: and (5) text vectorization processing.

In this step, the log with the specific field reserved in step 403 is subjected to text vectorization processing, so as to obtain a log text vector set capable of being identified by a machine learning algorithm.

Step 405, clustering the log text vector set.

In this step, the log text vector set of step 404 is clustered for many times through a k-means algorithm, then cosine similarity of each group of log classification results is calculated respectively, and the log classification result with the largest cosine similarity sum is selected as the log clustering result.

At step 406, a log template is generated.

In this step, generating the log template of each category in the clustering result obtained in step 405, first calculating the vocabulary of different categories in the log clustering result, selecting k words with the largest occurrence frequency to generate a regular expression as the log template corresponding to the category, where k is a natural number greater than or equal to 1, matching the log text vectors in each category through the log template, and when all the log text vectors cannot be matched, reducing the value of k, regenerating the log template corresponding to the category, and finally obtaining the log template capable of matching all the log text vectors in the category. And storing the log templates of all categories into a log template set.

Step 407, matching the input logs.

In this step, the log template set stored in step 406 is used to perform regular matching on the input log, so as to obtain the corresponding relationship between the input log and each log template in the log template set, and the corresponding relationship is stored in the log template matching list, so as to obtain the matching result.

And step 408, obtaining a log association template according to the matching result.

In this step, m log templates are continuously selected from the matching result obtained in step 407 to obtain a plurality of log template combinations, and the occurrence number of each template combination is recorded, where m is the number of log templates set according to the requirement, and a log template combination with the occurrence number greater than or equal to a set occurrence number threshold is selected as a log association template.

And step 409, classifying the input logs according to the log association template.

In this step, the input logs corresponding to the log association templates are classified according to the plurality of log association templates obtained in step 408.

Fig. 5 is a flowchart illustrating a process of generating a log template according to an embodiment of the present disclosure, where in the embodiment illustrated in the present disclosure, a process of generating a log template according to a log clustering result is described, and for convenience of detailed description, the present disclosure illustrates a process of calculating one category in a log clustering result, and a process of calculating the remaining categories is the same as the process described in the present disclosure, and specifically, the process is:

step 501: inputting a category in the log clustering result.

Step 502: and removing the discrete log text vectors.

In this step, all the log text vectors in the category are selected, the mean value X and the standard deviation S of the log text vector lengths are calculated, the log text vectors with the lengths within the range of X +/-S are reserved, and other log text vectors are deleted from the category.

Step 503: the vocabulary is calculated.

In this step, the vocabulary after the discrete log text vectors are eliminated in the calculation step 502.

Step 504: and sorting the words according to descending occurrence times to obtain a sorted list.

In this step, the words are sorted in order of decreasing occurrence according to the vocabulary result obtained in step 503 to obtain a sorted list of the category.

Step 505: the initial value of k is set to the longest log length +1.

In this step, the initial length k is set to +1, the length of the longest log in the category.

Step 506: the top k words are selected from the ranked list obtained in step 504.

Step 507: generating a regular expression comprising the top k words as a log template.

In this step, a regular expression is generated as a log template according to the first k words selected in step 506.

Step 508: and matching all log text vectors in the category through the log template.

Step 509: and judging whether all logs are successfully matched.

In this step, if the log template is successfully matched with all logs in the category, the log template is used as the category log template; if the log template does not successfully match all logs in the category, k = k-1 is calculated and steps 506-508 are repeated until the generated log template can successfully match all logs in the category.

Step 510: and taking the log template as the log template of the category.

Fig. 6 is a flowchart illustrating a process of generating a log association template by using log templates of all categories in a clustering result, matching input logs and generating a log association template, and classifying the input logs according to the log association template, in an embodiment illustrated in the present disclosure, an input log may be a set of multiple logs, and each log in the multiple logs is classified, where the specific process is as follows:

step 601: and inputting a log to be classified.

In this step, the log to be categorized may be a collection of a plurality of logs.

Step 602: and performing text vectorization processing on the log.

In this step, the log input in step 601 is subjected to text vectorization processing in the same manner as steps 402 to 404 in fig. 4.

Step 603: and matching through the log template to obtain a log matching template list.

In this step, regular matching is performed on the log text vector obtained in step 602 and all log templates to obtain a corresponding relationship between the log text vector and the log template, and the corresponding relationship is stored in a log matching template list.

Step 604: the user sets m =2 according to the demand.

Step 605: and continuously selecting m log templates from the log matching template list to obtain a plurality of log template combinations.

In this step, m log templates are continuously selected from the log template matching list obtained in step 603 for performing an or operation, so as to obtain a plurality of log template combinations.

Step 606: and recording the times of the log template combinations in the log matching template list.

Step 607: and comparing the occurrence times of all log template combinations according to the occurrence time threshold set by the user.

Step 608: and selecting the log template combination with the occurrence frequency more than or equal to a threshold value set by a user as a log association template.

Step 609: and classifying the input logs according to the log association template.

As shown in fig. 7, which is a schematic structural diagram of a computer device according to an embodiment herein, the log fast classifying apparatus in this embodiment may be a computing device in this embodiment, and execute the method described herein, and the computer device 701 may include one or more processors 702, such as one or more Central Processing Units (CPUs), each of which may implement one or more hardware threads. The computer device 701 may also include any memory 703 for storing any kind of information, such as code, settings, data etc. For example, and without limitation, memory 703 may include any one or combination of the following: any type of RAM, any type of ROM, flash memory devices, hard disks, optical disks, etc. More generally, any memory may use any technology to store information. Further, any memory may provide volatile or non-volatile retention of information. Further, any memory may represent fixed or removable components of computer device 701. In one case, when the processor 702 executes the associated instructions, which are stored in any memory or combination of memories, the computer device 701 can perform any of the operations of the associated instructions. The computer device 701 also includes one or more drive mechanisms 704, such as a hard disk drive mechanism, an optical disk drive mechanism, or the like, for interacting with any memory.

Computer device 701 may also include an input/output module 705 (I/O) for receiving various inputs (via input device 706) and for providing various outputs (via output device 707)). One particular output mechanism may include a presentation device 708 and an associated Graphical User Interface (GUI) 709. In other embodiments, the input/output module 705 (I/O), the input device 706, and the output device 707 may not be included, but merely as one computer device in a network. Computer device 701 may also include one or more network interfaces 710 for exchanging data with other devices via one or more communication links 711. One or more communication buses 712 couple the above-described components together.

Communication link 711 may be implemented in any manner, such as over a local area network, a wide area network (e.g., the Internet), a point-to-point connection, etc., or any combination thereof. The communication link 711 may include any combination of hardwired links, wireless links, routers, gateway functions, name servers, etc., governed by any protocol or combination of protocols.

Embodiments herein also provide a computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

clustering the log text vector set to obtain a log clustering result;

analyzing the log clustering result to obtain a log template;

The computer device provided by the embodiment can also implement the methods in fig. 2, 4-6.

Corresponding to the methods in fig. 2, 4-6, embodiments herein also provide a computer-readable storage medium having stored thereon a computer program, which, when executed by a processor, performs the steps of the above-described method.

Embodiments herein also provide computer readable instructions, wherein a program therein causes a processor to perform the method as shown in fig. 2, 4-6 when the instructions are executed by the processor.

It should be understood that, in various embodiments herein, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments herein.

It should also be understood that, in the embodiments herein, the term "and/or" is only one kind of association relation describing an associated object, meaning that three kinds of relations may exist. For example, a and/or B, may represent: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided herein, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the elements may be selected according to actual needs to achieve the objectives of the embodiments herein.

In addition, functional units in the embodiments herein may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present invention may be implemented in a form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The principles and embodiments of this document are explained herein using specific examples, which are presented only to aid in understanding the methods and their core concepts; meanwhile, for a person skilled in the art, according to the idea of the present disclosure, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present disclosure should not be construed as a limitation to the present disclosure.

Claims

1. A method for rapidly classifying logs is characterized by comprising the following steps,

clustering the log text vector set to obtain a log clustering result;

analyzing the log clustering result to obtain a log template;

calculating the relevance among the log templates according to the log matching template list to obtain a log relevance template, wherein the log relevance template is used for finishing log classification;

calculating the relevance between the log templates according to the log matching template list to obtain a log relevance template,

continuously selecting m log templates from the log template matching list for carrying out OR operation to obtain a plurality of log template combinations, and recording the occurrence times of each log template combination, wherein m is the number of the log templates set according to requirements;

selecting a log template combination with the occurrence times larger than or equal to a set occurrence time threshold value as a log association template;

and classifying the log according to the log association template.

2. The method as claimed in claim 1, wherein the text vectorization process is performed on the log library to obtain a log text vector set further comprises,

performing data cleaning on the log library;

cutting the special fields of the logs in the log library to obtain log texts;

and encoding the log text to obtain a log text vector set.

3. The method as claimed in claim 1, wherein clustering the log text vector set to obtain a log clustering result further comprises,

clustering the log text vector set for multiple times through a k-means algorithm to obtain a plurality of groups of log classification results;

respectively calculating cosine similarity of each group of log classification results;

and selecting the log classification result with the largest cosine similarity sum as a log clustering result.

4. The method of claim 3, wherein clustering the log text vector set multiple times by a k-means algorithm further comprises,

selecting K points in the log text vector set as centroids, wherein the K points represent K clustering results;

assigning each log vector in the set of log text vectors to the nearest centroid, forming K clusters;

and recalculating the mass center of each cluster until the mass center is not changed any more, and obtaining a log classification result.

5. The method as claimed in claim 3, wherein the cosine similarity of each group of log classification results is calculated by the following formula,

；

；

wherein K is the number of clusters, x is a log text vector, ci is the ith cluster, ci is the centroid of the cluster Ci, and mi is the number of logs in the ith cluster.

6. The method of claim 1, wherein analyzing the log clustering results to obtain a log template further comprises,

calculating the vocabulary of different categories in the log clustering result, and selecting k words with the most occurrence times to generate a log template corresponding to the categories, wherein k is a natural number more than or equal to 1;

and matching the log text vectors in the category through the log template, and reducing the value of k and regenerating the log template corresponding to the category when all the log text vectors cannot be matched.

7. The method as claimed in claim 1, wherein the step of matching the inputted log according to the log template to obtain the log matching template list further comprises,

performing text vectorization processing on an input log to obtain a log text vector;

and traversing the log template corresponding to each category in the clustering result, matching the log text vectors, and recording the corresponding relation between the log and the log template in the log matching template list.

8. The method of claim 1, wherein analyzing the log clustering results to obtain the log template further comprises analyzing the log template to obtain a failure cause corresponding to the log template.

9. The method for rapidly classifying logs according to claim 8, wherein obtaining the log association template according to the log matching template list further comprises obtaining a fault reason of the log association template according to a fault reason corresponding to each log template associated in the log association template, and rapidly locating a fault.

10. A log fast classifying device is characterized by comprising,

the text vectorization unit is used for carrying out text vectorization processing on the log library to obtain a log text vector set;

the log association template generating unit is used for calculating the association between the log templates according to the log matching template list to obtain a log association template, and the log association template is used for finishing log classification;

the log associated template generating unit further comprises a log template combination module, m log templates are continuously selected from the log template matching list obtained by the log matching unit to perform or operate a plurality of log template combinations, and the occurrence frequency of each template combination is recorded, wherein m is the number of the log templates set by a user;

the log association template generation unit further comprises an association template selection module which selects template combinations with the occurrence times more than or equal to a set occurrence time threshold value to obtain a plurality of log association templates;

the log association template generation unit further comprises a log classification module which classifies the log according to the log association template obtained by the association template selection module.

11. A computer arrangement comprising a memory, a processor, and a computer program stored on the memory, characterized in that the computer program, when executed by the processor, executes the instructions of the method according to any one of claims 1-7.

12. A computer storage medium on which a computer program is stored, characterized in that the computer program, when being executed by a processor of a computer device, executes instructions of a method according to any one of claims 1-7.