CN113986864A

CN113986864A - Log data processing method and device, electronic equipment and storage medium

Info

Publication number: CN113986864A
Application number: CN202111323610.1A
Authority: CN
Inventors: 张阳; 刘东阳
Original assignee: CCB Finetech Co Ltd
Current assignee: CCB Finetech Co Ltd
Priority date: 2021-11-11
Filing date: 2021-11-11
Publication date: 2022-01-28

Abstract

The disclosure provides a log data processing method which can be applied to the technical field of artificial intelligence and the technical field of finance. The log data processing method comprises the following steps: acquiring log data, wherein the log data comprises at least one log record; vectorizing each log record to obtain a vectorized log set containing log record vectors; clustering log record vectors in the vector quantization log set to form different log clusters, wherein the same log cluster comprises similar log records; identifying log records in the same log cluster to obtain a named entity; and generating a log template of the log cluster according to the named entity, wherein the log template is used for representing log structure characteristics of the log cluster. The present disclosure also provides a log data processing apparatus, a device, a storage medium, and a program product.

Description

Log data processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence and financial technologies, and more particularly, to a log data processing method, apparatus, device, medium, and program product.

Background

The log is data generated by the information system, records the running states of equipment, an operating system and application software, reflects the internal real conditions of the information system, and is an important basis for daily operation and maintenance troubleshooting and attack tracing analysis of information operation and maintenance managers and security managers, so that the analysis and management of the log become increasingly important. In the related technology, the logs are classified and identified, classification is established according to formats, corresponding data field processing normal form rules are matched, and fields are extracted and assigned to designated fields so as to extract data.

In the process of implementing the technical scheme of the present disclosure, the inventor finds that at least the following problems exist in the related art: according to the method in the related art, log formats need to be defined in advance, each log format needs to be adapted, when the log formats are changed, format errors can occur in new log identification, log identification failure is caused, technical personnel need to establish a data format matching rule again to solve the problem, the whole log analysis and processing process is time-consuming and labor-consuming, and the efficiency is low.

Disclosure of Invention

In view of the above, the present disclosure provides a log processing method, apparatus, device, medium, and program product that improve log recognition speed.

According to a first aspect of the present disclosure, there is provided a log data processing method including:

acquiring log data, wherein the log data comprises at least one log record;

vectorizing each log record to obtain a vectorized log set containing log record vectors;

clustering the log record vectors in the vectorized log set to form different log clusters, wherein the same log cluster comprises similar log records;

identifying the log records in the same log cluster to obtain a named entity; and

and generating a log template of the log cluster according to the named entity, wherein the log template is used for representing the log structure characteristics of the log cluster.

According to an embodiment of the present disclosure, the performing vectorization processing on each log record to obtain a vectorized log set including a log record vector includes:

preprocessing each log record;

performing word segmentation and word stop removal processing on the preprocessed log records to obtain a single word stock;

inputting the word corpus into a vectorization model, vectorizing the words, and outputting a word vector library; wherein, the word vector library comprises a word vector corresponding to each word;

and determining a log record vector corresponding to each log record according to the word corresponding to each log record and the word vector corresponding to the word so as to obtain the vectorized log set.

According to an embodiment of the present disclosure, the clustering the log record vectors in the vectorized log set to form different log clusters includes:

clustering the log record vectors in the vectorization log set to form different vector clusters;

and determining the log record corresponding to each log record vector according to the log record vectors in the same vector cluster so as to form the log cluster corresponding to the vector cluster.

According to an embodiment of the present disclosure, the clustering the log record vectors in the vectorized log set to form different vector clusters includes:

determining the e-neighborhood of each log record vector in the vectorization log set according to a preset neighborhood parameter to obtain a core object set;

determining the log record vectors with the reachable density of the first core object in the vectorization log set according to the first core object in the core object set to form a first vector cluster;

and determining the log record vectors with the reachable density of the second core objects in the updated vectorized log set according to the second core objects in the core object set to form a second vector cluster so as to obtain different vector clusters, wherein the updated vectorized log set comprises the log record vectors in the first vector cluster.

According to an embodiment of the present disclosure, the identifying the log records in the same log cluster to obtain a named entity includes:

training an initial model by using historical log data to obtain a named entity recognition model for recognizing the log records;

and inputting the log records in each log cluster into the named entity recognition model, and outputting named entities.

According to an embodiment of the present disclosure, the training of the initial model by using the historical log data to obtain the named entity recognition model for recognizing the log record includes:

determining an operation and maintenance entity in the information technology operation and maintenance field to form a log entity dictionary;

labeling the historical log data by adopting the log entity dictionary to form a labeling set;

constructing word characteristics and word boundary characteristics to form a characteristic set;

inputting the label set and the feature set into a target feature template to output a test data set;

and inputting the test data set into the initial model, and training the initial model to obtain the named entity recognition model.

According to an embodiment of the present disclosure, before the vectorizing processing is performed on each of the log records, the method further includes:

and marking the field with the preset format in a rule definition or regular expression mode so as to shield the field with the preset format.

According to an embodiment of the present disclosure, the field having the predetermined format includes time, a special character, a universal resource identifier, an internet protocol, a brace, a bracket, a parenthesis, an underline, a slash, and a backslash.

A second aspect of the present disclosure provides a log data processing apparatus including:

the device comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring log data, and the log data comprises at least one log record;

the vectorization module is used for vectorizing each log record to obtain a vectorization log set containing log record vectors;

a clustering module, configured to perform clustering processing on the log record vectors in the vectorized log set to form different log clusters, where the same log cluster includes similar log records;

a named entity identification module, configured to identify the log records in the same log cluster to obtain a named entity; and

and the generating module is used for generating a log template of the log cluster according to the named entity, wherein the log template is used for representing the log structure characteristics of the log cluster.

A third aspect of the present disclosure provides an electronic device, comprising: one or more processors; a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the log data processing method.

The fourth aspect of the present disclosure also provides a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform the above-described log data processing method.

A fifth aspect of the present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the above-described log data processing method.

According to the embodiment of the disclosure, log data comprising at least one log record is obtained, and vectorization processing is performed on the log record to obtain a vectorization log set comprising log record vectors; then clustering the log record vectors by adopting a clustering mode to form log clusters containing similar log records; and then identifying the log records in each log cluster to obtain a named entity, and generating a log template for representing the log structure characteristics of the log clusters according to the named entity. According to the method, the log records are clustered and divided into different log clusters, and the log template of each log cluster is extracted, so that a large number of logs can be effectively structured, a large number of redundant data are removed, intelligent analysis of the log data is realized, manual intervention is reduced, the log analysis efficiency and operation and maintenance efficiency are improved, a relatively effective analysis means is provided for subsequent abnormal detection and automatic alarm, and stable operation of the system is well guaranteed.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be apparent from the following description of embodiments of the disclosure, which proceeds with reference to the accompanying drawings, in which:

fig. 1 schematically illustrates an application scenario diagram of a log data processing method, apparatus, device, medium, and program product according to an embodiment of the present disclosure;

FIG. 2 schematically shows a flow chart of a log data processing method according to an embodiment of the present disclosure;

FIG. 3 schematically illustrates a flow diagram of vectorizing log records;

FIG. 4 schematically shows a flow chart of a clustering method;

FIG. 5 schematically illustrates a flow chart of a method of determining a named entity recognition model for recognizing log records;

fig. 6 schematically shows a block diagram of a log data processing apparatus according to an embodiment of the present disclosure;

fig. 7 schematically shows a block diagram of a log data processing apparatus according to another embodiment of the present disclosure; and

fig. 8 schematically shows a block diagram of an electronic device adapted to implement a method of log data processing according to an embodiment of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

A journal is data generated by an information system that has the following characteristics: the system has the advantages that the system has wide sources, and not only contains logs and alarms generated by all devices and systems in the traditional information system environment, but also contains a large amount of logs generated by the mobile client and the sensor; secondly, the quantity is large, and various machines and systems generate data constantly, so that the quantity is large in a summary mode; thirdly, the formats are various and are not uniform, the formats generated by various systems are different, the meanings are different, and the requirements on the professional technology of technicians are different; and fourthly, the storage is dispersed, different logs are stored in different systems and devices, and different methods are needed for collection and reading.

The log data reflects the intrinsic real condition of the information system, and the analysis of the log data is an important basis for the daily operation and maintenance troubleshooting and the analysis of attack tracing of information operation and maintenance managers and safety managers. In the related technology, the method for analyzing the log data mainly comprises the steps of classifying and identifying the log data, establishing classification according to formats, matching corresponding data field processing normal form rules, extracting and assigning fields to designated fields, and finishing data extraction.

However, the scheme in the related art needs to define the log format in advance, and each log format is adapted. If the format of the log is changed, the problem that the new log is identified with wrong format and the identification of the whole log fails is solved by establishing a new data format matching rule, so that the adaptability is poor, the whole process is supported by technicians usually, time and labor are wasted, the period is long, and the user experience is poor.

In view of the above technical problem, the present disclosure provides a log data processing method, including: acquiring log data, wherein the log data comprises at least one log record; vectorizing each log record to obtain a vectorized log set containing log record vectors; clustering log record vectors in the vector quantization log set to form different log clusters, wherein the same log cluster comprises similar log records; identifying log records in the same log cluster to obtain a named entity; and generating a log template of the log cluster according to the named entity, wherein the log template is used for representing log structure characteristics of the log cluster.

It should be noted that the method and apparatus for processing log data in the embodiments of the present disclosure may be used in the financial field and the computer technology field, and may also be used in any technical field other than the financial field and the computer technology field.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, necessary confidentiality measures are taken, and the customs of the public order is not violated.

Fig. 1 schematically illustrates an application scenario diagram of a log data processing method, apparatus, device, medium, and program product according to an embodiment of the present disclosure.

As shown in fig. 1, the application scenario 100 according to this embodiment may include a network, a terminal device, and a server. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have installed thereon various communication client applications, such as shopping-like applications, web browser applications, search-like applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (for example only) providing support for websites browsed by users using the

terminal devices

101, 102, 103. The background management server may analyze and perform other processing on the received data such as the user request, and feed back a processing result (e.g., a webpage, information, or data obtained or generated according to the user request) to the terminal device.

It should be noted that the log data processing method provided by the embodiment of the present disclosure may be generally executed by the server 105. Accordingly, the log data processing apparatus provided by the embodiment of the present disclosure may be generally disposed in the server 105. The log data processing method provided by the embodiment of the present disclosure may also be executed by a server or a server cluster that is different from the server 105 and is capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105. Accordingly, the log data processing apparatus provided by the embodiment of the present disclosure may also be disposed in a server or a server cluster different from the server 105 and capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

The log data processing method of the disclosed embodiment will be described in detail below through fig. 2 to 5 based on the scenario described in fig. 1.

Fig. 2 schematically shows a flow chart of a log data processing method according to an embodiment of the present disclosure.

As shown in fig. 2, the log data processing method of the embodiment includes operations S210 to S250, and the log data processing method may be performed by a terminal device or a server.

In operation S210, log data is acquired, wherein the log data includes at least one log record.

In operation S220, a vectorization process is performed on each log record to obtain a vectorized log set including a log record vector.

According to an embodiment of the present disclosure, the vectorization process may employ, for example, Word2Voc Word vector training model or bert (bidirectional Encoder replication from transformations) Word vector training model. The Word2Voc Word vector training model is represented by static vectors, the Word vector representation of semantics is not considered, each Word is only represented by one corresponding Word vector, and the Word vector obtained by the same Word is unchanged. The BERT word vector training model is dynamic word vector representation, semantic word vector representation is considered, and different vectorization representations are obtained by the same word according to different context information.

In operation S230, a clustering process is performed on log record vectors in the vector quantization log set to form different log clusters, where the same log cluster includes similar log records.

According to the embodiment of the disclosure, log records with high similarity are aggregated in the same log cluster through clustering processing, so that not only can log data be summarized, but also problem positioning and anomaly detection can be facilitated.

In operation S240, log records in the same log cluster are identified to obtain a named entity.

According to the embodiment of the disclosure, when identifying log records in the same log cluster, for example, a part of the log records may be selected for identification, or all the log records may be selected for identification.

According to an embodiment of the present disclosure, identifying log records in the same log cluster may include, for example, identifying named entities of log records in the same log cluster, that is, identifying named entities corresponding to the log records.

The process of specifically identifying the named entity corresponding to the log record includes: based on domain knowledge, determining a named entity of the domain log from operation and maintenance production practice, and forming the named entity into a log entity dictionary; then, marking historical log data by using a named entity in the log entity dictionary; then training the initial model by using the historical log data marked by the named entity to obtain a named entity recognition model; and then inputting the log records needing named entity recognition into the named entity recognition model obtained through training, and outputting the named entities corresponding to the log records, thereby realizing recognition of the log records.

According to embodiments of the present disclosure, a named entity may include, for example, a date, an internet protocol address, a port, a URL (Uniform Resource Locator), a level, a component, a connection _ pid (pid is a process identifier), and the like.

According to embodiments of the present disclosure, a URL refers to an identification method for completely describing an address of a web page on the internet and other resources.

According to embodiments of the present disclosure, named entities may be accumulated from generation practices based on log domain knowledge.

According to embodiments of the present disclosure, a named entity generally refers to an entity having a particular meaning or strong reference in the text, and generally includes a person name, a place name, an organization name, a date and time, a proper noun, and the like. Named entity recognition is the extraction of named entities from unstructured input text, and may identify more classes of named entities according to business needs. For named entities with strong regularity, such as percentage, website, E-mail and the like, processing can be carried out through a regular expression, and fragments which are not matched are delivered to a model to carry out log entity information extraction.

Named entity recognition is a sequence tagging problem, where a sequence is given to find a tag corresponding to each element in the sequence. For example, the input is a sequence of words s corresponding to a log record<word₁，word₂，......，word_n>The output is the entity sequence corresponding to the log record<entity₁><entity₂>......<entity_n>。

In operation S250, a log template of a log cluster is generated according to the named entity, wherein the log template is used for characterizing log structure features of the log cluster.

According to the embodiment of the disclosure, the log records are clustered into different log clusters by using clustering processing, and the named entities are obtained from the log clusters by adopting a named entity identification method so as to generate the log template of each log cluster, wherein the log template can more effectively structure log data, remove a large amount of redundant data and provide log analysis efficiency and operation and maintenance efficiency.

FIG. 3 schematically shows a flow diagram of a vectorization process for log records.

As shown in fig. 3, the vectorization process includes operations S310 to S340.

In operation S310, each log record is preprocessed.

According to an embodiment of the present disclosure, preprocessing each log record may include, for example, stemming, morphological reduction, case-to-case conversion, and so on.

In operation S320, the preprocessed log records are subjected to word segmentation and word stop processing to obtain a single word library.

According to the embodiment of the disclosure, before performing word segmentation and word decommissioning processing, the method further comprises the step of constructing a decommissioning word list in the log data field.

According to an embodiment of the present disclosure, the pre-processed log records are word segmented and word-out-of-use processed, for example, using a jieba, ltp segmentation tool.

In operation S330, inputting the word library into the vectorization model, vectorizing the words, and outputting a word vector library; the word vector library comprises word vectors corresponding to each word.

According to embodiments of the present disclosure, the vectorization model may include, for example, the Word2 Vec model. And inputting the words in the Word pre-material library into a Word2 Vec model to vectorize the words, and outputting a Word vector corresponding to each Word.

In operation S340, according to the word corresponding to each log record and the word vector corresponding to the word, a log record vector corresponding to each log record is determined, so as to obtain a vectorized log set.

According to an embodiment of the present disclosure, the log record vector may be obtained from a vector sum calculated from a word vector of a word corresponding to each log record. For example, the word vector of the word corresponding to log record X is divided into a, b, c, and d, and the log record vector corresponding to log record X is the vector sum of a, b, c, and d.

According to the embodiment of the disclosure, clustering log record vectors in a vector quantization log set to form different log clusters includes:

clustering log record vectors in the vector quantization log set to form different vector clusters;

and determining the log record corresponding to each log record vector according to the log record vectors in the same vector cluster to form the log cluster corresponding to the vector cluster.

According to the embodiment of the present disclosure, for example, the vectorized log set obtained in operation S220 is D_MAs (M1, M2, M3........ Mn), where M1, M2, M3...... and Mn correspond to the vectored log set D, respectively_MAnd n represents the total number of logging vectors. Subtended quantization log set D_MClustering process, collecting vectorized logs D_MDivision into a plurality of vector clusters C₁，C2，......，C_mAnd m represents the total number of the vector clusters obtained by the clustering process. Vector cluster C₁The log records corresponding to the log record vector in (1) form a vector cluster C₁A corresponding log cluster.

Fig. 4 schematically shows a flow chart of a clustering method.

As shown in FIG. 4, the log Clustering method using DBSCAN (sensitivity-Based Spatial Clustering of Applications with Noise) includes operations S410 to S430.

In operation S410, according to the preset neighborhood parameter, an e-neighborhood of each log record vector in the vectorized log set is determined to obtain a core object set.

According to an embodiment of the present disclosure, for example, the vectorized log set is a sample set D ═ x₁，x₂，...，x_m}，x₁，x₂，...，x_mFor the logging vector, the realm parameter is (∈, MinPts). For logging vector x_j，x_jE D, its e-neighborhood contains the sum x of D_jIf x is not greater than a sample of ∈_jE-neighborhood of (c) contains at least MinPts samples, then x_jIs a core object. According to the method, all the core objects in the sample set D are determined to form a core object set.

In operation S420, according to a first core object in the core object set, a log record vector of which the density of the first core object in the vectorized log set is reachable is determined, and a first vector cluster is formed.

According to embodiments of the present disclosure, for example, for core object x_iIf x_jAt x_iIs in the neighborhood, then x is called_jFrom x_iThe density is direct; for x_iAnd x_jIf a sample sequence p is present₁，p₂，…，p_nWherein p is₁＝x_i，p_n＝x_jAnd p is_i+1From p_iWhen the density is up to, it is called x_jFrom x_iThe density can be reached.

According to an embodiment of the present disclosure, according to core object x_iDetermining the core object x_iThe density reachable log records vectors, forming a first vector cluster.

In operation S430, according to a second core object in the core object set, determining a log record vector in the updated vectorized log set, where the density of the second core object is reachable, to form a second vector cluster, so as to obtain a different vector cluster, where the updated vectorized log set includes removing the log record vector in the first vector cluster.

According to the embodiment of the disclosure, after removing the log record vectors in the first vector cluster from the vectorized log set, determining the log record vectors with the reachable density of the second core object according to the second core object, and forming the second vector cluster.

According to the embodiment of the present disclosure, each time a vector cluster is formed, the log record vectors in the vector cluster are removed from the vectorized log set, and then the operation S420 is repeatedly performed on the vectorized log set according to the update of each time until the core object is traversed or removed.

According to the embodiment of the disclosure, the clustering can be performed on the dense data sets in any shapes by adopting the DBSCAN method, the clustering result is not biased, the clustering calculation can be performed successively, the serious degradation of the calculation speed is avoided, and the clustering effect is good.

According to the embodiment of the disclosure, identifying log records in the same log cluster to obtain a named entity comprises:

training the initial model by using historical log data to obtain a named entity recognition model for recognizing log records;

and inputting the log records in each log cluster into a named entity recognition model, and outputting the named entities.

FIG. 5 schematically illustrates a flow chart of a method of determining a named entity recognition model for recognizing log records.

As shown in FIG. 5, the method of determining a named entity recognition model for recognizing log records includes operations S510-S550.

In operation S510, an operation and maintenance entity of an information technology operation and maintenance domain is determined to form a log entity dictionary.

According to an embodiment of the present disclosure, an operation and maintenance entity may include, for example, a date, an internet protocol address, a port interface, a same resource locator, a level, a component, a process identifier, a person name, a place name, a time, and the like.

In operation S520, the historical log data is annotated with the log entity dictionary to form an annotation set.

According to the embodiment of the disclosure, the operation and maintenance entity in the log entity dictionary is adopted to label the historical log data to form a label set.

In operation S530, word features and word boundary features are constructed to form a feature set.

According to embodiments of the present disclosure, word characteristics may include, for example, part of speech, whether it is a characteristic word, whether it is a place, whether it is a sentence end, and the like. The word features include, for example, word-level features, dictionary features, document-level features, and the like. Word level features include whether to end with a period, whether to include a number, part of speech, n-gram of words (a language model), etc. The dictionary features rely on external dictionary definitions, such as predefined vocabularies and the like. Document-level features are computed based on the entire corpus of documents, such as word frequency, co-occurrence words, etc., in the corpus of documents. Part-of-speech features refer to nouns, verbs, and the like. The word boundary characteristic is that the position of a word is labeled, a BIO labeling method is adopted, B represents the initial position of an entity, I represents the middle or end position of the entity, and O represents that a corresponding character is not the entity.

In operation S540, the annotation set and the feature set are input into the target feature template to output a test data set.

According to an embodiment of the present disclosure, the target feature templates may include, for example, a Unigram template and a Bigram template. The Unigram template generates a state feature function sl (yi, x, i) of CRF (conditional Random field), where yi is the label, x is the observation sequence, and i is the current node position. The Bigram template generates a transfer feature function tk (yi, yi-1, x, i), where yi is the label, x is the observation sequence, and i is the current node position.

According to the embodiment of the disclosure, the label set and the feature set are input into the target feature template and converted into a uniform expression format, so that the algorithm processing of the initial model is facilitated.

In operation S550, the test data set is input into the initial model, and the initial model is trained to obtain a named entity recognition model.

According to an embodiment of the present disclosure, the initial model may comprise a CRF model, for example.

According to an embodiment of the present disclosure, the test data set may include, for example, a training set and a test set, the training set is used to train the initial model, and the test set is used to test the accuracy of the named entity recognition model obtained by the training. The ratio of training set to test set may be 7: 3.

According to an embodiment of the present disclosure, the named entity recognition model includes a loss function,

where D represents the training samples (m sequences): d ═ x₁，y₁)，(x₂，y₂)，...，(x_m，y_m)]After the lambda is subjected to partial derivation, a conventional optimization method, such as gradient descent iterative optimization model parameters, is adopted to provide an input sequence to calculate a prediction output sequence, namely an optimal sequence for maximizing a target function, which is a dynamic programming problem and can be decoded by using a Viterbi algorithm.

According to the embodiment of the present disclosure, before vectorizing processing is performed on each log record, the method further includes:

and marking the field with the preset format in a rule definition or regular expression mode to shield the field with the preset format.

According to the embodiment of the disclosure, the field with the preset format is shielded, which is beneficial to improving the accuracy of clustering.

According to an embodiment of the present disclosure, the fields having the preset format include time, special characters, universal resource identifiers, internet protocol, braces, parentheses, parenthesis, underline, slash, and backslash.

Based on the log data processing method, the disclosure also provides a log data processing device. The apparatus will be described in detail below with reference to fig. 6.

Fig. 6 schematically shows a block diagram of a log data processing apparatus according to an embodiment of the present disclosure.

As shown in fig. 6, the log data processing apparatus 600 of this embodiment includes an acquisition module 610, a vectorization module 620, a clustering module 630, a named entity identification module 640, and a generation module 650.

The obtaining module 610 is configured to obtain log data, where the log data includes at least one log record. In an embodiment, the obtaining module 610 may be configured to perform the operation S210 described above, which is not described herein again.

The vectorization module 620 is configured to perform vectorization processing on each log record to obtain a vectorized log set including a log record vector. In an embodiment, the vectorization module 620 may be configured to perform the operation S220 described above, which is not described herein again.

The clustering module 630 is configured to perform clustering processing on log record vectors in the quantitative log set to form different log clusters, where the same log cluster includes similar log records. In an embodiment, the clustering module 630 may be configured to perform the operation S230 described above, which is not described herein again.

The named entity identifying module 640 is configured to identify log records in the same log cluster to obtain a named entity. In an embodiment, the named entity identifying module 640 may be configured to perform the operation S240 described above, which is not described herein again.

The generating module 650 is configured to generate a log template of a log cluster according to the named entity, where the log template is used to characterize a log structure of the log cluster. In an embodiment, the generating module 650 may be configured to perform the operation S250 described above, which is not described herein again.

According to the embodiment of the disclosure, the vectorization module comprises a preprocessing unit, a word segmentation and stop word unit, a vectorization unit and a determination unit.

And the preprocessing unit is used for preprocessing each log record.

And the word segmentation and stop word removal unit is used for carrying out word segmentation and stop word removal processing on the preprocessed log records to obtain a single word stock.

The vectorization unit is used for inputting the word stock into the vectorization model, vectorizing the words and outputting the word vector library; the word vector library comprises word vectors corresponding to each word.

And the first determining unit is used for determining the log record vector corresponding to each log record according to the word corresponding to each log record and the word vector corresponding to the word so as to obtain a vectorized log set.

According to an embodiment of the present disclosure, a clustering module includes a clustering unit and a determination unit.

And the clustering unit is used for clustering log record vectors in the vector quantization log set to form different vector clusters.

And the second determining unit is used for determining the log record corresponding to each log record vector according to the log record vectors in the same vector cluster so as to form the log cluster corresponding to the vector cluster.

According to an embodiment of the present disclosure, the clustering unit includes a first determining subunit, a second determining subunit, and a third determining subunit.

The first determining subunit is configured to determine, according to a preset neighborhood parameter, an e-neighborhood of each log record vector in the vectorized log set to obtain a core object set.

And the second determining subunit is used for determining log record vectors with the reachable density of the first core object in the vectorization log set according to the first core object in the core object set, so as to form a first vector cluster.

And the third determining subunit is configured to determine, according to a second core object in the core object set, a log record vector in the updated vectorized log set, where the density of the second core object is reachable, to form a second vector cluster, so as to obtain a different vector cluster, where the updated vectorized log set includes removing the log record vector in the first vector cluster.

According to an embodiment of the present disclosure, a named entity recognition module includes a model training unit and an input unit:

and the model training unit is used for training the initial model by using the historical log data to obtain a named entity recognition model for recognizing the log records.

And the input unit is used for inputting the log records in each log cluster into the named entity recognition model and outputting the named entities.

According to the embodiment of the disclosure, the model training unit comprises a fourth determining subunit, a labeling subunit, a building subunit, an input subunit and a model training subunit.

And the fourth determining subunit is used for determining the operation and maintenance entity in the information technology operation and maintenance field to form a log entity dictionary.

And the labeling subunit is used for labeling the historical log data by adopting the log entity dictionary to form a labeling set.

And the construction subunit is used for constructing the word characteristics and the word boundary characteristics to form a characteristic set.

And the input subunit is used for inputting the label set and the feature set into the target feature template so as to output the test data set.

And the model training subunit is used for inputting the test data set into the initial model, training the initial model and obtaining the named entity recognition model.

Fig. 7 schematically shows a block diagram of a log data processing apparatus according to another embodiment of the present disclosure.

As shown in fig. 7, the log data processing apparatus 600 of this embodiment further includes a labeling module 660 in addition to the above-mentioned obtaining module 610, vectorization module 620, clustering module 630, named entity identification module 640, and generation module 650.

The marking module 660 is configured to mark the field with the preset format in a rule definition or regular expression manner to mask the field with the preset format.

According to the embodiment of the present disclosure, any plurality of the obtaining module 610, the vectorization module 620, the clustering module 630, the named entity recognition module 640, the generation module 650, and the marking module 660 may be combined and implemented in one module, or any one of them may be split into a plurality of modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module. According to an embodiment of the present disclosure, at least one of the obtaining module 610, the vectorizing module 620, the clustering module 630, the named entity identifying module 640, the generating module 650, and the marking module 660 may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable manner of integrating or packaging a circuit, or in any one of three implementations of software, hardware, and firmware, or in a suitable combination of any of them. Alternatively, at least one of the obtaining module 610, the vectorization module 620, the clustering module 630, the named entity recognition module 640, the generation module 650, and the tagging module 660 can be implemented at least in part as a computer program module that can perform corresponding functions when executed.

As shown in fig. 8, an electronic device 800 according to an embodiment of the present disclosure includes a processor 801 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. The processor 801 may include, for example, a general purpose microprocessor (e.g., CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., Application Specific Integrated Circuit (ASIC)), among others. The processor 801 may also include onboard memory for caching purposes. The processor 801 may include a single processing unit or multiple processing units for performing different actions of the method flows according to embodiments of the present disclosure.

In the RAM 803, various programs and data necessary for the operation of the electronic apparatus 800 are stored. The processor 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. The processor 801 performs various operations of the method flows according to the embodiments of the present disclosure by executing programs in the ROM 802 and/or RAM 803. Note that the programs may also be stored in one or more memories other than the ROM 802 and RAM 803. The processor 801 may also perform various operations of method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.

Electronic device 800 may also include input/output (I/O) interface 805, input/output (I/O) interface 805 also connected to bus 804, according to an embodiment of the present disclosure. Electronic device 800 may also include one or more of the following components connected to I/O interface 805: an input portion 806 including a keyboard, a mouse, and the like; an output section 807 including a signal such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 808 including a hard disk and the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. A drive 810 is also connected to the I/O interface 805 as necessary. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is mounted on the storage section 808 as necessary.

The present disclosure also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, a computer-readable storage medium may include the ROM 802 and/or RAM 803 described above and/or one or more memories other than the ROM 802 and RAM 803.

Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the method illustrated in the flow chart. When the computer program product runs in a computer system, the program code is used for causing the computer system to realize the item recommendation method provided by the embodiment of the disclosure.

The computer program performs the above-described functions defined in the system/apparatus of the embodiments of the present disclosure when executed by the processor 801. The systems, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.

In one embodiment, the computer program may be hosted on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted in the form of a signal on a network medium, distributed, downloaded and installed via communication section 809, and/or installed from removable media 811. The computer program containing program code may be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 809 and/or installed from the removable medium 811. The computer program, when executed by the processor 801, performs the above-described functions defined in the system of the embodiments of the present disclosure. The systems, devices, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.

In accordance with embodiments of the present disclosure, program code for executing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, these computer programs may be implemented using high level procedural and/or object oriented programming languages, and/or assembly/machine languages. The programming language includes, but is not limited to, programming languages such as Java, C + +, python, the "C" language, or the like. The program code may execute entirely on the user computing device, partly on the user device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present disclosure may be made without departing from the spirit or teaching of the present disclosure. All such combinations and/or associations are within the scope of the present disclosure.

The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described separately above, this does not mean that the measures in the embodiments cannot be used in advantageous combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.

Claims

1. A log data processing method comprises the following steps:

acquiring log data, wherein the log data comprises at least one log record;

2. The method of claim 1, wherein the vectorizing each of the log records to obtain a vectorized log set comprising a vector of log records comprises:

preprocessing each log record;

performing word segmentation and stop word removal processing on the preprocessed log records to obtain a single word stock;

inputting the single word stock into a vectorization model, vectorizing the single words, and outputting a word vector library; wherein the word vector library comprises a word vector corresponding to each word;

and determining a log record vector corresponding to each log record according to the word corresponding to each log record and the word vector corresponding to the word to obtain the vectorized log set.

3. The method of claim 1, the clustering the log record vectors in the vectored log collection to form different log clusters, comprising:

4. The method of claim 3, the clustering the log record vectors in the vectored log collection to form different vector clusters comprising:

determining the epsilon-neighborhood of each log record vector in the vectorization log set according to preset neighborhood parameters to obtain a core object set;

determining, according to a second core object in the core object set, the log record vector of which the density of the second core object in the updated vectorized log set is reachable, and forming a second vector cluster to obtain a different vector cluster, wherein the updated vectorized log set includes removing the log record vector in the first vector cluster.

5. The method of claim 1, wherein identifying the log records in the same log cluster to obtain a named entity comprises:

training an initial model by using historical log data to obtain a named entity recognition model for recognizing the log record;

6. The method of claim 5, the training an initial model with historical log data, resulting in a named entity recognition model for recognizing the log records comprising:

7. The method of claim 1, further comprising, prior to the vectorizing processing each of the log records:

8. The method of claim 7, wherein the fields having the predetermined format include time, special characters, universal resource identifiers, internet protocol, braces, parentheses, underline, slash, and backslash.

9. A log data processing apparatus comprising:

the clustering module is used for clustering the log record vectors in the vectorization log set to form different log clusters, wherein the same log cluster comprises similar log records;

the named entity identification module is used for identifying the log records in the same log cluster to obtain a named entity; and

10. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-8.

11. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method of any one of claims 1 to 8.

12. A computer program product comprising a computer program which, when executed by a processor, implements a method according to any one of claims 1 to 8.