CN117436440A - Log identification method, system, terminal equipment and storage medium - Google Patents

Log identification method, system, terminal equipment and storage medium Download PDF

Info

Publication number
CN117436440A
CN117436440A CN202311563040.2A CN202311563040A CN117436440A CN 117436440 A CN117436440 A CN 117436440A CN 202311563040 A CN202311563040 A CN 202311563040A CN 117436440 A CN117436440 A CN 117436440A
Authority
CN
China
Prior art keywords
log
word segmentation
identified
key
content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311563040.2A
Other languages
Chinese (zh)
Inventor
李妙杏
傅宇
陈澄广
黄滔
黄桂泉
杨盛辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Yitong Hengrui Technology Co ltd
Original Assignee
Guangdong Yitong Hengrui Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Yitong Hengrui Technology Co ltd filed Critical Guangdong Yitong Hengrui Technology Co ltd
Priority to CN202311563040.2A priority Critical patent/CN117436440A/en
Publication of CN117436440A publication Critical patent/CN117436440A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a log identification method, a system, terminal equipment and a storage medium, wherein the method comprises the following steps: extracting a message body from a log to be identified, and removing variable information in the message body to obtain the trunk content of the log; inputting the trunk content of the log into a word segmentation model for word segmentation processing to obtain a word segmentation dictionary of the log to be identified; inputting the word segmentation dictionary of the log to be identified into a classification model for classification processing to obtain keywords of the log to be identified; matching the key words of the log to be identified with key fields in a pre-configured key log table to obtain matched log content, setting a label for the matched log content according to the corresponding relation between the key fields in the key log table, the logic expression and the label, and outputting the identification result of the log to be identified. By adopting the technical scheme of the invention, the accuracy of log screening and the efficiency of log processing can be improved.

Description

Log identification method, system, terminal equipment and storage medium
Technical Field
The invention relates to the technical fields of log processing, log anomaly detection, machine learning and the like, in particular to a log identification method.
Background
With the rapid development of internet technology, a communication system uses a huge computer cluster, hundreds of millions of log data are generated per hour, log formats of different manufacturers are different, and how to read massive log data rapidly becomes an urgent problem to be solved. The current common log simplifying method is to select the concerned log records by setting log keywords and then using regular expressions to carry out log screening or by a classification method, so that the number of the logs is reduced and the aim of readability is achieved.
However, the existing sorting method is not used for sorting logs, the actual number of the logs is not reduced, the logs are cut into small blocks, the reading range is reduced, the number of the logs is related to the sorting depth, and the method cannot accurately extract the logs with specific characteristics. Although the existing method can solve the problem of log filtering fineness by setting keywords and filtering logs by using regular expressions, the filtering efficiency is low, and if a plurality of labels are required to be added to one log, the logic expression is very complex, and when a new label is required, the bottom code is required to be modified, so that the method cannot be applied to a real-time log processing system.
Disclosure of Invention
The invention aims to solve the technical problem of providing a log identification method, a system, a terminal device and a storage medium, which can improve the accuracy of log screening, quickly complete log identification and can be applied to a real-time log processing system.
To solve the above technical problem, in a first aspect, the present invention provides a log identifying method, including:
extracting a message body from a log to be identified, and removing variable information in the message body to obtain the trunk content of the log;
inputting the trunk content of the log into a word segmentation model for word segmentation processing to obtain a word segmentation dictionary of the log to be identified;
inputting the word segmentation dictionary of the log to be identified into a classification model for classification processing to obtain keywords of the log to be identified;
when the key words of the log to be identified are matched with key fields in a pre-configured key log table, obtaining matched log content, setting a label for the matched log content according to the corresponding relation between the logic expression of the key fields in the key log table and the label, and outputting the identification result of the log to be identified.
Preferably, the variable information in the message body includes at least one of a timestamp, IP address information, entity name and log category generated by the log to be identified.
Preferably, the word segmentation model is obtained through training of the following steps:
based on a Sentence piece algorithm, word segmentation training is carried out in a word form, and a trained word segmentation model is obtained.
Preferably, before inputting the word segmentation dictionary of the log to be identified into a classification model for classification processing, the method further comprises:
and comparing the words in the word segmentation dictionary of the log to be recognized with key fields in a preset key log table to obtain an optimized word segmentation dictionary of the log to be recognized.
Preferably, the classification model is obtained by training the following steps:
constructing a nerve network model based on a transducer;
the neural network model carries out classification training according to the frequency of word occurrence in the word segmentation dictionary, and a trained classification model is obtained; wherein the lower the frequency of occurrence of the word, the higher the probability that the word becomes a keyword.
Preferably, after the outputting of the identification result of the log to be identified, the method further includes:
de-duplicating the logs to be identified to form a log trunk template set;
using the log trunk template set as a training sample to perform word segmentation training to obtain an optimized word segmentation model; the optimized word segmentation model is used for word segmentation processing of the trunk content of the log to be identified next time.
Preferably, the number of the labels set for the matched log content is related to the corresponding relation between the key fields recorded in the key log table and the logic expressions and the labels, and one or more labels are set for the matched log content at one time according to the corresponding relation.
In a second aspect, the present invention provides a log identifying system, configured to implement the log identifying method according to any one of the first aspect, including:
the extraction module is used for extracting a message body from the log to be identified, removing variable information in the message body and obtaining the trunk content of the log;
the word segmentation module is used for inputting the trunk content of the log into a word segmentation model for word segmentation processing to obtain a word segmentation dictionary of the log to be identified;
the classification module is used for inputting the word segmentation dictionary of the log to be identified into a classification model for classification processing to obtain keywords of the log to be identified;
the identification module is used for matching the key words of the log to be identified with key fields in a pre-configured key log table to obtain matched log contents, setting labels for the matched log contents according to the key fields in the key log table and the corresponding relations between the logic expressions and the labels, and outputting the identification results of the log to be identified.
In a third aspect, the present invention also provides a terminal device, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, where the processor implements the log identifying method according to any one of the above when executing the computer program.
In a fourth aspect, the present invention further provides a computer readable storage medium, where the computer readable storage medium includes a stored computer program, where when the computer program runs, the computer readable storage medium is controlled to execute the log identifying method according to any one of the above methods.
Compared with the prior art, the invention has the following beneficial effects:
the invention provides a log identification method, a system, a terminal device and a storage medium, wherein the method comprises the steps of extracting a message body from a log to be identified, removing variable information in the message body to obtain log trunk content, inputting the log trunk content into a word segmentation model for word segmentation processing to obtain a word segmentation dictionary of the log to be identified, inputting the word segmentation dictionary of the log to be identified into a classification model for classification processing to obtain keywords of the log to be identified, finally matching the keywords of the log to be identified with key fields in a pre-configured key log table to obtain matched log content, setting a label for the matched log content according to the corresponding relation between a logic expression of the key fields in the key log table and the label, and outputting an identification result of the log to be identified. According to the invention, the segmentation model based on the Sentence piece algorithm is trained to segment words and output a word segmentation dictionary, so that the accuracy of log word segmentation is improved; the words are classified through training the classification model, and a feedback mechanism is supported to update the training model, so that the method can be applied to a real-time log processing system; finally, according to the corresponding relation in the pre-configured key log table, one or more label information can be set for the log at one time, so that the rapid identification of the log is realized.
Drawings
FIG. 1 is a flow chart of a preferred embodiment of a log identification method provided by the present invention;
FIG. 2 is a log to be identified according to a preferred embodiment of a log identification method provided by the present invention;
FIG. 3 is a log backbone content of a preferred embodiment of a log identification method provided by the present invention;
FIG. 4 is a word segmentation dictionary of a log to be identified according to a preferred embodiment of a log identification method provided by the present invention;
FIG. 5 is a key log table of a preferred embodiment of a log identification method provided by the present invention;
FIG. 6 is a word segmentation dictionary of an optimized log to be identified according to a preferred embodiment of a log identification method provided by the present invention;
FIG. 7 is a diagram showing keywords and non-keywords of a log to be identified according to a preferred embodiment of a log identifying method provided by the present invention;
FIG. 8 is a log identification result of a preferred embodiment of a log identification method provided by the present invention;
FIG. 9 is a block diagram of a preferred embodiment of a log identification system provided by the present invention;
fig. 10 is a block diagram of a preferred embodiment of a terminal device according to the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, are intended to fall within the scope of the present invention.
Referring to fig. 1 to 8, referring to fig. 1, fig. 1 is a flowchart of a preferred embodiment of a log identifying method provided by the present invention, and the method includes steps S1 to S4:
step S1, extracting a message body from a log to be identified, and removing variable information in the message body to obtain a log trunk content;
it will be appreciated that for a 5G computer cluster, the generated log includes a record of the operation of the computer hardware, a record of the operation of the computer software, a record of the operation of the switch virtualization software, etc.; for safety, the 5G network comprises a plurality of factories, the log is an internal record of the equipment, and no standardized specification exists, so that different factories can have great differences in log format and content among different models of equipment; in step S1, when a new log is collected as a log to be identified, the scheme extracts the collected log to obtain key information such as a log message body, and meanwhile, removes interference variable information in the message body.
It should be noted that, the variable information in the message body includes a timestamp generated by the log to be identified, IP address information, an entity name, a log category, a module generated by the log, and the like.
Referring to fig. 2, fig. 2 is a log to be identified according to a preferred embodiment of a log identifying method provided by the present invention; the log to be identified mainly comprises computer running records of different factories, and the computer running records comprise time stamps generated by the log, IP address information, entity names, modules generated by the log, trunk contents of the log and the like. Referring to fig. 3, fig. 3 is a log trunk content of a preferred embodiment of a log identifying method provided by the present invention. In implementation, the log to be identified shown in fig. 2 is extracted to obtain a message body, and variable information such as time, IP address, entity name and the like in the message body is removed to obtain the log trunk content shown in fig. 3.
S2, inputting the trunk content of the log into a word segmentation model for word segmentation processing to obtain a word segmentation dictionary of the log to be identified;
specifically, the word segmentation model is obtained through training of the following steps:
based on a Sentence piece algorithm, word segmentation training is carried out in a word form, and a trained word segmentation model is obtained.
It should be noted that, an unsupervised token based on the sentence piece algorithm is constructed, word forms are used for word segmentation, the vocabulary size is set mainly based on the principle of keeping the integrity of words, and word segmentation dictionary is output.
It should be noted that, because there are many representations of key fields in the key log table, a single word, such as "restart", is possible; a combination of logical operations of several words is also possible, such as "full" AND "pool" AND "FAILED" AND "APN"; it may also be a phrase or a logical operation of a phrase, such as "upload file failed" AND "taskId". For such flexible keyword setting, the general token is not applicable at present, so that the scheme trains a special word segmentation model for the 5G communication network log based on a Sentence piece algorithm to segment the log trunk content so as to obtain a more accurate word segmentation result, and solves the problem of low matching of the general word segmentation model.
Referring to fig. 4, fig. 4 is a word segmentation dictionary of a log to be identified according to a preferred embodiment of the log identifying method provided by the present invention, taking the log to be identified of fig. 2 as an example, the obtained log trunk content shown in fig. 3 is input into a word segmentation model to perform word segmentation processing, so as to obtain the vocabulary in the word segmentation dictionary shown in fig. 4, for example, the words after word segmentation processing based on the principle of maintaining the integrity of the words, such as "complete", "done", "Delete", "User", "ADD", "bindbndupfswitch", etc., are obtained.
After the log trunk content is input into a word segmentation model to perform word segmentation processing to obtain a word segmentation dictionary of the log to be identified, in order to further improve the accuracy of word segmentation, the method further comprises the following steps:
and comparing the words in the word segmentation dictionary of the log to be recognized with key fields in a preset key log table to obtain an optimized word segmentation dictionary of the log to be recognized.
Specifically, after the word segmentation dictionary is obtained preliminarily, the key field in the key log table can be queried and compared with the words in the word segmentation dictionary, and when a certain word in the word segmentation dictionary is matched with the content of a key field part in the key log table but is not completely consistent with the key field, the word in the word segmentation dictionary is converted according to the key field so as to be consistent with the key field, so that the corrected word segmentation dictionary is obtained. When there is ambiguity in the transition, the longest rule will be applied.
Referring to fig. 5, fig. 5 is a key log table of a preferred embodiment of a log identification method provided by the present invention; the key log table records label information, manufacturer, entity type, log category, key field, logic expression and other contents. Referring to fig. 6, fig. 6 is a word segmentation dictionary of an optimized log to be identified according to a preferred embodiment of a log identification method provided by the present invention. In specific implementation, the key log table shown in fig. 5 is queried to obtain key fields such as "limit done", "Delete User Template" and "ADD BINDUPFSWITCH" in the key log table, and the key fields are compared with words in the word segmentation dictionary shown in fig. 4, and exemplary, the "limit" and "done" in the word segmentation dictionary are compared with the key field "limit done", and then conversion is performed according to the key fields to obtain an optimized word segmentation dictionary shown in fig. 6, wherein the words in the optimized word segmentation dictionary are "limit done"; in addition, when there is ambiguity in the conversion, for example, only two words "Delete" and "User" exist in the word segmentation dictionary in fig. 4, but the corresponding key field in the key log table is a combination of the partial word existing in fig. 4 and the word not existing, and if the key field is "Delete User Template", the two words in the word segmentation dictionary in fig. 4 will be converted according to the complete key field in the key log table, so as to obtain a corrected word "Delete User Template" in the word segmentation dictionary in fig. 6.
S3, inputting the word segmentation dictionary of the log to be identified into a classification model for classification processing to obtain keywords of the log to be identified;
it should be noted that, since the word contained in the word segmentation dictionary obtained in step S2 has both keywords and non-keywords, the word segmentation dictionary may be further classified and the keywords may be picked out in step S3, so as to further improve the matching efficiency of the subsequent steps, thereby improving the processing efficiency of the log.
Specifically, the classification model is obtained through training of the following steps:
constructing a nerve network model based on a transducer;
the neural network model carries out classification training according to the frequency of word occurrence in the word segmentation dictionary, and a trained classification model is obtained; wherein the lower the frequency of occurrence of the word, the higher the probability that the word becomes a keyword.
It should be noted that, to meet the requirement of the online system, the present solution trains a neural network model of a transducer for classifying words, and the model supports updating training parameters of the model through feedback information. By calculating the frequency of word occurrence, it is determined whether it belongs to a key word. The frequency of word occurrence is strongly correlated with the criticality, the higher the frequency of occurrence, the lower the critical probability of the word. When an engineer discovers that the classification of a word is wrong, the classification of the word can be corrected through the feedback function of the transducer model, and the model can be judged according to new classification information when judging next time, so that the requirement of online updating of the system is met.
Referring to fig. 7, fig. 7 is a diagram showing keywords and non-keywords of a log to be identified according to a preferred embodiment of a log identifying method provided by the present invention. In specific implementation, the word segmentation dictionary shown in fig. 6 is input into the classification model to obtain the classified keywords and non-keywords shown in fig. 7, and exemplary keywords such as "limit done", "Delete User Template", "ADD BINDUPFSWITCH" and the like matched with the key fields in the key log table and non-keywords such as "operation log", "Opad", "dateTime" and the like are obtained respectively.
And S4, matching the key words of the log to be identified with key fields in a pre-configured key log table to obtain matched log contents, setting a label for the matched log contents according to the key fields in the key log table and the corresponding relation between the logic expression and the label, and outputting the identification result of the log to be identified.
Specifically, according to the expert experience or the log information focused by the operation and maintenance personnel, a key log field and a logic expression are set in a key log table in advance, and when the log needs to be identified, the log is matched with records in the key log table according to the log category of the real-time log content and the sorted keywords. And the successfully matched data can set labels for the logs according to the label information set in the key log table and output the identification result of the logs.
Referring to fig. 8, fig. 8 is a log recognition result of a preferred embodiment of a log recognition method provided by the present invention. In an exemplary implementation, the log to be identified in fig. 2 is extracted to obtain the log trunk content shown in fig. 3, the log trunk content is word-segmented by a word segmentation model to obtain words such as "commit" and "done" in the word segmentation dictionary shown in fig. 4, the words such as "commit done" shown in fig. 7 are obtained after optimization and classification processing of a classification model, then the log content matched with the "commit done" is searched in the log to be identified, finally label information such as "high-risk operation" is set for the matched log content according to the label corresponding to the "commit done" in the key log table, and finally the complete log identification result finally shown in fig. 8 is output by combining the manufacturer, entity type, log category and other information of the log content.
It should be noted that, the key log table includes information such as tag information, manufacturer, entity type, log category, key field, and logic expression. The table is preset, but in the running process of the system, the table can be added, deleted and checked according to the actual requirement. When the new log label information is expected to be added, whether the words in the word segmentation dictionary output by the word segmentation model meet the requirements or not needs to be searched first, and if the requirements are met, new labels and related information can be directly added in the key log list; if the word of the existing word segmentation dictionary does not meet the requirement, new word segmentation information needs to be newly added in the dictionary in a self-defined mode: adding new segmentation into a new token through an add_token function; informing the word segmentation model to update the vocabulary size by using a resize_token_emmbeddings function; and finally, storing the newly added vocabulary.
It should be noted that, the number of the labels set for the matched log content is related to the corresponding relationship between the key fields recorded in the key log table and the logic expressions and the labels, and one or more labels are set for the matched log content at one time according to the corresponding relationship. As shown in fig. 8, the output log identification result includes both the log content of one tag information and the log content of a plurality of tag information. The label information can be stored in a database storing the log, and the database is provided with an independent label column for storing the label information corresponding to the log. For example, the event is the high-risk operation event, but the log is not in a format, log records of the high-risk operation event are different from each other in different manufacturers, and the upper layer application cannot count once when doing statistics, but can count through unified tag information of the high-risk operation.
The invention provides a log identification method, a system, a terminal device and a storage medium, wherein the method comprises the steps of extracting a message body from a log to be identified, removing variable information in the message body to obtain log trunk content, inputting the log trunk content into a word segmentation model for word segmentation processing to obtain a word segmentation dictionary of the log to be identified, inputting the word segmentation dictionary of the log to be identified into a classification model for classification processing to obtain keywords of the log to be identified, finally obtaining matched log content when the keywords of the log to be identified are matched with key fields in a pre-configured key log table, setting a label for the matched log content according to the corresponding relation of a logic expression of the key fields in the key log table, and outputting an identification result of the log to be identified. According to the embodiment of the invention, the segmentation model based on the Sentence piece algorithm is trained to segment words and output a word segmentation dictionary, so that the accuracy of log word segmentation is improved; the words are classified through training the classification model, and a feedback mechanism is supported to update the training model, so that the method can be applied to a real-time log processing system; finally, according to the corresponding relation in the pre-configured key log table, one or more label information can be set for the log at a time, so that the rapid identification of the log is realized.
In another preferred embodiment, after the outputting the identification result of the log to be identified, the method further includes:
de-duplicating the logs to be identified to form a log trunk template set;
using the log trunk template set as a training sample to perform word segmentation training to obtain an optimized word segmentation model; the optimized word segmentation model is used for word segmentation processing of the trunk content of the log to be identified next time.
It is worth to say that, the collected logs are de-duplicated, so that the quality of a training set can be improved, the problem that the generated content of a language model is repeated is solved, and the problem of overfitting caused by leakage of a testing set is avoided; and secondly, inputting the log content subjected to duplication removal into a log trunk extraction model to extract and form a log trunk template set, removing variable parameters in the log, and only keeping the trunk of the log. One feature of the 5G network switch log is that the log has many variable parameters, which need to be cleaned before training the word segmentation model, and these parameters are similar to the variable parameters in the message body, and are not described in detail herein.
Based on the method item embodiments, the invention correspondingly provides a system item embodiment.
As shown in fig. 9, another preferred embodiment of the present invention provides a log identifying system, comprising:
the extraction module 21 is used for extracting a message body from the log to be identified, removing variable information in the message body and obtaining log trunk content;
the word segmentation module 22 is configured to input the trunk content of the log into a word segmentation model for word segmentation processing, so as to obtain a word segmentation dictionary of the log to be identified;
the classification module 23 is configured to input the word segmentation dictionary of the log to be identified into a classification model for classification processing, so as to obtain keywords of the log to be identified;
the identification module 24 is configured to match the keyword of the log to be identified with a key field in a pre-configured key log table, obtain a matched log content, set a label for the matched log content according to the key field in the key log table and a corresponding relationship between a logic expression and the label, and output an identification result of the log to be identified.
It should be noted that, the log identifying system provided by the embodiment of the present invention is used for executing all the flow steps of the log identifying method in the above embodiment, and the working principles and beneficial effects of the two correspond one to one, so that the description is omitted.
The embodiment of the invention also provides a terminal device, as shown in fig. 10, which is a block diagram of a preferred embodiment of the terminal device. The terminal device comprises a computer program comprising a processor 31, a memory 32 and stored in the memory 32 and configured to be executed by the processor 31, the processor 31 implementing the log identification method according to any of the embodiments above when executing the computer program.
In addition, an embodiment of the present invention further provides a computer readable storage medium, where the computer readable storage medium includes a stored computer program, where when the computer program runs, a device where the computer readable storage medium is controlled to execute the log identifying method according to any one of the embodiments above.
The processor 31, when executing the computer program, implements the steps of the above-described log identification method embodiment, for example, all the steps of the log identification method shown in fig. 1. Alternatively, the processor 31 may implement the functions of the modules in the above embodiment of the log recognition system when executing the computer program, for example, the functions of the modules of the log recognition system shown in fig. 9.
Preferably, the computer program may be divided into one or more modules/units, which are stored in the memory 32 and executed by the processor 31 to complete the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing the specified functions, which instruction segments are used for describing the execution of the computer program in the terminal device.
The processor 31 may be a central processing unit (Central Processing Unit, CPU), it may be a further general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a Field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic devices, discrete hardware components, etc., or the processor 10 may be any conventional processor, the processor 31 being a control center of the terminal device, the various interfaces and lines being used to connect the various parts of the terminal device.
The memory 32 mainly includes a program storage area, which may store an operating system, application programs required for at least one function, and the like, and a data storage area, which may store related data and the like. In addition, the memory 32 may be a high-speed random access memory, a nonvolatile memory such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card), etc., or the memory 32 may be other volatile solid-state memory devices.
It should be noted that the above-mentioned terminal device may include, but is not limited to, a processor, a memory, and those skilled in the art will understand that the structural block diagram shown in fig. 3 is merely an example of the structure of the above-mentioned terminal device, and does not limit the structure of the above-mentioned terminal device, and the above-mentioned terminal device may include more or less components than those shown, or may combine some components, or different components.
In summary, according to the log identifying method, system, terminal device and storage medium provided by the embodiment of the invention, a message body is extracted from a log to be identified, variable information in the message body is removed, log trunk content is obtained, the log trunk content is input into a word segmentation model for word segmentation processing, a word segmentation dictionary of the log to be identified is obtained, the word segmentation dictionary of the log to be identified is input into a classification model for classification processing, keywords of the log to be identified are obtained, finally, when the keywords of the log to be identified are matched with key fields in a pre-configured key log table, matched log content is obtained, a label is set for the matched log content according to the corresponding relation of the logic expression and the label of the key fields in the key log table, and an identifying result of the log to be identified is output. According to the embodiment of the invention, the segmentation model based on the Sentence piece algorithm is trained to segment words and output a word segmentation dictionary, so that the accuracy of log word segmentation is improved; the words are classified through training the classification model, and a feedback mechanism is supported to update the training model, so that the method can be applied to a real-time log processing system; finally, according to the corresponding relation in the pre-configured key log table, one or more label information can be set for the log at a time, so that the rapid identification of the log is realized.
While the foregoing is directed to the preferred embodiments of the present invention, it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of the invention, such changes and modifications are also intended to be within the scope of the invention.

Claims (10)

1. A log identifying method, comprising:
extracting a message body from a log to be identified, and removing variable information in the message body to obtain the trunk content of the log;
inputting the trunk content of the log into a word segmentation model for word segmentation processing to obtain a word segmentation dictionary of the log to be identified;
inputting the word segmentation dictionary of the log to be identified into a classification model for classification processing to obtain keywords of the log to be identified;
matching the key words of the log to be identified with key fields in a pre-configured key log table to obtain matched log content, setting a label for the matched log content according to the corresponding relation between the key fields in the key log table, the logic expression and the label, and outputting the identification result of the log to be identified.
2. The method of claim 1, wherein the variable information in the message body includes at least one of a timestamp, IP address information, an entity name, and a log category generated by the log to be identified.
3. The log recognition method as claimed in claim 1, wherein the word segmentation model is obtained by training:
based on a Sentence piece algorithm, word segmentation training is carried out in a word form, and a trained word segmentation model is obtained.
4. The log recognition method as set forth in claim 1, further comprising, before inputting the word segmentation dictionary of the log to be recognized into a classification model for classification processing:
and comparing the words in the word segmentation dictionary of the log to be recognized with key fields in a preset key log table to obtain an optimized word segmentation dictionary of the log to be recognized.
5. The log identification method as claimed in claim 1, wherein the classification model is obtained by training the steps of:
constructing a nerve network model based on a transducer;
the neural network model carries out classification training according to the frequency of word occurrence in the word segmentation dictionary, and a trained classification model is obtained; wherein the lower the frequency of occurrence of the word, the higher the probability that the word becomes a keyword.
6. The log identifying method as set forth in claim 1, further comprising, after said outputting the identification result of the log to be identified:
de-duplicating the logs to be identified to form a log trunk template set;
using the log trunk template set as a training sample to perform word segmentation training to obtain an optimized word segmentation model; the optimized word segmentation model is used for word segmentation processing of the trunk content of the log to be identified next time.
7. The method for identifying logs according to claim 1, wherein the number of the labels set for the matched log content is related to a correspondence between key fields recorded in the key log table and logical expressions and labels thereof, and one or more labels are set for the matched log content at one time according to the correspondence.
8. A log recognition system for implementing the log recognition method according to any one of claims 1 to 7, comprising:
the extraction module is used for extracting a message body from the log to be identified, removing variable information in the message body and obtaining the trunk content of the log;
the word segmentation module is used for inputting the trunk content of the log into a word segmentation model for word segmentation processing to obtain a word segmentation dictionary of the log to be identified;
the classification module is used for inputting the word segmentation dictionary of the log to be identified into a classification model for classification processing to obtain keywords of the log to be identified;
the identification module is used for matching the key words of the log to be identified with key fields in a pre-configured key log table to obtain matched log contents, setting labels for the matched log contents according to the key fields in the key log table and the corresponding relations between the logic expressions and the labels, and outputting the identification results of the log to be identified.
9. A terminal device comprising a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the log identifying method according to any of claims 1 to 7 when executing the computer program.
10. A computer readable storage medium, characterized in that the computer readable storage medium comprises a stored computer program, wherein the computer program, when run, controls a device in which the computer readable storage medium is located to perform the log identifying method according to any one of claims 1 to 7.
CN202311563040.2A 2023-11-21 2023-11-21 Log identification method, system, terminal equipment and storage medium Pending CN117436440A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311563040.2A CN117436440A (en) 2023-11-21 2023-11-21 Log identification method, system, terminal equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311563040.2A CN117436440A (en) 2023-11-21 2023-11-21 Log identification method, system, terminal equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117436440A true CN117436440A (en) 2024-01-23

Family

ID=89548021

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311563040.2A Pending CN117436440A (en) 2023-11-21 2023-11-21 Log identification method, system, terminal equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117436440A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117827991A (en) * 2024-03-06 2024-04-05 南湖实验室 Method and system for identifying personal identification information in semi-structured data
CN117827991B (en) * 2024-03-06 2024-05-31 南湖实验室 Method and system for identifying personal identification information in semi-structured data

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117827991A (en) * 2024-03-06 2024-04-05 南湖实验室 Method and system for identifying personal identification information in semi-structured data
CN117827991B (en) * 2024-03-06 2024-05-31 南湖实验室 Method and system for identifying personal identification information in semi-structured data

Similar Documents

Publication Publication Date Title
EP3846048A1 (en) Online log analysis method, system, and electronic terminal device thereof
CN110263009B (en) Method, device and equipment for generating log classification rule and readable storage medium
CN110851598B (en) Text classification method and device, terminal equipment and storage medium
CN107704539A (en) The method and device of extensive text message batch structuring
CN111191012B (en) Knowledge graph generation device and method and computer readable storage medium thereof
CN111259144A (en) Multi-model fusion text matching method, device, equipment and storage medium
CN112307741B (en) Insurance industry document intelligent analysis method and device
CN110580308A (en) information auditing method and device, electronic equipment and storage medium
CN112784009B (en) Method and device for mining subject term, electronic equipment and storage medium
US20230401121A1 (en) Fault log classification method and system, and device and medium
CN111488314A (en) Simulation log analysis method based on Python
CN114785606A (en) Log anomaly detection method based on pre-training LogXLNET model, electronic device and storage medium
CN115953123A (en) Method, device and equipment for generating robot automation flow and storage medium
CN112395881B (en) Material label construction method and device, readable storage medium and electronic equipment
CN110737770B (en) Text data sensitivity identification method and device, electronic equipment and storage medium
CN112732655B (en) Online analysis method and system for format-free log
CN111581057B (en) General log analysis method, terminal device and storage medium
CN117194255A (en) Test data maintenance method, device, equipment and storage medium
CN114842982B (en) Knowledge expression method, device and system for medical information system
CN116578700A (en) Log classification method, log classification device, equipment and medium
CN117436440A (en) Log identification method, system, terminal equipment and storage medium
CN112115362B (en) Programming information recommendation method and device based on similar code recognition
CN115294593A (en) Image information extraction method and device, computer equipment and storage medium
CN115186240A (en) Social network user alignment method, device and medium based on relevance information
CN114722960A (en) Method and system for detecting incomplete track of event log in business process

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination