CN117724925A - Log data pattern recognition method, system and electronic device based on distribution - Google Patents

Log data pattern recognition method, system and electronic device based on distribution Download PDF

Info

Publication number
CN117724925A
CN117724925A CN202311509412.3A CN202311509412A CN117724925A CN 117724925 A CN117724925 A CN 117724925A CN 202311509412 A CN202311509412 A CN 202311509412A CN 117724925 A CN117724925 A CN 117724925A
Authority
CN
China
Prior art keywords
log
template
mode
signature
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311509412.3A
Other languages
Chinese (zh)
Other versions
CN117724925B (en
Inventor
屈兴
艾丽平
崔文正
王拓
陆林峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Xishu Technology Co ltd
Original Assignee
Shenzhen Xishu Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Xishu Technology Co ltd filed Critical Shenzhen Xishu Technology Co ltd
Priority to CN202311509412.3A priority Critical patent/CN117724925B/en
Publication of CN117724925A publication Critical patent/CN117724925A/en
Application granted granted Critical
Publication of CN117724925B publication Critical patent/CN117724925B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

The invention provides a method, a system and an electronic device for identifying a log data mode based on distributed type, wherein the method comprises the following steps: s1, in a writing stage, specifically, classifying and aggregating collected original log data, and performing pattern matching extraction according to a preset pattern; and S2, in the inquiring stage, specifically, inquiring a log signature I D in a log range of a selected time, performing deduplication based on the log signature I D, inquiring a corresponding log template from a storage system, and combining to obtain a final log mode. The invention can be deployed on the nodes of each data source of the distributed system, thereby supporting the transverse expansion application, concurrency and processing large-scale log data, and the speed is higher.

Description

Log data pattern recognition method, system and electronic device based on distribution
Technical Field
The present disclosure relates to the field of information processing technologies, and in particular, to a method, a system, and an electronic device for identifying a log data pattern based on distribution.
Background
When the software system is used, a large amount of log information is generated, and the log information has important significance for analyzing faults, improving system performance, safety analysis, user behavior analysis and the like. The log is one of three sources of basic data (log, index, trace) of the observability system, and provides the most clear record for diagnosing the state of the system and reproducing the state when the problem occurs.
Therefore, log analysis can help users to know the running condition of the system, find problems, optimize performance, monitor safety and insight business, thereby improving the reliability, safety and efficiency of the system.
The traditional log data has smaller scale, and for a company or a single service system, the software system can be deployed on a single machine or a single cluster machine to complete data acquisition and analysis.
However, with the explosion of big data, in the use scenes of various real operation and maintenance, diagnosis, alarm, analysis and the like, the sources of logs are complex, the log formats are different, and for large-scale service, uninterrupted service is provided for terminals or equipment in the whole country or even the whole world, huge data volume is newly increased in real time, and the prior art needs to cluster the logs first, combine similar modes and extract a log mode.
Although some tools such as LogMine, spell, logram, drain and other systems are available at present, log modes are extracted based on frequent item mining, clustering or analysis tree and the like, the tools have single-machine memory limitations or offline batch processing modes, the processing time is long, and real-time log mode extraction cannot be performed on massive real-time logs. Therefore, extracting modes from massive distributed data sources and complex logs of various source types in real time and high efficiency, finding new modes and providing the new modes for subsequent application is a very important problem.
Disclosure of Invention
In order to solve the above problems, the present invention provides a method for identifying log data patterns based on distribution, the method comprising:
s1, in a writing stage, specifically, classifying and aggregating collected original log data, and performing pattern matching extraction according to a preset pattern;
and S2, in the inquiring stage, specifically, inquiring the log signature ID in the selected time log range, performing duplication elimination based on the log signature ID, inquiring the corresponding log template from the storage system, and combining to obtain a final log mode.
Further, step S1 includes:
step S11: acquiring log data, reading the existing pattern recognition in a storage system, and initializing a log analyzer;
step S12: batch processing is carried out on the stream log data to obtain a word list;
step S13: checking whether the log mode preset by the user is matched or not, if so, finding the log mode, returning to the template reference of the mode, otherwise, executing step S14;
step S14: inputting a word list of each log into a log analyzer one by one, searching and updating a template to which the log belongs in the log analyzer, creating a new template when the template to which the log belongs is not found, and generating a signature ID of the log;
step S15, storing the log with the signature ID and the log template with the new signature ID.
Further, step S12: batch processing is carried out on the stream log data, and the word list is obtained comprises the following steps:
word segmentation step S121: the batch of data is split into word lists one by one through a word splitter;
combining step S122: combining the word list according to a predefined word combining rule to obtain a combined and simplified word list;
masking step S123: masking the combined and reduced word list according to domain knowledge, and performing masking processing on some words in special domain, time stamps, IP (Internet protocol) and the like to obtain a masked word list.
Further, step S14 further includes a thread security processing step, including:
s141, setting a child node set Map as a ConcurrentHashMap type;
s142, using a copyOnWriteArraySet type reference template set;
s143, setting the number of child nodes and updating a child node set;
s144, updating the template reference, putting the new mode which does not exceed the threshold value into a set, and if the new mode does not exceed the threshold value, selecting a first template in the template set and forcedly merging the first template with the new template;
s145, masking different words at the same position and locking the whole updating logic.
Further, step S2 includes:
s21, inquiring data of a time range selected by a user, and carrying out grouping deduplication based on signature IDs to obtain a signature ID list of a preliminary log mode;
s22: based on the signature ID list, searching a corresponding log template and removing duplication;
s23, taking the found corresponding log template as an original log, performing mode extraction on the original log, and merging similar log templates to obtain a final log mode.
The invention also provides a system for identifying the log data mode based on the distributed mode, which comprises a writing device and a query device, wherein:
the writing device is used for classifying and aggregating the collected original log data and carrying out pattern matching extraction according to a preset pattern;
the inquiry device is used for inquiring the log signature ID in the selected time log range, carrying out duplication elimination based on the log signature ID, inquiring the corresponding log template from the storage system and carrying out combination to obtain a final log mode.
Further, the writing device comprises a log acquisition unit, a writing unit and a data storage unit, wherein:
the log acquisition unit is used for acquiring log data and reading the existing pattern recognition in the storage system, and initializing a log analyzer;
the writing unit is used for carrying out batch processing on the streaming log data to obtain word lists, checking whether the word lists are matched with a log mode preset by a user, if so, finding the log mode, returning to a template reference of the mode, if not, inputting the word list of each log into the log analyzer one by one, updating the log analyzer, searching and updating a template to which the log belongs, and generating a signature ID of the log;
the data storage unit is used for storing the log with the signature ID and all found log templates.
Further, the data storage unit further includes a pattern recognition cache module configured to:
caching the latest query pattern signature and the log parser of the specific source type;
before pattern recognition, reading the content of the recognized template, initializing a log pattern analyzer, and caching the content to compare the similarity with a word list and judging whether a new pattern needs to be generated;
when inquiring, the ID of the latest inquiring signature and the template content thereof are cached, so that the acquisition of the template from the corresponding inquiry of the storage system is avoided, and the inquiring speed is increased.
The present invention also provides a computer storage medium storing an executable program which, when run on a computer, performs the distributed log data pattern recognition method.
The invention also provides an electronic device, which comprises a processor and a memory, wherein the memory is used for storing an executable program, and the processor is used for executing the executable program to realize the method for identifying the pattern based on the distributed log data.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that are required to be used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained from these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a method for identifying patterns of log data based on distribution in a first embodiment.
Fig. 2 is a schematic structural diagram of a distributed log data pattern recognition system according to an embodiment of the present application.
The device 2 is queried by a log acquisition unit 11, a writing unit 12, a data storage unit 13.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
The log is typically made up of fixed descriptive statements and variable parts, variables including, for example, time stamps, IP addresses, specific values, etc. Log pattern extraction, i.e., masking the changing word portions in the log, uses wildcard symbols, e.g., </times >, instead, to generate patterns that can represent such logs. By checking the log mode, the retrieval scale can be greatly reduced, so that the cost of data analysis is reduced, and advanced functions of performing anomaly detection, alarm configuration and the like on specific log modes such as trend change can be realized.
In the current log analysis field, some log mode extraction methods are based on frequent item mining, clustering or analysis trees, such as LogMine, spell, log, drain and the like, but each has limitations, such as single-machine memory limitation or offline batch processing mode, and the processing time is long, so that real-time log mode extraction can not be performed on massive real-time logs.
For the log of the single machine system or the offline batch processing, because the log data is collected and stored, the writing stage is not involved, only the processing of the reading stage is involved, all the log data are read according to the need in the reading stage, the algorithm parameters are adjusted, the log mode is identified, and the parameters influence the quantity of the extracted log mode.
In the prior art, a single machine mode recognition algorithm is used by a drain system, when a log is processed and read, the log to be read is subjected to the log mode recognition algorithm according to the log to be read, and a log mode is output. The method has the advantages that the requirement on data input is low, namely the original log is needed, but the processing time of the method can be linearly increased along with the increase of the data volume of the log (such as statistics of weeks and mode number of logs of several months) to be identified.
The algorithm of offline batch processing generally uses the idea of map-reduce division to treat, and the data of a plurality of nodes are preprocessed in parallel, and then summarized and aggregated to generate a final log mode. The problem of distributed processing of data is solved, and expansion of the data size is dealt with by expanding nodes. Although the offline mode can solve the problem of distributed log data, the disadvantages are mainly:
1. the final log mode can be generated only after all data are analyzed, and the time consumption is still long and the data are in the order of minutes under the condition of large data scale.
2. There is a delay in that a new log is stored for a period of time, the log mode of which is not known, and must be processed at the next batch.
The application provides a distributed log data pattern recognition method, which is characterized in that for real-time streaming log data, a preliminary log pattern extraction is firstly carried out in a warehouse entry writing stage, then a more accurate log pattern is extracted in a checking stage based on a preliminary log pattern extraction result and a re-extraction pattern. Through the two-stage mode extraction, the log mode in the data stream can be extracted in a non-blocking concurrent mode, and the method specifically comprises the following steps:
s1, in a writing stage, specifically, classifying and aggregating collected original log data, and performing pattern matching according to a preset pattern;
and S2, in the inquiring stage, specifically, inquiring the log signature ID in the selected time log range, carrying out duplication elimination based on the log signature ID, inquiring a corresponding log template from a storage system, and merging similar primary log modes based on a mode extraction method to obtain a final log mode.
Referring to fig. 1, the following describes the technical solution of the present application in detail in connection with various preferred embodiments.
The writing stage S1 is specifically to classify and aggregate the collected original log data, perform pattern matching extraction according to a preset pattern, and extract and store the primary log pattern in combination with a log pattern analyzer, wherein the step S1 comprises the following steps as a preferred implementation manner:
step S11: acquiring log data, then reading the existing pattern recognition in the storage system, and initializing a log analyzer;
the first step in log analysis is log data collection, which is used to obtain the original log data, and the source of the log may be a file or a data stream system such as kafka, and the existing pattern recognition is a predefined pattern stored in the system and a pattern recognition obtained by historical update.
Step S12: batch processing is carried out on the stream log data to obtain a word list;
according to the source type, batch is divided and uploaded, the source type is divided, batch commit logs are stored and mode identification processed, and the batch size is generally 1024 original logs.
Specifically, the original log data is distributed with source types, logs of different systems, such as distributed application system logs of spark, hadoop and the like, are distinguished, and logs of a host operating system are configured to be of different source types. In general, for logs of different source types, the formats are different, so that it is unlikely that one type of pattern can be aggregated, so that subsequent log pattern extraction requires independent processing and log pattern extraction for different log source types.
The batch processing is specifically as follows:
word segmentation step S121: the batch of data is split into word lists one by one through a word splitter;
the log is processed by a word segmentation device, a word list is segmented, and the log is segmented into the word list. Specific word segmentation processes, including but not limited to:
(1) Character string segmentation is performed using spaces or other segmentation symbols such as #, |, commas, etc.;
(2) Extracting json and xml structured data by using a json and xml analyzer, and splitting the json and xml structured data into word lists;
(3) The log containing the Chinese is segmented using a Chinese segmenter such as ansj, hanlp, ik.
Combining step S122: combining the word list according to a predefined word combining rule to obtain a combined and simplified word list;
for word segmentation results, word lists are combined according to preset combining rules, and the aim is to combine non-equal-length variables according to requirements. The merging rule can be used for self-defining and configuring a plurality of word prefixes and word suffixes, and defaults to "[" and "]". Typically the "[ ]" content in the log is a tuple of variables, so the parts of the tuple content are treated as the same variable word.
Masking step S123: masking the combined and reduced word list according to domain knowledge, and performing masking processing on some words in special domain, time stamps, IP (Internet protocol) and the like to obtain a masked word list.
In the step, based on priori domain knowledge, mask processing is carried out on special format information such as time stamps, IP addresses and the like through a regular matching method, and more format information can be reserved compared with a general character mask.
The technical effects obtained by the streaming processing of the application are as follows: only the log is required to be scanned once in the writing phase, namely the storage phase, and the log signature ID is generated in combination with the following steps, so that a new generated log template is stored.
Compared with the prior art that the preprocessing of each node needs to be summarized and polymerized, the method does not need any aggregation operation in the writing stage, does not need to synchronize the primary identification result of each node, does not block the storage of log data, and does not block the extraction of the final mode of the query node.
Step S13: checking whether the log mode preset by the user is matched or not, if so, finding the log mode, returning to the template reference of the mode, otherwise, executing step S14;
in this step, a predefined log pattern is identified, and it is checked whether it matches the regular rules, templates predefined by the user. The traditional log mode generated based on the algorithm can reduce the manual quantity, but cannot meet the manual quantity in the actual scene demand, and the mode classification is carried out on the logs by matching the regular of the log mode preset by a user and designating the log template of the mask, so that the defect that the number of the final modes of some specific types of logs is possibly too large or too small when the mode is automatically extracted by the algorithm is overcome, and the rules of some specific embodiments are as follows:
regular pattern writing rules:
a. () Is a regular matching rule, requires a regular expression of the mask portion, is an example (\d+), and can name placeholder content (;
b. non () content to be constant character string without manual escape of regular character
c. If it is desired to identify (as a normal character, it is desired to escape \(, and \ (character required) replaced by \\\is \is \is \
Examples:
a regular template:
[(?<yyyy-MM-hhTHH:mm:ss.sss>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2},\d{3})][INFO][i.k.a.s.ApmServiceSyncHandler][wmmM50Y]APMServiceSyncstartTime(\d+)endTime(\d+)
logs that can be matched:
[2023-10-27T17:18:30,009][INFO][i.k.a.s.ApmServiceSyncHandler][wmmM50Y]APMServiceSyncstartTime0endTime1698398310009\n
generating a template:
[yyyy-MM-hhTHH:mm:ss.sss][INFO][i.k.a.s.ApmServiceSyncHandler][wmmM50Y]APM Service Sync startTime<*>endTime<*>
template mode writing rules:
a. template mode, based on word segmentation result, whether the matching is identical, and the mask word is identical with any word
The content is the content needing mask, common mask, < aaa >, < yyy-MM-hhTHH: MM: ss.sssxx > is the mask of the specified mask name, note that when space word is used by default, the template name does not contain space, otherwise the template cannot be matched
c. Non- < > contained content to be constant string
d. If the original is a word with the beginning and end of </u >, a manual pair of < escape\ </u > is needed
Examples:
pattern template: < yyy-MM-hhTHH: MM ss.sss > [ INFO ] [ i.k.a.s.ApmServiceDyncHandler ] [ wmM 50Y ] APM Service Sync startTime </endTime ]
Matching logs: [2023-10-27T19:11:00,020] [ INFO ] [ i.k.a.s.ApmServiceDyncHandler ] [ wmM 50Y ] APM Service Sync startTime0endTime 1698405060020
Generating a template: yyy-MM-hhTHH: MM ss.sss [ INFO ] [ i.k.a.s.ApmServiceDyncHandler ] [ wmM 50Y ] APM Service Sync startTime </endTime ]
Step S14: inputting a word list of each log into a log analyzer one by one, searching and updating a template to which the log belongs in the log analyzer, creating a new template when the template to which the log belongs is not found, and generating a signature ID of the log;
the signature ID is generated through the processing of a log analyzer, if the log does not meet any predefined mode, the log is subjected to mode extraction, a new template is generated, the existing template is updated, the thread safety is ensured, the memory overflow problem is ensured not to occur through the limitation of the number of modes, and the like are also required to be considered, and the method comprises the following steps of:
(1) Initializing, creating a root node as layer 1, and initializing built-in several special modes, such as null logs, wherein the log length exceeds a given threshold value, the number of modes reaches the threshold value, and a new mode cannot be generated.
(2) Word lists are processed, based on the length of the list, created or routed to the layer 2 node when not present, and logs of different lengths are divided into different patterns in this section.
(3) The location points to the first word of the current word list and is created or routed to the level 3 node.
(4) And (3) repeating the step until the algorithm depth threshold is reached or the word list of the position index reaches the end, comparing the word list with the existing modes under the node one by one on the node, calculating the similarity based on the number of the identical words and the number of masks, adding the mode with the maximum similarity when the similarity threshold is met, and otherwise, creating a new mode.
In particular, to avoid memory overflow, for each child node, the number of modes that can be created under the node is limited, as well as the number of modes in the log of the entire tree.
After the number of modes under the node reaches a given threshold, the new log ignores the similarity threshold, forcing the merging with the most similar mode. If there is no pattern under the node, the global pattern number limit is reached by returning to the special predefined pattern in step (1) that exceeds the given threshold.
In the process of newly adding templates and updating the template content formed by new classification, the application also comprises thread safety processing in combination with a data structure used by the log analyzer, and as an implementation mode of thread safety, the data structure of the log analyzer in the embodiment of the application is a fixed-length Tire tree, and the node members comprise the current depth, the node keywords, the Map formed by child node sets and the template set contained by the current node. Taking JAVA implementation as an example, the implementation is as follows:
(1) And defining and setting the child node set Map as a ConcurrentHashMap type, and ensuring the concurrent update safety.
(2) Defining template reference sets can use the copyOnWriteArraySet type to ensure that adding to newly added templates is also thread safe.
(3) The method is characterized in that a child node set is updated, a log template cannot be infinitely increased under the condition of large-scale log data, the number of the log templates is limited to about 100 in a single node, the child nodes are generally limited to obtain or create child nodes, and a thread safety updating mode is provided, and a getOrCreateHildeWithLimit method is provided, wherein the method ensures thread safety updating by using a computeIfAbsent method of ConcurrentHashMap.
When the key words of the child nodes are not acquired, checking whether the current number of children exceeds a given threshold value or not, creating the child nodes based on the key words, if yes, returning the child nodes with wildcards as the key words, and taking the position words as mask parts by the subsequent new modes to avoid explosion of the mode numbers.
(4) The number of templates is also limited by the number of updated template references, e.g., set to 100, and for new patterns that do not exceed a threshold, the set can be directly and safely placed, while exceeding the threshold, the first template in the set of templates is selected, forced to merge with the new template, masking of different words at the same location, and locking of the entire update logic, e.g., synchronized keywords.
Signature ID, which is a template unique token, a numeric type. Based on the template content generation, the same template content, the same signature ID is generated. According to the limitation, the same type of logs which are independently processed in parallel can be ensured, and the logs can be finally classified into the same mode. The numerical value type can be used for quickly comparing whether two logs are the same template or not, and the aggregation calculation is faster than the character string type.
In addition, different signature IDs, the application can be attributed to the same type of log mode. During the streaming process, the log template may find new variable locations, update the template content, and thus generate new signatures. The same log mode, the combination of different signatures, the final mode generation, will be combined in the inquiry phase. The merging is to set parameters of the mode extraction method according to the merging level selected by the user. Default to 10, minimum to 1, the smaller the value, the smaller the number of log patterns generated after merging, and the specific merging logic is referred to step S23.
Therefore, in the writing stage of pattern recognition, the method extracts logs of log patterns and similar patterns, generates the same signature under the condition of consistent algorithm parameter configuration, repeats the persistent log templates, is completely consistent, does not influence the pattern recognition effect, and can reduce the number of the primarily generated log patterns relative to the original log by tens to hundreds of orders in the reading merging stage, and the expense for merging the intermediate log patterns is ignored.
Step S15, storing the log with the signature ID and the log template with the new signature ID, wherein the log template is the template to which the log belongs in step S14.
Checking whether the signature ID is a new mode, and after the mode extraction method in the writing stage is used for processing, each log is allocated with a template reference, wherein the template reference contains meta information such as the signature ID, word list, source type, segmenter, identification field and the like of the template. Checking whether the signature IDs of the logs are recorded in the discovered log mode signature caches one by one, if so, indicating that the log mode signature caches are the existing mode, and if not, indicating that the log mode signature caches are new templates, and collecting the log mode signature caches uniformly and durably.
For the original log, a signature ID column is added, namely, when the original log is stored, a signature ID column is expanded for carrying out association inquiry in the inquiry stage and the log mode.
The log may be written directly to the storage system after the signature ID column is extended, and for a new log template, the write efficiency is improved, and may be written to the storage system after a given batch size, such as 1024 or a timing threshold, e.g., 5s, is met.
And S2, in the inquiring stage, specifically, inquiring the log signature ID in the selected time log range, carrying out duplication elimination based on the log signature ID, inquiring a corresponding log template from a storage system, and merging similar primary log modes based on a mode extraction method to obtain a final log mode.
For large-scale log data analysis, the traditional log pattern extraction consumes very large computing resources and takes very long time, the technical scheme provided by the application divides pattern extraction into a writing (warehouse-in stage) stage and a reading (inquiring) stage, and distributes the most consumed computing amount and time in the log warehouse-in stage, and only needs to read the log pattern in the inquiring stage, so that the application can support the technical effect of checking hundreds of millions of logs in seconds, and as a preferred implementation mode, the step S2 specifically comprises the following steps of
S21, inquiring data of a time range selected by a user, and carrying out grouping deduplication based on signature IDs to obtain a signature ID list of a preliminary log mode;
in the inquiry phase, the system can be provided with some operation interaction interfaces such as a mode acquisition mode overview, a mode comparison and other advanced application views, and a signature ID field is acquired according to data of an inquiry time range selected by a user.
S22: based on the signature ID list, a corresponding log template is found, grouping statistics is carried out, and duplicate removal processing is carried out, so that a unique signature ID is obtained.
A lookup or join type method can be used for filtering queries based on the signature IDs in batches, and corresponding log templates are acquired from a storage system. In order to improve query performance, an LRU algorithm is used to cache the log pattern of the most recently used query, avoiding duplicate queries.
In addition, the writing stage mode extraction method can be independently executed in parallel on different processing nodes, the same signature ID can be generated in a non-synchronous mode, redundancy writing is performed in a storage system, a specific template is queried according to the ID signature after duplication removal, and the primary mode of the obtained log is obtained.
S23, taking the found corresponding primary log template as an original log, and applying the mode extraction method again, and combining the same and similar log templates to obtain a final log mode. The combined pattern extraction method refers to the method described in step S14 above.
By comparing the similarity of words, in particular </mask words, can be matched with all words, considered as the same log template. In the process of processing log data in the same node streaming mode, different signatures of the same generated mode often recognize words which are considered as variables later as constants when the log appears for the first few times, and finally, after all variable words are found, the generated log template can completely cover the template of the previous signature ID, and finally, the generated log is even in the most general mode.
The final log templates possibly generated by different nodes are different due to different processed log data, but the differences are not large, so that the similarity threshold can be met, the similar log templates are regarded as being divided into the same mode, and the final log mode is generated.
Based on the above embodiment, considering the complex network node and transmission environment of the existing distributed system, the present application further provides a log data pattern recognition system based on distributed system, which includes a writing device and a query device, wherein:
the writing device is used for classifying and aggregating the collected original log data and carrying out pattern matching extraction according to a preset pattern;
the inquiry device is used for inquiring the log signature ID in the selected time log range, carrying out duplication elimination based on the log signature ID, inquiring the corresponding log template from the storage system and carrying out combination to obtain a final log mode.
The inquiry device is a group of stateless work processes which can be transversely expanded and consists of a plurality of inquiry computing nodes, and is used for executing mode inquiry tasks and bearing the function of a mode inquiry module.
Referring to fig. 2, further, the writing device includes a log collection unit, a writing unit, and a data storage unit, where:
the log acquisition unit is a component formed by a plurality of log acquisition agents and is used for acquiring log data and reading the existing pattern recognition in the storage system and initializing a log analyzer.
In addition, the following functions are supported:
for a directory or a log file, monitoring the change of the directory and the file content is supported, tracking the log generated in real time, and identifying the line feed mode of the log;
and merging the logs of the abnormal information into one row, and finally, after collecting the size of one batch, uploading the data to a writing unit. For example, with a kafka data log, the log collection unit acts as a data consumer to poll and pull the log in the data stream system. Alternatively, the logs may be combined and the log data may be uploaded to the writing unit in batch.
The writing unit is used for carrying out batch processing on the streaming log data to obtain word lists, checking whether the word lists are matched with a log mode preset by a user, if yes, finding the log mode, returning to a template reference of the mode, if not, inputting the word list of each log into the log analyzer one by one, updating the log analyzer, searching and updating a template to which the log belongs, and generating a signature ID of the log.
The writing unit is also a group of working processes which can be transversely expanded and have no state, and is composed of a plurality of writing working nodes for independently executing log data preprocessing, log mode identification function and mode caching module.
Each writing work node receives the log data from the log acquisition unit, completes the writing stage processing of the log mode, and outputs the log with the signature ID and all the found log templates.
The data storage unit is used for storing the log with the signature ID and all found log templates.
The data storage unit is also composed of a plurality of data storage nodes, the data storage nodes are a group of stateful storage processes which can be transversely expanded, the multi-copy guarantees high availability of data, and a time sequence data storage engine such as lucene can be selected for deployment.
To further speed up the writing phase and the querying phase, the data storage unit further comprises a pattern recognition cache module for:
caching the latest query pattern signature and the log parser of the specific source type;
before pattern recognition, reading the content of the recognized template, initializing a log pattern analyzer, and caching the content to compare the similarity with a word list and judging whether the pattern is a new pattern;
when inquiring, the ID of the latest inquiring signature and the template content thereof are cached, so that the acquisition of the template from the corresponding inquiry of the storage system is avoided, and the inquiring speed is increased.
The log acquisition unit, the writing unit and the data storage unit in the embodiment of the application are a group of independent deployed processes which can be transversely expanded and can be deployed on the nodes of each data source, so that the application of transverse expansion can be supported, concurrency can be supported, and large-scale log data can be processed.
The present application also provides a computer storage medium storing an executable program, when the executable program runs on a computer, the computer executes the method for identifying a pattern of log data based on the distribution mode according to any one of the embodiments.
The application also provides an electronic device, which comprises a processor and a memory, wherein the memory is used for storing an executable program, and the processor is used for executing the executable program to realize the distributed log data pattern recognition method.
It should be noted that, those skilled in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a computer readable storage medium, where the storage medium may include, but is not limited to: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. The method for identifying the log data pattern based on the distributed type is characterized by comprising the following steps:
s1, in a writing stage, specifically, classifying and aggregating collected original log data, and performing pattern matching extraction according to a preset pattern;
and S2, in the inquiring stage, specifically, inquiring the log signature ID in the selected time log range, performing duplication elimination based on the log signature ID, inquiring the corresponding log template from the storage system, and combining to obtain a final log mode.
2. The log data pattern recognition method as set forth in claim 1, wherein the step S1 includes:
step S11: acquiring log data, reading the existing pattern recognition in a storage system, and initializing a log analyzer;
step S12: batch processing is carried out on the stream log data to obtain a word list;
step S13: checking whether the log mode preset by the user is matched or not, if so, finding the log mode, returning to the template reference of the mode, otherwise, executing step S14;
step S14: inputting a word list of each log into a log analyzer one by one, searching and updating a template to which the log belongs in the log analyzer, creating a new template when the template to which the log belongs is not found, and generating a signature ID of the log;
step S15, storing the log with the signature ID and the log template with the new signature ID.
3. The log data pattern recognition method as set forth in claim 2, wherein step S12: batch processing is carried out on the stream log data, and the word list is obtained comprises the following steps:
word segmentation step S121: the batch of data is split into word lists one by one through a word splitter;
combining step S122: combining the word list according to a predefined word combining rule to obtain a combined and simplified word list;
masking step S123: masking the combined and reduced word list according to domain knowledge, and performing masking processing on some words in special domain, time stamps, IP (Internet protocol) and the like to obtain a masked word list.
4. The log data pattern recognition method as set forth in claim 2, wherein step S14 further comprises a thread security processing step comprising:
s141, setting a child node set Map as a ConcurrentHashMap type;
s142, using a copyOnWriteArraySet type reference template set;
s143, setting the number of child nodes and updating a child node set;
s144, updating the template reference, putting the new mode which does not exceed the threshold value into a set, and if the new mode does not exceed the threshold value, selecting a first template in the template set and forcedly merging the first template with the new template;
s145, masking different words at the same position and locking the whole updating logic.
5. The log data pattern recognition method as set forth in claim 1, wherein the step S2 comprises:
s21, inquiring data of a time range selected by a user, and carrying out grouping deduplication based on signature IDs to obtain a signature ID list of a preliminary log mode;
s22: based on the signature ID list, searching a corresponding log template and removing duplication;
s23, taking the found corresponding log template as an original log, performing mode extraction on the original log, and merging similar log templates to obtain a final log mode.
6. A log data pattern recognition system based on distribution, comprising a writing device and a query device, wherein:
the writing device is used for classifying and aggregating the collected original log data and carrying out pattern matching extraction according to a preset pattern;
the inquiry device is used for inquiring the log signature ID in the selected time log range, carrying out duplication elimination based on the log signature ID, inquiring the corresponding log template from the storage system and carrying out combination to obtain a final log mode.
7. The log data pattern recognition system of claim 6, wherein the writing means comprises a log acquisition unit, a writing unit, a data storage unit, wherein:
the log acquisition unit is used for acquiring log data and reading the existing pattern recognition in the storage system, and initializing a log analyzer;
the writing unit is used for carrying out batch processing on the streaming log data to obtain word lists, checking whether the word lists are matched with a log mode preset by a user, if so, finding the log mode, returning to a template reference of the mode, if not, inputting the word list of each log into the log analyzer one by one, updating the log analyzer, searching and updating a template to which the log belongs, and generating a signature ID of the log;
the data storage unit is used for storing the log with the signature ID and all found log templates.
8. The log data pattern recognition system of claim 7, wherein the data storage unit further comprises a pattern recognition cache module to:
caching the latest query pattern signature and the log parser of the specific source type;
before pattern recognition, reading the content of the recognized template, initializing a log pattern analyzer, and caching the content to compare the similarity with a word list and judging whether a new pattern needs to be generated;
when inquiring, the ID of the latest inquiring signature and the template content thereof are cached, so that the acquisition of the template from the corresponding inquiry of the storage system is avoided, and the inquiring speed is increased.
9. A computer storage medium storing an executable program which, when run on a computer, performs the distribution-based log data pattern recognition method of any one of claims 1 to 5.
10. An electronic device comprising a processor and a memory, the memory for storing an executable program, the processor for executing the executable program to implement the distributed-based log data pattern recognition method of any one of claims 1-5.
CN202311509412.3A 2023-11-14 2023-11-14 Log data pattern recognition method, system and electronic device based on distribution Active CN117724925B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311509412.3A CN117724925B (en) 2023-11-14 2023-11-14 Log data pattern recognition method, system and electronic device based on distribution

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311509412.3A CN117724925B (en) 2023-11-14 2023-11-14 Log data pattern recognition method, system and electronic device based on distribution

Publications (2)

Publication Number Publication Date
CN117724925A true CN117724925A (en) 2024-03-19
CN117724925B CN117724925B (en) 2024-08-02

Family

ID=90202412

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311509412.3A Active CN117724925B (en) 2023-11-14 2023-11-14 Log data pattern recognition method, system and electronic device based on distribution

Country Status (1)

Country Link
CN (1) CN117724925B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111435343A (en) * 2019-01-15 2020-07-21 北京大学 Automatic generation and online updating method and system for computer system log template
CN112100149A (en) * 2020-08-30 2020-12-18 西南电子技术研究所(中国电子科技集团公司第十研究所) Automatic log analysis system
US10915418B1 (en) * 2019-08-29 2021-02-09 Snowflake Inc. Automated query retry in a database environment
CN112732655A (en) * 2021-01-13 2021-04-30 北京六方云信息技术有限公司 Online analysis method and system for unformatted logs
CN114818643A (en) * 2022-06-21 2022-07-29 北京必示科技有限公司 Log template extraction method for reserving specific service information
CN116521628A (en) * 2023-04-28 2023-08-01 中国人民解放军战略支援部队信息工程大学 Log template online hybrid mining system for multi-source log

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111435343A (en) * 2019-01-15 2020-07-21 北京大学 Automatic generation and online updating method and system for computer system log template
US10915418B1 (en) * 2019-08-29 2021-02-09 Snowflake Inc. Automated query retry in a database environment
CN112100149A (en) * 2020-08-30 2020-12-18 西南电子技术研究所(中国电子科技集团公司第十研究所) Automatic log analysis system
CN112732655A (en) * 2021-01-13 2021-04-30 北京六方云信息技术有限公司 Online analysis method and system for unformatted logs
CN114818643A (en) * 2022-06-21 2022-07-29 北京必示科技有限公司 Log template extraction method for reserving specific service information
CN116521628A (en) * 2023-04-28 2023-08-01 中国人民解放军战略支援部队信息工程大学 Log template online hybrid mining system for multi-source log

Also Published As

Publication number Publication date
CN117724925B (en) 2024-08-02

Similar Documents

Publication Publication Date Title
CN110888849B (en) Online log analysis method and system and electronic terminal equipment thereof
US8756207B2 (en) Systems and methods for identifying potential duplicate entries in a database
JP5328808B2 (en) Data clustering method, system, apparatus, and computer program for applying the method
Efthymiou et al. Big data entity resolution: From highly to somehow similar entity descriptions in the web
CN111339293B (en) Data processing method and device for alarm event and classifying method for alarm event
CN115913655B (en) Shell command injection detection method based on flow analysis and semantic analysis
CN115437877A (en) Online analysis method and system for multi-source log, electronic equipment and storage medium
CN114116762A (en) Offline data fuzzy search method, device, equipment and medium
CN111190873B (en) Log mode extraction method and system for log training of cloud native system
CN114817243A (en) Method, device and equipment for establishing database joint index and storage medium
CN107085615B (en) Text duplicate elimination system, method, server and computer storage medium
Zhang et al. An efficient log parsing algorithm based on heuristic rules
KR102189127B1 (en) A unit and method for processing rule based action
CN117724925B (en) Log data pattern recognition method, system and electronic device based on distribution
CN113032371A (en) Database grammar analysis method and device and computer equipment
CN113297253A (en) Equipment identification method, device, equipment and readable storage medium
CN113821630A (en) Data clustering method and device
Narayana et al. A novel and efficient approach for near duplicate page detection in web crawling
US20150066947A1 (en) Indexing apparatus and method for search of security monitoring data
CN116822491A (en) Log analysis method and device, equipment and storage medium
CN114492366A (en) Binary file classification method, computing device and storage medium
CN113821517A (en) Data synchronization method, device, equipment and storage medium
CN112597498A (en) Webshell detection method, system and device and readable storage medium
CN113407656B (en) Method and equipment for fast online log clustering
CN111475380A (en) Log analysis method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant