WO2023284132A1 - 一种云平台日志的分析方法、系统、设备及介质 - Google Patents

一种云平台日志的分析方法、系统、设备及介质 Download PDF

Info

Publication number
WO2023284132A1
WO2023284132A1 PCT/CN2021/121902 CN2021121902W WO2023284132A1 WO 2023284132 A1 WO2023284132 A1 WO 2023284132A1 CN 2021121902 W CN2021121902 W CN 2021121902W WO 2023284132 A1 WO2023284132 A1 WO 2023284132A1
Authority
WO
WIPO (PCT)
Prior art keywords
time
logs
time period
log
total number
Prior art date
Application number
PCT/CN2021/121902
Other languages
English (en)
French (fr)
Inventor
雷跃辉
Original Assignee
苏州浪潮智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州浪潮智能科技有限公司 filed Critical 苏州浪潮智能科技有限公司
Publication of WO2023284132A1 publication Critical patent/WO2023284132A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Definitions

  • the present application relates to the field of log analysis, and more specifically refers to a cloud platform log analysis method, system, computer equipment and readable media.
  • the cloud platform can quickly build a development environment and allocate computing resources according to the needs of different users, and is flexible, fast, and on-demand.
  • the advantages For cloud platforms, ensuring system reliability is very important. For many enterprise-level large-scale cloud computing services, there may be tens of thousands of nodes, and such a large number of nodes is very prone to failure. Coupled with the complexity of cloud platform services, some problems are difficult to find and solve in a timely manner, which brings a huge workload to operation and maintenance personnel.
  • the log is an important record carrier of the system's running status. The operation and maintenance personnel can use the log to locate the abnormality of the service and provide a basis for the stable operation of the system.
  • the system log management tools currently on the market generally collect and index logs in a centralized manner, so that operation and maintenance personnel can search, analyze, monitor, and visualize them.
  • these tools do not perform in-depth analysis of the logs, and still need to manually interpret and analyze the logs to determine whether there is an abnormality in the system. Due to the large number of logs, manual investigation is extremely time-consuming, and it is impossible to detect system abnormalities in time. Make accurate judgments.
  • the purpose of the embodiment of the present application is to propose a cloud platform log analysis method, system, computer equipment, and computer-readable storage medium.
  • the inverse text frequency determines the cause of the failure, so that the cloud platform logs can be quickly analyzed, and the operation and maintenance efficiency of the operation and maintenance personnel can be improved.
  • an aspect of the embodiment of the present application provides a cloud platform log analysis method, including the following steps: preprocessing the cloud platform log, and dividing the log recording time into multiple time periods on average according to the preset time length , and count the total number of logs in each time period; select a time window including multiple consecutive time periods, classify each time period in the time window according to the value of the degree of difference to obtain anomalies, and according to the The time period corresponding to the log in the exception class determines the time period when the failure occurred; segment the log of the time period when the failure occurred, and calculate the word frequency and inverse text frequency of each word; and determine the time period when the failure occurred based on the product of word frequency and inverse text frequency reason.
  • the classifying each time period in the time window according to the dissimilarity value to obtain the abnormal class includes: randomly selecting a first number of time periods from the time window as the initial center point ; Calculate the dissimilarity values from each remaining time period to all initial central points in turn, and divide each remaining time period into corresponding initial central points according to the dissimilarity values to form multiple clusters; and calculate The error sum of squares of each cluster, based on the error sum of squares, determine a new central point in the cluster, and calculate the dissimilarity value again based on the new multiple central points and repeat the above steps until the clustering condition is satisfied .
  • dividing each remaining time segment into a corresponding initial center point according to the dissimilarity value to form a plurality of clusters includes: determining the lowest dissimilarity corresponding to the current time segment to be divided value, and divide the current time period into the initial center point corresponding to the lowest dissimilarity value.
  • repeating the above steps until the clustering condition is met includes: judging whether there is an inflection point in the size of the sum of square errors of the cluster; and stopping repeating the above steps in response to the presence of an inflection point in the size of the sum of square errors of the cluster.
  • the determining the period of time when the failure occurred according to the time corresponding to the logs in the abnormal category includes: obtaining the total number of logs in each category, and judging whether the total number of logs in any category is less than a threshold; and responding to the absence of The total number of logs of a category is less than the threshold, and the time period when the failure occurs is determined according to the category with the smallest total number of logs.
  • the determining the period of time when the fault occurs according to the time corresponding to the logs in the abnormal category includes: in response to the total number of logs of the existing category being less than the threshold, according to the category whose total number of logs is greater than or equal to the threshold The category with the smallest total number of logs and the category with the total number of logs less than the threshold determines the time period in which the failure occurred.
  • the determining the cause of the failure according to the product of word frequency and inverse text frequency includes: calculating the product of the word frequency and inverse text frequency of each word, and sorting the corresponding words according to the product from large to small ; and determine the cause of the failure according to the preset number of words in the front row.
  • a preprocessing module configured to preprocess the cloud platform logs, and divide the time of log records into equal parts according to the preset time length A plurality of time periods, and counting the total number of logs in each time period
  • the classification module is configured to select a time window comprising a plurality of continuous time periods, and carry out a process for each time period in the time window according to the value of the degree of dissimilarity Classify to obtain the abnormal class, and determine the time period of the failure according to the corresponding time of the log in the abnormal class
  • the calculation module is configured to segment the log of the time period of the failure, and calculate the word frequency and inverse of each word a text frequency
  • an analysis module configured to determine a cause of the failure based on the product of the word frequency and the inverse text frequency.
  • a computer device including: at least one processor; and a memory, the memory stores computer instructions that can be run on the processor, and the instructions are executed by the The steps of the above method are realized when the processor executes.
  • a computer-readable storage medium stores a computer program for implementing the above method steps when executed by a processor.
  • the application has the following beneficial technical effects: determine the time period of the failure by clustering, and determine the cause of the failure according to the word frequency and inverse text frequency, so that the cloud platform log can be quickly analyzed, and the operation and maintenance of the operation and maintenance personnel can be improved. efficiency.
  • Fig. 1 is the schematic diagram of the embodiment of the analysis method of cloud platform log provided by the present application
  • Fig. 2 is the hardware structure schematic diagram of the embodiment of the computer equipment of cloud platform log anomaly analysis provided by the application;
  • FIG. 3 is a schematic diagram of an embodiment of a computer storage medium for cloud platform log anomaly analysis provided by the present application.
  • FIG. 1 is a schematic diagram of an embodiment of a cloud platform log analysis method provided by the present application. As shown in Figure 1, the embodiment of the present application includes the following steps:
  • Preprocess the cloud platform logs divide the log recording time into multiple time periods on average according to the preset time length, and count the total number of logs in each time period;
  • word segmentation is performed on the log of the time period when the failure occurred, and the word frequency and inverse text frequency of each word are calculated;
  • the logs generated by the cloud platform contain a large number of repeated logs. If these logs appear in large numbers, they will interfere with the detection results.
  • the log format generated by the cloud platform is semi-structured, so the logs need to be preprocessed to obtain a standardized log format.
  • the processed log no longer uses the original virtual machine object for storage, but stores the data in the table structure, and uses memory columns for efficient storage. Then use the K-means clustering algorithm to obtain the approximate fault time period, and finally output the cause of the fault through the TF-IDF algorithm.
  • K-means clustering algorithm (k-means clustering algorithm, k-means clustering algorithm) is an iterative solution clustering analysis algorithm, which is the most commonly used clustering algorithm based on Euclidean distance. The closer, the greater the similarity.
  • TF-IDF (term frequency–inverse document frequency, word frequency-inverse text frequency) is a commonly used weighting technique for information retrieval and data mining.
  • TF stands for Term Frequency.
  • Term Frequency is the frequency of occurrence of words. The number of occurrences of words is counted, and the sum of the number of all words is used as statistical information.
  • IDF stands for Inverse Document Frequency (Inverse Document Frequency). Inverse text frequency reflects the frequency of a word appearing in all texts of the corpus. When a word appears in many texts, the inverse text frequency value of this word should be higher than Low, indicating that this word has little meaning in judging the content of the text.
  • Preprocess the cloud platform logs divide the log recording time into multiple time periods according to the preset time length, and count the total number of logs in each time period.
  • the preprocessing of the cloud platform logs includes: filtering duplicate logs, and converting the filtered logs into a standard format.
  • Cloud platform log preprocessing includes two steps. The first step is to filter duplicate logs, and the second step is to format the logs. Each log can be divided into five parts, which are timestamp, log address, code module, log level and Specific log content.
  • the analysis method further includes: storing the log in a standard format in a table structure, and storing the table structure in an in-memory column.
  • the original virtual machine object is no longer used to store the cloud platform log, but the data is stored in the table structure, and the memory column is used for storage, which can greatly reduce the space occupied , while improving the throughput of reading data, suitable for processing a large number of logs.
  • Select a time window including a plurality of consecutive time periods classify each time period in the time window according to the dissimilarity value to obtain an abnormal class, and determine the time when the fault occurs according to the time corresponding to the log in the abnormal class part.
  • the distribution of the number of logs in a stable cloud platform system is relatively uniform. Based on this idea, the characteristics of logs can be extracted based on the number of logs.
  • Use time as the primary key to count the number of logs in the current time period. For example, if the time interval is set to minutes, each minute will be used as the identifier of each row of data.
  • Select a certain moment as the center of the time window, and calculate the number of logs in the time period to which the moment belongs as a feature. Taking this moment as the center, select N time periods before and after the center point time to form a time window of 2N+1 time periods, and use the number of logs in each time period as a feature, and there are a total of 2N+1 features .
  • the time period here may be fixed or not.
  • the time length can be fixed as one minute, and 2 minutes are taken before and after the central point to form a time window of 5 time periods.
  • the classifying each time period in the time window according to the dissimilarity value to obtain the abnormal class includes: randomly selecting a first number of time periods from the time window as the initial center point ; Calculate the dissimilarity values from each remaining time period to all initial central points in turn, and divide each remaining time period into corresponding initial central points according to the dissimilarity values to form multiple clusters; and calculate The error sum of squares of each cluster, based on the error sum of squares, determine a new central point in the cluster, and calculate the dissimilarity value again based on the new multiple central points and repeat the above steps until the clustering condition is satisfied .
  • dividing each remaining time segment into a corresponding initial center point according to the dissimilarity value to form a plurality of clusters includes: determining the lowest dissimilarity corresponding to the current time segment to be divided value, and divide the current time period into the initial center point corresponding to the lowest dissimilarity value.
  • the time window has a total of 100 time periods, randomly select 4 time periods from the time window as the initial center point, for example, it can be A, B, C and D, and then calculate the remaining 96 time periods to all initial center points
  • a1 is one of the remaining 96 time periods, calculate the dissimilarity value A1 from a1 to A, the dissimilarity value B1 from a1 to B, and the dissimilarity value C1 from a1 to C Compare the size of A1, B1, C1, and D1 with the dissimilarity value D1 from a1 to D.
  • C1 is the smallest, that is, divide a1 into the cluster corresponding to C, and wait until the remaining 96 time periods are allocated.
  • the calculation formula of the sum of squares of the error can be as follows:
  • C i represents the i-th cluster
  • p represents the samples in C i
  • m i represents the average value of all samples of C i
  • SSE represents the clustering error of all sample points, which can represent the quality of the clustering effect.
  • the new center point determines a new center point according to the sum of squared errors in each cluster, specifically, select the time period with the smallest sum of squared errors in the cluster as the new central point.
  • the new center points are a2, B, a3, and a10 respectively, and the dissimilarity values from the remaining 96 time periods to all center points except the above new center point can be calculated.
  • the dissimilarity values from A to a2 can be calculated.
  • repeating the above steps until the clustering condition is met includes: judging whether there is an inflection point in the size of the sum of square errors of the cluster; and stopping repeating the above steps in response to the presence of an inflection point in the size of the sum of square errors of the cluster.
  • the values of the sum of squared errors are 10, 8, 7, 5, and 6 respectively, and they have been in a downward trend before. The last sudden increase indicates that an inflection point has appeared, and the above steps can be stopped.
  • the above steps may be continued for the clusters for which the sum of squared errors does not have an inflection point until all the clusters have an inflection point.
  • the suspicious time interval of the failure can be found based on the original log.
  • the largest number of logs is the normal class, the larger number is the class that is on the verge of failure, the smaller number is the abnormal class that is completely in the fault, and the smallest number is the number of logs caused by the initial startup of the system or the lack of logs. Very few of a kind.
  • the determining the period of time when the failure occurred according to the time corresponding to the logs in the abnormal category includes: obtaining the total number of logs in each category, and judging whether the total number of logs in any category is less than a threshold; and responding to the absence of The total number of logs of a category is less than the threshold, and the time period when the failure occurs is determined according to the category with the smallest total number of logs.
  • the threshold can be used to judge whether there is a category of initial system startup or missing logs. If the total number of logs of all categories is greater than or equal to the threshold, it means that there is no category of initial startup of the system or missing logs. At this time, the category with the least total number of logs can be used category to determine the time period in which the failure occurred.
  • the determining the period of time when the fault occurs according to the time corresponding to the logs in the abnormal category includes: in response to the total number of logs of the existing category being less than the threshold, according to the category whose total number of logs is greater than or equal to the threshold.
  • the category with the smallest total number of logs and the category with the total number of logs less than the threshold determines the time period in which the failure occurred. If the total number of logs of the existing categories is less than the threshold, it means that there are categories of system initial startup or missing logs, and these categories can be divided into abnormal categories.
  • the category with the least total number of logs can also be Divided into exception classes, therefore, the time period in which the failure occurred can be determined according to the exception class.
  • segment the logs to create a stop vocabulary list. Indexing the words will improve the speed of subsequent queries. Convert words into word vectors, use TF-IDF algorithm, calculate the value of words, sort from high to low, and output according to a certain number.
  • the log is first segmented into words. After the word segmentation, the document originally composed of sentences becomes many words, some of which are very common, such as "it", "of", and "I". These words have little significance for the analysis of the text, and in many cases will affect the results of the analysis and have a negative impact on the analysis. At the same time, too many words will increase the computational complexity of the algorithm, and such words are called stop words.
  • the determining the cause of the failure according to the product of word frequency and inverse text frequency includes: calculating the product of the word frequency and inverse text frequency of each word, and sorting the corresponding words according to the product from large to small ; and determine the cause of the failure according to the preset number of words in the front row.
  • TF word frequency
  • count(w) represents the number of words
  • represents the number of documents
  • IDF represents the inverse text frequency
  • N represents the total number of all documents in the corpus
  • I(w, D i ) represents whether the word w is in Document D i has appeared, if it has appeared, it will be 1, and if it has not appeared, it will be 0.
  • TF-IDF TF ⁇ IDF.
  • TF-IDF can extract the topic of the log, find the most critical information in the log during this period from the log, and judge the fault. The larger the calculation value of TF-IDF is, the more the word can represent the main content of the text, so it is sorted according to the value from large to small. The cause of the malfunction can be found from the first 20 words.
  • This application determines the time period of the fault by clustering, and determines the cause of the fault according to the word frequency and inverse text frequency, so that the log of the cloud platform can be quickly analyzed, and the operation and maintenance efficiency of the operation and maintenance personnel can be improved.
  • the second aspect of the embodiment of the present application proposes a system for abnormal analysis of cloud platform logs, including: a preprocessing module, configured to preprocess the cloud platform logs, and log the logs according to the preset time length The recorded time is divided into multiple time periods on average, and the total number of logs in each time period is counted; the classification module is configured to select a time window including multiple continuous time periods, and for each time period in the time window according to The dissimilarity value is classified to obtain the abnormal class, and the time period when the failure occurs is determined according to the time corresponding to the log in the abnormal class; the calculation module is configured to segment the log of the time period when the failure occurs, and calculates each a term frequency and an inverse text frequency of the word; and an analysis module configured to determine a cause of the failure based on the product of the word frequency and the inverse text frequency.
  • a preprocessing module configured to preprocess the cloud platform logs, and log the logs according to the preset time length The recorded time is divided into multiple time periods on average
  • the classification module is configured to: randomly select a first number of time periods from the time window as initial center points; calculate the dissimilarity between each remaining time period and all initial center points in turn value, and divide each remaining time segment into corresponding initial center points according to the dissimilarity value to form a plurality of clusters; and calculate the sum of squares of each cluster, based on the Determine a new center point in the above cluster, and calculate the dissimilarity value again based on the new multiple center points, and repeat the above steps until the clustering conditions are met.
  • the classification module is configured to: determine the lowest dissimilarity value corresponding to the current time period to be divided, and classify the current time period into the initial dissimilarity value corresponding to the lowest dissimilarity value. center point.
  • the classification module is configured to: determine whether there is an inflection point in the size of the sum of squared errors of the cluster; and stop repeating the above steps in response to the presence of an inflection point in the size of the sum of squared errors in the cluster.
  • the classification module is configured to: obtain the total number of logs in each category, and determine whether the total number of logs in the category is less than the threshold; The category determines the time period in which the failure occurred.
  • the classification module is configured to: in response to the existence of a category whose total number of logs is less than the threshold, according to the category with the smallest total number of logs among the categories whose total number of logs is greater than or equal to the threshold and the total number of logs is less than the threshold The category of determines the time period in which the failure occurred.
  • the analysis module is configured to: calculate the product of the word frequency and the inverse text frequency of each word, and sort the corresponding words according to the product from large to small; and according to the preset in front The number of words determines why the failure occurred.
  • the third aspect of the embodiments of the present application proposes a computer device, including: at least one processor; and a memory, the memory stores computer instructions that can run on the processor, and the instructions are executed by the processor to The following steps are implemented: S1. Preprocess the cloud platform logs, divide the log recording time into multiple time periods on average according to the preset time length, and count the total number of logs in each time period; S2. Select multiple consecutive time periods The time window of the segment, classify each time segment in the time window according to the dissimilarity value to obtain the abnormal class, and determine the time segment of the fault according to the time corresponding to the log in the abnormal class; S3. Segment the log of the time period of the fault, and calculate the word frequency and inverse text frequency of each word; and S4, determine the cause of the fault according to the product of the word frequency and the inverse text frequency.
  • the classifying each time period in the time window according to the dissimilarity value to obtain the abnormal class includes: randomly selecting a first number of time periods from the time window as the initial center point ; Calculate the dissimilarity values from each remaining time period to all initial central points in turn, and divide each remaining time period into corresponding initial central points according to the dissimilarity values to form multiple clusters; and calculate The error sum of squares of each cluster, based on the error sum of squares, determine a new central point in the cluster, and calculate the dissimilarity value again based on the new multiple central points and repeat the above steps until the clustering condition is satisfied .
  • dividing each remaining time segment into a corresponding initial center point according to the dissimilarity value to form a plurality of clusters includes: determining the lowest dissimilarity corresponding to the current time segment to be divided value, and divide the current time period into the initial center point corresponding to the lowest dissimilarity value.
  • repeating the above steps until the clustering condition is met includes: judging whether there is an inflection point in the size of the sum of square errors of the cluster; and stopping repeating the above steps in response to the presence of an inflection point in the size of the sum of square errors of the cluster.
  • the determining the period of time when the failure occurred according to the time corresponding to the logs in the abnormal category includes: obtaining the total number of logs in each category, and judging whether the total number of logs in any category is less than a threshold; and responding to the absence of The total number of logs of a category is less than the threshold, and the time period when the failure occurs is determined according to the category with the smallest total number of logs.
  • the determining the period of time when the fault occurs according to the time corresponding to the logs in the abnormal category includes: in response to the total number of logs of the existing category being less than the threshold, according to the category whose total number of logs is greater than or equal to the threshold The category with the smallest total number of logs and the category with the total number of logs less than the threshold determines the time period in which the failure occurred.
  • the determining the cause of the failure according to the product of word frequency and inverse text frequency includes: calculating the product of the word frequency and inverse text frequency of each word, and sorting the corresponding words according to the product from large to small ; and determine the cause of the failure according to the preset number of words in the front row.
  • FIG. 2 it is a schematic diagram of the hardware structure of an embodiment of the computer device for analyzing the above-mentioned cloud platform log anomaly provided by the present application.
  • the device includes a processor 201 and a memory 202 , and may further include: an input device 203 and an output device 204 .
  • the processor 201, the memory 202, the input device 203, and the output device 204 may be connected through a bus or in other ways. In FIG. 2, connection through a bus is taken as an example.
  • the memory 202 can be used to store non-volatile software programs, non-volatile computer-executable programs and modules, such as the analysis method of the cloud platform log in the embodiment of the present application. program instructions/modules.
  • the processor 201 executes various functional applications and data processing of the server by running non-volatile software programs, instructions and modules stored in the memory 202, that is, implements the cloud platform log analysis method of the above method embodiment.
  • the memory 202 may include a program storage area and a data storage area, wherein the program storage area may store an operating system and an application program required by at least one function; the data storage area may store data created according to the analysis method of the cloud platform log, etc. .
  • the memory 202 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage devices.
  • the memory 202 may optionally include memories that are remotely located relative to the processor 201, and these remote memories may be connected to the local module through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the input device 203 can receive input information such as user name and password.
  • the output device 204 may include a display device such as a display screen.
  • the program instructions/modules corresponding to one or more cloud platform log analysis methods are stored in the memory 202, and when executed by the processor 201, the cloud platform log analysis method in any of the above method embodiments is executed.
  • Any one embodiment of the computer device that executes the analysis method of the above-mentioned cloud platform log can achieve the same or similar effects as any of the above-mentioned method embodiments corresponding to it.
  • the present application also provides a computer-readable storage medium, and the computer-readable storage medium stores a computer program for executing the above method when executed by a processor.
  • FIG. 3 it is a schematic diagram of an embodiment of the computer storage medium for abnormal analysis of the cloud platform log provided by the present application.
  • the computer readable storage medium 3 stores a computer program 31 for executing the above method when executed by a processor.
  • the program of the cloud platform log analysis method can be stored in a computer.
  • the program when executed, it may include the procedures of the embodiments of the above-mentioned methods.
  • the storage medium of the program may be a magnetic disk, an optical disk, a read-only memory (ROM) or a random access memory (RAM), and the like.
  • the foregoing computer program embodiments can achieve the same or similar effects as any of the foregoing method embodiments corresponding thereto.
  • the storage medium may be a read-only memory, a magnetic disk or an optical disk, and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Quality & Reliability (AREA)
  • Probability & Statistics with Applications (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Debugging And Monitoring (AREA)

Abstract

本申请公开了一种云平台日志的分析方法、系统、设备和存储介质,方法包括:对云平台日志进行预处理,按照预设时间长度将日志记录的时间平均分成多个时间段,并统计每个时间段内的日志总数;选取包括多个连续时间段的时间窗口,对所述时间窗口中的每个时间段根据相异度数值进行分类以得到异常类,并根据所述异常类中日志对应的时间确定发生故障的时间段;对发生故障的时间段的日志进行分词,并计算每个单词的词频和逆文本频率;以及根据词频和逆文本频率的乘积确定发生故障的原因。本申请通过聚类的方式确定故障发生的时间段,并根据词频和逆文本频率确定发生故障的原因,从而能够快速对云平台日志进行分析,提高运维人员的运维效率。

Description

一种云平台日志的分析方法、系统、设备及介质
本申请要求在2021年7月15日提交中国专利局、申请号为202110801817.9、发明名称为“一种云平台日志的分析方法、系统、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及日志分析领域,更具体地,特别是指一种云平台日志的分析方法、系统、计算机设备及可读介质。
背景技术
随着云计算的高速发展,越来越多的企业把公司业务和系统放到云平台上,云平台能按照不同用户的需求,快速构建开发环境和分配计算资源,具有弹性、快速、按需的优点。对于云平台,保证系统可靠性是非常重要的。对于很多企业级的大型云计算服务来说,节点可能成千上万,如此众多的节点是非常容易出现故障的。在加上云平台服务的复杂性,导致有些问题很难发现并及时解决,这给运维人员带来巨大的工作量。日志是系统运行状态的重要的记录载体,运维人员可以通过日志定位服务的异常,为系统的稳定运行提供依据。
目前市面上的系统日志管理工具,一般都是对日志进行集中采集和对日志进行索引处理,以便运维人员从中搜索、分析、监控与可视化等功能。但是这些工具没有对日志进行深度分析,仍然需要通过人工的方式去对日志进行解读与分析,从而判断系统是否存在异常,由于存在大量的日志,人工排查极其耗时间,无法及时的发现系统异常并作出准确判断。
发明内容
有鉴于此,本申请实施例的目的在于提出一种云平台日志的分析方法、系统、计算机设备及计算机可读存储介质,本申请通过聚类的方式确定故障发生的时间段,并根据词频和逆文本频率确定发生故障的原因,从而能够快速对云平台日志进行分析,提高运维人员的运维效率。
基于上述目的,本申请实施例的一方面提供了一种云平台日志的分析方法,包括如下步骤:对云平台日志进行预处理,按照预设时间长度将日志记录的时间平均分成多个时间段,并统计每个时间段内的日志总数;选取包括多个连续时间段的时间窗口,对所述时间窗口中的每个时间段根据相异度数值进行分类以得到异常类,并根据所述异常类中日志对应的时间确定发生故障的时间段;对发生故障的时间段的日志进行分词,并计算每个单词的词频和逆文本频率;以及根据词频和逆文本频率的乘积确定发生故障的原因。
在一些实施方式中,所述对所述时间窗口中的每个时间段根据相异度数值进行分类以得到异常类包括:从所述时间窗口中随机选取第一数量个时间段作为初始中心点;依次计算每个剩余的时间段到所有初始中心点的相异度数值,并根据所述相异度数值将每个剩余的时间段分到对应的初始中心点以形成多个簇;以及计算每个所述簇的误差平方和,基于所述误差平方和在所述簇中确定新的中心点,并基于新的多个中心点再次计算相异度数值并重复上述步骤直到满足聚类条件。
在一些实施方式中,所述根据所述相异度数值将每个剩余的时间段分到对应的初始中心点以形成多个簇包括:确定待分的当前时间段对应的最低的相异度数值,并将所述当前时间段分到所述最低的相异度数值对应的初始中心点。
在一些实施方式中,所述重复上述步骤直到满足聚类条件包括:判断是否存在簇的误差平方和的大小出现拐点;以及响应于存在簇的误差平方和的大小出现拐点,停止重复上述步骤。
在一些实施方式中,所述根据所述异常类中日志对应的时间确定发生故障的时间段包括:获取每个类别中日志总数,并判断是否存在类别的日志 总数小于阈值;以及响应于不存在类别的日志总数小于阈值,根据日志总数最小的类别确定发生故障的时间段。
在一些实施方式中,所述根据所述异常类中日志对应的时间确定发生故障的时间段包括:响应于存在类别的日志总数小于所述阈值,根据日志总数大于或等于所述阈值的类别中日志总数最小的类别和日志总数小于所述阈值的类别确定发生故障的时间段。
在一些实施方式中,所述根据词频和逆文本频率的乘积确定发生故障的原因包括:计算每个单词的词频和逆文本频率的乘积,并按照所述乘积由大到小将对应的词进行排序;以及根据排在前面的预设数量个单词确定发生故障的原因。
本申请实施例的另一方面,提供了一种云平台日志异常分析的系统,包括:预处理模块,配置用于对云平台日志进行预处理,按照预设时间长度将日志记录的时间平均分成多个时间段,并统计每个时间段内的日志总数;分类模块,配置用于选取包括多个连续时间段的时间窗口,对所述时间窗口中的每个时间段根据相异度数值进行分类以得到异常类,并根据所述异常类中日志对应的时间确定发生故障的时间段;计算模块,配置用于对发生故障的时间段的日志进行分词,并计算每个单词的词频和逆文本频率;以及分析模块,配置用于根据词频和逆文本频率的乘积确定发生故障的原因。
本申请实施例的又一方面,还提供了一种计算机设备,包括:至少一个处理器;以及存储器,所述存储器存储有可在所述处理器上运行的计算机指令,所述指令由所述处理器执行时实现如上方法的步骤。
本申请实施例的再一方面,还提供了一种计算机可读存储介质,计算机可读存储介质存储有被处理器执行时实现如上方法步骤的计算机程序。
本申请具有以下有益技术效果:通过聚类的方式确定故障发生的时间段,并根据词频和逆文本频率确定发生故障的原因,从而能够快速对云平台日志进行分析,提高运维人员的运维效率。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的实施例。
图1为本申请提供的云平台日志的分析方法的实施例的示意图;
图2为本申请提供的云平台日志异常分析的计算机设备的实施例的硬件结构示意图;
图3为本申请提供的云平台日志异常分析的计算机存储介质的实施例的示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本申请实施例进一步详细说明。
需要说明的是,本申请实施例中所有使用“第一”和“第二”的表述均是为了区分两个相同名称非相同的实体或者非相同的参量,可见“第一”“第二”仅为了表述的方便,不应理解为对本申请实施例的限定,后续实施例对此不再一一说明。
本申请实施例的第一个方面,提出了一种云平台日志的分析方法的实施例。图1示出的是本申请提供的云平台日志的分析方法的实施例的示意图。如图1所示,本申请实施例包括如下步骤:
S1、对云平台日志进行预处理,按照预设时间长度将日志记录的时间平均分成多个时间段,并统计每个时间段内的日志总数;
S2、选取包括多个连续时间段的时间窗口,对所述时间窗口中的每个时间段根据相异度数值进行分类以得到异常类,并根据所述异常类中日志对应的时间确定发生故障的时间段;
S3、对发生故障的时间段的日志进行分词,并计算每个单词的词频和逆文本频率;以及
S4、根据词频和逆文本频率的乘积确定发生故障的原因。
云平台产生的日志含有大量的重复日志,这些日志如果大量出现的话会对检测结果造成干扰,同时云平台产生的日志格式为半结构化,所以需要对日志进行预处理,得到规范化的日志格式。处理好的日志不再使用原始的虚拟机对象进行存储,而是将数据存储到表结构中,并且使用内存列进行高效存储。然后利用K-means聚类算法得到大致的故障时间段,最后通过TF-IDF算法输出故障产生的原因。
K-means聚类算法(k-means clustering algorithm,k均值聚类算法)是一种迭代求解的聚类分析算法,是最常用的基于欧式距离的聚类算法,其认为两个目标的距离越近,相似度越大。TF-IDF:(term frequency–inverse document frequency,词频-逆文本频率)是一种用于信息检索与数据挖掘的常用加权技术。TF表示词频(Term Frequency),词频即词语出现的频率,统计词语出现的数量,与所有词语的数量之和作商,作为统计信息。IDF表示逆文本频率指数(Inverse Document Frequency),逆文本频率反应出一个词语在语料库的所有文本中出现的频率,当一个词语在很多的文本中都出现时,这个词语的逆文本频率值应该较低,说明这个词语在判断文本内容上意义较小。
对云平台日志进行预处理,按照预设时间长度将日志记录的时间平均分成多个时间段,并统计每个时间段内的日志总数。
在一些实施方式中,所述对云平台日志进行预处理包括:过滤重复的日志,并将过滤后的日志转换成标准格式。云平台日志预处理包括两步,第一步为过滤重复的日志,第二步对日志进行格式处理,每条日志可分为五部分,分别为时间戳、日志地址、代码模块、日志等级和具体的日志内容。
在一些实施方式中,分析方法还包括:将标准格式的日志存储在表结构中,并将所述表结构存储在内存列中。为了提高日志的读取效率,不再使用原始的虚拟机对象对云平台日志进行存储,而是将数据存储在表结构中,并且使用内存列进行存储,内存列存储可以大大减少空间的占用量,同时提高读取数据的吞吐量,适用于处理大量日志。
选取包括多个连续时间段的时间窗口,对所述时间窗口中的每个时间段根据相异度数值进行分类以得到异常类,并根据所述异常类中日志对应的时间确定发生故障的时间段。
一个稳定运行的云平台系统的日志数量的分布是比较均匀的,基于这个思想,可以以日志数量为基准,提取日志的特征。将时间作为主键,统计当前时间段的日志的条数,例如,将时间间隔设置为分钟,则以每分钟作为每行数据的标识。选取某个时刻作为时间窗口的中心,计算出该时刻所属的时间段内的日志数量,作为一个特征。以该时刻为中心,选取中心点时间的前后各N个时间段,形成一个2N+1个时间段的时间窗口,将每个时间段内的日志数量作为一个特征,总共有2N+1个特征。这里的时间段可以是固定的,也可以是不固定的。例如,可以将时间长度固定为一分钟,中心点前后各取2分钟,形成5个时间段的时间窗口。除此之外,也可以分别往前取1分钟,2分钟,3分钟,往后取1分钟,2分钟,3分钟,形成一个7个时间段的时间窗口。
在一些实施方式中,所述对所述时间窗口中的每个时间段根据相异度数值进行分类以得到异常类包括:从所述时间窗口中随机选取第一数量个时间段作为初始中心点;依次计算每个剩余的时间段到所有初始中心点的相异度数值,并根据所述相异度数值将每个剩余的时间段分到对应的初始中心点以形成多个簇;以及计算每个所述簇的误差平方和,基于所述误差平方和在所述簇中确定新的中心点,并基于新的多个中心点再次计算相异度数值并重复上述步骤直到满足聚类条件。
在一些实施方式中,所述根据所述相异度数值将每个剩余的时间段分到对应的初始中心点以形成多个簇包括:确定待分的当前时间段对应的最低的相异度数值,并将所述当前时间段分到所述最低的相异度数值对应的初始中心点。例如,时间窗口一共100个时间段,从时间窗口中随机选取4个时间段作为初始中心点,例如可以是A、B、C和D,然后计算剩下的96个时间段到所有初始中心点的相异度数值,例如a1是剩下的96个时间段中的一个,计算a1到A的相异度数值A1、a1到B的相异度数值B1、a1 到C的相异度数值C1和a1到D的相异度数值D1,比较A1、B1、C1和D1的大小,假设C1最小,即将a1分到C对应的簇,等到剩余的96个时间段都分配完,分别计算每个簇的误差平方和,误差平方和的计算公式可以如下:
Figure PCTCN2021121902-appb-000001
式中C i代表第i个簇,p代表C i中的样本,m i代表C i的所有样本平均值。SSE表示所有样本点的聚类误差,能够代表聚类效果的好坏。
然后分别在每个簇中根据误差平方和确定新的中心点,具体可以是选择簇中误差平方和最小的时间段作为新的中心点。在确定了每个簇新的中心点后再次计算剩余的时间段到所有中心点的相异度数值。例如,新的中心点分别为a2、B、a3、a10,可以计算除了上述新的中心点之外剩余96个时间段到所有中心点的相异度数值,例如,可以计算A到a2的相异度数值A2、A到B的相异度数值B2、A到a3的相异度数值C2和A到a10的相异度数值D2,假设B2最小,将A分到B对应的簇,直到剩余的96个时间段都分配完,分别计算每个簇的误差平方和,再次选择新的中心点直到满足聚类条件。
在一些实施方式中,所述重复上述步骤直到满足聚类条件包括:判断是否存在簇的误差平方和的大小出现拐点;以及响应于存在簇的误差平方和的大小出现拐点,停止重复上述步骤。例如,误差平方和的数值分别为10、8、7、5、6,之前一直处于下降的趋势,最后一次突然增加,则说明出现了拐点,则可以停止上述步骤。
在一些实施方式中,可以对误差平方和没有出现拐点的簇继续上述步骤直到所有的簇都出现拐点。
最后会获取到四类结果,可以按照每个类别日志数量的多少将其划分成异常类和正常类,根据异常类中的时间就可以根据原始日志找到发生故障的可疑时间区间。一般而言,日志数量最多的是正常类,数量较多的是处 于故障边缘的类,数量较小的是完全处于故障中的异常类,数量最少的是系统初始启动或者日志缺失导致的日志数量极少的一类。
在一些实施方式中,所述根据所述异常类中日志对应的时间确定发生故障的时间段包括:获取每个类别中日志总数,并判断是否存在类别的日志总数小于阈值;以及响应于不存在类别的日志总数小于阈值,根据日志总数最小的类别确定发生故障的时间段。阈值可以用来判断是否存在系统初始启动或者日志缺失的类别,如果所有的类别的日志总数都大于或等于阈值,则说明不存在系统初始启动或者日志缺失的类别,此时可以根据日志总数最少的类别来确定发生故障的时间段。
在一些实施方式中,所述根据所述异常类中日志对应的时间确定发生故障的时间段包括:响应于存在类别的日志总数小于所述阈值,根据日志总数大于或等于所述阈值的类别中日志总数最小的类别和日志总数小于所述阈值的类别确定发生故障的时间段。如果存在类别的日志总数小于阈值,则说明存在系统初始启动或者日志缺失的类别,可以将这些类别划分成异常类,另外,在日志总数大于或等于阈值的类别中,日志总数最少的类别也可以划分为异常类,因此,可以根据异常类来确定发生故障的时间段。
对发生故障的时间段的日志进行分词,并计算每个单词的词频和逆文本频率。根据词频和逆文本频率的乘积确定发生故障的原因。提取异常类的日志后,将日志进行分词,建立停用词表。将单词建立索引,后续查询时将提高速度。将单词转化为词向量,使用TF-IDF算法,计算单词的数值,并由高到低进行排序,按照一定数量进行输出。在处理日志时,首先对日志进行分词,分词之后原来由句子组成的文档成为了众多词语,有些词语非常常见,例如“它”、“的”、“我”这些词语。这些词语对于文本的分析意义很小,很多场合下还会影响分析的结果,对分析产生负面影响。同时这些词语过多也会提高算法的计算复杂性,这样的词语被称为停用词。
在一些实施方式中,所述根据词频和逆文本频率的乘积确定发生故障的原因包括:计算每个单词的词频和逆文本频率的乘积,并按照所述乘积由大到小将对应的词进行排序;以及根据排在前面的预设数量个单词确定发 生故障的原因。
计算词频和逆文本频率的公式如下:
Figure PCTCN2021121902-appb-000002
Figure PCTCN2021121902-appb-000003
式中TF表示词频,count(w)表示词语的数量,|D|表示文档的数量,IDF表示逆文本频率,N表示语料库中所有文档的总数;I(w,D i)表示词语w是否在文档D i出现过,如果出现过则为1,没有出现过则为0。
计算出词频和逆文本频率之后,将两个数值相乘,结果就是最终得到的TF-IDF数值:TF-IDF=TF×IDF。TF-IDF能够对日志的主题进行提取,从日志中找到这段时间日志中最关键的信息,对故障进行判断。TF-IDF的计算数值越大表明这个单词越能够代表文本的主要内容,因此按照数值由大到小进行排序。可以从前20个单词中找到故障的原因。
本申请通过聚类的方式确定故障发生的时间段,并根据词频和逆文本频率确定发生故障的原因,从而能够快速对云平台日志进行分析,提高运维人员的运维效率。
需要特别指出的是,上述云平台日志的分析方法的各个实施例中的各个步骤均可以相互交叉、替换、增加、删减,因此,这些合理的排列组合变换之于云平台日志的分析方法也应当属于本申请的保护范围,并且不应将本申请的保护范围局限在实施例之上。
基于上述目的,本申请实施例的第二个方面,提出了一种云平台日志异常分析的系统,包括:预处理模块,配置用于对云平台日志进行预处理,按照预设时间长度将日志记录的时间平均分成多个时间段,并统计每个时间段内的日志总数;分类模块,配置用于选取包括多个连续时间段的时间窗口,对所述时间窗口中的每个时间段根据相异度数值进行分类以得到异常类,并根据所述异常类中日志对应的时间确定发生故障的时间段;计算模块,配置用于对发生故障的时间段的日志进行分词,并计算每个单词的词频和逆 文本频率;以及分析模块,配置用于根据词频和逆文本频率的乘积确定发生故障的原因。
在一些实施方式中,所述分类模块配置用于:从所述时间窗口中随机选取第一数量个时间段作为初始中心点;依次计算每个剩余的时间段到所有初始中心点的相异度数值,并根据所述相异度数值将每个剩余的时间段分到对应的初始中心点以形成多个簇;以及计算每个所述簇的误差平方和,基于所述误差平方和在所述簇中确定新的中心点,并基于新的多个中心点再次计算相异度数值并重复上述步骤直到满足聚类条件。
在一些实施方式中,所述分类模块配置用于:确定待分的当前时间段对应的最低的相异度数值,并将所述当前时间段分到所述最低的相异度数值对应的初始中心点。
在一些实施方式中,所述分类模块配置用于:判断是否存在簇的误差平方和的大小出现拐点;以及响应于存在簇的误差平方和的大小出现拐点,停止重复上述步骤。
在一些实施方式中,所述分类模块配置用于:获取每个类别中日志总数,并判断是否存在类别的日志总数小于阈值;以及响应于不存在类别的日志总数小于阈值,根据日志总数最小的类别确定发生故障的时间段。
在一些实施方式中,所述分类模块配置用于:响应于存在类别的日志总数小于所述阈值,根据日志总数大于或等于所述阈值的类别中日志总数最小的类别和日志总数小于所述阈值的类别确定发生故障的时间段。
在一些实施方式中,所述分析模块配置用于:计算每个单词的词频和逆文本频率的乘积,并按照所述乘积由大到小将对应的词进行排序;以及根据排在前面的预设数量个单词确定发生故障的原因。
基于上述目的,本申请实施例的第三个方面,提出了一种计算机设备,包括:至少一个处理器;以及存储器,存储器存储有可在处理器上运行的计算机指令,指令由处理器执行以实现如下步骤:S1、对云平台日志进行预处理,按照预设时间长度将日志记录的时间平均分成多个时间段,并统计每个 时间段内的日志总数;S2、选取包括多个连续时间段的时间窗口,对所述时间窗口中的每个时间段根据相异度数值进行分类以得到异常类,并根据所述异常类中日志对应的时间确定发生故障的时间段;S3、对发生故障的时间段的日志进行分词,并计算每个单词的词频和逆文本频率;以及S4、根据词频和逆文本频率的乘积确定发生故障的原因。
在一些实施方式中,所述对所述时间窗口中的每个时间段根据相异度数值进行分类以得到异常类包括:从所述时间窗口中随机选取第一数量个时间段作为初始中心点;依次计算每个剩余的时间段到所有初始中心点的相异度数值,并根据所述相异度数值将每个剩余的时间段分到对应的初始中心点以形成多个簇;以及计算每个所述簇的误差平方和,基于所述误差平方和在所述簇中确定新的中心点,并基于新的多个中心点再次计算相异度数值并重复上述步骤直到满足聚类条件。
在一些实施方式中,所述根据所述相异度数值将每个剩余的时间段分到对应的初始中心点以形成多个簇包括:确定待分的当前时间段对应的最低的相异度数值,并将所述当前时间段分到所述最低的相异度数值对应的初始中心点。
在一些实施方式中,所述重复上述步骤直到满足聚类条件包括:判断是否存在簇的误差平方和的大小出现拐点;以及响应于存在簇的误差平方和的大小出现拐点,停止重复上述步骤。
在一些实施方式中,所述根据所述异常类中日志对应的时间确定发生故障的时间段包括:获取每个类别中日志总数,并判断是否存在类别的日志总数小于阈值;以及响应于不存在类别的日志总数小于阈值,根据日志总数最小的类别确定发生故障的时间段。
在一些实施方式中,所述根据所述异常类中日志对应的时间确定发生故障的时间段包括:响应于存在类别的日志总数小于所述阈值,根据日志总数大于或等于所述阈值的类别中日志总数最小的类别和日志总数小于所述阈值的类别确定发生故障的时间段。
在一些实施方式中,所述根据词频和逆文本频率的乘积确定发生故障的原因包括:计算每个单词的词频和逆文本频率的乘积,并按照所述乘积由大到小将对应的词进行排序;以及根据排在前面的预设数量个单词确定发生故障的原因。
如图2所示,为本申请提供的上述云平台日志异常分析的计算机设备的一个实施例的硬件结构示意图。
以如图2所示的装置为例,在该装置中包括一个处理器201以及一个存储器202,并还可以包括:输入装置203和输出装置204。
处理器201、存储器202、输入装置203和输出装置204可以通过总线或者其他方式连接,图2中以通过总线连接为例。
存储器202作为一种非易失性计算机可读存储介质,可用于存储非易失性软件程序、非易失性计算机可执行程序以及模块,如本申请实施例中的云平台日志的分析方法对应的程序指令/模块。处理器201通过运行存储在存储器202中的非易失性软件程序、指令以及模块,从而执行服务器的各种功能应用以及数据处理,即实现上述方法实施例的云平台日志的分析方法。
存储器202可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储根据云平台日志的分析方法的使用所创建的数据等。此外,存储器202可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实施例中,存储器202可选包括相对于处理器201远程设置的存储器,这些远程存储器可以通过网络连接至本地模块。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
输入装置203可接收输入的用户名和密码等信息。输出装置204可包括显示屏等显示设备。
一个或者多个云平台日志的分析方法对应的程序指令/模块存储在存储 器202中,当被处理器201执行时,执行上述任意方法实施例中的云平台日志的分析方法。
执行上述云平台日志的分析方法的计算机设备的任何一个实施例,可以达到与之对应的前述任意方法实施例相同或者相类似的效果。
本申请还提供了一种计算机可读存储介质,计算机可读存储介质存储有被处理器执行时执行如上方法的计算机程序。
如图3所示,为本申请提供的上述云平台日志异常分析的计算机存储介质的一个实施例的示意图。以如图3所示的计算机存储介质为例,计算机可读存储介质3存储有被处理器执行时执行如上方法的计算机程序31。
最后需要说明的是,本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,可以通过计算机程序来指令相关硬件来完成,云平台日志的分析方法的程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,程序的存储介质可为磁碟、光盘、只读存储记忆体(ROM)或随机存储记忆体(RAM)等。上述计算机程序的实施例,可以达到与之对应的前述任意方法实施例相同或者相类似的效果。
以上是本申请公开的示例性实施例,但是应当注意,在不背离权利要求限定的本申请实施例公开的范围的前提下,可以进行多种改变和修改。根据这里描述的公开实施例的方法权利要求的功能、步骤和/或动作不需以任何特定顺序执行。此外,尽管本申请实施例公开的元素可以以个体形式描述或要求,但除非明确限制为单数,也可以理解为多个。
应当理解的是,在本文中使用的,除非上下文清楚地支持例外情况,单数形式“一个”旨在也包括复数形式。还应当理解的是,在本文中使用的“和/或”是指包括一个或者一个以上相关联地列出的项目的任意和所有可能组合。
上述本申请实施例公开实施例序号仅仅为了描述,不代表实施例的优劣。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
所属领域的普通技术人员应当理解:以上任何实施例的讨论仅为示例性的,并非旨在暗示本申请实施例公开的范围(包括权利要求)被限于这些例子;在本申请实施例的思路下,以上实施例或者不同实施例中的技术特征之间也可以进行组合,并存在如上的本申请实施例的不同方面的许多其它变化,为了简明它们没有在细节中提供。因此,凡在本申请实施例的精神和原则之内,所做的任何省略、修改、等同替换、改进等,均应包含在本申请实施例的保护范围之内。

Claims (10)

  1. 一种云平台日志的分析方法,其特征在于,包括以下步骤:
    对云平台日志进行预处理,按照预设时间长度将日志记录的时间平均分成多个时间段,并统计每个时间段内的日志总数;
    选取包括多个连续时间段的时间窗口,对所述时间窗口中的每个时间段根据相异度数值进行分类以得到异常类,并根据所述异常类中日志对应的时间确定发生故障的时间段;
    对发生故障的时间段的日志进行分词,并计算每个单词的词频和逆文本频率;以及
    根据词频和逆文本频率的乘积确定发生故障的原因。
  2. 根据权利要求1所述的分析方法,其特征在于,所述对所述时间窗口中的每个时间段根据相异度数值进行分类以得到异常类包括:
    从所述时间窗口中随机选取第一数量个时间段作为初始中心点;
    依次计算每个剩余的时间段到所有初始中心点的相异度数值,并根据所述相异度数值将每个剩余的时间段分到对应的初始中心点以形成多个簇;以及
    计算每个所述簇的误差平方和,基于所述误差平方和在所述簇中确定新的中心点,并基于新的多个中心点再次计算相异度数值并重复上述步骤直到满足聚类条件。
  3. 根据权利要求2所述的分析方法,其特征在于,所述根据所述相异度数值将每个剩余的时间段分到对应的初始中心点以形成多个簇包括:
    确定待分的当前时间段对应的最低的相异度数值,并将所述当前时间段分到所述最低的相异度数值对应的初始中心点。
  4. 根据权利要求2所述的分析方法,其特征在于,所述重复上述步骤直到满足聚类条件包括:
    判断是否存在簇的误差平方和的大小出现拐点;以及
    响应于存在簇的误差平方和的大小出现拐点,停止重复上述步骤。
  5. 根据权利要求1所述的方法,其特征在于,所述根据所述异常类中日志对应的时间确定发生故障的时间段包括:
    获取每个类别中日志总数,并判断是否存在类别的日志总数小于阈值;以及
    响应于不存在类别的日志总数小于阈值,根据日志总数最小的类别确定发生故障的时间段。
  6. 根据权利要求5所述的方法,其特征在于,所述根据所述异常类中日志对应的时间确定发生故障的时间段包括:
    响应于存在类别的日志总数小于所述阈值,根据日志总数大于或等于所述阈值的类别中日志总数最小的类别和日志总数小于所述阈值的类别确定发生故障的时间段。
  7. 根据权利要求1所述的分析方法,其特征在于,所述根据词频和逆文本频率的乘积确定发生故障的原因包括:
    计算每个单词的词频和逆文本频率的乘积,并按照所述乘积由大到小将对应的词进行排序;以及
    根据排在前面的预设数量个单词确定发生故障的原因。
  8. 一种云平台日志的分析系统,其特征在于,包括:
    预处理模块,配置用于对云平台日志进行预处理,按照预设时间长度将日志记录的时间平均分成多个时间段,并统计每个时间段内的日志总数;
    分类模块,配置用于选取包括多个连续时间段的时间窗口,对所述时间窗口中的每个时间段根据相异度数值进行分类以得到异常类,并根据所述异常类中日志对应的时间确定发生故障的时间段;
    计算模块,配置用于对发生故障的时间段的日志进行分词,并计算每个单词的词频和逆文本频率;以及
    分析模块,配置用于根据词频和逆文本频率的乘积确定发生故障的原因。
  9. 一种计算机设备,其特征在于,包括:
    至少一个处理器;以及
    存储器,所述存储器存储有可在所述处理器上运行的计算机指令,所述指令由所述处理器执行时实现权利要求1-7任意一项所述方法的步骤。
  10. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1-7任意一项所述方法的步骤。
PCT/CN2021/121902 2021-07-15 2021-09-29 一种云平台日志的分析方法、系统、设备及介质 WO2023284132A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110801817.9 2021-07-15
CN202110801817.9A CN113254255B (zh) 2021-07-15 2021-07-15 一种云平台日志的分析方法、系统、设备及介质

Publications (1)

Publication Number Publication Date
WO2023284132A1 true WO2023284132A1 (zh) 2023-01-19

Family

ID=77180450

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/121902 WO2023284132A1 (zh) 2021-07-15 2021-09-29 一种云平台日志的分析方法、系统、设备及介质

Country Status (2)

Country Link
CN (1) CN113254255B (zh)
WO (1) WO2023284132A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115858894A (zh) * 2023-02-14 2023-03-28 温州众成科技有限公司 一种可视化的大数据分析方法
CN115858794A (zh) * 2023-02-20 2023-03-28 北京特立信电子技术股份有限公司 用于网络运行安全监测的异常日志数据识别方法

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113254255B (zh) * 2021-07-15 2021-10-29 苏州浪潮智能科技有限公司 一种云平台日志的分析方法、系统、设备及介质
CN116541252B (zh) * 2023-07-06 2023-10-20 广州豪特节能环保科技股份有限公司 一种机房故障日志数据处理方法及装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109634818A (zh) * 2018-10-24 2019-04-16 中国平安人寿保险股份有限公司 日志分析方法、系统、终端及计算机可读存储介质
CN110288004A (zh) * 2019-05-30 2019-09-27 武汉大学 一种基于日志语义挖掘的系统故障诊断方法及装置
US20200159636A1 (en) * 2017-07-25 2020-05-21 Huawei Technologies Co., Ltd. Memory Anomaly Detection Method and Device
CN111538642A (zh) * 2020-07-02 2020-08-14 杭州海康威视数字技术股份有限公司 一种异常行为的检测方法、装置、电子设备及存储介质
CN112685215A (zh) * 2021-01-22 2021-04-20 浪潮云信息技术股份公司 一种云平台异常日志分析方法
CN113254255A (zh) * 2021-07-15 2021-08-13 苏州浪潮智能科技有限公司 一种云平台日志的分析方法、系统、设备及介质

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103761173A (zh) * 2013-12-28 2014-04-30 华中科技大学 一种基于日志的计算机系统故障诊断方法及装置
US20150347895A1 (en) * 2014-06-02 2015-12-03 Qualcomm Incorporated Deriving relationships from overlapping location data
US9678822B2 (en) * 2015-01-02 2017-06-13 Tata Consultancy Services Limited Real-time categorization of log events
CN105577440B (zh) * 2015-12-24 2019-06-11 华为技术有限公司 一种网络故障时间定位方法和分析设备
CN105471659B (zh) * 2015-12-25 2019-03-01 华为技术有限公司 一种故障根因分析方法和分析设备
CN105812177B (zh) * 2016-03-08 2019-10-18 华为技术有限公司 一种网络故障处理方法和处理设备
US11086919B2 (en) * 2018-02-19 2021-08-10 Harness Inc. Service regression detection using real-time anomaly detection of log data
CN110516034A (zh) * 2019-06-28 2019-11-29 中兴通讯股份有限公司 日志管理方法、装置、网络设备和可读存储介质
CN110413500B (zh) * 2019-07-31 2024-01-09 口口相传(北京)网络技术有限公司 基于大数据融合的故障分析方法及装置
CN110958136A (zh) * 2019-11-11 2020-04-03 国网山东省电力公司信息通信公司 一种基于深度学习的日志分析预警方法
CN112948155B (zh) * 2019-12-11 2022-12-16 中移(苏州)软件技术有限公司 模型训练方法、状态预测方法、装置、设备及存储介质
CN112488080A (zh) * 2020-12-23 2021-03-12 武汉烽火众智数字技术有限责任公司 一种基于聚类算法的故障诊断分析方法及系统
CN112613309A (zh) * 2020-12-24 2021-04-06 北京浪潮数据技术有限公司 一种日志归类分析方法、装置、设备及可读存储介质
CN112612887A (zh) * 2020-12-25 2021-04-06 北京天融信网络安全技术有限公司 日志处理方法、装置、设备和存储介质
CN112988440B (zh) * 2021-02-23 2023-08-01 山东英信计算机技术有限公司 一种系统故障预测方法、装置、电子设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200159636A1 (en) * 2017-07-25 2020-05-21 Huawei Technologies Co., Ltd. Memory Anomaly Detection Method and Device
CN109634818A (zh) * 2018-10-24 2019-04-16 中国平安人寿保险股份有限公司 日志分析方法、系统、终端及计算机可读存储介质
CN110288004A (zh) * 2019-05-30 2019-09-27 武汉大学 一种基于日志语义挖掘的系统故障诊断方法及装置
CN111538642A (zh) * 2020-07-02 2020-08-14 杭州海康威视数字技术股份有限公司 一种异常行为的检测方法、装置、电子设备及存储介质
CN112685215A (zh) * 2021-01-22 2021-04-20 浪潮云信息技术股份公司 一种云平台异常日志分析方法
CN113254255A (zh) * 2021-07-15 2021-08-13 苏州浪潮智能科技有限公司 一种云平台日志的分析方法、系统、设备及介质

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115858894A (zh) * 2023-02-14 2023-03-28 温州众成科技有限公司 一种可视化的大数据分析方法
CN115858894B (zh) * 2023-02-14 2023-05-16 温州众成科技有限公司 一种可视化的大数据分析方法
CN115858794A (zh) * 2023-02-20 2023-03-28 北京特立信电子技术股份有限公司 用于网络运行安全监测的异常日志数据识别方法

Also Published As

Publication number Publication date
CN113254255A (zh) 2021-08-13
CN113254255B (zh) 2021-10-29

Similar Documents

Publication Publication Date Title
WO2023284132A1 (zh) 一种云平台日志的分析方法、系统、设备及介质
CN111984499B (zh) 一种大数据集群的故障检测方法和装置
CN107609121B (zh) 基于LDA和word2vec算法的新闻文本分类方法
WO2021088385A1 (zh) 一种在线日志解析方法、系统及其电子终端设备
CN107391772B (zh) 一种基于朴素贝叶斯的文本分类方法
US20130013597A1 (en) Processing Repetitive Data
JP2022118108A (ja) ログ監査方法、装置、電子機器、媒体およびコンピュータプログラム
CN112100149B (zh) 日志自动化分析系统
Wang et al. Loguad: log unsupervised anomaly detection based on word2vec
WO2021051864A1 (zh) 词典扩充方法及装置、电子设备、存储介质
CN104239553A (zh) 一种基于Map-Reduce框架的实体识别方法
CN111177360B (zh) 一种基于云上用户日志的自适应过滤方法及装置
WO2022095637A1 (zh) 一种故障日志分类方法、系统、设备以及介质
WO2021109724A1 (zh) 日志异常检测方法及装置
WO2024031930A1 (zh) 一种异常日志检测方法、装置、电子设备及存储介质
CN116167370A (zh) 基于日志时空特征分析的分布式系统异常检测方法
CN111753070A (zh) 一种服务器监控日志处理的系统和方法
WO2022257421A1 (zh) 集群异常检测方法、装置和相关设备
CN112612832B (zh) 节点分析方法、装置、设备及存储介质
Vandic et al. A semantic clustering-based approach for searching and browsing tag spaces
CN110826845B (zh) 一种多维组合成本分摊装置及方法
CN112306820A (zh) 一种日志运维根因分析方法、装置、电子设备及存储介质
Zou et al. Improving log-based fault diagnosis by log classification
CN115102848B (zh) 日志数据的提取方法、系统、设备及介质
Makinist et al. Preparation of improved Turkish dataset for sentiment analysis in social media

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21949905

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE