WO2021082780A1 - 一种日志分类方法及装置 - Google Patents
一种日志分类方法及装置 Download PDFInfo
- Publication number
- WO2021082780A1 WO2021082780A1 PCT/CN2020/115409 CN2020115409W WO2021082780A1 WO 2021082780 A1 WO2021082780 A1 WO 2021082780A1 CN 2020115409 W CN2020115409 W CN 2020115409W WO 2021082780 A1 WO2021082780 A1 WO 2021082780A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- log
- word
- classification
- feature
- feature word
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3346—Query execution using probabilistic model
Definitions
- the embodiments of the present invention relate to the field of financial technology (Fintech), and in particular to a log classification method and device.
- the current common log classification idea is a text classification algorithm based on machine learning.
- the text classification algorithm is based on statistical theory, using the algorithm to make the machine have a human-like automatic learning ability, that is, to perform statistical analysis on the known training data to obtain the law, and then use the law to predict and analyze the unknown data. Because machine learning technology has good actual performance in the field of text classification, it has become the mainstream in the field of log analysis and classification.
- the embodiment of the present invention provides a log classification method and device, which combines a machine learning algorithm and knowledge engineering to overcome the problem of unbalanced training data in a sample set, thereby improving the accuracy of model classification.
- the classification model determines the log classification to which the log to be classified belongs; the classification model is based on the conditional probability of each feature word in the sample log under each log classification definite;
- conditional probability of each feature word in each log category is determined according to the word frequency model and the frequency modulation model; the word frequency model includes the number of times each feature word appears in each log category, and the frequency modulation model includes each An adjustment parameter of each feature word under each log category, where the adjustment parameter is used to adjust the number of times the corresponding feature word is under the corresponding log category.
- conditional probability of each feature word under each log classification is determined according to the word frequency model and the frequency modulation model, including:
- the adjustment parameter of the characteristic word in the frequency modulation model According to the number of times the characteristic word is in the word frequency model, the adjustment parameter of the characteristic word in the frequency modulation model, and the sum of the number of times each characteristic word appears under the log classification, it is determined that the characteristic word is in the word frequency model.
- Conditional probability under log classification According to the number of times the characteristic word is in the word frequency model, the adjustment parameter of the characteristic word in the frequency modulation model, and the sum of the number of times each characteristic word appears under the log classification, it is determined that the characteristic word is in the word frequency model.
- the word frequency model is a word frequency matrix of m rows ⁇ n columns, and the frequency modulation model is a frequency modulation matrix of m rows ⁇ n columns;
- the log classification corresponding to the i-th row in the word frequency matrix is in the frequency modulation matrix
- the log classification corresponding to the i-th row is the same, and the feature word corresponding to the j-th column in the word frequency matrix is the same as the feature word corresponding to the j-th column in the frequency modulation matrix; 0 ⁇ i ⁇ m, 0 ⁇ j ⁇ n;
- the conditional probability under the log classification includes:
- the formula (1) is:
- x j is the feature word j-th column
- T i is the log category i-th row
- T i ) is the conditional probability at T i at x j of
- a (i, j) is the first The number of occurrences of the feature word corresponding to the jth column in the log classification corresponding to row i
- B(i,j) is the adjustment parameter of the feature word corresponding to the jth column in the log classification corresponding to row i
- count(T i ) Is the sum of the number of times each feature word appears under T i
- ⁇ is the smoothing coefficient
- n is the number of columns of the frequency modulation matrix or the frequency modulation matrix.
- the classification model is determined according to the conditional probability of each feature word in the sample log under each log classification, and includes:
- For each feature word determine the sum of the conditional probability of the feature word under each log classification; combine the conditional probability of the feature word under each log category and the sum of the conditional probability of the feature word under each log category The ratio of is determined as the feature weight of the feature word under each log classification;
- the feature weight of each feature word in each log classification is formed into a feature weight matrix, and the feature weight matrix is used as the classification model.
- the frequency modulation matrix is used to adjust the word frequency of feature words in log categories with fewer sample logs, so as to amplify the word frequency of the feature words under this log category, and simulate the sample logs in the log category.
- an embodiment of the present invention also provides a log classification device, including:
- the determining unit is used to determine the number of times each feature word appears in the log to be classified
- the classification unit is configured to determine the log classification to which the log to be classified belongs according to the number of occurrences of each feature word in the log to be classified and the classification model; the classification model is the training unit according to each sample log The conditional probability of feature words under each log classification is determined;
- conditional probability of each feature word in each log category is determined by the training unit according to the word frequency model and the frequency modulation model;
- the word frequency model includes the number of times each feature word appears in each log category, and
- the frequency modulation model includes an adjustment parameter of each feature word in each log category, and the adjustment parameter is used by the training unit to adjust the number of times the corresponding feature word is in the corresponding log category.
- the training unit is specifically used for:
- the adjustment parameter of the characteristic word in the frequency modulation model According to the number of times the characteristic word is in the word frequency model, the adjustment parameter of the characteristic word in the frequency modulation model, and the sum of the number of times each characteristic word appears under the log classification, it is determined that the characteristic word is in the word frequency model.
- Conditional probability under log classification According to the number of times the characteristic word is in the word frequency model, the adjustment parameter of the characteristic word in the frequency modulation model, and the sum of the number of times each characteristic word appears under the log classification, it is determined that the characteristic word is in the word frequency model.
- the word frequency model is a word frequency matrix of m rows ⁇ n columns
- the frequency modulation model is a frequency modulation matrix of m rows ⁇ n columns
- the log classification corresponding to the i-th row in the word frequency matrix is in the frequency modulation matrix
- the log classification corresponding to the i-th row is the same
- the feature word corresponding to the j-th column in the word frequency matrix is the same as the feature word corresponding to the j-th column in the frequency modulation matrix; 0 ⁇ i ⁇ m, 0 ⁇ j ⁇ n;
- the training unit is specifically used for:
- the formula (1) is:
- x j is the feature word j-th column
- T i is the log category i-th row
- T i ) is the conditional probability at T i at x j of
- a (i, j) is the first The number of occurrences of the feature word corresponding to the jth column in the log classification corresponding to row i
- B(i,j) is the adjustment parameter of the feature word corresponding to the jth column in the log classification corresponding to row i
- count(T i ) Is the sum of the number of times each feature word appears under T i
- ⁇ is the smoothing coefficient
- n is the number of columns of the frequency modulation matrix or the frequency modulation matrix.
- the training unit is specifically used for:
- For each feature word determine the sum of the conditional probability of the feature word under each log classification; combine the conditional probability of the feature word under each log category and the sum of the conditional probability of the feature word under each log category The ratio of is determined as the feature weight of the feature word under each log classification;
- the feature weight of each feature word in each log classification is formed into a feature weight matrix, and the feature weight matrix is used as the classification model.
- an embodiment of the present invention also provides a computing device, including:
- processor, memory, and communication interface among them, the processor, memory and communication interface are connected by a bus;
- the processor is configured to read the program in the memory and execute the above log classification method
- the memory is used to store one or more executable programs, and can store data used by the processor when performing operations.
- the embodiment of the present invention also provides a non-transitory computer-readable storage medium.
- the non-transitory computer-readable storage medium stores computer instructions, which when run on a computer, causes the computer to execute the above log classification method.
- an embodiment of the present invention also provides a computer program product containing instructions.
- the computer program product includes a calculation program stored on a non-transitory computer-readable storage medium.
- the computer program includes program instructions. When the program instructions are executed by the computer, the computer executes the above log classification method.
- FIG. 1 is a schematic diagram of a system architecture provided by an embodiment of the present invention
- FIG. 2 is a schematic flowchart of a log classification method provided by an embodiment of the present invention.
- FIG. 3 is a schematic diagram of a process for determining conditional probability according to an embodiment of the present invention.
- FIG. 4 is a schematic diagram of a process for determining feature weights according to an embodiment of the present invention.
- FIG. 5 is a schematic flowchart of another log classification method provided by an embodiment of the present invention.
- FIG. 6 is a schematic structural diagram of a log classification device provided by an embodiment of the present invention.
- FIG. 7 is a schematic structural diagram of a computing device provided by this application.
- Bayesian classification is a general term for a class of classification algorithms, which are based on Bayes' theorem, so they are collectively referred to as Bayesian classification.
- Naive Bayesian classification is the simplest and most common classification method in Bayesian classification.
- Bayes' theorem is named after Bayes, a British mathematician, to solve the problem of the relationship between two conditional probabilities. Simply put, how to obtain the probability of P(B
- Naive Bayes assumes that the feature P(A) is independent under a certain result P(B).
- the Bayesian algorithm calculates the probability of occurrence of P(B
- the calculation method can be attributed to the Bayesian formula.
- the Yeess formula can be as shown in formula (2).
- each probability has a specific name:
- P(B) is the probability of event B occurring in the sample space, also called the prior probability of event B.
- P(A) is the probability of event A occurring in the sample space, also called the prior probability of event A.
- B) is the conditional probability of A after B occurs, and is called the likelihood function.
- A) is the conditional probability of B after the occurrence of A, which is called the posterior probability.
- B)/P(A) is the adjustment factor, also known as the standard likelihood.
- the basic method of Naive Bayes On the basis of statistical data, according to the conditional probability formula, calculate the probability that the sample of the current feature belongs to a certain category, and select the largest probability category. For the given item to be classified, find the probability of each category appearing under the condition that the item appears, whichever is the largest, then consider which category the item to be classified belongs to.
- x ⁇ a1, a2,..., am ⁇ are items to be classified, and each a is a characteristic attribute of x;
- Fig. 1 exemplarily shows the system architecture applicable to the log classification method provided by the embodiment of the present invention.
- the system architecture may include a data source module, a front-end module, a back-end module, a classification algorithm module, and a database; the functions of each module are as follows:
- Data source module Provides the error log text used by the training model in the embodiment of the present invention, which can also be referred to as the source error log.
- Front desk module Responsible for providing a web interface, mainly used to display log classification information, and provide users with operation entrances such as data management.
- Back-end module mainly used for log processing, responsible for pulling the original log text from the data source, and cleaning it (filtering valueless text content by means of regular matching, etc.), de-duplication (merging samples with too high similarity), Finally, the generated sample set (training set) is stored in the database.
- the background module is also responsible for providing data operation interfaces, automatically calling the classification algorithm module for model training, and storing model parameters in the database.
- Classification algorithm module responsible for the training of the classifier model and the classification function of the sample log.
- Database Used to store processed standardized sample logs (error sample log set), FM matrix information, configuration data, classification information and other types of data.
- FIG. 2 exemplarily shows the flow of a log classification method provided by an embodiment of the present invention.
- the flow may be executed by a log classification device, which may be located in a classification algorithm module, or may be the classification algorithm module. .
- the process specifically includes:
- Step 201 Determine the number of occurrences of each feature word in the log to be classified
- Step 202 Determine the log category to which the log to be classified belongs according to the number of occurrences of each feature word in the log to be classified and the classification model.
- a characteristic word refers to a word or phrase determined from multiple sample logs in a sample set. Since the sample log is essentially a text format and cannot be directly involved in calculation, the sample log needs to be vectorized first.
- the word set model can be used to vectorize the sample log. With words as the basic processing unit, all the words in the sample set are first summarized to obtain a word bank of size N, and each The sample log is mapped into an N-dimensional vector, and the value of each dimension represents the number of feature words in the sample log (it can also be said that the word frequency of the feature words in the sample log), and the N-dimensional vector reflects Information about word frequency in the sample log.
- the embodiment of the present invention can use n word combinations to split the text during text vectorization, combine adjacent words of length n into new features, and add them to the vocabulary, where n can be based on Empirical setting, for example, when n is set to 2, two consecutive words in the sample log can be combined as one word to obtain a new feature word.
- n word combinations for text splitting can effectively retain semantic feature words.
- the log to be classified can also be vectorized, such as generating a vector of length 10: (0 1 1 0 1 1 1 1 0), and then according to the The vector generated by the log to be classified and the classification model are combined with the Bayesian classification algorithm to determine the log classification to which the log to be classified belongs.
- the classification model is determined according to the conditional probability of each feature word in the sample log under each log classification, where the conditional probability of each feature word under each log classification is based on the word frequency model and frequency modulation The model is determined.
- the word frequency model includes the number of times each feature word appears in each log category.
- the word frequency model may be expressed in the form of a word frequency matrix, or may be expressed in a word frequency array or other forms.
- the word frequency model can be determined according to the characteristic words in each sample log in the sample set.
- sample logs in the sample set as shown in Table 1, that is, there are three log categories in the sample set, namely http error, db error, and redis error; http error includes sample log 1, sample log 2, sample log 3, and db error Including sample log 4, sample log 5, sample log 6, sample log 7, redis error includes sample log 8, sample log 9. And each sample log corresponds to its own vector, for example, the vector corresponding to sample log 1 is (2 0 3 0 4 0 0 0 3).
- the word frequency matrix generated after the statistics can be as shown in Table 2. For example, the number of occurrences of async in http error is 5, the number of occurrences of async in db error is 0, and the number of occurrences of async in redis error is 1. It can be observed that if the number of occurrences of a feature word in a log category is very high, its correlation with this category is generally also very high.
- the frequency modulation model can be determined according to the word frequency model.
- the frequency modulation model includes adjustment parameters for each feature word in each log category. The adjustment parameters are used to adjust the number of times the corresponding feature word is in the corresponding log category.
- the frequency modulation model can be expressed in the form of a frequency modulation matrix, or it can be Expressed in FM array or other forms.
- the frequency modulation matrix is an adjustment to the word frequency matrix. Its number of rows and columns is consistent with the word frequency matrix.
- the frequency modulation matrix is used to improve the naive Bayes classification algorithm.
- the frequency modulation matrix includes the adjustment parameters of each feature word under each log category, and the adjustment parameters are used to adjust the number of times (word frequency) of the feature words under the corresponding log category according to manual rules. For example, characteristics such as jdbc and mysql will appear in the log information of dberror in most cases. Generally, if this type of characteristic word appears, it can be concluded that the log information belongs to the dberror category.
- the frequency modulation matrix is a matrix of artificial rules, and the initial parameter of each item is 1, that is, it is not adjusted by default.
- the conditional probability of each feature word under each log classification can be determined according to the word frequency matrix and the frequency modulation matrix.
- Step 301 Determine the sum of the number of times each feature word appears under the log classification
- T i is the log classification; count (T i) is the sum of the times T i at each feature word appears; A (i, j) is the number of T i x j Key words appear, i.e. Frequency.
- T i is http
- T i is DB
- T i is Redis, number of occurrence of each characteristic word sum count (redis) 50.
- Step 302 Determine the conditional probability of the feature word in the log classification according to the number of times the feature word is in the word frequency model, the adjustment parameters of the feature word in the frequency modulation model, and the sum of the number of times each feature word appears under the log classification.
- the word frequency model is a word frequency matrix with m rows ⁇ n columns
- the frequency modulation model is a frequency modulation matrix with m rows ⁇ n columns.
- the log classification corresponding to the i-th row in the word frequency matrix corresponds to the i-th row in the frequency modulation matrix.
- the log classification is the same, and the feature word corresponding to the jth column in the word frequency matrix is the same as the feature word corresponding to the jth column in the frequency modulation matrix; 0 ⁇ i ⁇ m, 0 ⁇ j ⁇ n.
- conditional probability of a feature word in the log classification can be determined according to the formula ( 1) OK.
- x j is the characteristic word in the jth column
- T i is the log classification of the i-th row
- A(i,j) is the number of occurrences of the feature word corresponding to the jth column in the log classification corresponding to the ith row;
- B(i,j) is the adjustment parameter of the feature word corresponding to the jth column in the log classification corresponding to the ith row;
- count(T i ) is the sum of the number of times each feature word appears under T i;
- ⁇ is a smoothing coefficient, which adds a small word frequency value to all feature words, which is used to reduce the negative impact of the classification calculation when the word frequency is 0 and the conditional probability is 0.
- n is the frequency modulation matrix or the number of columns of the frequency modulation matrix.
- a conditional probability matrix composed of the conditional probability of each feature word in each log classification can be used as the classification model.
- the classification model can be as shown in Table 4.
- the conditional probability is normalized to obtain a new matrix to better reflect each feature
- the degree of influence of words in different categories is called the weight of feature words. The higher the weight of a feature word in a certain category, it means that the sample log carrying this feature word has a higher probability of being classified into this category.
- the feature weight matrix can be extracted, which can be specifically as follows Figure 4 shows the flowchart.
- Step 401 For each feature word, determine the sum of the conditional probability of the feature word under each log category; determine the ratio of the conditional probability of the feature word under each log category to the sum of the conditional probability of the feature word under each log category Is the feature weight of the feature word under each log classification;
- the feature weight of each feature word in each log classification can be determined according to formula (4), where formula (4) can be:
- W (i, j) x j is the feature weights weight at T i; m is the number of rows in the matrix or frequency term frequency matrix.
- step 402 the feature weights of each feature word in each log classification are formed into a feature weight matrix, and the feature weight matrix is used as a classification model.
- FIG. 5 Another log classification process is provided below, as shown in FIG. 5, which is specifically as follows:
- the left half of the process is the model training process.
- the training set is obtained.
- the training set includes each sample log, and vectorizes the text of each sample log, determines the word frequency of each feature word under each log classification, and calculates the classification of each log The conditional probability of each feature word of, and then generate a classification model.
- the right half of the process is the model use process. Obtain the log to be classified, vectorize the log to be classified, combine the classification model and use the Bayesian formula to calculate the probability of the log to be classified in each log classification, and then the maximum probability corresponds to The log classification is determined as the log classification to which the log to be classified belongs.
- Sample imbalance is a common problem in the field of machine learning. Taking classification as an example, ideally, the number of samples of different categories in the sample set should be evenly distributed, that is, to ensure that each category has enough samples for model training. But under realistic conditions, the imbalance of sample distribution is widespread. In the field of log classification, logs of different levels and types often appear at different frequencies. For example, http connect time out is a common network request exception, which has a high probability of occurrence and may happen every day; while the OOM (out of memory) of JVM (Java Virtual Machine) is difficult to appear, but Very serious error. In the sample set, it is obvious that http abnormal samples are much more than JVM abnormal samples, which causes the sample imbalance problem, which in turn affects the classification accuracy of JVM abnormal samples.
- Sample labeling is a very headache. To train a high-quality model, the size of the sample set is a very critical decisive factor. In the past, sample labeling required manual labeling one by one, which required a lot of manpower for thousands of samples.
- the frequency modulation matrix After collecting enough initial features, we use the frequency modulation matrix to set a very large adjustment parameter for such features (for example, 1000 or more), and then classify the sample set to be classified, and use the result as the classification label; most samples It can fall into the corresponding classification correctly, and a small part of the samples that do not contain the initial features fall into the default unknown classification, and then manually mark them.
- a very large adjustment parameter for such features for example, 1000 or more
- Model classification may be wrong. Under the naive Bayesian classification algorithm based on the word frequency model, a problem will arise: we find that a sample is classified into the wrong category, and then we manually correct this sample and put it into the sample set, retrain the model, and again Classify the sample-the resulting model still gives the previous misclassification. This is because the word frequency model is to perform word frequency statistics on all samples under the same category. Adjusting a certain sample individually is only a drop in the bucket and cannot achieve the purpose of correcting the model.
- FIG. 6 exemplarily shows the structure of a log classification device provided by an embodiment of the present invention, and the device can execute the flow of the log classification method.
- the device includes:
- the determining unit 601 is configured to determine the number of times each feature word appears in the log to be classified;
- the classification unit 602 is configured to determine the log classification to which the log to be classified belongs according to the number of occurrences of each feature word in the log to be classified and the classification model; the classification model is that the training unit 603 determines the log classification to which the log to be classified belongs according to the sample log The conditional probability of each feature word under each log classification is determined;
- conditional probability of each feature word in each log category is determined by the training unit 603 according to the word frequency model and the frequency modulation model;
- the word frequency model includes the number of times each feature word appears in each log category, so
- the frequency modulation model includes adjustment parameters of each feature word in each log category, and the adjustment parameters are used by the training unit 603 to adjust the number of times the corresponding feature word is in the corresponding log category.
- the training unit 603 is specifically configured to:
- the adjustment parameter of the characteristic word in the frequency modulation model According to the number of times the characteristic word is in the word frequency model, the adjustment parameter of the characteristic word in the frequency modulation model, and the sum of the number of times each characteristic word appears under the log classification, it is determined that the characteristic word is in the word frequency model.
- Conditional probability under log classification According to the number of times the characteristic word is in the word frequency model, the adjustment parameter of the characteristic word in the frequency modulation model, and the sum of the number of times each characteristic word appears under the log classification, it is determined that the characteristic word is in the word frequency model.
- the word frequency model is a word frequency matrix of m rows ⁇ n columns
- the frequency modulation model is a frequency modulation matrix of m rows ⁇ n columns
- the log classification corresponding to the i-th row in the word frequency matrix is in the frequency modulation matrix
- the log classification corresponding to the i-th row is the same
- the feature word corresponding to the j-th column in the word frequency matrix is the same as the feature word corresponding to the j-th column in the frequency modulation matrix; 0 ⁇ i ⁇ m, 0 ⁇ j ⁇ n;
- the training unit 603 is specifically used for:
- the formula (1) is:
- x j is the feature word j-th column
- T i is the log category i-th row
- T i ) is the conditional probability at T i at x j of
- a (i, j) is the first The number of occurrences of the feature word corresponding to the jth column in the log classification corresponding to row i
- B(i,j) is the adjustment parameter of the feature word corresponding to the jth column in the log classification corresponding to row i
- count(T i ) Is the sum of the number of occurrences of each feature word under T i
- ⁇ is the smoothing coefficient
- n is the number of columns of the frequency modulation matrix or the frequency modulation matrix;
- the training unit 603 is specifically configured to:
- For each feature word determine the sum of the conditional probability of the feature word under each log classification; combine the conditional probability of the feature word under each log category and the sum of the conditional probability of the feature word under each log category The ratio of is determined as the feature weight of the feature word under each log classification;
- the feature weight of each feature word in each log classification is formed into a feature weight matrix, and the feature weight matrix is used as the classification model.
- the present application also provides a computing device.
- the computing device includes at least one processor 720 for implementing the method in FIG. 2 provided by the embodiment of the present application. Either method.
- the computing device 700 may also include at least one memory 730 for storing program instructions and/or data.
- the memory 730 and the processor 720 are coupled.
- the coupling in the embodiments of the present application is an indirect coupling or communication connection between devices, units or modules, and may be in electrical, mechanical or other forms, and is used for information exchange between devices, units or modules.
- the processor 720 may operate in cooperation with the memory 730.
- the processor 720 may execute program instructions stored in the memory 730. At least one of the at least one memory may be included in the processor.
- each step of the above method can be completed by an integrated logic circuit of hardware in the processor or instructions in the form of software.
- the steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware processor, or executed and completed by a combination of hardware and software modules in the processor.
- the software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers.
- the storage medium is located in the memory, and the processor reads the information in the memory and completes the steps of the above method in combination with its hardware. To avoid repetition, it will not be described in detail here.
- the processor in the embodiment of the present application may be an integrated circuit chip with signal processing capability.
- the steps of the foregoing method embodiments can be completed by hardware integrated logic circuits in the processor or instructions in the form of software.
- the above-mentioned processor may be a general-purpose processor, a digital signal processing circuit (digital signal processor, DSP), a dedicated integrated circuit (application specific integrated circuit, ASIC), a field programmable gate array (field programmable gate array, FPGA) or other Programming logic devices, discrete gates or transistor logic devices, discrete hardware components.
- DSP digital signal processing circuit
- ASIC application specific integrated circuit
- FPGA field programmable gate array
- Programming logic devices discrete gates or transistor logic devices, discrete hardware components.
- the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
- the steps of the method disclosed in the embodiments of the present application can be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor.
- the software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers.
- the storage medium is located in the memory, and the processor reads the information in the memory and completes the steps of the above method in combination with its hardware.
- the memory in the embodiments of the present application may be a volatile memory or a non-volatile memory, or may include both volatile and non-volatile memory.
- the non-volatile memory can be read-only memory (ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), and electrically available Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
- the volatile memory may be random access memory (RAM), which is used as an external cache.
- RAM random access memory
- static random access memory static random access memory
- dynamic RAM dynamic RAM
- DRAM dynamic random access memory
- synchronous dynamic random access memory synchronous DRAM, SDRAM
- double data rate synchronous dynamic random access memory double data rate SDRAM, DDR SDRAM
- enhanced synchronous dynamic random access memory enhanced SDRAM, ESDRAM
- synchronous connection dynamic random access memory serial DRAM, SLDRAM
- direct rambus RAM direct rambus RAM
- the computing device 700 may further include a communication interface 710 for communicating with other devices through a transmission medium, so that the apparatus used in the computing device 700 can communicate with other devices.
- the communication interface may be a transceiver, circuit, bus, module, or other type of communication interface.
- the transceiver when the communication interface is a transceiver, the transceiver may include an independent receiver and an independent transmitter; it may also be a transceiver with integrated transceiver functions, or an interface circuit.
- the computing device 700 may also include a communication line 740.
- the communication interface 710, the processor 720, and the memory 730 may be connected to each other through a communication line 740;
- the communication line 740 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (extended industry standard architecture). , Referred to as EISA) bus and so on.
- the communication line 740 can be divided into an address bus, a data bus, a control bus, and the like. For ease of representation, only one thick line is used in FIG. 7, but it does not mean that there is only one bus or one type of bus.
- the embodiments of the present invention also provide a non-transitory computer-readable storage medium.
- the non-transitory computer-readable storage medium stores computer instructions. When it runs on a computer, the computer executes the above log classification. method.
- inventions of the present application provide a computer program product.
- the computer program product includes a calculation program stored on a non-transitory computer-readable storage medium.
- the computer program includes program instructions. When executed by a computer, the computer is caused to execute the above log classification method.
- These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
- the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
- These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
- the instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
Abstract
Description
async | at | connection | db | error | jdbc | mysql | redis | timeout | |
http error | 5 | 2 | 10 | 0 | 12 | 0 | 0 | 0 | 8 |
db error | 0 | 5 | 8 | 10 | 15 | 22 | 22 | 0 | 12 |
redis error | 1 | 12 | 4 | 0 | 8 | 0 | 0 | 20 | 5 |
async | at | connection | db | error | jdbc | mysql | redis | timeout | |
http error | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
db error | 0.2 | 1 | 1 | 20 | 1 | 20 | 20 | 1 | 1 |
redis error | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
async | at | connection | db | error | jdbc | mysql | redis | timeout | |
http error | 0.13 | 0.07 | 0.24 | 0.02 | 0.28 | 0.02 | 0.02 | 0.02 | 0.20 |
db error | 0.01 | 0.06 | 0.09 | 1.95 | 0.16 | 4.28 | 4.28 | 0.01 | 0.13 |
redis error | 0.03 | 0.22 | 0.08 | 0.02 | 0.15 | 0.02 | 0.02 | 0.36 | 0.10 |
async | at | connection | db | error | jdbc | mysql | redis | timeout | |
http error | 0.75 | 0.19 | 0.58 | 0.01 | 0.48 | 0.01 | 0.01 | 0.06 | 0.46 |
db error | 0.06 | 0.17 | 0.21 | 0.98 | 0.26 | 0.99 | 0.99 | 0.03 | 0.30 |
redis error | 0.19 | 0.64 | 0.21 | 0.01 | 0.26 | 0.00 | 0.00 | 0.92 | 0.24 |
Claims (11)
- 一种日志分类方法,其特征在于,包括:确定待分类日志中各特征词出现的次数;根据所述待分类日志中各特征词出现的次数和分类模型,确定所述待分类日志所属的日志分类;所述分类模型是根据样本日志中每个特征词在每个日志分类下的条件概率确定的;其中,每个特征词在每个日志分类下的条件概率是根据词频模型和调频模型确定的;所述词频模型包括每个特征词在每个日志分类下出现的次数,所述调频模型包括每个特征词在每个日志分类下的调整参数,所述调整参数用于调整对应的特征词在对应的日志分类下的次数。
- 如权利要求1所述的方法,其特征在于,所述每个特征词在每个日志分类下的条件概率是根据词频模型和调频模型确定的,包括:针对每一个特征词在每个日志分类下执行下述操作:确定所述日志分类下各特征词出现的次数的总和;根据所述特征词在所述词频模型中的次数、所述特征词在所述调频模型中的调整参数、所述日志分类下各特征词出现的次数的总和,确定所述特征词在所述日志分类下的条件概率。
- 如权利要求2所述的方法,其特征在于,所述词频模型为m行×n列的词频矩阵,所述调频模型为m行×n列的调频矩阵;所述词频矩阵中第i行对应的日志分类与所述调频矩阵中第i行对应的日志分类相同,所述词频矩阵中第j列对应的特征词与所述调频矩阵中第j列对应的特征词相同;0<i≤m,0<j≤n;所述根据所述特征词在所述词频模型中的次数、所述特征词在所述调频模型中的调整参数、所述日志分类下各特征词出现的次数的总和,确定所述特征词在所述日志分类下的条件概率,包括:根据公式(1)确定所述特征词在所述日志分类下的条件概率;所述公式(1)为:其中,x j为第j列的特征词;T i为第i行的日志分类;P(x j|T i)为在T i下x j的条件概率;A(i,j)为在第i行对应的日志分类中第j列对应的特征词出现的次数;B(i,j)为在第i行对应的日志分类中第j列对应的特征词的调整参数;count(T i)为T i下的各特征词出现的次数的总和;α为平滑系数;n为调频矩阵或调频矩阵的列数。
- 如权利要求1所述的方法,其特征在于,所述分类模型是根据样本日志中每个特征词在每个日志分类下的条件概率确定的,包括:针对每一个特征词,确定所述特征词在各日志分类下的条件概率的总和;将所述特征词在各日志分类下的条件概率与所述特征词在各日志分类下的条件概率的总和的比值确定为所述特征词在各日志分类下的特征权重;将各特征词在各日志分类中的特征权重组成特征权重矩阵,将所述特征权重矩阵作为所述分类模型。
- 一种日志分类装置,其特征在于,包括:确定单元、分类单元和训练单元;所述确定单元,用于确定待分类日志中各特征词出现的次数;所述分类单元,用于根据所述待分类日志中各特征词出现的次数和分类模型,确定所述待分类日志所属的日志分类;所述分类模型是所述训练单元根据样本日志中每个特征词在每个日志分类下的条件概率确定的;其中,每个特征词在每个日志分类下的条件概率是所述训练单元根据词频模型和调频模型确定的;所述词频模型包括每个特征词在每个日志分类下出现的次数,所述调频模型包括每个特征词在每个日志分类下的调整参数,所述调整参数用于所述训练单元调整对应的特征词在对应的日志分类下的次数。
- 如权利要求5所述的装置,其特征在于,所述训练单元具体用于:针对每一个特征词在每个日志分类下执行下述操作:确定所述日志分类下各特征词出现的次数的总和;根据所述特征词在所述词频模型中的次数、所述特征词在所述调频模型中的调整参数、所述日志分类下各特征词出现的次数的总和,确定所述特征词在所述日志分类下的条件概率。
- 如权利要求6所述的装置,其特征在于,所述词频模型为m行×n列的词频矩阵,所述调频模型为m行×n列的调频矩阵;所述词频矩阵中第i行对应的日志分类与所述调频矩阵中第i行对应的日志分类相同,所述词频矩阵中第j列对应的特征词与所述调频矩阵中第j列对应的特征词相同;0<i≤m,0<j≤n;所述训练单元具体用于:根据公式(1)确定所述特征词在所述日志分类下的条件概率;所述公式(1)为:其中,x j为第j列的特征词;T i为第i行的日志分类;P(x j|T i)为在T i下x j的条件概率;A(i,j)为在第i行对应的日志分类中第j列对应的特征词出现的次数;B(i,j)为在第i行对应的日志分类中第j列对应的特征词的调整参数;count(T i)为T i下的各特征词出现的次数的总和;α为平滑系数;n为调频矩阵或调频矩阵的列数。
- 如权利要求5所述的装置,其特征在于,所述训练单元具体用于:针对每一个特征词,确定所述特征词在各日志分类下的条件概率的总和;将所述特征词在各日志分类下的条件概率与所述特征词在各日志分类下的条件概率的总和的比值确定为所述特征词在各日志分类下的特征权重;将各特征词在各日志分类中的特征权重组成特征权重矩阵,将所述特征权重矩阵作为所述分类模型。
- 一种计算设备,其特征在于,包括处理器、存储器、通信接口,其中处理器、存储器与通信接口之间通过总线连接;所述处理器,用于读取所述存储器中的程序,执行权利要求1至4任一所述方法;所述存储器,用于存储一个或多个可执行程序,以及存储所述处理器在执行操作时所使用的数据。
- 一种非暂态计算机可读存储介质,其特征在于,所述非暂态计算机可读存储介质存储计算机指令,所述计算机指令用于使所述计算机执行权利要求1至4任一所述方法。
- 一种计算机程序产品,其特征在于,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行权利要求1至4任一所述方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911060648.7 | 2019-11-01 | ||
CN201911060648.7A CN110929028A (zh) | 2019-11-01 | 2019-11-01 | 一种日志分类方法及装置 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021082780A1 true WO2021082780A1 (zh) | 2021-05-06 |
Family
ID=69850230
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/115409 WO2021082780A1 (zh) | 2019-11-01 | 2020-09-15 | 一种日志分类方法及装置 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110929028A (zh) |
WO (1) | WO2021082780A1 (zh) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110929028A (zh) * | 2019-11-01 | 2020-03-27 | 深圳前海微众银行股份有限公司 | 一种日志分类方法及装置 |
CN112000502B (zh) * | 2020-08-11 | 2023-04-07 | 杭州安恒信息技术股份有限公司 | 海量错误日志的处理方法、装置、电子装置及存储介质 |
CN112199227B (zh) * | 2020-10-14 | 2022-09-27 | 北京紫光展锐通信技术有限公司 | 参数确定方法及相关产品 |
CN113704469B (zh) * | 2021-08-18 | 2022-04-15 | 百融至信(北京)征信有限公司 | 一种基于贝叶斯定理的短文本分类数据集矫正方法及系统 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090234825A1 (en) * | 2008-02-28 | 2009-09-17 | Fujitsu Limited | Information distribution system and information distribution method |
CN103810264A (zh) * | 2014-01-27 | 2014-05-21 | 西安理工大学 | 基于特征选择的网页文本分类方法 |
CN105446495A (zh) * | 2015-12-08 | 2016-03-30 | 北京搜狗科技发展有限公司 | 一种候选排序方法和装置 |
CN105893225A (zh) * | 2015-08-25 | 2016-08-24 | 乐视网信息技术(北京)股份有限公司 | 一种错误自动处理方法及装置 |
CN110929028A (zh) * | 2019-11-01 | 2020-03-27 | 深圳前海微众银行股份有限公司 | 一种日志分类方法及装置 |
-
2019
- 2019-11-01 CN CN201911060648.7A patent/CN110929028A/zh active Pending
-
2020
- 2020-09-15 WO PCT/CN2020/115409 patent/WO2021082780A1/zh active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090234825A1 (en) * | 2008-02-28 | 2009-09-17 | Fujitsu Limited | Information distribution system and information distribution method |
CN103810264A (zh) * | 2014-01-27 | 2014-05-21 | 西安理工大学 | 基于特征选择的网页文本分类方法 |
CN105893225A (zh) * | 2015-08-25 | 2016-08-24 | 乐视网信息技术(北京)股份有限公司 | 一种错误自动处理方法及装置 |
CN105446495A (zh) * | 2015-12-08 | 2016-03-30 | 北京搜狗科技发展有限公司 | 一种候选排序方法和装置 |
CN110929028A (zh) * | 2019-11-01 | 2020-03-27 | 深圳前海微众银行股份有限公司 | 一种日志分类方法及装置 |
Also Published As
Publication number | Publication date |
---|---|
CN110929028A (zh) | 2020-03-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021082780A1 (zh) | 一种日志分类方法及装置 | |
US10459971B2 (en) | Method and apparatus of generating image characteristic representation of query, and image search method and apparatus | |
WO2021184554A1 (zh) | 数据库异常监测方法、装置、计算机装置及存储介质 | |
US20180349158A1 (en) | Bayesian optimization techniques and applications | |
WO2018090657A1 (zh) | 基于BP_Adaboost模型的信用卡用户违约的预测方法及系统 | |
US20180268296A1 (en) | Machine learning-based network model building method and apparatus | |
Filippi et al. | Parametric bandits: The generalized linear case | |
WO2022077646A1 (zh) | 一种用于图像处理的学生模型的训练方法及装置 | |
US6466946B1 (en) | Computer implemented scalable, incremental and parallel clustering based on divide and conquer | |
WO2019179403A1 (zh) | 基于序列宽深学习的欺诈交易检测方法 | |
WO2022042123A1 (zh) | 图像识别模型生成方法、装置、计算机设备和存储介质 | |
US10747961B2 (en) | Method and device for identifying a sentence | |
WO2020220758A1 (zh) | 一种异常交易节点的检测方法及装置 | |
WO2018040387A1 (zh) | 基于支持向量数据描述的特征提取及分类方法及其系统 | |
WO2018153201A1 (zh) | 深度学习训练方法及装置 | |
CN106599913A (zh) | 一种基于聚类的多标签不平衡生物医学数据分类方法 | |
CN110569289B (zh) | 基于大数据的列数据处理方法、设备及介质 | |
CN109766437A (zh) | 一种文本聚类方法、文本聚类装置及终端设备 | |
US20240037408A1 (en) | Method and apparatus for model training and data enhancement, electronic device and storage medium | |
WO2022116444A1 (zh) | 文本分类方法、装置、计算机设备和介质 | |
WO2019232844A1 (zh) | 手写模型训练方法、手写字识别方法、装置、设备及介质 | |
CN112329862A (zh) | 基于决策树的反洗钱方法及系统 | |
CN110880018B (zh) | 一种卷积神经网络目标分类方法 | |
WO2023016267A1 (zh) | 垃圾评论的识别方法、装置、设备及介质 | |
US20220092452A1 (en) | Automated machine learning tool for explaining the effects of complex text on predictive results |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20881431 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20881431 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 210922) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20881431 Country of ref document: EP Kind code of ref document: A1 |