WO2021082780A1 - 一种日志分类方法及装置 - Google Patents

一种日志分类方法及装置 Download PDF

Info

Publication number
WO2021082780A1
WO2021082780A1 PCT/CN2020/115409 CN2020115409W WO2021082780A1 WO 2021082780 A1 WO2021082780 A1 WO 2021082780A1 CN 2020115409 W CN2020115409 W CN 2020115409W WO 2021082780 A1 WO2021082780 A1 WO 2021082780A1
Authority
WO
WIPO (PCT)
Prior art keywords
log
word
classification
feature
feature word
Prior art date
Application number
PCT/CN2020/115409
Other languages
English (en)
French (fr)
Inventor
欧百川
尤嘉
叶金瓒
李泽宇
王雅琪
朱子豪
Original Assignee
深圳前海微众银行股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳前海微众银行股份有限公司 filed Critical 深圳前海微众银行股份有限公司
Publication of WO2021082780A1 publication Critical patent/WO2021082780A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model

Definitions

  • the embodiments of the present invention relate to the field of financial technology (Fintech), and in particular to a log classification method and device.
  • the current common log classification idea is a text classification algorithm based on machine learning.
  • the text classification algorithm is based on statistical theory, using the algorithm to make the machine have a human-like automatic learning ability, that is, to perform statistical analysis on the known training data to obtain the law, and then use the law to predict and analyze the unknown data. Because machine learning technology has good actual performance in the field of text classification, it has become the mainstream in the field of log analysis and classification.
  • the embodiment of the present invention provides a log classification method and device, which combines a machine learning algorithm and knowledge engineering to overcome the problem of unbalanced training data in a sample set, thereby improving the accuracy of model classification.
  • the classification model determines the log classification to which the log to be classified belongs; the classification model is based on the conditional probability of each feature word in the sample log under each log classification definite;
  • conditional probability of each feature word in each log category is determined according to the word frequency model and the frequency modulation model; the word frequency model includes the number of times each feature word appears in each log category, and the frequency modulation model includes each An adjustment parameter of each feature word under each log category, where the adjustment parameter is used to adjust the number of times the corresponding feature word is under the corresponding log category.
  • conditional probability of each feature word under each log classification is determined according to the word frequency model and the frequency modulation model, including:
  • the adjustment parameter of the characteristic word in the frequency modulation model According to the number of times the characteristic word is in the word frequency model, the adjustment parameter of the characteristic word in the frequency modulation model, and the sum of the number of times each characteristic word appears under the log classification, it is determined that the characteristic word is in the word frequency model.
  • Conditional probability under log classification According to the number of times the characteristic word is in the word frequency model, the adjustment parameter of the characteristic word in the frequency modulation model, and the sum of the number of times each characteristic word appears under the log classification, it is determined that the characteristic word is in the word frequency model.
  • the word frequency model is a word frequency matrix of m rows ⁇ n columns, and the frequency modulation model is a frequency modulation matrix of m rows ⁇ n columns;
  • the log classification corresponding to the i-th row in the word frequency matrix is in the frequency modulation matrix
  • the log classification corresponding to the i-th row is the same, and the feature word corresponding to the j-th column in the word frequency matrix is the same as the feature word corresponding to the j-th column in the frequency modulation matrix; 0 ⁇ i ⁇ m, 0 ⁇ j ⁇ n;
  • the conditional probability under the log classification includes:
  • the formula (1) is:
  • x j is the feature word j-th column
  • T i is the log category i-th row
  • T i ) is the conditional probability at T i at x j of
  • a (i, j) is the first The number of occurrences of the feature word corresponding to the jth column in the log classification corresponding to row i
  • B(i,j) is the adjustment parameter of the feature word corresponding to the jth column in the log classification corresponding to row i
  • count(T i ) Is the sum of the number of times each feature word appears under T i
  • is the smoothing coefficient
  • n is the number of columns of the frequency modulation matrix or the frequency modulation matrix.
  • the classification model is determined according to the conditional probability of each feature word in the sample log under each log classification, and includes:
  • For each feature word determine the sum of the conditional probability of the feature word under each log classification; combine the conditional probability of the feature word under each log category and the sum of the conditional probability of the feature word under each log category The ratio of is determined as the feature weight of the feature word under each log classification;
  • the feature weight of each feature word in each log classification is formed into a feature weight matrix, and the feature weight matrix is used as the classification model.
  • the frequency modulation matrix is used to adjust the word frequency of feature words in log categories with fewer sample logs, so as to amplify the word frequency of the feature words under this log category, and simulate the sample logs in the log category.
  • an embodiment of the present invention also provides a log classification device, including:
  • the determining unit is used to determine the number of times each feature word appears in the log to be classified
  • the classification unit is configured to determine the log classification to which the log to be classified belongs according to the number of occurrences of each feature word in the log to be classified and the classification model; the classification model is the training unit according to each sample log The conditional probability of feature words under each log classification is determined;
  • conditional probability of each feature word in each log category is determined by the training unit according to the word frequency model and the frequency modulation model;
  • the word frequency model includes the number of times each feature word appears in each log category, and
  • the frequency modulation model includes an adjustment parameter of each feature word in each log category, and the adjustment parameter is used by the training unit to adjust the number of times the corresponding feature word is in the corresponding log category.
  • the training unit is specifically used for:
  • the adjustment parameter of the characteristic word in the frequency modulation model According to the number of times the characteristic word is in the word frequency model, the adjustment parameter of the characteristic word in the frequency modulation model, and the sum of the number of times each characteristic word appears under the log classification, it is determined that the characteristic word is in the word frequency model.
  • Conditional probability under log classification According to the number of times the characteristic word is in the word frequency model, the adjustment parameter of the characteristic word in the frequency modulation model, and the sum of the number of times each characteristic word appears under the log classification, it is determined that the characteristic word is in the word frequency model.
  • the word frequency model is a word frequency matrix of m rows ⁇ n columns
  • the frequency modulation model is a frequency modulation matrix of m rows ⁇ n columns
  • the log classification corresponding to the i-th row in the word frequency matrix is in the frequency modulation matrix
  • the log classification corresponding to the i-th row is the same
  • the feature word corresponding to the j-th column in the word frequency matrix is the same as the feature word corresponding to the j-th column in the frequency modulation matrix; 0 ⁇ i ⁇ m, 0 ⁇ j ⁇ n;
  • the training unit is specifically used for:
  • the formula (1) is:
  • x j is the feature word j-th column
  • T i is the log category i-th row
  • T i ) is the conditional probability at T i at x j of
  • a (i, j) is the first The number of occurrences of the feature word corresponding to the jth column in the log classification corresponding to row i
  • B(i,j) is the adjustment parameter of the feature word corresponding to the jth column in the log classification corresponding to row i
  • count(T i ) Is the sum of the number of times each feature word appears under T i
  • is the smoothing coefficient
  • n is the number of columns of the frequency modulation matrix or the frequency modulation matrix.
  • the training unit is specifically used for:
  • For each feature word determine the sum of the conditional probability of the feature word under each log classification; combine the conditional probability of the feature word under each log category and the sum of the conditional probability of the feature word under each log category The ratio of is determined as the feature weight of the feature word under each log classification;
  • the feature weight of each feature word in each log classification is formed into a feature weight matrix, and the feature weight matrix is used as the classification model.
  • an embodiment of the present invention also provides a computing device, including:
  • processor, memory, and communication interface among them, the processor, memory and communication interface are connected by a bus;
  • the processor is configured to read the program in the memory and execute the above log classification method
  • the memory is used to store one or more executable programs, and can store data used by the processor when performing operations.
  • the embodiment of the present invention also provides a non-transitory computer-readable storage medium.
  • the non-transitory computer-readable storage medium stores computer instructions, which when run on a computer, causes the computer to execute the above log classification method.
  • an embodiment of the present invention also provides a computer program product containing instructions.
  • the computer program product includes a calculation program stored on a non-transitory computer-readable storage medium.
  • the computer program includes program instructions. When the program instructions are executed by the computer, the computer executes the above log classification method.
  • FIG. 1 is a schematic diagram of a system architecture provided by an embodiment of the present invention
  • FIG. 2 is a schematic flowchart of a log classification method provided by an embodiment of the present invention.
  • FIG. 3 is a schematic diagram of a process for determining conditional probability according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of a process for determining feature weights according to an embodiment of the present invention.
  • FIG. 5 is a schematic flowchart of another log classification method provided by an embodiment of the present invention.
  • FIG. 6 is a schematic structural diagram of a log classification device provided by an embodiment of the present invention.
  • FIG. 7 is a schematic structural diagram of a computing device provided by this application.
  • Bayesian classification is a general term for a class of classification algorithms, which are based on Bayes' theorem, so they are collectively referred to as Bayesian classification.
  • Naive Bayesian classification is the simplest and most common classification method in Bayesian classification.
  • Bayes' theorem is named after Bayes, a British mathematician, to solve the problem of the relationship between two conditional probabilities. Simply put, how to obtain the probability of P(B
  • Naive Bayes assumes that the feature P(A) is independent under a certain result P(B).
  • the Bayesian algorithm calculates the probability of occurrence of P(B
  • the calculation method can be attributed to the Bayesian formula.
  • the Yeess formula can be as shown in formula (2).
  • each probability has a specific name:
  • P(B) is the probability of event B occurring in the sample space, also called the prior probability of event B.
  • P(A) is the probability of event A occurring in the sample space, also called the prior probability of event A.
  • B) is the conditional probability of A after B occurs, and is called the likelihood function.
  • A) is the conditional probability of B after the occurrence of A, which is called the posterior probability.
  • B)/P(A) is the adjustment factor, also known as the standard likelihood.
  • the basic method of Naive Bayes On the basis of statistical data, according to the conditional probability formula, calculate the probability that the sample of the current feature belongs to a certain category, and select the largest probability category. For the given item to be classified, find the probability of each category appearing under the condition that the item appears, whichever is the largest, then consider which category the item to be classified belongs to.
  • x ⁇ a1, a2,..., am ⁇ are items to be classified, and each a is a characteristic attribute of x;
  • Fig. 1 exemplarily shows the system architecture applicable to the log classification method provided by the embodiment of the present invention.
  • the system architecture may include a data source module, a front-end module, a back-end module, a classification algorithm module, and a database; the functions of each module are as follows:
  • Data source module Provides the error log text used by the training model in the embodiment of the present invention, which can also be referred to as the source error log.
  • Front desk module Responsible for providing a web interface, mainly used to display log classification information, and provide users with operation entrances such as data management.
  • Back-end module mainly used for log processing, responsible for pulling the original log text from the data source, and cleaning it (filtering valueless text content by means of regular matching, etc.), de-duplication (merging samples with too high similarity), Finally, the generated sample set (training set) is stored in the database.
  • the background module is also responsible for providing data operation interfaces, automatically calling the classification algorithm module for model training, and storing model parameters in the database.
  • Classification algorithm module responsible for the training of the classifier model and the classification function of the sample log.
  • Database Used to store processed standardized sample logs (error sample log set), FM matrix information, configuration data, classification information and other types of data.
  • FIG. 2 exemplarily shows the flow of a log classification method provided by an embodiment of the present invention.
  • the flow may be executed by a log classification device, which may be located in a classification algorithm module, or may be the classification algorithm module. .
  • the process specifically includes:
  • Step 201 Determine the number of occurrences of each feature word in the log to be classified
  • Step 202 Determine the log category to which the log to be classified belongs according to the number of occurrences of each feature word in the log to be classified and the classification model.
  • a characteristic word refers to a word or phrase determined from multiple sample logs in a sample set. Since the sample log is essentially a text format and cannot be directly involved in calculation, the sample log needs to be vectorized first.
  • the word set model can be used to vectorize the sample log. With words as the basic processing unit, all the words in the sample set are first summarized to obtain a word bank of size N, and each The sample log is mapped into an N-dimensional vector, and the value of each dimension represents the number of feature words in the sample log (it can also be said that the word frequency of the feature words in the sample log), and the N-dimensional vector reflects Information about word frequency in the sample log.
  • the embodiment of the present invention can use n word combinations to split the text during text vectorization, combine adjacent words of length n into new features, and add them to the vocabulary, where n can be based on Empirical setting, for example, when n is set to 2, two consecutive words in the sample log can be combined as one word to obtain a new feature word.
  • n word combinations for text splitting can effectively retain semantic feature words.
  • the log to be classified can also be vectorized, such as generating a vector of length 10: (0 1 1 0 1 1 1 1 0), and then according to the The vector generated by the log to be classified and the classification model are combined with the Bayesian classification algorithm to determine the log classification to which the log to be classified belongs.
  • the classification model is determined according to the conditional probability of each feature word in the sample log under each log classification, where the conditional probability of each feature word under each log classification is based on the word frequency model and frequency modulation The model is determined.
  • the word frequency model includes the number of times each feature word appears in each log category.
  • the word frequency model may be expressed in the form of a word frequency matrix, or may be expressed in a word frequency array or other forms.
  • the word frequency model can be determined according to the characteristic words in each sample log in the sample set.
  • sample logs in the sample set as shown in Table 1, that is, there are three log categories in the sample set, namely http error, db error, and redis error; http error includes sample log 1, sample log 2, sample log 3, and db error Including sample log 4, sample log 5, sample log 6, sample log 7, redis error includes sample log 8, sample log 9. And each sample log corresponds to its own vector, for example, the vector corresponding to sample log 1 is (2 0 3 0 4 0 0 0 3).
  • the word frequency matrix generated after the statistics can be as shown in Table 2. For example, the number of occurrences of async in http error is 5, the number of occurrences of async in db error is 0, and the number of occurrences of async in redis error is 1. It can be observed that if the number of occurrences of a feature word in a log category is very high, its correlation with this category is generally also very high.
  • the frequency modulation model can be determined according to the word frequency model.
  • the frequency modulation model includes adjustment parameters for each feature word in each log category. The adjustment parameters are used to adjust the number of times the corresponding feature word is in the corresponding log category.
  • the frequency modulation model can be expressed in the form of a frequency modulation matrix, or it can be Expressed in FM array or other forms.
  • the frequency modulation matrix is an adjustment to the word frequency matrix. Its number of rows and columns is consistent with the word frequency matrix.
  • the frequency modulation matrix is used to improve the naive Bayes classification algorithm.
  • the frequency modulation matrix includes the adjustment parameters of each feature word under each log category, and the adjustment parameters are used to adjust the number of times (word frequency) of the feature words under the corresponding log category according to manual rules. For example, characteristics such as jdbc and mysql will appear in the log information of dberror in most cases. Generally, if this type of characteristic word appears, it can be concluded that the log information belongs to the dberror category.
  • the frequency modulation matrix is a matrix of artificial rules, and the initial parameter of each item is 1, that is, it is not adjusted by default.
  • the conditional probability of each feature word under each log classification can be determined according to the word frequency matrix and the frequency modulation matrix.
  • Step 301 Determine the sum of the number of times each feature word appears under the log classification
  • T i is the log classification; count (T i) is the sum of the times T i at each feature word appears; A (i, j) is the number of T i x j Key words appear, i.e. Frequency.
  • T i is http
  • T i is DB
  • T i is Redis, number of occurrence of each characteristic word sum count (redis) 50.
  • Step 302 Determine the conditional probability of the feature word in the log classification according to the number of times the feature word is in the word frequency model, the adjustment parameters of the feature word in the frequency modulation model, and the sum of the number of times each feature word appears under the log classification.
  • the word frequency model is a word frequency matrix with m rows ⁇ n columns
  • the frequency modulation model is a frequency modulation matrix with m rows ⁇ n columns.
  • the log classification corresponding to the i-th row in the word frequency matrix corresponds to the i-th row in the frequency modulation matrix.
  • the log classification is the same, and the feature word corresponding to the jth column in the word frequency matrix is the same as the feature word corresponding to the jth column in the frequency modulation matrix; 0 ⁇ i ⁇ m, 0 ⁇ j ⁇ n.
  • conditional probability of a feature word in the log classification can be determined according to the formula ( 1) OK.
  • x j is the characteristic word in the jth column
  • T i is the log classification of the i-th row
  • A(i,j) is the number of occurrences of the feature word corresponding to the jth column in the log classification corresponding to the ith row;
  • B(i,j) is the adjustment parameter of the feature word corresponding to the jth column in the log classification corresponding to the ith row;
  • count(T i ) is the sum of the number of times each feature word appears under T i;
  • is a smoothing coefficient, which adds a small word frequency value to all feature words, which is used to reduce the negative impact of the classification calculation when the word frequency is 0 and the conditional probability is 0.
  • n is the frequency modulation matrix or the number of columns of the frequency modulation matrix.
  • a conditional probability matrix composed of the conditional probability of each feature word in each log classification can be used as the classification model.
  • the classification model can be as shown in Table 4.
  • the conditional probability is normalized to obtain a new matrix to better reflect each feature
  • the degree of influence of words in different categories is called the weight of feature words. The higher the weight of a feature word in a certain category, it means that the sample log carrying this feature word has a higher probability of being classified into this category.
  • the feature weight matrix can be extracted, which can be specifically as follows Figure 4 shows the flowchart.
  • Step 401 For each feature word, determine the sum of the conditional probability of the feature word under each log category; determine the ratio of the conditional probability of the feature word under each log category to the sum of the conditional probability of the feature word under each log category Is the feature weight of the feature word under each log classification;
  • the feature weight of each feature word in each log classification can be determined according to formula (4), where formula (4) can be:
  • W (i, j) x j is the feature weights weight at T i; m is the number of rows in the matrix or frequency term frequency matrix.
  • step 402 the feature weights of each feature word in each log classification are formed into a feature weight matrix, and the feature weight matrix is used as a classification model.
  • FIG. 5 Another log classification process is provided below, as shown in FIG. 5, which is specifically as follows:
  • the left half of the process is the model training process.
  • the training set is obtained.
  • the training set includes each sample log, and vectorizes the text of each sample log, determines the word frequency of each feature word under each log classification, and calculates the classification of each log The conditional probability of each feature word of, and then generate a classification model.
  • the right half of the process is the model use process. Obtain the log to be classified, vectorize the log to be classified, combine the classification model and use the Bayesian formula to calculate the probability of the log to be classified in each log classification, and then the maximum probability corresponds to The log classification is determined as the log classification to which the log to be classified belongs.
  • Sample imbalance is a common problem in the field of machine learning. Taking classification as an example, ideally, the number of samples of different categories in the sample set should be evenly distributed, that is, to ensure that each category has enough samples for model training. But under realistic conditions, the imbalance of sample distribution is widespread. In the field of log classification, logs of different levels and types often appear at different frequencies. For example, http connect time out is a common network request exception, which has a high probability of occurrence and may happen every day; while the OOM (out of memory) of JVM (Java Virtual Machine) is difficult to appear, but Very serious error. In the sample set, it is obvious that http abnormal samples are much more than JVM abnormal samples, which causes the sample imbalance problem, which in turn affects the classification accuracy of JVM abnormal samples.
  • Sample labeling is a very headache. To train a high-quality model, the size of the sample set is a very critical decisive factor. In the past, sample labeling required manual labeling one by one, which required a lot of manpower for thousands of samples.
  • the frequency modulation matrix After collecting enough initial features, we use the frequency modulation matrix to set a very large adjustment parameter for such features (for example, 1000 or more), and then classify the sample set to be classified, and use the result as the classification label; most samples It can fall into the corresponding classification correctly, and a small part of the samples that do not contain the initial features fall into the default unknown classification, and then manually mark them.
  • a very large adjustment parameter for such features for example, 1000 or more
  • Model classification may be wrong. Under the naive Bayesian classification algorithm based on the word frequency model, a problem will arise: we find that a sample is classified into the wrong category, and then we manually correct this sample and put it into the sample set, retrain the model, and again Classify the sample-the resulting model still gives the previous misclassification. This is because the word frequency model is to perform word frequency statistics on all samples under the same category. Adjusting a certain sample individually is only a drop in the bucket and cannot achieve the purpose of correcting the model.
  • FIG. 6 exemplarily shows the structure of a log classification device provided by an embodiment of the present invention, and the device can execute the flow of the log classification method.
  • the device includes:
  • the determining unit 601 is configured to determine the number of times each feature word appears in the log to be classified;
  • the classification unit 602 is configured to determine the log classification to which the log to be classified belongs according to the number of occurrences of each feature word in the log to be classified and the classification model; the classification model is that the training unit 603 determines the log classification to which the log to be classified belongs according to the sample log The conditional probability of each feature word under each log classification is determined;
  • conditional probability of each feature word in each log category is determined by the training unit 603 according to the word frequency model and the frequency modulation model;
  • the word frequency model includes the number of times each feature word appears in each log category, so
  • the frequency modulation model includes adjustment parameters of each feature word in each log category, and the adjustment parameters are used by the training unit 603 to adjust the number of times the corresponding feature word is in the corresponding log category.
  • the training unit 603 is specifically configured to:
  • the adjustment parameter of the characteristic word in the frequency modulation model According to the number of times the characteristic word is in the word frequency model, the adjustment parameter of the characteristic word in the frequency modulation model, and the sum of the number of times each characteristic word appears under the log classification, it is determined that the characteristic word is in the word frequency model.
  • Conditional probability under log classification According to the number of times the characteristic word is in the word frequency model, the adjustment parameter of the characteristic word in the frequency modulation model, and the sum of the number of times each characteristic word appears under the log classification, it is determined that the characteristic word is in the word frequency model.
  • the word frequency model is a word frequency matrix of m rows ⁇ n columns
  • the frequency modulation model is a frequency modulation matrix of m rows ⁇ n columns
  • the log classification corresponding to the i-th row in the word frequency matrix is in the frequency modulation matrix
  • the log classification corresponding to the i-th row is the same
  • the feature word corresponding to the j-th column in the word frequency matrix is the same as the feature word corresponding to the j-th column in the frequency modulation matrix; 0 ⁇ i ⁇ m, 0 ⁇ j ⁇ n;
  • the training unit 603 is specifically used for:
  • the formula (1) is:
  • x j is the feature word j-th column
  • T i is the log category i-th row
  • T i ) is the conditional probability at T i at x j of
  • a (i, j) is the first The number of occurrences of the feature word corresponding to the jth column in the log classification corresponding to row i
  • B(i,j) is the adjustment parameter of the feature word corresponding to the jth column in the log classification corresponding to row i
  • count(T i ) Is the sum of the number of occurrences of each feature word under T i
  • is the smoothing coefficient
  • n is the number of columns of the frequency modulation matrix or the frequency modulation matrix;
  • the training unit 603 is specifically configured to:
  • For each feature word determine the sum of the conditional probability of the feature word under each log classification; combine the conditional probability of the feature word under each log category and the sum of the conditional probability of the feature word under each log category The ratio of is determined as the feature weight of the feature word under each log classification;
  • the feature weight of each feature word in each log classification is formed into a feature weight matrix, and the feature weight matrix is used as the classification model.
  • the present application also provides a computing device.
  • the computing device includes at least one processor 720 for implementing the method in FIG. 2 provided by the embodiment of the present application. Either method.
  • the computing device 700 may also include at least one memory 730 for storing program instructions and/or data.
  • the memory 730 and the processor 720 are coupled.
  • the coupling in the embodiments of the present application is an indirect coupling or communication connection between devices, units or modules, and may be in electrical, mechanical or other forms, and is used for information exchange between devices, units or modules.
  • the processor 720 may operate in cooperation with the memory 730.
  • the processor 720 may execute program instructions stored in the memory 730. At least one of the at least one memory may be included in the processor.
  • each step of the above method can be completed by an integrated logic circuit of hardware in the processor or instructions in the form of software.
  • the steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware processor, or executed and completed by a combination of hardware and software modules in the processor.
  • the software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers.
  • the storage medium is located in the memory, and the processor reads the information in the memory and completes the steps of the above method in combination with its hardware. To avoid repetition, it will not be described in detail here.
  • the processor in the embodiment of the present application may be an integrated circuit chip with signal processing capability.
  • the steps of the foregoing method embodiments can be completed by hardware integrated logic circuits in the processor or instructions in the form of software.
  • the above-mentioned processor may be a general-purpose processor, a digital signal processing circuit (digital signal processor, DSP), a dedicated integrated circuit (application specific integrated circuit, ASIC), a field programmable gate array (field programmable gate array, FPGA) or other Programming logic devices, discrete gates or transistor logic devices, discrete hardware components.
  • DSP digital signal processing circuit
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • Programming logic devices discrete gates or transistor logic devices, discrete hardware components.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the steps of the method disclosed in the embodiments of the present application can be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers.
  • the storage medium is located in the memory, and the processor reads the information in the memory and completes the steps of the above method in combination with its hardware.
  • the memory in the embodiments of the present application may be a volatile memory or a non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory can be read-only memory (ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), and electrically available Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • the volatile memory may be random access memory (RAM), which is used as an external cache.
  • RAM random access memory
  • static random access memory static random access memory
  • dynamic RAM dynamic RAM
  • DRAM dynamic random access memory
  • synchronous dynamic random access memory synchronous DRAM, SDRAM
  • double data rate synchronous dynamic random access memory double data rate SDRAM, DDR SDRAM
  • enhanced synchronous dynamic random access memory enhanced SDRAM, ESDRAM
  • synchronous connection dynamic random access memory serial DRAM, SLDRAM
  • direct rambus RAM direct rambus RAM
  • the computing device 700 may further include a communication interface 710 for communicating with other devices through a transmission medium, so that the apparatus used in the computing device 700 can communicate with other devices.
  • the communication interface may be a transceiver, circuit, bus, module, or other type of communication interface.
  • the transceiver when the communication interface is a transceiver, the transceiver may include an independent receiver and an independent transmitter; it may also be a transceiver with integrated transceiver functions, or an interface circuit.
  • the computing device 700 may also include a communication line 740.
  • the communication interface 710, the processor 720, and the memory 730 may be connected to each other through a communication line 740;
  • the communication line 740 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (extended industry standard architecture). , Referred to as EISA) bus and so on.
  • the communication line 740 can be divided into an address bus, a data bus, a control bus, and the like. For ease of representation, only one thick line is used in FIG. 7, but it does not mean that there is only one bus or one type of bus.
  • the embodiments of the present invention also provide a non-transitory computer-readable storage medium.
  • the non-transitory computer-readable storage medium stores computer instructions. When it runs on a computer, the computer executes the above log classification. method.
  • inventions of the present application provide a computer program product.
  • the computer program product includes a calculation program stored on a non-transitory computer-readable storage medium.
  • the computer program includes program instructions. When executed by a computer, the computer is caused to execute the above log classification method.
  • These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
  • the instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

Abstract

本发明公开了一种日志分类方法及装置,其中,方法包括:确定待分类日志中各特征词出现的次数,根据待分类日志中各特征词出现的次数和分类模型,确定待分类日志所属的日志分类;分类模型是根据样本日志中每个特征词在每个日志分类下的条件概率确定的;其中,每个特征词在每个日志分类下的条件概率是根据词频模型和调频模型确定的;词频模型包括每个特征词在每个日志分类下出现的次数,调频模型包括每个特征词在每个日志分类下的调整参数,调整参数用于调整对应的特征词在对应的日志分类下的次数。该技术方案将机器学习算法与知识工程相结合,克服样本集中训练数据不均衡的问题,从而提升模型分类准确率。

Description

一种日志分类方法及装置
相关申请的交叉引用
本申请要求在2019年11月01日提交中国专利局、申请号为201911060648.7、申请名称为“一种日志分类方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明实施例涉及金融科技(Fintech)领域,尤其涉及一种日志分类方法及装置。
背景技术
随着计算机技术的发展,越来越多的技术应用在金融领域,传统金融业正在逐步向金融科技转变,机器学习技术也不例外,但由于金融、支付行业的安全性、实时性要求,也对机器学习技术提出的更高的要求。
目前常见的日志分类思路是基于机器学习的文本分类算法。文本分类算法以统计理论为基础,利用算法让机器具有类似人类般的自动学习能力,即对已知的训练数据做统计分析从而获得规律,再运用规律对未知数据做预测分析。由于机器学习技术在文本分类领域有着良好的实际表现,已经成为了日志分析与分类领域的主流。
在对分类模型进行训练时,通常会遇到训练数据不均衡的问题,以错误日志为例,级别越高、越严重的错误,一般出现的概率就越小,故此类型的样本数量就越少。而使用不均衡的样本集进行模型训练,往往不能得到很好的效果,模型分类的准确率较低。
发明内容
本发明实施例提供一种日志分类方法及装置,将机器学习算法与知识工程相结合,克服样本集中训练数据不均衡的问题,从而提升模型分类准确率。
本发明实施例提供的一种日志分类方法,包括:
确定待分类日志中各特征词出现的次数;
根据所述待分类日志中各特征词出现的次数和分类模型,确定所述待分类日志所属的日志分类;所述分类模型是根据样本日志中每个特征词在每个日志分类下的条件概率确定的;
其中,每个特征词在每个日志分类下的条件概率是根据词频模型和调频模型确定的;所述词频模型包括每个特征词在每个日志分类下出现的次数,所述调频模型包括每个特征词在每个日志分类下的调整参数,所述调整参数用于调整对应的特征词在对应的日志分类下的次数。
可选的,所述每个特征词在每个日志分类下的条件概率是根据词频模型和调频模型确定的,包括:
针对每一个特征词在每个日志分类下执行下述操作:
确定所述日志分类下各特征词出现的次数的总和;
根据所述特征词在所述词频模型中的次数、所述特征词在所述调频模型中的调整参数、所述日志分类下各特征词出现的次数的总和,确定所述特征词在所述日志分类下的条件概 率。
可选的,所述词频模型为m行×n列的词频矩阵,所述调频模型为m行×n列的调频矩阵;所述词频矩阵中第i行对应的日志分类与所述调频矩阵中第i行对应的日志分类相同,所述词频矩阵中第j列对应的特征词与所述调频矩阵中第j列对应的特征词相同;0<i≤m,0<j≤n;
所述根据所述特征词在所述词频模型中的次数、所述特征词在所述调频模型中的调整参数、所述日志分类下各特征词出现的次数的总和,确定所述特征词在所述日志分类下的条件概率,包括:
根据公式(1)确定所述特征词在所述日志分类下的条件概率;
所述公式(1)为:
Figure PCTCN2020115409-appb-000001
其中,x j为第j列的特征词;T i为第i行的日志分类;P(x j|T i)为在T i下x j的条件概率;A(i,j)为在第i行对应的日志分类中第j列对应的特征词出现的次数;B(i,j)为在第i行对应的日志分类中第j列对应的特征词的调整参数;count(T i)为T i下的各特征词出现的次数的总和;α为平滑系数;n为调频矩阵或调频矩阵的列数。
可选的,所述分类模型是根据样本日志中每个特征词在每个日志分类下的条件概率确定的,包括:
针对每一个特征词,确定所述特征词在各日志分类下的条件概率的总和;将所述特征词在各日志分类下的条件概率与所述特征词在各日志分类下的条件概率的总和的比值确定为所述特征词在各日志分类下的特征权重;
将各特征词在各日志分类中的特征权重组成特征权重矩阵,将所述特征权重矩阵作为所述分类模型。
上述技术方案中,采用调频矩阵对某些样本日志较少的日志类别中的特征词的词频进行调整,以此放大该特征词在此日志分类下的词频,模拟出将该日志类别中样本日志的数量增多的效果,从而减少各日志类别对应的样本日志不均衡所导致的模型训练不准确的问题。
相应的,本发明实施例还提供了一种日志分类装置,包括:
确定单元、分类单元和训练单元;
所述确定单元,用于确定待分类日志中各特征词出现的次数;
所述分类单元,用于根据所述待分类日志中各特征词出现的次数和分类模型,确定所述待分类日志所属的日志分类;所述分类模型是所述训练单元根据样本日志中每个特征词在每个日志分类下的条件概率确定的;
其中,每个特征词在每个日志分类下的条件概率是所述训练单元根据词频模型和调频模型确定的;所述词频模型包括每个特征词在每个日志分类下出现的次数,所述调频模型包括每个特征词在每个日志分类下的调整参数,所述调整参数用于所述训练单元调整对应的特征词在对应的日志分类下的次数。
可选的,所述训练单元具体用于:
针对每一个特征词在每个日志分类下执行下述操作:
确定所述日志分类下各特征词出现的次数的总和;
根据所述特征词在所述词频模型中的次数、所述特征词在所述调频模型中的调整参数、所述日志分类下各特征词出现的次数的总和,确定所述特征词在所述日志分类下的条件概率。
可选的,所述词频模型为m行×n列的词频矩阵,所述调频模型为m行×n列的调频矩阵;所述词频矩阵中第i行对应的日志分类与所述调频矩阵中第i行对应的日志分类相同,所述词频矩阵中第j列对应的特征词与所述调频矩阵中第j列对应的特征词相同;0<i≤m,0<j≤n;
所述训练单元具体用于:
根据公式(1)确定所述特征词在所述日志分类下的条件概率;
所述公式(1)为:
Figure PCTCN2020115409-appb-000002
其中,x j为第j列的特征词;T i为第i行的日志分类;P(x j|T i)为在T i下x j的条件概率;A(i,j)为在第i行对应的日志分类中第j列对应的特征词出现的次数;B(i,j)为在第i行对应的日志分类中第j列对应的特征词的调整参数;count(T i)为T i下的各特征词出现的次数的总和;α为平滑系数;n为调频矩阵或调频矩阵的列数。
可选的,所述训练单元具体用于:
针对每一个特征词,确定所述特征词在各日志分类下的条件概率的总和;将所述特征词在各日志分类下的条件概率与所述特征词在各日志分类下的条件概率的总和的比值确定为所述特征词在各日志分类下的特征权重;
将各特征词在各日志分类中的特征权重组成特征权重矩阵,将所述特征权重矩阵作为所述分类模型。
相应的,本发明实施例还提供了一种计算设备,包括:
处理器、存储器、通信接口;其中,处理器、存储器与通信接口之间通过总线连接;
所述处理器,用于读取所述存储器中的程序,执行上述日志分类方法;
所述存储器,用于存储一个或多个可执行程序,可以存储所述处理器在执行操作时所使用的数据。
相应的,本发明实施例还提供了一种非暂态计算机可读存储介质,非暂态计算机可读存储介质中存储计算机指令,当其在计算机上运行时,使得计算机执行上述日志分类方法。
相应的,本发明实施例还提供一种包含指令的计算机程序产品,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行上述日志分类方法。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简要介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本发明实施例提供的一种系统架构的示意图;
图2为本发明实施例提供的一种日志分类方法的流程示意图;
图3为本发明实施例提供的一种确定条件概率的流程示意图;
图4为本发明实施例提供的一种确定特征权重的流程示意图;
图5为本发明实施例提供的另一种日志分类方法的流程示意图;
图6为本发明实施例提供的一种日志分类装置的结构示意图;
图7为本申请提供的一种计算设备的结构示意图。
具体实施方式
为了使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明作进一步地详细描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本发明保护的范围。
为了更好的解释本发明实施例,先对本发明实施例中所涉及朴素贝叶斯分类算法解释如下:
目前常见的分类算法有许多种,例如贝叶斯、神经网络、决策树、KNN(K-Nearest Neighbor,k最邻近分类算法)、SVM(Support Vector Machine,支持向量机)等。其中,贝叶斯分类是一类分类算法的总称,这类算法均以贝叶斯定理为基础,故统称为贝叶斯分类。而朴素贝叶斯分类是贝叶斯分类中最简单,也是常见的一种分类方法。贝叶斯定理是以英国数学家贝叶斯命名,用来解决两个条件概率之间的关系问题。简单的说就是在已知P(A|B)时如何获得P(B|A)的概率。朴素贝叶斯假设特征P(A)在特定结果P(B)下是独立的。贝叶斯算法通过已知的P(A|B),P(A)和P(B)三个概率计算P(B|A)发生的概率,其计算方法可归结于贝叶斯公式,贝叶斯公式可如公式(2)所示。
Figure PCTCN2020115409-appb-000003
在上述贝叶斯公式中,每一种概率都有一个特定的名字:
P(B)是事件B在样本空间中发生的概率,也叫事件B的先验概率。
P(A)是事件A在样本空间中发生的概率,也叫事件A的先验概率。
P(A|B)是已知B发生后A的条件概率,叫做似然函数。
P(B|A)是已知A发生后B的条件概率,叫做后验概率。
P(A|B)/P(A)是调整因子,也被称作标准似然度。
朴素贝叶斯的基本方法:在统计数据的基础上,依据条件概率公式,计算当前特征的样本属于某个分类的概率,选最大的概率分类。对于给出的待分类项,求解在此项出现的条件下各个类别出现的概率,哪个最大,就认为此待分类项属于哪个类别。
计算流程如下:
(1)x={a1,a2,……,am}为待分类项,每个a为x的一个特征属性;
(2)有类别集合C={y1,y2,……,yn};
(3)分别计算P(y1∣x),P(y2∣x),……,P(yn∣x);
(4)P(yk∣x)=max{P(y1∣x),P(y2∣x),……,P(yn∣x)};
图1示例性的示出了本发明实施例提供日志分类方法所适用的系统架构,该系统架构可以包括数据源模块、前台模块、后台模块、分类算法模块以及数据库;各模块功能具体如下:
数据源模块:提供本发明实施例中训练模型所使用的错误日志文本,也可以称为是源错误日志。
前台模块:负责提供Web界面,主要用于展示日志分类信息、为用户提供数据管理等操作入口。
后台模块:主要用于日志处理,负责从数据源拉取原始日志文本,并对其进行清洗(以正则匹配等方式过滤无价值的文本内容)、去重(合并相似度过高的样本),最后将生成的样本集(训练集)存储至数据库中。除此之外,后台模块还负责提供数据操作接口,以及自动化调用分类算法模块进行模型训练,并将模型参数存储至数据库中。
分类算法模块:负责分类器模型的训练与样本日志的分类功能。
数据库:用于存储处理后的规范样本日志(错误样本日志集)、调频矩阵信息、配置数据、分类信息等各类型数据。
基于上述描述,图2示例性的示出了本发明实施例提供的一种日志分类方法的流程,该流程可以由日志分类装置执行,该装置可以位于分类算法模块中,可以是该分类算法模块。
如图2所示,该流程具体包括:
步骤201,确定待分类日志中各特征词出现的次数;
步骤202,根据待分类日志中各特征词出现的次数和分类模型,确定待分类日志所属的日志分类。
本发明实施例中,特征词是指根据样本集中多个样本日志确定的词语或词组,由于样本日志实质上是文本格式,不能直接参与计算,所以需要先将样本日志进行向量化。一种实现方式中,可以采用词集模型对样本日志进行向量化,以词语为基础处理单元,先将样本集中的所有词汇进行汇总,得到大小为N的词库,并将样本集中的每个样本日志映射成一个N维的向量,每个维度上的值代表该样本日志中存在特征词的个数(也可以说是该样本日志中存在特征词的词频),该N维的向量体现了样本日志中词频的信息。
举例来说,假设某样本集汇总生成大小为10的词库:(“async”“at”“connection”“db”“error”“jdbc”“mysql”“redis”“timeout”“user”);
现将某样本日志“mysql jdbc connection timeout error”以上述规则进行向量化,生成长度为10的向量:(0 0 1 0 1 1 1 0 1 0);
上述例子中,使用词集模型进行向量化时,因为是对单独的每个词语进行词频统计,会出现语序信息丢失问题:例如词组dead lock会被拆分成dead与lock两个独立特征进行统计,而词组本身的语义则丢失了。为了解决这个问题,本发明实施例在文本向量化时可以使用n个词语组合的方式进行文本拆分,将长度为n的相邻词语组合成新的特征,并加入词库,其中n可以根据经验设置,例如,n设置为2时,可以将样本日志中的两个连续的词语作为一个词语组合,从而得到新的特征词。
还是以上文样本日志为例,在n=2的情况下,会生成以下特征词:
(“mysql”“jdbc”“connection”“timeout”“error”“mysql jdbc”“jdbc connection”“connection timeout”“timeout error”);
使用n个词语组合的方式进行文本拆分,可以有效保留有语义的特征词。
在确定出待分类日志中各特征词出现的次数之后,同样可以将该待分类日志进行向量化,如生成长度为10的向量:(0 1 1 0 1 1 1 1 1 0),然后根据该待分类日志生成的向量与分 类模型,结合贝叶斯分类算法,确定该待分类日志所属的日志分类。
本发明实施例中,分类模型是根据样本日志中每个特征词在每个日志分类下的条件概率确定的,其中,每个特征词在每个日志分类下的条件概率是根据词频模型和调频模型确定的。
具体的,词频模型包括每个特征词在每个日志分类下出现的次数,词频模型可以是以词频矩阵的形式表现,也可以是以词频数组或者其他的形式表现。词频模型可以是根据样本集中各样本日志中的特征词确定。
以根据样本集中各样本日志中的特征词确定词频矩阵为例,进行说明如下:
存在样本集中各样本日志如表1所示,即样本集中有三个日志分类,分别是http error、db error和redis error;http error中包括样本日志1、样本日志2、样本日志3,db error中包括样本日志4、样本日志5、样本日志6、样本日志7,redis error中包括样本日志8、样本日志9。且每个样本日志对应自己的向量,如样本日志1对应向量为(2 0 3 0 4 0 0 0 3)。
表1样本集中各样本日志
Figure PCTCN2020115409-appb-000004
对表1中各样本日志进行统计,确定每个日志分类中各特征词出现次数的总和,统计之后生成的词频矩阵可以如表2所示。如async在http error中出现的次数为5次,async在db error中出现的次数为0次,async在redis error中出现的次数为1次。可以观察到,如果特征词在某日志分类下的出现的次数很高,则它与此分类的相关性一般也很高。
表2词频矩阵
  async at connection db error jdbc mysql redis timeout
http error 5 2 10 0 12 0 0 0 8
db error 0 5 8 10 15 22 22 0 12
redis error 1 12 4 0 8 0 0 20 5
在确定出词频模型后,可以根据词频模型确定调频模型。该调频模型包括每个特征词 在每个日志分类下的调整参数,调整参数用于调整对应的特征词在对应的日志分类下的次数,调频模型可以是以调频矩阵的形式表现,也可以是以调频数组或者其他的形式表现。
调频矩阵是对词频矩阵的一个调整,它的行数和列数与词频矩阵保持一致,调频矩阵用于对朴素贝叶斯分类算法进行改进。如表3所示,在调频矩阵中包括每个特征词在每个日志分类下的调整参数,该调整参数用于根据人工规则对特征词在对应的日志分类下的次数(词频)进行调整。如:jdbc、mysql等特征在绝大多数情况下,都会出现在db error的日志信息中,一般出现了此类型特征词,则可以断定此日志信息属于db error分类。所以我们可以通过配置预设值来增大特征的词频,进而提高特征在db error分类下的权重,令包含这些特征的日志信息有更高概率被分类到db error下。相反,我们也可以通过配置预设值来减少特征的词频,例如,通过配置一个小于1的调整参数,来减少特征词在某分类下的出现的次数,进而降低特征词在此分类下的权重。
调频矩阵是人工规则的矩阵化体现,其每一项的初始参数都为1,即默认不调整。我们可以通过调整调频矩阵中每一项的调整参数,来精确控制每一个特征词在特定分类下的权重,将现有的知识规则与朴素贝叶斯分类算法相结合,进而提高模型的分类准确率。
表3调频矩阵
  async at connection db error jdbc mysql redis timeout
http error 1 1 1 1 1 1 1 1 1
db error 0.2 1 1 20 1 20 20 1 1
redis error 1 1 1 1 1 1 1 1 1
在确定出词频矩阵和调频矩阵之后,即可以根据词频矩阵和调频矩阵确定每个特征词在每个日志分类下的条件概率。为了方便描述,可以以其中的任一个特征词在任一个日志分类下为例进行说明,如图3所示的流程图中:
步骤301,确定日志分类下各特征词出现的次数的总和;
如公式(3)所示:
Figure PCTCN2020115409-appb-000005
其中,T i为日志分类;count(T i)为T i下各特征词出现的次数的总和;A(i,j)为T i下关键词x j出现的次数,也即词频。
以表2为例,T i为http,则T i下各特征词出现的次数的总和为:count(http)=5+2+10+0+12+0+0+0+8=37,同理,T i为db时,各特征词出现的次数的总和count(db)为94;T i为redis时,各特征词出现的次数的总和count(redis)为50。
步骤302,根据特征词在词频模型中的次数、特征词在调频模型中的调整参数、日志分类下各特征词出现的次数的总和,确定特征词在日志分类下的条件概率。
在一种实现方式中,词频模型为m行×n列的词频矩阵,调频模型为m行×n列的调频矩阵,词频矩阵中第i行对应的日志分类与调频矩阵中第i行对应的日志分类相同,词频矩阵中第j列对应的特征词与调频矩阵中第j列对应的特征词相同;0<i≤m,0<j≤n。在根据特征词在词频模型中的次数、特征词在调频模型中的调整参数、日志分类下各特征词出现的次数的总和,确定特征词在日志分类下的条件概率时,可以是根据公式(1)确定。
其中,公式(1)为:
Figure PCTCN2020115409-appb-000006
其中,x j为第j列的特征词;
T i为第i行的日志分类;
P(x j|T i)为在T i下x j的条件概率;
A(i,j)为在第i行对应的日志分类中第j列对应的特征词出现的次数;
B(i,j)为在第i行对应的日志分类中第j列对应的特征词的调整参数;
count(T i)为T i下的各特征词出现的次数的总和;
α为平滑系数,给所有特征词额外增加一个较小的词频值,用于降低词频为0的情况下,条件概率为0,为分类计算带来的负面影响。
n为调频矩阵或调频矩阵的列数。
结合表2的词频矩阵和表3的调频矩阵举例说明,根据公式(1)可以确定出每个日志分类中每个特征词的条件概率可以如表4所示,此处可以假设α=1。
表4条件概率矩阵
  async at connection db error jdbc mysql redis timeout
http error 0.13 0.07 0.24 0.02 0.28 0.02 0.02 0.02 0.20
db error 0.01 0.06 0.09 1.95 0.16 4.28 4.28 0.01 0.13
redis error 0.03 0.22 0.08 0.02 0.15 0.02 0.02 0.36 0.10
一种实现方式中,可以将每个日志分类中每个特征词的条件概率组成的条件概率矩阵作为分类模型,此时,分类模型可以如表4所示。另一种实现方式中,考虑到条件概率的值一般都非常小,常在10e-3量级下,所以将条件概率进行归一,得到的新的矩阵,用于更好地体现每个特征词在不同分类下的影响度,我们将其称为特征词的权重。特征词在某一分类下的权重越高,则表示此携带此特征词的样本日志有更高的概率被分到此类别下,可以在确定出条件概率矩阵后提取特征权重矩阵,具体可以如图4示出的流程图。
步骤401,针对每一个特征词,确定特征词在各日志分类下的条件概率的总和;将特征词在各日志分类下的条件概率与特征词在各日志分类下的条件概率的总和的比值确定为特征词在各日志分类下的特征权重;
可以根据公式(4)确定各特征词在各日志分类中的特征权重,其中公式(4)可以为:
Figure PCTCN2020115409-appb-000007
其中,W(i,j)为x j在T i下的特征权重;m为词频矩阵或调频矩阵的行数。
步骤402,将各特征词在各日志分类中的特征权重组成特征权重矩阵,将特征权重矩阵作为分类模型。
结合表4的条件概率矩阵,确定特征权重矩阵如表5所示。
表5特征权重矩阵
  async at connection db error jdbc mysql redis timeout
http error 0.75 0.19 0.58 0.01 0.48 0.01 0.01 0.06 0.46
db error 0.06 0.17 0.21 0.98 0.26 0.99 0.99 0.03 0.30
redis error 0.19 0.64 0.21 0.01 0.26 0.00 0.00 0.92 0.24
模型训练完成之后,就可以开始对待分类日志进行分类预测了。与模型训练过程一致,在分类之前同样需要对待分类日志进行向量化,只是在对待分类日志进行向量化时,必须要使用样本集向量化时生成的词库。在向量化完成之后,分类概率的计算过程与朴素贝叶斯分类过程没有区别,直接使用贝叶斯公式计算待分类日志在每个日志分类下的概率,并取概率最大的一个日志分类作为最终分类结果,在此不再多加阐述。
为了更好的解释本发明实施例,下面提供另一种日志分类流程,如图5所示,具体如下:
左半部分流程是模型训练过程,获取训练集,该训练集中包括各样本日志,并对各样本日志进行文本向量化,确定出各日志分类下的各特征词的词频,并计算各日志分类下的各特征词的条件概率,进而生成分类模型。
右半部分流程是模型使用过程,获取待分类日志,对待分类日志进行文本向量化,结合分类模型并使用贝叶斯公式计算该待分类日志在各日志分类中的概率,进而将最大概率对应的日志分类确定为该待分类日志所属的日志分类。
本发明实施例中采用调频矩阵具有如下有益效果:
(1)通过调频矩阵减少样本不均衡带来的影响。
样本不均衡是机器学习领域常见的问题。以分类为例,理想情况下,样本集中的不同类别的样本数最好是均匀分布的,即保证每一个类别都有足够的样本来进行模型训练。但在现实条件下,样本分布的不均衡性是广泛存在的。在日志分类领域中,不同级别、不同类型的日志,其出现的频率往往都是不同的。例如,http connect time out是常见的网络请求异常,其出现的概率很高,每天都有可能发生;而JVM(Java Virtual Machine,Java虚拟机)的OOM(out of memory)是很难出现,但非常严重的错误。在样本集中,很明显http的异常样本要比JVM异常样本要多很多,这就造成了样本不均衡问题,进而影响JVM异常样本的分类准确度。
在本方法中,我们可以通过调频矩阵对那些样本数过少的类别设置一个很高的调整参数,以此放大特征词在此日志分类下的词频,模拟出向样本集增加此类别样本的效果,进而减小样本不均衡带来的影响。以JVM异常为例,可以调整JVM异常样本中最为显著的特征词对应的调整参数,如“out of memory”等特征。
(2)通过调频矩阵进行快速样本标注,节省人力成本。
样本标注是一个非常令人头痛的事情。要想训练出一个高质量的模型,样本集的大小是一个非常关键的决定性因素。而在以往,样本标注都是需要手动逐个进行标注,动辄成千上万的样本,需要耗费不小的人力开销。
在本方发明实施例中,我们可以通过确定出的调频矩阵对后续需要进行分类的样本集进行初始化标注,有效地减少人工标注工作量。在错误样本日志中,大部分的样本都带有 能够显著区分类别的特征,比如”mysql”“redis”“gns”“http”“timeout”“out of memory”等,基本上只要出现了这类关键字,就能够断定样本日志是属于某个类别,我们将这一类特征称为初始化特征。在收集到足够初始化特征后,我们通过调频矩阵,为此类特征设置成一个非常大的调整参数(例如1000以上),然后再对待分类样本集进行分类,将其结果作为分类标签;大部分样本能够正确落到对应分类下,小部分不包含初始化特征的样本落入默认的unknown分类下,再手动对其进行标记即可。
(3)对分类错误样本进行回归分析,结合调频矩阵进行特征调整。
模型分类可能会出错。在基于词频模型的朴素贝叶斯分类算法下,会产生一个问题:我们发现某个样本被分到错误的类别,然后我们将此样本手动纠正,并放入样本集,重新训练模型,并再次对该样本进行分类—结果模型依旧给出之前的错误分类。这是因为词频模型是将同一分类下的所有样本进行词频统计,单独调整某一个样本只是杯水车薪,无法达到纠正模型的目的。
在本发明实施例中,我们可以使用模型训练得到的特征权重矩阵,对样本进行回归分析。举例来说,现有一处理后的样本日志“银行报告sys TransDAO certNo查询超时costTime”,其所属的类别本应是“外部合作伙伴业务异常”,但模型将其分类成默认的“unknown”类别。我们通过特征权重矩阵查询到,该样本在“unknown”分类下权重最高的5个特征如下:
(“银行报告”,0.8652419428703651)
(“查询超时”,0.5142907974010534)
(“sys”,0.5142907974010534)
(“超时”,0.15651730037704084)
(“costtime”,0.1881949392920741)
从中我们发现,“银行报告”这个特征在“unknown”分类下的权重最大,而实际上“银行”很明显应该属于外部合作伙伴的特征范畴,故我们需要调整此特征在这两个类别的权重。我们可以结合调频矩阵,将“银行报告”此特征词在“unknown”分类下的调整参数调低,并将“外部合作伙伴业务异常”分类下的调整参数调高。调整完成后,重新训练模型,并重新进行分类测试,结果该样本已经成功被归结到“外部合作伙伴业务异常”分类之下了。
基于同一发明构思,图6示例性的示出了本发明实施例提供的一种日志分类装置的结构,该装置可以执行日志分类方法的流程。
该装置包括:
确定单元601、分类单元602和训练单元603;
所述确定单元601,用于确定待分类日志中各特征词出现的次数;
所述分类单元602,用于根据所述待分类日志中各特征词出现的次数和分类模型,确定所述待分类日志所属的日志分类;所述分类模型是所述训练单元603根据样本日志中每个特征词在每个日志分类下的条件概率确定的;
其中,每个特征词在每个日志分类下的条件概率是所述训练单元603根据词频模型和调频模型确定的;所述词频模型包括每个特征词在每个日志分类下出现的次数,所述调频模型包括每个特征词在每个日志分类下的调整参数,所述调整参数用于所述训练单元603调整对应的特征词在对应的日志分类下的次数。
可选的,所述训练单元603具体用于:
针对每一个特征词在每个日志分类下执行下述操作:
确定所述日志分类下各特征词出现的次数的总和;
根据所述特征词在所述词频模型中的次数、所述特征词在所述调频模型中的调整参数、所述日志分类下各特征词出现的次数的总和,确定所述特征词在所述日志分类下的条件概率。
可选的,所述词频模型为m行×n列的词频矩阵,所述调频模型为m行×n列的调频矩阵;所述词频矩阵中第i行对应的日志分类与所述调频矩阵中第i行对应的日志分类相同,所述词频矩阵中第j列对应的特征词与所述调频矩阵中第j列对应的特征词相同;0<i≤m,0<j≤n;
所述训练单元603具体用于:
根据公式(1)确定所述特征词在所述日志分类下的条件概率;
所述公式(1)为:
Figure PCTCN2020115409-appb-000008
其中,x j为第j列的特征词;T i为第i行的日志分类;P(x j|T i)为在T i下x j的条件概率;A(i,j)为在第i行对应的日志分类中第j列对应的特征词出现的次数;B(i,j)为在第i行对应的日志分类中第j列对应的特征词的调整参数;count(T i)为T i下的各特征词出现的次数的总和;α为平滑系数;n为调频矩阵或调频矩阵的列数;
可选的,所述训练单元603具体用于:
针对每一个特征词,确定所述特征词在各日志分类下的条件概率的总和;将所述特征词在各日志分类下的条件概率与所述特征词在各日志分类下的条件概率的总和的比值确定为所述特征词在各日志分类下的特征权重;
将各特征词在各日志分类中的特征权重组成特征权重矩阵,将所述特征权重矩阵作为所述分类模型。
基于与上述图2所示的方法相同的构思,本申请还提供一种计算设备,如图7所示,该计算设备包括至少一个处理器720,用于实现本申请实施例提供的图2中任一方法。
计算设备700还可以包括至少一个存储器730,用于存储程序指令和/或数据。存储器730和处理器720耦合。本申请实施例中的耦合是装置、单元或模块之间的间接耦合或通信连接,可以是电性,机械或其它的形式,用于装置、单元或模块之间的信息交互。处理器720可能和存储器730协同操作。处理器720可能执行存储器730中存储的程序指令。所述至少一个存储器中的至少一个可以包括于处理器中。
在实现过程中,上述方法的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令完成。结合本申请实施例所公开的方法的步骤可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述方法的步骤。为避免重复,这里不再详细描述。
应注意,本申请实施例中的处理器可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法实施例的各步骤可以通过处理器中的硬件的集成逻辑电路或者软 件形式的指令完成。上述的处理器可以是通用处理器、数字信号处理电路(digital signal processor,DSP)、专用集成芯片(application specific integrated circuit,ASIC)、现场可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述方法的步骤。
可以理解,本申请实施例中的存储器可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(dynamic RAM,DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。应注意,本文描述的系统和方法的存储器旨在包括但不限于这些和任意其它适合类型的存储器。
计算设备700还可以包括通信接口710,用于通过传输介质和其它设备进行通信,从而用于计算设备700中的装置可以和其它设备进行通信。在本申请实施例中,通信接口可以是收发器、电路、总线、模块或其它类型的通信接口。在本申请实施例中,通信接口为收发器时,收发器可以包括独立的接收器、独立的发射器;也可以集成收发功能的收发器、或者是接口电路。
计算设备700还可以包括通信线路740。其中,通信接口710、处理器720以及存储器730可以通过通信线路740相互连接;通信线路740可以是外设部件互连标准(peripheral component interconnect,简称PCI)总线或扩展工业标准结构(extended industry standard architecture,简称EISA)总线等。所述通信线路740可以分为地址总线、数据总线、控制总线等。为便于表示,图7中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
基于同一发明构思,本发明实施例还提供了一种非暂态计算机可读存储介质,非暂态计算机可读存储介质中存储计算机指令,当其在计算机上运行时,使得计算机执行上述日志分类方法。
基于同一发明构思,本申请实施例提供一种计算机程序产品,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行上述日志分类方法。
本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图 和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本发明的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例作出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本发明范围的所有变更和修改。
显然,本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和范围。这样,倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内,则本发明也意图包含这些改动和变型在内。

Claims (11)

  1. 一种日志分类方法,其特征在于,包括:
    确定待分类日志中各特征词出现的次数;
    根据所述待分类日志中各特征词出现的次数和分类模型,确定所述待分类日志所属的日志分类;所述分类模型是根据样本日志中每个特征词在每个日志分类下的条件概率确定的;
    其中,每个特征词在每个日志分类下的条件概率是根据词频模型和调频模型确定的;所述词频模型包括每个特征词在每个日志分类下出现的次数,所述调频模型包括每个特征词在每个日志分类下的调整参数,所述调整参数用于调整对应的特征词在对应的日志分类下的次数。
  2. 如权利要求1所述的方法,其特征在于,所述每个特征词在每个日志分类下的条件概率是根据词频模型和调频模型确定的,包括:
    针对每一个特征词在每个日志分类下执行下述操作:
    确定所述日志分类下各特征词出现的次数的总和;
    根据所述特征词在所述词频模型中的次数、所述特征词在所述调频模型中的调整参数、所述日志分类下各特征词出现的次数的总和,确定所述特征词在所述日志分类下的条件概率。
  3. 如权利要求2所述的方法,其特征在于,所述词频模型为m行×n列的词频矩阵,所述调频模型为m行×n列的调频矩阵;所述词频矩阵中第i行对应的日志分类与所述调频矩阵中第i行对应的日志分类相同,所述词频矩阵中第j列对应的特征词与所述调频矩阵中第j列对应的特征词相同;0<i≤m,0<j≤n;
    所述根据所述特征词在所述词频模型中的次数、所述特征词在所述调频模型中的调整参数、所述日志分类下各特征词出现的次数的总和,确定所述特征词在所述日志分类下的条件概率,包括:
    根据公式(1)确定所述特征词在所述日志分类下的条件概率;
    所述公式(1)为:
    Figure PCTCN2020115409-appb-100001
    其中,x j为第j列的特征词;T i为第i行的日志分类;P(x j|T i)为在T i下x j的条件概率;A(i,j)为在第i行对应的日志分类中第j列对应的特征词出现的次数;B(i,j)为在第i行对应的日志分类中第j列对应的特征词的调整参数;count(T i)为T i下的各特征词出现的次数的总和;α为平滑系数;n为调频矩阵或调频矩阵的列数。
  4. 如权利要求1所述的方法,其特征在于,所述分类模型是根据样本日志中每个特征词在每个日志分类下的条件概率确定的,包括:
    针对每一个特征词,确定所述特征词在各日志分类下的条件概率的总和;将所述特征词在各日志分类下的条件概率与所述特征词在各日志分类下的条件概率的总和的比值确定为所述特征词在各日志分类下的特征权重;
    将各特征词在各日志分类中的特征权重组成特征权重矩阵,将所述特征权重矩阵作为所述分类模型。
  5. 一种日志分类装置,其特征在于,包括:
    确定单元、分类单元和训练单元;
    所述确定单元,用于确定待分类日志中各特征词出现的次数;
    所述分类单元,用于根据所述待分类日志中各特征词出现的次数和分类模型,确定所述待分类日志所属的日志分类;所述分类模型是所述训练单元根据样本日志中每个特征词在每个日志分类下的条件概率确定的;
    其中,每个特征词在每个日志分类下的条件概率是所述训练单元根据词频模型和调频模型确定的;所述词频模型包括每个特征词在每个日志分类下出现的次数,所述调频模型包括每个特征词在每个日志分类下的调整参数,所述调整参数用于所述训练单元调整对应的特征词在对应的日志分类下的次数。
  6. 如权利要求5所述的装置,其特征在于,所述训练单元具体用于:
    针对每一个特征词在每个日志分类下执行下述操作:
    确定所述日志分类下各特征词出现的次数的总和;
    根据所述特征词在所述词频模型中的次数、所述特征词在所述调频模型中的调整参数、所述日志分类下各特征词出现的次数的总和,确定所述特征词在所述日志分类下的条件概率。
  7. 如权利要求6所述的装置,其特征在于,所述词频模型为m行×n列的词频矩阵,所述调频模型为m行×n列的调频矩阵;所述词频矩阵中第i行对应的日志分类与所述调频矩阵中第i行对应的日志分类相同,所述词频矩阵中第j列对应的特征词与所述调频矩阵中第j列对应的特征词相同;0<i≤m,0<j≤n;
    所述训练单元具体用于:
    根据公式(1)确定所述特征词在所述日志分类下的条件概率;
    所述公式(1)为:
    Figure PCTCN2020115409-appb-100002
    其中,x j为第j列的特征词;T i为第i行的日志分类;P(x j|T i)为在T i下x j的条件概率;A(i,j)为在第i行对应的日志分类中第j列对应的特征词出现的次数;B(i,j)为在第i行对应的日志分类中第j列对应的特征词的调整参数;count(T i)为T i下的各特征词出现的次数的总和;α为平滑系数;n为调频矩阵或调频矩阵的列数。
  8. 如权利要求5所述的装置,其特征在于,所述训练单元具体用于:
    针对每一个特征词,确定所述特征词在各日志分类下的条件概率的总和;将所述特征词在各日志分类下的条件概率与所述特征词在各日志分类下的条件概率的总和的比值确定为所述特征词在各日志分类下的特征权重;
    将各特征词在各日志分类中的特征权重组成特征权重矩阵,将所述特征权重矩阵作为所述分类模型。
  9. 一种计算设备,其特征在于,包括处理器、存储器、通信接口,其中处理器、存储器与通信接口之间通过总线连接;
    所述处理器,用于读取所述存储器中的程序,执行权利要求1至4任一所述方法;
    所述存储器,用于存储一个或多个可执行程序,以及存储所述处理器在执行操作时所使用的数据。
  10. 一种非暂态计算机可读存储介质,其特征在于,所述非暂态计算机可读存储介质存储计算机指令,所述计算机指令用于使所述计算机执行权利要求1至4任一所述方法。
  11. 一种计算机程序产品,其特征在于,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行权利要求1至4任一所述方法。
PCT/CN2020/115409 2019-11-01 2020-09-15 一种日志分类方法及装置 WO2021082780A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911060648.7 2019-11-01
CN201911060648.7A CN110929028A (zh) 2019-11-01 2019-11-01 一种日志分类方法及装置

Publications (1)

Publication Number Publication Date
WO2021082780A1 true WO2021082780A1 (zh) 2021-05-06

Family

ID=69850230

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/115409 WO2021082780A1 (zh) 2019-11-01 2020-09-15 一种日志分类方法及装置

Country Status (2)

Country Link
CN (1) CN110929028A (zh)
WO (1) WO2021082780A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110929028A (zh) * 2019-11-01 2020-03-27 深圳前海微众银行股份有限公司 一种日志分类方法及装置
CN112000502B (zh) * 2020-08-11 2023-04-07 杭州安恒信息技术股份有限公司 海量错误日志的处理方法、装置、电子装置及存储介质
CN112199227B (zh) * 2020-10-14 2022-09-27 北京紫光展锐通信技术有限公司 参数确定方法及相关产品
CN113704469B (zh) * 2021-08-18 2022-04-15 百融至信(北京)征信有限公司 一种基于贝叶斯定理的短文本分类数据集矫正方法及系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090234825A1 (en) * 2008-02-28 2009-09-17 Fujitsu Limited Information distribution system and information distribution method
CN103810264A (zh) * 2014-01-27 2014-05-21 西安理工大学 基于特征选择的网页文本分类方法
CN105446495A (zh) * 2015-12-08 2016-03-30 北京搜狗科技发展有限公司 一种候选排序方法和装置
CN105893225A (zh) * 2015-08-25 2016-08-24 乐视网信息技术(北京)股份有限公司 一种错误自动处理方法及装置
CN110929028A (zh) * 2019-11-01 2020-03-27 深圳前海微众银行股份有限公司 一种日志分类方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090234825A1 (en) * 2008-02-28 2009-09-17 Fujitsu Limited Information distribution system and information distribution method
CN103810264A (zh) * 2014-01-27 2014-05-21 西安理工大学 基于特征选择的网页文本分类方法
CN105893225A (zh) * 2015-08-25 2016-08-24 乐视网信息技术(北京)股份有限公司 一种错误自动处理方法及装置
CN105446495A (zh) * 2015-12-08 2016-03-30 北京搜狗科技发展有限公司 一种候选排序方法和装置
CN110929028A (zh) * 2019-11-01 2020-03-27 深圳前海微众银行股份有限公司 一种日志分类方法及装置

Also Published As

Publication number Publication date
CN110929028A (zh) 2020-03-27

Similar Documents

Publication Publication Date Title
WO2021082780A1 (zh) 一种日志分类方法及装置
US10459971B2 (en) Method and apparatus of generating image characteristic representation of query, and image search method and apparatus
WO2021184554A1 (zh) 数据库异常监测方法、装置、计算机装置及存储介质
US20180349158A1 (en) Bayesian optimization techniques and applications
WO2018090657A1 (zh) 基于BP_Adaboost模型的信用卡用户违约的预测方法及系统
US20180268296A1 (en) Machine learning-based network model building method and apparatus
Filippi et al. Parametric bandits: The generalized linear case
WO2022077646A1 (zh) 一种用于图像处理的学生模型的训练方法及装置
US6466946B1 (en) Computer implemented scalable, incremental and parallel clustering based on divide and conquer
WO2019179403A1 (zh) 基于序列宽深学习的欺诈交易检测方法
WO2022042123A1 (zh) 图像识别模型生成方法、装置、计算机设备和存储介质
US10747961B2 (en) Method and device for identifying a sentence
WO2020220758A1 (zh) 一种异常交易节点的检测方法及装置
WO2018040387A1 (zh) 基于支持向量数据描述的特征提取及分类方法及其系统
WO2018153201A1 (zh) 深度学习训练方法及装置
CN106599913A (zh) 一种基于聚类的多标签不平衡生物医学数据分类方法
CN110569289B (zh) 基于大数据的列数据处理方法、设备及介质
CN109766437A (zh) 一种文本聚类方法、文本聚类装置及终端设备
US20240037408A1 (en) Method and apparatus for model training and data enhancement, electronic device and storage medium
WO2022116444A1 (zh) 文本分类方法、装置、计算机设备和介质
WO2019232844A1 (zh) 手写模型训练方法、手写字识别方法、装置、设备及介质
CN112329862A (zh) 基于决策树的反洗钱方法及系统
CN110880018B (zh) 一种卷积神经网络目标分类方法
WO2023016267A1 (zh) 垃圾评论的识别方法、装置、设备及介质
US20220092452A1 (en) Automated machine learning tool for explaining the effects of complex text on predictive results

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20881431

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20881431

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 210922)

122 Ep: pct application non-entry in european phase

Ref document number: 20881431

Country of ref document: EP

Kind code of ref document: A1