WO2021139279A1 - 基于分类模型的数据处理方法、装置、电子设备及介质 - Google Patents

基于分类模型的数据处理方法、装置、电子设备及介质 Download PDF

Info

Publication number
WO2021139279A1
WO2021139279A1 PCT/CN2020/119368 CN2020119368W WO2021139279A1 WO 2021139279 A1 WO2021139279 A1 WO 2021139279A1 CN 2020119368 W CN2020119368 W CN 2020119368W WO 2021139279 A1 WO2021139279 A1 WO 2021139279A1
Authority
WO
WIPO (PCT)
Prior art keywords
log data
unmarked
marked
network model
training
Prior art date
Application number
PCT/CN2020/119368
Other languages
English (en)
French (fr)
Inventor
邓悦
郑立颖
徐亮
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021139279A1 publication Critical patent/WO2021139279A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/1734Details of monitoring file system events, e.g. by the use of hooks, filter drivers, logs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application relates to the technical field of neural networks in artificial intelligence, and in particular, this application relates to a data processing method, device, electronic device, and medium based on a classification model.
  • Anomaly detection is a very basic but very important function in the intelligent operation (AIOps) system. It mainly uses algorithms and models to automatically discover abnormal behaviors in KPI (Key Performance Indicator) time series data for subsequent alarms. , Automatic stop loss, root cause analysis, etc. provide the necessary basis for decision-making.
  • KPI Key Performance Indicator
  • Logs are text messages generated by large-scale systems to record system status and runtime status. Each log includes a time stamp and a text message indicating what happened.
  • traditional abnormal log classification models usually use supervised learning methods. They use marked log data (with clear instructions for normal and abnormal conditions). However, the marked log data in the massive log is very rare. Marking the marked log data is very labor-intensive and time-consuming in the massive log information of the modern system.
  • the inventor realized that the various types of abnormalities and KPIs have brought great difficulties to abnormality detection.
  • the embodiments of the present application provide a data processing method, device, electronic device, and storage medium based on a classification model.
  • an embodiment of the present application provides a data processing method based on a classification model.
  • the method includes: acquiring log data, the log data including marked log data and unmarked log data, and the marked log data carries a mark Information; performing data enhancement processing on the unmarked log data to obtain enhanced unmarked log data; based on a text classification network model, predicting processing on the enhanced unmarked log data according to the marked log data to obtain the The consistency loss of the enhanced unmarked log data, where the consistency loss represents: the difference between the corresponding outputs of the unmarked log data and the enhanced unmarked log data in the processing of the text classification network model Distance; training the text classification network model based on the consistency loss to obtain a target classification model and abnormal information of the unlabeled log data.
  • an embodiment of the present application provides a data processing device based on a classification model, including: an acquisition module for acquiring log data, the log data including marked log data and unmarked log data, the marked log data Carrying marking information; a data enhancement module for performing data enhancement processing on the unmarked log data to obtain enhanced unmarked log data; a testing module for performing data enhancement processing on the unmarked log data to obtain enhanced unmarked log data; The enhanced unmarked log data is subjected to prediction processing to obtain the consistency loss of the enhanced unmarked log data.
  • the consistency loss indicates that the unmarked log data and the enhanced unmarked log data are in the text
  • the training module is used to train the text classification network model based on the consistency loss to obtain the target classification model and the abnormal information of the unmarked log data.
  • an embodiment of the present application also provides an electronic device, including a processor, an input device, an output device, and a memory.
  • the processor, input device, output device, and memory are connected to each other, wherein the memory is used for
  • a computer program is stored, and the computer program includes program instructions, and the processor is configured to invoke the program instructions to execute the method according to the first aspect and any one of its possible implementation manners, wherein the computer program is based on
  • the data processing method of the classification model includes: obtaining log data, the log data including marked log data and unmarked log data, the marked log data carries marked information; data enhancement processing is performed on the unmarked log data to obtain enhanced Unmarked log data; based on a text classification network model, predictive processing is performed on the enhanced unmarked log data according to the marked log data to obtain the consistency loss of the enhanced unmarked log data, and the consistency loss represents : The distance between the corresponding outputs of the unlabeled log data and the enhanced unlabeled log data in the text classification network model processing; training the text classification network model
  • an embodiment of the present application provides a computer storage medium that stores a computer program, and the computer program includes program instructions that, when executed by a processor, cause the processor to execute
  • the embodiment of the application obtains log data.
  • the above log data includes marked log data and unmarked log data.
  • the marked log data carries marking information.
  • Data enhancement processing is performed on the unmarked log data to obtain enhanced unmarked log data.
  • the text classification network model predicts and processes the enhanced unmarked log data according to the marked log data to obtain the consistency loss of the enhanced unmarked log data.
  • the consistency loss indicates: the unmarked log data and the enhanced unmarked log data.
  • the distance between the corresponding outputs is then trained based on the consistency loss to obtain the target classification model and the abnormal information of the unmarked log data.
  • data enhancement of unlabeled log data can expand the number of abnormal log data in training samples, replacing traditional noise injection methods, thereby improving the model's recognition of abnormal points; no AI is required
  • Operators carry out a large amount of log labeling work, requiring less labeled data and high accuracy; and during the training time, the abnormal information of unlabeled log data can be obtained, that is, unlabeled log data will gradually be labeled, which is more traditional than traditional unlabeled log data.
  • the training speed of the supervised learning model is accelerated, the memory footprint is small, and the computational burden on the hardware is greatly reduced, which is suitable for large-scale deployment.
  • FIG. 1 is a schematic flowchart of a data processing method based on a classification model provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of the architecture of a method for enhancing unmarked abnormal log data provided by an embodiment of the present application
  • FIG. 3 is a schematic flowchart of another data processing method based on a classification model provided by an embodiment of the present application.
  • Fig. 4 is a schematic diagram of a method for constructing a word vector provided by an embodiment of the present application.
  • Fig. 5 is a schematic structural diagram of a data processing device based on a classification model provided by an embodiment of the present application
  • Fig. 6 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 1 is a data processing method based on a classification model provided by an embodiment of this application.
  • a schematic flowchart of the method, as shown in FIG. 1, the method may include:
  • the log data includes marked log data and unmarked log data, and the marked log data carries marking information.
  • the execution subject in the embodiment of the present application may be a data processing device based on a classification model, and specifically may be the above-mentioned electronic device.
  • Logs are text messages generated by large-scale systems to record system status and runtime status. Each log includes a time stamp and a text message indicating what happened.
  • the above marked log data refers to the log data with marked information.
  • the marked information has a clear description of normal and abnormal conditions (such as abnormality level: severe, normal, minor, etc.).
  • abnormality level severe, normal, minor, etc.
  • the marked log data in the massive log is really There are few, and labeling unmarked log data is very labor-intensive and time-consuming in the massive log information of modern systems.
  • step 102 may be performed.
  • Text-CNN is an algorithm that uses convolutional neural networks to classify text. It uses convolutional neural networks to extract text N-gram features and maximum pools. A new model of classification, fully connected and then classified. It consists of four parts to extract text: input layer, convolutional layer, pooling layer, and fully connected layer.
  • a supervised learning method can be used to calculate the cross-entropy loss function.
  • a consistent training mode can be applied, that is, abnormal log data and data-enhanced abnormal log data have the same output under the same model.
  • unlabeled data is predicted Whether the marked information (tag) of the log data is similar to the prediction of the corresponding unmarked log data after enhancement.
  • step 102 includes:
  • the specific expansion method of the abnormal log may adopt the method of back translation processing.
  • Back translation is to translate the translation B of language A into language A.
  • Back translation can be divided into two types: term regression back translation and translation accuracy test back translation.
  • Back translation can generate different expressions while keeping the semantics of the log text unchanged, enhancing the diversity of the text.
  • TF-IDF can also be used to replace words.
  • TF-IDF is a commonly used weighting technique for information retrieval and data mining.
  • TF is term frequency (Term Frequency)
  • IDF is inverse document frequency index (Inverse Document Frequency). ). It is used to evaluate the importance of a word to a document set or one of the documents in a corpus. The importance of a word increases in proportion to the number of times it appears in the document, but at the same time it decreases in inverse proportion to the frequency of its appearance in the corpus.
  • TF-IDF Error Data Augmentation, word operation on the input text, such as replacement, deletion, insertion, exchange
  • the above DBPedia is a kind of knowledge graph or concept library, which extracts various concepts from Wikipedia or webpage articles. Through the above method, it can be ensured that the expanded log text includes the necessary keywords while the log text is expanded.
  • the above-mentioned back translation processing is to perform data enhancement on the entire document, and the above-mentioned TF-IDF is processing at the word level.
  • the text classification network model Based on the text classification network model, perform prediction processing on the enhanced unmarked log data according to the marked log data to obtain the consistency loss of the enhanced unmarked log data.
  • the consistency loss indicates: the unmarked log data and
  • the above-mentioned enhanced unmarked log data corresponds to the distance between the outputs in the above-mentioned text classification network model processing.
  • FIG. 2 refers to the schematic diagram of the architecture of a method for enhancing unmarked abnormal log data shown in FIG. 2.
  • x represents the log data
  • y can represent the label of the log data.
  • M is a model that predicts y based on x, where: p ⁇ (y
  • the model for predicting y; ⁇ represents various parameters of the model.
  • x represents unmarked log data
  • It means the unmarked log data enhanced by the above-mentioned back translation processing and/or the TF-IDF replacement word method
  • the Text-CNN model is applied to the unmarked log data and the enhanced corresponding unmarked log data at the same time, and the two models generated by calculation
  • the distance between the outputs is the consistency loss, and then the final loss of the network is calculated.
  • the training method shown in FIG. 2 can also refer to the specific introduction in the following text, which will not be repeated here.
  • the consistency loss mentioned above is reduced to a minimum (it can be lowered below the preset loss threshold), and the marking information of marked log data will be gradually propagated from marked log data to unmarked log data, that is, to obtain unmarked log data. Predict the mark information, you can determine the abnormal log data.
  • the scope of application of the model in the embodiments of this application is greatly broadened. Only a small amount of marked abnormal logs are needed, and then the unmarked logs are predicted to be consistent based on the label information of the marked abnormal logs, which can greatly expand the abnormality of the model.
  • the number of log inputs can improve the model's recognition of abnormal points, and the accuracy can be comparable to or even surpassed by the supervised model that uses a large amount of labeled data. Processing log data through this model can also reduce the cost of anomaly detection.
  • the aforementioned abnormal information is the marked information predicted by the network model, which can be understood as determining the abnormal level or the abnormal classification of the unmarked log data through prediction.
  • the above method further includes:
  • the system log data is analyzed according to the above target classification model to obtain analysis results.
  • the above analysis results include the probability that the system log data belongs to each abnormal level.
  • AI operators can learn the operating status of the system reflected in the log, so as to formulate specific operation and maintenance strategies, such as:
  • the training method and the application method for analyzing log data in the embodiments of the present application may be executed in different devices, respectively.
  • the embodiment of the application obtains log data.
  • the above log data includes marked log data and unmarked log data.
  • the marked log data carries marking information.
  • Data enhancement processing is performed on the unmarked log data to obtain enhanced unmarked log data.
  • the text classification network model predicts and processes the enhanced unmarked log data according to the marked log data to obtain the consistency loss of the enhanced unmarked log data.
  • the consistency loss indicates: the unmarked log data and the enhanced unmarked log data.
  • the distance between the corresponding outputs is then trained based on the consistency loss to obtain the target classification model and the abnormal information of the unmarked log data.
  • data enhancement of unlabeled log data can expand the number of abnormal log data in training samples, replacing traditional noise injection methods, thereby improving the model's recognition of abnormal points; no AI is required Operators perform a large amount of log annotation work, requiring less labeled data and high accuracy, which is suitable for the new intelligent operation and maintenance digital service engine (AIOps); and the abnormal information of unlabeled log data can be obtained during the training time, that is, no The labeled log data will gradually be labeled, which is faster than traditional unsupervised learning model training, has a small memory footprint, and greatly reduces the computational burden on hardware, making it suitable for large-scale deployment.
  • AIOps new intelligent operation and maintenance digital service engine
  • FIG. 3 is a schematic flowchart of another data processing method based on a classification model provided by an embodiment of the present application.
  • the embodiment shown in FIG. 3 may be obtained on the basis of the embodiment shown in FIG. 1, as The method shown in Figure 3 may include:
  • the log data includes marked log data and unmarked log data, and the marked log data carries marking information.
  • the execution subject in the embodiment of the present application may be a data processing device based on a classification model, and specifically may be the above-mentioned electronic device.
  • step 301 and step 302 reference may be made to the specific description of step 101 and step 102 in the embodiment shown in FIG. 1, which will not be repeated here.
  • a supervised learning method can be used to calculate the cross-entropy loss function, as shown in the upper part of Figure 2 above.
  • M uses the Text-CNN model, and the specific structure can be described as follows:
  • the input layer of the text classification network model includes a set length threshold; the inputting the marked log data into the text classification network model for training includes:
  • the text length of the above sample sequence is less than the above length threshold, use a custom filler to fill the above sample sequence to meet the above length threshold; if the text length of the above sample sequence is greater than the above length threshold, the above sample sequence is intercepted to meet the above length threshold And construct the word vector of the sample sequence, and the word vector of the sample sequence includes the distributed representation corresponding to each vocabulary in the sample sequence.
  • a fixed-length log text sequence needs to be input in the input layer of the Text-CNN model.
  • the length L of an input sequence can be specified by analyzing the length of the corpus sample, that is, the length threshold is preset. For the input log data, sample sequences shorter than L need to be filled, and sequences longer than L need to be intercepted.
  • exception log is as follows:
  • Figure 4 corresponds to the log file mentioned above:
  • the log information contains a total of 6 words, and each word is used as a vector. Since the number of words is 6, it can be assumed that the dimension of each vector is 1*5, so that each word can be distinguished as much as possible.
  • the final input layer is the distributed representation corresponding to each vocabulary in the log text sequence, that is, the word vector.
  • step 304 reference may be made to the specific description of step 103 in the embodiment shown in FIG. 1, which will not be repeated here.
  • Cross entropy is an important concept in information theory, mainly used to measure the difference between two probability distributions. Cross entropy can measure the degree of difference between two different probability distributions in the same random variable. In machine learning, it is expressed as the difference between the true probability distribution and the predicted probability distribution. The smaller the value of cross entropy, the better the model prediction effect.
  • the above-mentioned Text-CNN model is selected in this embodiment of the application, and its input layer is as described in step 303. Further, the model also includes:
  • the convolution kernel In the field of Natural Language Processing (NLP), the convolution kernel generally only performs one-dimensional sliding, that is, the width of the convolution kernel is as wide as the dimension of the word vector, and the convolution kernel only performs one-dimensional sliding.
  • the Text-CNN model in the embodiments of this application generally uses multiple convolution kernels of different sizes.
  • the height of the convolution kernel, that is, the window value can be understood as N in the N-gram model, that is, the length of the local word order used, which means that the content in the text is operated in a sliding window of size N according to bytes, forming A sequence of byte fragments of length N.
  • the window value is also a hyperparameter and needs to be determined in the task.
  • the above window value can be an integer value between 2-8.
  • Max-pool is used in the pooling layer of the Text-CNN model, which reduces the parameters of the model and ensures that the input of a fixed-length fully connected layer is obtained on the output of the variable-length volume base layer .
  • the core role of the convolutional layer and the pooling layer in the classification model is to extract features. From the input fixed-length text sequence, use the local word order information to extract the primary features, and combine the primary features into advanced features, through convolution and pooling Operationalization saves the step of feature engineering in traditional machine learning.
  • the function of the fully connected layer is the classifier.
  • the original Text-CNN model uses a fully connected network with only one hidden layer, which is equivalent to inputting the abnormal log features extracted from the convolution and pooling layer into a Softmax function for classification , Output the probability of log data classification into each category.
  • the output rules set in the embodiments of this application can be abnormal levels, including: major abnormalities, common abnormalities, minor abnormalities and normal, then the model will output the probability that each log belongs to each abnormality level, which can realize the classification of log abnormality levels. .
  • the unmarked abnormal data enhancement technology in the embodiment of the present application calculates the final loss by combining the cross entropy loss of marked log data and the unsupervised consistency loss of unmarked log data, that is, the aforementioned target loss, the formula may be as follows:
  • J( ⁇ ) is the objective loss function
  • is set to balance the supervised loss and unsupervised loss
  • represents various parameters of the model, which can include the weight of the neural network, the number of convolution kernels, and the size of the sliding window.
  • the text classification network model (Text-CNN model) can be trained, and the loss function of the text classification network model (Text-CNN model) uses the above target loss function. Train to obtain a target classification model for log analysis and anomaly detection.
  • the marked log data in training may be gradually deleted according to the increase of marked log data.
  • the embodiment of the present application proposes a method of training signal annealing method, which is only for marking log data.
  • the threshold can be dynamically changed to prevent overfitting.
  • the basic principle is as follows: in the training process, as the unmarked log data increases, the marked log data in the training is gradually deleted, so as to avoid the model from overfitting the marked log data.
  • the stepwise deletion of the marked log data in training according to the increase of marked log data includes:
  • the probability that the foregoing prediction is correct is the probability that the category result of the foregoing target labeled log data is predicted to be the same as the labeled information of the foregoing target labeled log data;
  • the aforementioned probability threshold is updated according to the aforementioned number of training steps and the total number of training steps.
  • the corresponding relationship between the preset number of steps threshold and the probability threshold can be preset, and the probability threshold ⁇ t is used to represent the corresponding relationship, that is, different probability thresholds can be used for different training steps t.
  • the number of training steps is t
  • x) the probability of correct prediction
  • set K as the number of categories, and the value of ⁇ t can be in the interval The value is gradually increased to prevent overfitting to the labeled data.
  • the way of updating the probability threshold ⁇ t may be:
  • ⁇ t can be set as required.
  • ⁇ t can include the following logarithmic, linear and exponential forms:
  • T represents the total number of training steps
  • t is the current number of training steps
  • the threshold ⁇ t in the embodiment of the present application can be set to the aforementioned logarithmic, linear or exponential form according to the data volume of the marked log data. Specifically, it corresponds to the following three different applicable conditions:
  • the model can make high-probability predictions based on the data in a short time. At this time, we can use the exp exponential function to make The threshold grows slowly to remove more samples that are easier to train.
  • a linear function can be used to adjust the threshold.
  • the above-mentioned target classification model obtained by training can be used for log data analysis.
  • the system operation status reflected in the log can be learned, so as to formulate specific operation and maintenance strategies:
  • the data processing method based on the classification model of the embodiment of the application requires less labeled data for training the text classification network model, and has a high accuracy rate. It does not require a large amount of manual log labeling work, and saves a lot of time and energy for manually labeling data, thus extremely Greatly reduce the cost of anomaly detection.
  • the scope of application of the model has been greatly broadened. Only a small amount of marked log data (including a small amount of marked abnormal logs) is needed, and then the unmarked logs can be predicted consistently based on the label information of marked abnormal logs.
  • the number of abnormal log inputs of the model is expanded to improve the model's recognition of abnormal points, and the accuracy is comparable to, or even beyond, the supervision model that uses a large amount of labeled data.
  • FIG. 5 is a schematic structural diagram of a data processing device based on a classification model provided by an embodiment of the present application.
  • the data processing device 500 based on a classification model includes:
  • the obtaining module 510 is configured to obtain log data, the above-mentioned log data includes marked log data and unmarked log data, and the above-mentioned marked log data carries marking information;
  • the data enhancement module 520 is configured to perform data enhancement processing on the aforementioned unmarked log data to obtain enhanced unmarked log data;
  • the prediction module 530 is configured to perform prediction processing on the enhanced unmarked log data based on the text classification network model according to the marked log data to obtain the consistency loss of the enhanced unmarked log data, and the consistency loss indicates: The distance between the marked log data and the above-mentioned enhanced unmarked log data in the processing of the above-mentioned text classification network model, respectively corresponding to the output;
  • the training module 540 is configured to train the text classification network model based on the consistency loss to obtain the target classification model and the abnormal information of the unmarked log data.
  • the above-mentioned training module 540 is further configured to: before the above-mentioned prediction module 530 performs prediction processing on the above-mentioned enhanced unlabeled log data based on the text classification network model according to the above-mentioned labeled log data:
  • the input layer of the aforementioned text classification network model includes a set length threshold
  • the aforementioned training module 540 is specifically configured to:
  • the text length of the above sample sequence is less than the above length threshold, use a custom filler to fill the above sample sequence to meet the above length threshold; if the text length of the above sample sequence is greater than the above length threshold, the above sample sequence is intercepted to meet the above length threshold And construct the word vector of the sample sequence, and the word vector of the sample sequence includes the distributed representation corresponding to each vocabulary in the sample sequence.
  • the above-mentioned training module 540 is further configured to, in the training process of the above-mentioned text classification network model, gradually delete the marked log data in training according to the increase of marked log data.
  • the above-mentioned training module 540 is specifically used for:
  • the probability that the foregoing prediction is correct is the probability that the category result of the foregoing target labeled log data is predicted to be the same as the labeled information of the foregoing target labeled log data;
  • the aforementioned probability threshold is updated according to the aforementioned number of training steps and the total number of training steps.
  • the above-mentioned classification model-based data processing device 500 further includes an analysis module 550 for analyzing the system log data according to the above-mentioned target classification model to obtain an analysis result.
  • the above-mentioned analysis result includes that the above-mentioned system log data belongs to each abnormal level. The probability.
  • the steps involved in the data processing method based on the classification model shown in FIG. 1 and FIG. 3 may be executed by each module in the data processing apparatus 500 based on the classification model shown in FIG. 5 Yes, I won’t repeat them here.
  • the data processing device 500 based on the classification model can obtain log data.
  • the log data includes marked log data and unmarked log data.
  • the marked log data carries marking information.
  • Based on the text classification network model perform prediction processing on the enhanced unmarked log data according to the above marked log data to obtain the enhanced unmarked log data.
  • Consistency loss the above consistency loss means: the distance between the corresponding outputs of the above unlabeled log data and the above enhanced unlabeled log data in the text classification network model processing, and then training the above text based on the above consistency loss
  • the classification network model obtains the target classification model and the abnormal information of the above unmarked log data.
  • data enhancement of unlabeled log data can expand the number of abnormal log data in training samples, replacing traditional noise injection methods, thereby improving the model's recognition of abnormal points; no AI is required
  • Operators carry out a large amount of log labeling work, requiring less labeled data and high accuracy; and during the training time, the abnormal information of unlabeled log data can be obtained, that is, unlabeled log data will gradually be labeled, which is more traditional than traditional unlabeled log data.
  • the training speed of the supervised learning model is accelerated, the memory footprint is small, and the computational burden on the hardware is greatly reduced, which is suitable for large-scale deployment.
  • FIG. 6 is a schematic structural diagram of an electronic device disclosed in an embodiment of the present application.
  • the electronic device 600 includes a processor 601 and a memory 602.
  • the electronic device 600 may also include a bus 603.
  • the processor 601 and the memory 602 may be connected to each other through the bus 603.
  • PCI Peripheral Component Interconnect
  • EISA Extended Industry Standard Architecture
  • the bus 603 can be divided into an address bus, a data bus, a control bus, and so on. For ease of representation, only one thick line is used in FIG. 6, but it does not mean that there is only one bus or one type of bus.
  • the electronic device 600 may also include an input/output device 604, and the input/output device 604 may include a display screen, such as a liquid crystal display screen.
  • the memory 602 is used to store one or more programs containing instructions; the processor 601 is used to call the instructions stored in the memory 602 to execute the data processing method based on the classification model mentioned in the embodiment of FIG. 1 and FIG.
  • the method includes: obtaining log data, the log data including marked log data and unmarked log data, the marked log data carries marking information; and data enhancement is performed on the unmarked log data Processing to obtain enhanced unmarked log data; based on the text classification network model, predict the enhanced unmarked log data according to the marked log data to obtain the consistency loss of the enhanced unmarked log data, so
  • the consistency loss means: the distance between the corresponding outputs of the unlabeled log data and the enhanced unlabeled log data in the text classification network model processing; training the text based on the consistency loss
  • the classification network model obtains the target classification model and the abnormal information of the unmarked log data. I won't repeat them here.
  • the processor 601 may be a central processing unit (CPU), and the processor may also be other general-purpose processors or digital signal processors (DSP). , Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the input device 602 may include a touch panel, a fingerprint sensor (used to collect user fingerprint information and fingerprint orientation information), a microphone, etc.
  • the output device 603 may include a display (LCD, etc.), a speaker, and the like.
  • the memory 604 may include a read-only memory and a random access memory, and provides instructions and data to the processor 601. A part of the memory 604 may also include a non-volatile random access memory. For example, the memory 604 may also store device type information.
  • the electronic device 600 can obtain log data.
  • the log data includes marked log data and unmarked log data.
  • the marked log data carries marking information, and data enhancement processing is performed on the unmarked log data.
  • Obtain enhanced unmarked log data based on the text classification network model, predict the enhanced unmarked log data according to the marked log data, and obtain the consistency loss of the enhanced unmarked log data.
  • the consistency loss indicates: In the processing of the text classification network model, the distance between the corresponding output of the unmarked log data and the enhanced unmarked log data is then trained based on the consistency loss to obtain the target classification model, and The abnormal information of the above unmarked log data.
  • data enhancement of unlabeled log data can expand the number of abnormal log data in training samples, replacing traditional noise injection methods, thereby improving the model's recognition of abnormal points; no AI is required
  • Operators carry out a large amount of log labeling work, requiring less labeled data and high accuracy; and during the training time, the abnormal information of unlabeled log data can be obtained, that is, unlabeled log data will gradually be labeled, which is more traditional than traditional unlabeled log data.
  • the training speed of the supervised learning model is accelerated, the memory footprint is small, and the computational burden on the hardware is greatly reduced, which is suitable for large-scale deployment.
  • the embodiment of the present application also provides a computer storage medium, the storage medium is a volatile storage medium or a non-volatile storage medium, wherein the computer storage medium stores a computer program for electronic data exchange, and the computer program makes The computer executes part or all of the steps of any data processing method based on a classification model as recorded in the above method embodiment, wherein the method includes: obtaining log data, the log data including marked log data and unmarked log data , The marked log data carries marking information; data enhancement processing is performed on the unmarked log data to obtain enhanced unmarked log data; based on the text classification network model, the enhanced unmarked log is processed according to the marked log data The data is subjected to prediction processing to obtain the consistency loss of the enhanced unmarked log data, and the consistency loss indicates that: the unmarked log data and the enhanced unmarked log data are in the process of the text classification network model , Respectively corresponding to the distance between the outputs; training the text classification network model based on the consistency loss to obtain the target classification model and the abnormal information of the unmarked log data.
  • the disclosed device may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the modules is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or modules, and may be in electrical or other forms.
  • modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical modules, that is, they may be located in one place, or they may be distributed on multiple network modules. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the integrated module is implemented in the form of a software function module and sold or used as an independent product, it can be stored in a computer readable memory.
  • the essence of the technical solution or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments.
  • the aforementioned memory includes: U disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), mobile hard disk, magnetic disk or optical disk and other various media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Medical Informatics (AREA)
  • Debugging And Monitoring (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请实施例公开了一种基于分类模型的数据处理方法、装置、电子设备及介质,涉及人工智能中的神经网络技术,其中方法包括:获取日志数据,日志数据包括标记日志数据和无标记日志数据,标记日志数据携带标记信息;对无标记日志数据进行数据增强处理,获得增强的无标记日志数据;基于文本分类网络模型,根据标记日志数据对增强的无标记日志数据进行预测处理,获得增强的无标记日志数据的一致性损失,一致性损失表示:无标记日志数据和增强的无标记日志数据在文本分类网络模型处理中,分别对应的输出之间的距离;基于一致性损失训练文本分类网络模型,获得目标分类模型,以及无标记日志数据的异常信息。

Description

基于分类模型的数据处理方法、装置、电子设备及介质
本申请要求于2020年7月30日提交中国专利局、申请号为202010751730.0,发明名称为“基于分类模型的数据处理方法、装置、电子设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能中的神经网络技术领域,本申请尤其涉及一种基于分类模型的数据处理方法、装置、电子设备及介质。
背景技术
异常检测是智能运营(AIOps)系统中的一项非常基础但是十分重要的功能,主要是通过算法和模型去自动的挖掘发现KPI(Key Performance Indicator)时间序列数据中的异常行为,为后续的报警,自动止损,根因分析等提供必要的决策依据。
日志是由大规模系统生成来记录系统状态和运行时状态的文本信息,每个日志都包括时间戳和指示发生了什么的文本消息。传统的异常日志分类模型为了获取准确率通常使用监督学习方法,采用的是标记日志数据(对正常情况和异常情况有明确的说明),然而海量日志中具有标记的日志数据十分稀少,而对无标记的日志数据进行标注,在现代化系统的海量日志信息中非常耗费人力和时间。并且发明人意识到,异常类型及KPI类型多样,给异常检测带来了极大的困难。
技术问题
传统的异常日志分类模型为了获取准确率通常使用监督学习方法,采用的是标记日志数据(对正常情况和异常情况有明确的说明),然而海量日志中具有标记的日志数据十分稀少,而对无标记的日志数据进行标注,在现代化系统的海量日志信息中非常耗费人力和时间。并且异常类型及KPI类型多样,给异常检测带来了极大的困难。
技术解决方案
本申请实施例提供一种基于分类模型的数据处理方法、装置、电子设备及存储介质。
第一方面,本申请实施例提供了一种基于分类模型的数据处理方法,所述方法包括:获取日志数据,所述日志数据包括标记日志数据和无标记日志数据,所述标记日志数据携带标记信息;对所述无标记日志数据进行数据增强处理,获得增强的无标记日志数据;基于文本分类网络模型,根据所述标记日志数据对所述增强的无标记日志数据进行预测处理,获得所述增强的无标记日志数据的一致性损失,所述一致性损失表示:所述无标记日志数据和所述增强的无标记日志数据在所述文本分类网络模型处理中,分别对应的输出之间的距离;基于所述一致性损失训练所述文本分类网络模型,获得目标分类模型,以及所述无标记日志数据的异常信息。
第二方面,本申请实施例提供了一种基于分类模型的数据处理装置,包括:获取模块,用于获取日志数据,所述日志数据包括标记日志数据和无标记日志数据,所述标记日志数据携带标记信息;数据增强模块,用于对所述无标记日志数据进行数据增强处理,获得增强的无标记日志数据;测模块,用于基于文本分类网络模型,根据所述标记日志数据对所述增强的无标记日志数据进行预测处理,获得所述增强的无标记日志数据的一致性损失,所述一致性损失表示:所述无标记日志数据和所述增强的无标记日志数据在所述文本分类网络模型处理中,分别 对应的输出之间的距离;练模块,用于基于所述一致性损失训练所述文本分类网络模型,获得目标分类模型,以及所述无标记日志数据的异常信息。
第三方面,本申请实施例还提供了一种电子设备,包括处理器、输入设备、输出设备和存储器,所述处理器、输入设备、输出设备和存储器相互连接,其中,所述存储器用于存储计算机程序,所述计算机程序包括程序指令,所述处理器被配置用于调用所述程序指令,执行如第一方面及其任一种可能的实施方式所述的方法,其中,所述基于分类模型的数据处理方法包括:获取日志数据,所述日志数据包括标记日志数据和无标记日志数据,所述标记日志数据携带标记信息;对所述无标记日志数据进行数据增强处理,获得增强的无标记日志数据;基于文本分类网络模型,根据所述标记日志数据对所述增强的无标记日志数据进行预测处理,获得所述增强的无标记日志数据的一致性损失,所述一致性损失表示:所述无标记日志数据和所述增强的无标记日志数据在所述文本分类网络模型处理中,分别对应的输出之间的距离;基于所述一致性损失训练所述文本分类网络模型,获得目标分类模型,以及所述无标记日志数据的异常信息。
第四方面,本申请实施例提供了一种计算机存储介质,所述计算机存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行上述第一方面及其任一种可能的实施方式的方法,其中,所述基于分类模型的数据处理方法包括:获取日志数据,所述日志数据包括标记日志数据和无标记日志数据,所述标记日志数据携带标记信息;对所述无标记日志数据进行数据增强处理,获得增强的无标记日志数据;基于文本分类网络模型,根据所述标记日志数据对所述增强的无标记日志数据进行预测处理,获得所述增强的无标记日志数据的一致性损失,所述一致性损失表示:所述无标记日志数据和所述增强的无标记日志数据在所述文本分类网络模型处理中,分别对应的输出之间的距离;基于所述一致性损失训练所述文本分类网络模型,获得目标分类模型,以及所述无标记日志数据的异常信息。
有益效果
本申请实施例通过获取日志数据,上述日志数据包括标记日志数据和无标记日志数据,上述标记日志数据携带标记信息,对上述无标记日志数据进行数据增强处理,获得增强的无标记日志数据,基于文本分类网络模型,根据上述标记日志数据对上述增强的无标记日志数据进行预测处理,获得上述增强的无标记日志数据的一致性损失,上述一致性损失表示:上述无标记日志数据和上述增强的无标记日志数据在上述文本分类网络模型处理中,分别对应的输出之间的距离,再基于上述一致性损失训练上述文本分类网络模型,获得目标分类模型,以及上述无标记日志数据的异常信息。在数据样本大规模不平衡的情况下,对无标记日志数据进行数据增强,可以扩大训练样本中异常日志数据的数量,取代了传统的噪声注入方法,从而提高模型对异常点的识别;无需AI运营人员进行大量的日志标注工作,所需标记数据少,准确率高;并且在训练时间推移中可以获得无标记日志数据的异常信息,即无标记日志数据会逐渐被打上标签,较传统的无监督学习模型训练速度加快,内存占用小,对硬件的计算负担大大降低,适合大规模部署。
附图说明
图1是本申请实施例提供的一种基于分类模型的数据处理方法的流程示意图;
图2是本申请实施例提供的一种无标记异常日志数据增强方法架构示意图;
图3是本申请实施例提供的另一种基于分类模型的数据处理方法的流程示意图;
图4是本申请实施例提供的一种构建词向量的方法示意图;
图5是本申请实施例提供的一种基于分类模型的数据处理装置的结构示意图;
图6是本申请实施例提供的一种电子设备的结构示意图。
本发明的最佳实施方式
为了解决上述问题,本申请提供了一种基于分类模型的数据处理方法,涉及人工智能中的神经网络技术领域,具体请参见图1,是本申请实施例提供的一种基于分类模型的数据处理方法的示意流程图,如图1所示该方法可包括:
101、获取日志数据,上述日志数据包括标记日志数据和无标记日志数据,上述标记日志数据携带标记信息。
本申请实施例中的执行主体可以为一种基于分类模型的数据处理装置,具体可以为上述电子设备。
日志是由大规模系统生成来记录系统状态和运行时状态的文本信息,每个日志都包括时间戳和指示发生了什么的文本消息。
上述标记日志数据指的有标记信息的日志数据,标记信息即对正常情况和异常情况有明确的说明(如异常等级:严重、普通、轻微等),然而海量日志中具有标记的日志数据实在是少之又少,而对无标记日志数据进行标注,在现代化系统的海量日志信息中非常耗费人力和时间。
本申请中可以仅依赖于少量标记日志数据,来对无标记日志数据进行正确的预测,可以极大地扩充模型的异常日志数量,也便于后续使用异常日志的分析和管理。在获取作为样本数据的标记日志数据和无标记日志数据之后,可以执行步骤102。
102、对上述无标记日志数据进行数据增强处理,获得增强的无标记日志数据。
本申请实施例中可以使用文本分类网络模型(Text-CNN),Text-CNN是一种利用卷积神经网络对文本进行分类的算法,是采用卷积神经网络提取文本N-gram特征、最大池化、全连接然后进行分类的一种新型模型,由四部分构成提取文本:输入层、卷积层、池化层、全连接层。
对于上述标记日志数据,可以使用监督学习的方法来计算交叉熵损失函数。而具体的,本申请实施例中对于未标记数据,可以应用一致性训练模式,即异常日志数据和经过数据增强的异常日志数据,在相同模型下的输出是一致的,根据这个原则预测无标记日志数据的标记信息(标签)和增强之后的对应无标记日志数据的预测是否相似。
在一种可选的实施方式中,上述步骤102包括:
对上述无标记日志数据进行回译处理,以及确定上述无标记日志数据中的关键词,根据上述关键词进行同义词替换,获得上述增强的无标记日志数据。
具体的,在一致性训练模式下,对异常日志的具体扩充方法可以采用回译处理的方法。回译即是将A语言的译文B翻译成A语言。回译可分为两种:术语回归回译和翻译精确性测试回译。回译能够在保存日志文本语义不变的情况下,生成不同的表达,增强文本的多样性。
可选的,还可以使用TF-IDF替换单词法,TF-IDF是一种用于信息检索与数据挖掘的常用加权技术,其中TF是词频(TermFrequency),IDF是逆文本频率指 数(Inverse Document Frequency)。用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。
使用TF-IDF优化了EDA(Easy data augmentation,对输入文本进行词语操作,比如替换、删除、插入、交换)的随机处理词策略,可以根据DBPedia先验知识和实际语料的词频确定关键词,再根据确定好的关键词替换同义词,避免无用数据和错误数据的产生。上述DBPedia是一种知识图谱或者概念库,从维基百科或者网页文章中提取各种不同的概念。通过上述方式可以保证在日志文本扩充的同时使扩充的日志文本包含必要的关键词。其中,上述回译处理是对整个文档进行数据增强,上述TF-IDF是对单词层面的处理。
103、基于文本分类网络模型,根据上述标记日志数据对上述增强的无标记日志数据进行预测处理,获得上述增强的无标记日志数据的一致性损失,上述一致性损失表示:上述无标记日志数据和上述增强的无标记日志数据在上述文本分类网络模型处理中,分别对应的输出之间的距离。
具体的,可以参见图2所示的一种无标记异常日志数据增强方法架构示意图。如图2所示,对于无标记日志数据,应用一致性训练模式来预测。图2中x表示日志数据,y可表示日志数据的标签,
Figure PCTCN2020119368-appb-000001
是x经过增强的日志数据输入。图2中M是根据x预测y的模型,其中:p θ(y|x)为根据x预测y的概率,对应的M为根据x预测y的模型;
Figure PCTCN2020119368-appb-000002
为根据
Figure PCTCN2020119368-appb-000003
预测y的概率,对应的M为根据
Figure PCTCN2020119368-appb-000004
预测y的模型;θ表示模型的各种参数。其中,如图2中下半部分,x表示无标记日志数据,
Figure PCTCN2020119368-appb-000005
则表示通过上述回译处理和/或TF-IDF替换单词法增强的无标记日志数据;Text-CNN模型被同时应用于无标记日志数据和增强的对应无标记日志数据,计算产生的两个模型输出之间的距离,即一致性损失,进而计算网络的最终损失。
图2所示的训练方法还可以参考后文中的具体介绍,此处不再赘述。
104、基于上述一致性损失训练上述文本分类网络模型,获得目标分类模型,以及上述无标记日志数据的异常信息。
将上述一致性损失降至最低(可以是降低到预设损失阈值以下),会逐渐将标记日志数据的标记信息从已标记的日志数据传播到未标记的日志数据,即获得无标记日志数据的预测标记信息,则可以确定其中的异常日志数据。本申请实施例中模型的适用范围被极大的拓宽,只需要少量的有标记异常日志,再根据有标记异常日志的标签信息对无标记日志进行一致性预测,就可以极大地扩充模型的异常日志输入数,从而提高模型对异常点的识别,准确率且可以和使用大量标记数据的监督模型相媲美,甚至超越。通过该模型处理日志数据,也能够降低异常检测的成本。
上述异常信息为网络模型预测的标记信息,可以理解为通过预测确定无标记日志数据的异常等级或者异常分类。
在一种实施方式中,上述方法还包括:
根据上述目标分类模型对系统日志数据进行分析,获得分析结果,上述分析结果包括上述系统日志数据属于每个异常等级的概率等。
AI运营人员可以根据Text-CNN模型对系统日志的分析结果,可以获悉日志所反映出来的系统运行状态,从而制定具体的运维策略,如:
对异常日志系统分优先级管理,重点关注容易发生重大异常的运行情况。
对于优先级高的异常日志,一旦出现重大异常情况,要及时采取应急措施,务必快速响应,定位到具体的故障原因,并加以排除。
本申请实施例中的训练方法和对日志数据进行分析的应用方法可以分别在不同的装置中执行。
本申请实施例通过获取日志数据,上述日志数据包括标记日志数据和无标记日志数据,上述标记日志数据携带标记信息,对上述无标记日志数据进行数据增强处理,获得增强的无标记日志数据,基于文本分类网络模型,根据上述标记日志数据对上述增强的无标记日志数据进行预测处理,获得上述增强的无标记日志数据的一致性损失,上述一致性损失表示:上述无标记日志数据和上述增强的无标记日志数据在上述文本分类网络模型处理中,分别对应的输出之间的距离,再基于上述一致性损失训练上述文本分类网络模型,获得目标分类模型,以及上述无标记日志数据的异常信息。在数据样本大规模不平衡的情况下,对无标记日志数据进行数据增强,可以扩大训练样本中异常日志数据的数量,取代了传统的噪声注入方法,从而提高模型对异常点的识别;无需AI运营人员进行大量的日志标注工作,所需标记数据少,准确率高,适用于智能运维数字业务新引擎(AIOps);并且在训练时间推移中可以获得无标记日志数据的异常信息,即无标记日志数据会逐渐被打上标签,较传统的无监督学习模型训练速度加快,内存占用小,对硬件的计算负担大大降低,适合大规模部署。
参见图3,是本申请实施例提供的另一种基于分类模型的数据处理方法的示意流程图,图3所示的实施例可以是在图1所示的实施例的基础上得到的,如图3所示该方法可包括:
301、获取日志数据,上述日志数据包括标记日志数据和无标记日志数据,上述标记日志数据携带标记信息。
302、对上述无标记日志数据进行数据增强处理,获得增强的无标记日志数据。
本申请实施例中的执行主体可以为一种基于分类模型的数据处理装置,具体可以为上述电子设备。
其中,上述步骤301和步骤302可以分别参考图1所示实施例的步骤101和步骤102中的具体描述,此处不再赘述。
303、将上述标记日志数据输入上述文本分类网络模型进行训练,获得上述标记日志数据的交叉熵损失。
具体的,对于标记日志数据,可以使用监督学习的方法来计算交叉熵损失函数,如前述图2中的上半部分。其中M选用Text-CNN模型,具体结构可以如下所述:
1)输入层(词嵌入层):
在一种可选的实施方式中,上述文本分类网络模型的输入层包括设置的长度阈值;上述将上述标记日志数据输入上述文本分类网络模型进行训练,包括:
将上述标记日志数据的样本序列输入上述文本分类网络模型,在上述文本分类网络模型的输入层:
判断上述样本序列的文本长度是否小于上述长度阈值;
若上述样本序列的文本长度小于上述长度阈值,使用自定义填充符将上述样本序列填充至满足上述长度阈值,若上述样本序列的文本长度大于上述长度阈值,将上述样本序列截取为满足上述长度阈值的子序列,并构建上述样本序列的词向量,上述样本序列的词向量包括上述样本序列中各个词汇对应的分布式表示。
具体的,在Text-CNN模型的输入层需要输入一个定长的日志文本序列,可以通过分析语料集样本的长度指定一个输入序列的长度L,即预先设置该长度阈值。对于输入的日志数据,比L短的样本序列需要进行填充,比L长的序列需要进行截取。
举例来讲,异常日志如下:
2008-11-09 20:55:54 PacketResponder 0 for block blk_321 terminating重大异常
2008-11-09 20:55:54 Received block blk_321 of size 67108864 from轻微异常/10.251.195.70
2008-11-09 20:55:54 PacketResponder 2 for block blk_321 terminating
2008-11-09 20:55:54 Received block blk_321 of size 67108864 from/10.251.126.5
2008-11-09 21:56:50 10.251.126.5:50010:Got exception while serving blk_321 to/10.251.127.243
2008-11-10 03:58:04 Vertification succeeded for blk_321正常
2008-11-10 10:36:37 Deleting block blk_321 file/mnt/hadoop/dfs/data/current/subdir1/blk_321
2008-11-10 10:36:50 Deleting block blk_321 file/mnt/hadoop/dfs/data/current/subdir1/blk_321
可以参见图4所示的一种构建词向量的方法示意图,图4中对应于前文提及的日志文件:
“2008-11-09 20:55:54 PacketResponder 0 for block blk_321 terminating重大异常”;
该条日志信息中共包含6个单词,每个单词作为一个向量,由于单词数为6,可以假设每个向量维度是1*5,从而使每个单词尽可能区分开。最终输入层输入的是日志文本序列中各个词汇对应的分布式表示,即词向量。
304、基于文本分类网络模型,根据上述标记日志数据对上述增强的无标记日志数据进行预测处理,获得上述增强的无标记日志数据的一致性损失,上述一致性损失表示:上述无标记日志数据和上述增强的无标记日志数据在上述文本分类网络模型处理中,分别对应的输出之间的距离。
其中,上述步骤304可以参考图1所示实施例的步骤103中的具体描述,此处不再赘述。
305、将上述标记日志数据输入上述文本分类网络模型进行训练,获得上述标记日志数据的交叉熵损失。
对于标记日志数据,使用监督学习的方法来计算交叉熵损失函数,如图2上半部分。交叉熵是信息论中的一个重要概念,主要用于度量两个概率分布间的差异性。交叉熵能够衡量同一个随机变量中的两个不同概率分布的差异程度,在机器学习中就表示为真实概率分布与预测概率分布之间的差异。交叉熵的值越小,模型预测效果就越好。
本申请实施例选用上述Text-CNN模型,其输入层如步骤303中所述。进一步的,该模型还包括:
2)卷积层:
在自然语言处理(Natural Language Processing,NLP)领域一般卷积核只进行一维的滑动,即卷积核的宽度与词向量的维度等宽,卷积核只进行一维的滑动。本申请实施例中的Text-CNN模型一般使用多个不同尺寸的卷积核。卷积核的高度,即窗口值,可以理解为N-gram模型中的N,即利用的局部词序的长度,意思是将文本里面的内容按照字节进行大小为N的滑动窗口操作,形成了长度是N的字节片段序列。其中窗口值也是一个超参数,需要在任务中尝试确定,可选的,上述窗口值可以选取2-8之间的整数值。
3)池化层:
在Text-CNN模型的池化层中使用了最大值池化(Max-pool),即减少模型的参数,又保证了在不定长的卷基层的输出上获得一个定长的全连接层的输入。
卷积层与池化层在分类模型的核心作用就是提取特征,从输入的定长文本序列中,利用局部词序信息,提取初级的特征,并组合初级的特征为高级特征,通过卷积与池化操作,省去了传统机器学习中的特征工程的步骤。
4)全连接层:
全连接层的作用就是分类器,原始的Text-CNN模型使用了只有一层隐藏层的全连接网络,相当于把从卷积与池化层提取的异常日志特征输入到一个Softmax函数中进行分类,输出日志数据分类为每个类别的概率。本申请实施例中设定的输出规则可以为异常等级,包括:重大异常、普通异常、轻微异常和正常,则模型会输出每条日志属于每个异常等级的概率,可以实现日志异常等级的分类。
306、根据上述标记日志数据的交叉熵损失和上述无标记日志数据的一致性损失计算目标损失。
具体的,本申请实施例中的无标记异常数据增强技术通过结合标记日志数据的交叉熵损失和无标记日志数据的无监督一致性损失,来计算最终损失,即上述目标损失,公式可如下:
Figure PCTCN2020119368-appb-000006
其中,J(θ)是目标损失函数,
Figure PCTCN2020119368-appb-000007
是标记数据的交叉熵损失函数,
Figure PCTCN2020119368-appb-000008
是未标记数据的相对熵损失函数;其中λ的设置是为了平衡监督损失和无监督损失,θ表示模型的各种参数,可以包括神经网络的权重,卷积核数量,滑动窗口大小等。
307、基于上述目标损失训练上述文本分类网络模型,获得上述目标分类模型。
具体的,根据上述步骤306中的描述,可以进行文本分类网络模型(Text-CNN模型)的训练,其损失函数使用上述目标损失函数。训练获得用于日志分析和异常检测的目标分类模型。
在一种可选的实施方式中,可以在上述文本分类网络模型的训练过程中,根据标记日志数据的增加情况,逐步删除训练中的标记日志数据。
由于目标损失函数中前半部分的标记日志数据比较少,而后半部分的无标记日志数据比较多,所以前在模型训练之初,必定会随着训练次数的增加而发生过拟合现象。为了防止这种过拟合,本申请实施例提出了训练信号退火法的方法,该方法仅仅只针对标记日志数据。具体的,可通过动态改变阈值来防止过拟合。其基本原理如下:在训练的过程中,随着无标记日志数据的增加,逐步删除训练中的标记日志数据,从而避免模型对标记日志数据过拟合。
在一种实施方式中,上述根据标记日志数据的增加情况,逐步删除训练中的标记日志数据,包括:
在训练步数达到预设步数阈值的情况下,当由上述标记日志数据中目标标记日志数据获得的预测正确的概率大于概率阈值时,将上述目标标记日志数据从损失函数中删除;
上述预测正确的概率为,预测上述目标标记日志数据的类别结果与上述目标标记日志数据的标记信息相同的概率;
上述概率阈值根据上述训练步数和训练总步数进行更新。
具体的,可以预先设置预设步数阈值与概率阈值的对应关系,使用概率阈值η t表示该对应关系,即在不同的训练步数t可以使用不同的概率阈值。在训练步数为t步时,当由某个标记数据计算出的p(y *|x)(预测正确的概率)大于概率阈值η t时,就将该标记日志数据从损失函数中移除。
其中,设定K为类别数,η t的取值可以在区间
Figure PCTCN2020119368-appb-000009
上逐渐递增,以防止对标记数据过拟合。在一种实施方式中,概率阈值η t的更新方式可以为:
Figure PCTCN2020119368-appb-000010
其中α t可以根据需要进行设置,举例来讲,α t可以包括以下对数、线性和指数形式:
Figure PCTCN2020119368-appb-000011
其中,T表示总的训练步数,t为当前训练步数。
本申请实施例中的阈值α t可以根据标记日志数据的数据量设置为上述对数、线性或指数形式,具体的,对应于以下三种不同的适用条件:
(1)当问题相对容易,标记数据量比较少,模型很容易发生过拟合时,模型能够在短时间内根据数据做出高概率的预测,此时我们就可以采用exp指数函数,来使阈值的增长缓慢一些,以便删除更多容易训练的样本。
(2)当数据量比较大,模型很难发生过拟合时,模型需要花费很长时间才能做出高概率的预测,相同时间内模型输出的高概率预测样本就比较少,需要删除 的样本也比较少,此时我们可以采用log对数函数,来使阈值的增长快速一些,这样删除的样本就比较少。
(3)对于一般的样本,采用线性函数来调整阈值即可。
训练获得的上述目标分类模型可以用于日志数据分析。根据Text-CNN模型对系统日志的分析结果,可以获悉日志所反映出来的系统运行状态,从而制定具体的运维策略:
对异常日志系统分优先级管理,重点关注容易发生重大异常的运行情况。
对于优先级高的异常日志,一旦出现重大异常情况,要及时采取应急措施,务必快速响应,定位到具体的故障原因,并加以排除。
本申请实施例的基于分类模型的数据处理方法,对文本分类网络模型训练所需标记数据少,准确率高,无需人工进行大量的日志标注工作,节省大量人工标注数据的时间和精力,从而极大的降低异常检测的成本。同时,模型的适用范围被极大的拓宽,只需要少量的标记日志数据(包括少量有标记异常日志),再根据有标记异常日志的标签信息对无标记日志进行一致性预测,就可以极大地扩充模型的异常日志输入数,从而提高模型对异常点的识别,准确率且可以和使用大量标记数据的监督模型相媲美,甚至超越。
另外,由于所需标记日志数据量小,且无标记日志数据会随着时间的转移逐渐打上标签,较传统的无监督学习模型训练速度加快,内存占用小,对硬件的计算负担大大降低,适合大规模部署。
请参见图5,图5是本申请实施例提供的一种基于分类模型的数据处理装置的结构示意图,该基于分类模型的数据处理装置500包括:
获取模块510,用于获取日志数据,上述日志数据包括标记日志数据和无标记日志数据,上述标记日志数据携带标记信息;
数据增强模块520,用于对上述无标记日志数据进行数据增强处理,获得增强的无标记日志数据;
预测模块530,用于基于文本分类网络模型,根据上述标记日志数据对上述增强的无标记日志数据进行预测处理,获得上述增强的无标记日志数据的一致性损失,上述一致性损失表示:上述无标记日志数据和上述增强的无标记日志数据在上述文本分类网络模型处理中,分别对应的输出之间的距离;
训练模块540,用于基于上述一致性损失训练上述文本分类网络模型,获得目标分类模型,以及上述无标记日志数据的异常信息。
可选的,上述训练模块540还用于,在上述预测模块530基于文本分类网络模型,根据上述标记日志数据对上述增强的无标记日志数据进行预测处理之前:
将上述标记日志数据输入上述文本分类网络模型进行训练,获得上述标记日志数据的交叉熵损失;
根据上述标记日志数据的交叉熵损失和上述无标记日志数据的一致性损失计算目标损失;
基于上述目标损失训练上述文本分类网络模型,获得上述目标分类模型。
可选的,上述文本分类网络模型的输入层包括设置的长度阈值,上述训练模块540具体用于:
将上述标记日志数据的样本序列输入上述文本分类网络模型,在上述文本分类网络模型的输入层:
判断上述样本序列的文本长度是否小于上述长度阈值;
若上述样本序列的文本长度小于上述长度阈值,使用自定义填充符将上述样本序列填充至满足上述长度阈值,若上述样本序列的文本长度大于上述长度阈值,将上述样本序列截取为满足上述长度阈值的子序列,并构建上述样本序列的词向量,上述样本序列的词向量包括上述样本序列中各个词汇对应的分布式表示。
可选的,上述训练模块540还用于,在上述文本分类网络模型的训练过程中,根据标记日志数据的增加情况,逐步删除训练中的标记日志数据。
进一步可选的,上述训练模块540具体用于:
在训练步数达到预设步数阈值的情况下,当由上述标记日志数据中目标标记日志数据获得的预测正确的概率大于概率阈值时,将上述目标标记日志数据从损失函数中删除;
上述预测正确的概率为,预测上述目标标记日志数据的类别结果与上述目标标记日志数据的标记信息相同的概率;
上述概率阈值根据上述训练步数和训练总步数进行更新。
可选的,上述基于分类模型的数据处理装置500还包括分析模块550,用于根据上述目标分类模型对系统日志数据进行分析,获得分析结果,上述分析结果包括上述系统日志数据属于每个异常等级的概率。
根据本申请实施例的具体实施方式,图1与图3所示的基于分类模型的数据处理方法涉及的步骤可以是由图5所示的基于分类模型的数据处理装置500中的各个模块来执行的,此处不再赘述。
通过本申请实施例的基于分类模型的数据处理装置500,基于分类模型的数据处理装置500可以获取日志数据,上述日志数据包括标记日志数据和无标记日志数据,上述标记日志数据携带标记信息,对上述无标记日志数据进行数据增强处理,获得增强的无标记日志数据,基于文本分类网络模型,根据上述标记日志数据对上述增强的无标记日志数据进行预测处理,获得上述增强的无标记日志数据的一致性损失,上述一致性损失表示:上述无标记日志数据和上述增强的无标记日志数据在上述文本分类网络模型处理中,分别对应的输出之间的距离,再基于上述一致性损失训练上述文本分类网络模型,获得目标分类模型,以及上述无标记日志数据的异常信息。在数据样本大规模不平衡的情况下,对无标记日志数据进行数据增强,可以扩大训练样本中异常日志数据的数量,取代了传统的噪声注入方法,从而提高模型对异常点的识别;无需AI运营人员进行大量的日志标注工作,所需标记数据少,准确率高;并且在训练时间推移中可以获得无标记日志数据的异常信息,即无标记日志数据会逐渐被打上标签,较传统的无监督学习模型训练速度加快,内存占用小,对硬件的计算负担大大降低,适合大规模部署。
请参阅图6,图6是本申请实施例公开的一种电子设备的结构示意图。如图6所示,该电子设备600包括处理器601和存储器602,其中,电子设备600还可以包括总线603,处理器601和存储器602可以通过总线603相互连接,总线603可以是外设部件互连标准(Peripheral Component Interconnect,PCI)总线或扩展工业标准结构(Extended Industry Standard Architecture,EISA)总线等。总线603可以分为地址总线、数据总线、控制总线等。为便于表示,图6中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。其中,电子设备600还可以包括输入输出设备604,输入输出设备604可以包括显示屏,例如液晶显示屏。存储器602用于存储包含指令的一个或多个程序;处理器601用于调用存储在存储器602中的指令执行上述图1和图3实施例中提到的一种基 于分类模型的数据处理方法的部分或全部方法步骤,其中,所述方法包括:获取日志数据,所述日志数据包括标记日志数据和无标记日志数据,所述标记日志数据携带标记信息;对所述无标记日志数据进行数据增强处理,获得增强的无标记日志数据;基于文本分类网络模型,根据所述标记日志数据对所述增强的无标记日志数据进行预测处理,获得所述增强的无标记日志数据的一致性损失,所述一致性损失表示:所述无标记日志数据和所述增强的无标记日志数据在所述文本分类网络模型处理中,分别对应的输出之间的距离;基于所述一致性损失训练所述文本分类网络模型,获得目标分类模型,以及所述无标记日志数据的异常信息。在此不再赘述。
应当理解,在本申请实施例中,所称处理器601可以是中央处理单元(Central Processing Unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
输入设备602可以包括触控板、指纹采传感器(用于采集用户的指纹信息和指纹的方向信息)、麦克风等,输出设备603可以包括显示器(LCD等)、扬声器等。
该存储器604可以包括只读存储器和随机存取存储器,并向处理器601提供指令和数据。存储器604的一部分还可以包括非易失性随机存取存储器。例如,存储器604还可以存储设备类型的信息。
通过本申请实施例的电子设备600,电子设备600可以获取日志数据,上述日志数据包括标记日志数据和无标记日志数据,上述标记日志数据携带标记信息,对上述无标记日志数据进行数据增强处理,获得增强的无标记日志数据,基于文本分类网络模型,根据上述标记日志数据对上述增强的无标记日志数据进行预测处理,获得上述增强的无标记日志数据的一致性损失,上述一致性损失表示:上述无标记日志数据和上述增强的无标记日志数据在上述文本分类网络模型处理中,分别对应的输出之间的距离,再基于上述一致性损失训练上述文本分类网络模型,获得目标分类模型,以及上述无标记日志数据的异常信息。在数据样本大规模不平衡的情况下,对无标记日志数据进行数据增强,可以扩大训练样本中异常日志数据的数量,取代了传统的噪声注入方法,从而提高模型对异常点的识别;无需AI运营人员进行大量的日志标注工作,所需标记数据少,准确率高;并且在训练时间推移中可以获得无标记日志数据的异常信息,即无标记日志数据会逐渐被打上标签,较传统的无监督学习模型训练速度加快,内存占用小,对硬件的计算负担大大降低,适合大规模部署。
本申请实施例还提供一种计算机存储介质,所述存储介质为易失性存储介质或非易失性存储介质,其中,该计算机存储介质存储用于电子数据交换的计算机程序,该计算机程序使得计算机执行如上述方法实施例中记载的任何一种基于分类模型的数据处理方法的部分或全部步骤,其中,所述方法包括:获取日志数据,所述日志数据包括标记日志数据和无标记日志数据,所述标记日志数据携带标记信息;对所述无标记日志数据进行数据增强处理,获得增强的无标记日志数据;基于文本分类网络模型,根据所述标记日志数据对所述增强的无标记日志数据进行预测处理,获得所述增强的无标记日志数据的一致性损失,所述一致性损失表 示:所述无标记日志数据和所述增强的无标记日志数据在所述文本分类网络模型处理中,分别对应的输出之间的距离;基于所述一致性损失训练所述文本分类网络模型,获得目标分类模型,以及所述无标记日志数据的异常信息。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或模块的间接耦合或通信连接,可以是电性或其它的形式。
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理模块,即可以位于一个地方,或者也可以分布到多个网络模块上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,本的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本各个实施例所述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(Read-OnlyMemory,ROM)、随机存取存储器(Random Access Memory,RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。

Claims (20)

  1. 一种基于分类模型的数据处理方法,其中,所述方法包括:
    获取日志数据,所述日志数据包括标记日志数据和无标记日志数据,所述标记日志数据携带标记信息;
    对所述无标记日志数据进行数据增强处理,获得增强的无标记日志数据;
    基于文本分类网络模型,根据所述标记日志数据对所述增强的无标记日志数据进行预测处理,获得所述增强的无标记日志数据的一致性损失,所述一致性损失表示:所述无标记日志数据和所述增强的无标记日志数据在所述文本分类网络模型处理中,分别对应的输出之间的距离;
    基于所述一致性损失训练所述文本分类网络模型,获得目标分类模型,以及所述无标记日志数据的异常信息。
  2. 根据权利要求1所述的方法,其中,所述基于文本分类网络模型,根据所述标记日志数据对所述增强的无标记日志数据进行预测处理之前,所述方法还包括:
    将所述标记日志数据输入所述文本分类网络模型进行训练,获得所述标记日志数据的交叉熵损失;
    所述基于所述一致性损失训练所述文本分类网络模型,获得目标分类模型,包括:
    根据所述标记日志数据的交叉熵损失和所述无标记日志数据的一致性损失计算目标损失;
    基于所述目标损失训练所述文本分类网络模型,获得所述目标分类模型。
  3. 根据权利要求2所述的方法,其中,所述文本分类网络模型的输入层包括设置的长度阈值,所述将所述标记日志数据输入所述文本分类网络模型进行训练,包括:
    将所述标记日志数据的样本序列输入所述文本分类网络模型,在所述文本分类网络模型的输入层:
    判断所述样本序列的文本长度是否小于所述长度阈值;
    若所述样本序列的文本长度小于所述长度阈值,使用自定义填充符将所述样本序列填充至满足所述长度阈值,若所述样本序列的文本长度大于所述长度阈值,将所述样本序列截取为满足所述长度阈值的子序列,并构建所述样本序列的词向量,所述样本序列的词向量包括所述样本序列中各个词汇对应的分布式表示。
  4. 根据权利要求1-3任一项所述的方法,其中,所述方法还包括:
    在所述文本分类网络模型的训练过程中,根据标记日志数据的增加情况,逐步删除训练中的标记日志数据。
  5. 根据权利要求4所述的方法,其中,所述根据标记日志数据的增加情况,逐步删除训练中的标记日志数据,包括:
    在训练步数达到预设步数阈值的情况下,当由所述标记日志数据中目标标记日志数据获得的预测正确的概率大于概率阈值时,将所述目标标记日志数据从损失函数中删除;
    所述预测正确的概率为,预测所述目标标记日志数据的类别结果与所述目标标记日志数据的标记信息相同的概率;
    所述概率阈值根据所述训练步数和训练总步数进行更新。
  6. 根据权利要求1-3任一项所述的方法,其中,所述方法还包括:
    根据所述目标分类模型对系统日志数据进行分析,获得分析结果,所述分析结果包括所述系统日志数据属于每个异常等级的概率。
  7. 根据权利要求1-3任一项所述的方法,其中,所述对所述无标记日志数据进行数据增强处理,获得增强的无标记日志数据,包括:
    对所述无标记日志数据进行回译处理,以及确定所述无标记日志数据中的关键词,根据所述关键词进行同义词替换,获得所述增强的无标记日志数据。
  8. 一种基于分类模型的数据处理装置,其中,包括:
    获取模块,用于获取日志数据,所述日志数据包括标记日志数据和无标记日志数据,所述标记日志数据携带标记信息;
    数据增强模块,用于对所述无标记日志数据进行数据增强处理,获得增强的无标记日志数据;
    预测模块,用于基于文本分类网络模型,根据所述标记日志数据对所述增强的无标记日志数据进行预测处理,获得所述增强的无标记日志数据的一致性损失,所述一致性损失表示:所述无标记日志数据和所述增强的无标记日志数据在所述文本分类网络模型处理中,分别对应的输出之间的距离;
    训练模块,用于基于所述一致性损失训练所述文本分类网络模型,获得目标分类模型,以及所述无标记日志数据的异常信息。
  9. 一种电子设备,其中,包括处理器、输入设备、输出设备和存储器,所述处理器、输入设备、输出设备和存储器相互连接,其中,所述存储器用于存储计算机程序,所述计算机程序包括程序指令,所述处理器被配置用于调用所述程序指令,执行一种基于分类模型的数据处理方法;
    其中,所述基于分类模型的数据处理方法包括:
    获取日志数据,所述日志数据包括标记日志数据和无标记日志数据,所述标记日志数据携带标记信息;
    对所述无标记日志数据进行数据增强处理,获得增强的无标记日志数据;
    基于文本分类网络模型,根据所述标记日志数据对所述增强的无标记日志数据进行预测处理,获得所述增强的无标记日志数据的一致性损失,所述一致性损失表示:所述无标记日志数据和所述增强的无标记日志数据在所述文本分类网络模型处理中,分别对应的输出之间的距离;
    基于所述一致性损失训练所述文本分类网络模型,获得目标分类模型,以及所述无标记日志数据的异常信息。
  10. 根据权利要求9所述的电子设备,其中,所述基于文本分类网络模型,根据所述标记日志数据对所述增强的无标记日志数据进行预测处理之前,所述方法还包括:
    将所述标记日志数据输入所述文本分类网络模型进行训练,获得所述标记日志数据的交叉熵损失;
    所述基于所述一致性损失训练所述文本分类网络模型,获得目标分类模型,包括:
    根据所述标记日志数据的交叉熵损失和所述无标记日志数据的一致性损失计算目标损失;
    基于所述目标损失训练所述文本分类网络模型,获得所述目标分类模型。
  11. 根据权利要求10所述的电子设备,其中,所述文本分类网络模型的输入层包括设置的长度阈值,所述将所述标记日志数据输入所述文本分类网络模型进行训练,包括:
    将所述标记日志数据的样本序列输入所述文本分类网络模型,在所述文本分类网络模型的输入层:
    判断所述样本序列的文本长度是否小于所述长度阈值;
    若所述样本序列的文本长度小于所述长度阈值,使用自定义填充符将所述样本序列填充至满足所述长度阈值,若所述样本序列的文本长度大于所述长度阈值,将所述样本序列截取为满足所述长度阈值的子序列,并构建所述样本序列的词向量,所述样本序列的词向量包括所述样本序列中各个词汇对应的分布式表示。
  12. 根据权利要求9-11所述的电子设备,其中,所述方法还包括:
    在所述文本分类网络模型的训练过程中,根据标记日志数据的增加情况,逐步删除训练中的标记日志数据。
  13. 根据权利要求12所述的电子设备,其中,所述根据标记日志数据的增加情况,逐步删除训练中的标记日志数据,包括:
    在训练步数达到预设步数阈值的情况下,当由所述标记日志数据中目标标记日志数据获得的预测正确的概率大于概率阈值时,将所述目标标记日志数据从损失函数中删除;
    所述预测正确的概率为,预测所述目标标记日志数据的类别结果与所述目标标记日志数据的标记信息相同的概率;
    所述概率阈值根据所述训练步数和训练总步数进行更新。
  14. 根据权利要求9-11所述的电子设备,其中,所述方法还包括:
    根据所述目标分类模型对系统日志数据进行分析,获得分析结果,所述分析结果包括所述系统日志数据属于每个异常等级的概率。
  15. 一种计算机存储介质,其中,所述计算机存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行一种基于分类模型的数据处理方法;
    其中,所述基于分类模型的数据处理方法包括:
    获取日志数据,所述日志数据包括标记日志数据和无标记日志数据,所述标记日志数据携带标记信息;
    对所述无标记日志数据进行数据增强处理,获得增强的无标记日志数据;
    基于文本分类网络模型,根据所述标记日志数据对所述增强的无标记日志数据进行预测处理,获得所述增强的无标记日志数据的一致性损失,所述一致性损失表示:所述无标记日志数据和所述增强的无标记日志数据在所述文本分类网络模型处理中,分别对应的输出之间的距离;
    基于所述一致性损失训练所述文本分类网络模型,获得目标分类模型,以及所述无标记日志数据的异常信息。
  16. 根据权利要求15所述的计算机存储介质,其中,所述基于文本分类网络模型,根据所述标记日志数据对所述增强的无标记日志数据进行预测处理之前,所述方法还包括:
    将所述标记日志数据输入所述文本分类网络模型进行训练,获得所述标记日志数据的交叉熵损失;
    所述基于所述一致性损失训练所述文本分类网络模型,获得目标分类模型,包括:
    根据所述标记日志数据的交叉熵损失和所述无标记日志数据的一致性损失计算目标损失;
    基于所述目标损失训练所述文本分类网络模型,获得所述目标分类模型。
  17. 根据权利要求16所述的计算机存储介质,其中,所述文本分类网络模型的输入层包括设置的长度阈值,所述将所述标记日志数据输入所述文本分类网络模型进行训练,包括:
    将所述标记日志数据的样本序列输入所述文本分类网络模型,在所述文本分类网络模型的输入层:
    判断所述样本序列的文本长度是否小于所述长度阈值;
    若所述样本序列的文本长度小于所述长度阈值,使用自定义填充符将所述样本序列填充至满足所述长度阈值,若所述样本序列的文本长度大于所述长度阈值,将所述样本序列截取为满足所述长度阈值的子序列,并构建所述样本序列的词向量,所述样本序列的词向量包括所述样本序列中各个词汇对应的分布式表示。
  18. 根据权利要求15-16所述的计算机存储介质,其中,所述方法还包括:
    在所述文本分类网络模型的训练过程中,根据标记日志数据的增加情况,逐步删除训练中的标记日志数据。
  19. 根据权利要求18所述的计算机存储介质,其中,所述根据标记日志数据的增加情况,逐步删除训练中的标记日志数据,包括:
    在训练步数达到预设步数阈值的情况下,当由所述标记日志数据中目标标记日志数据获得的预测正确的概率大于概率阈值时,将所述目标标记日志数据从损失函数中删除;
    所述预测正确的概率为,预测所述目标标记日志数据的类别结果与所述目标标记日志数据的标记信息相同的概率;
    所述概率阈值根据所述训练步数和训练总步数进行更新。
  20. 根据权利要求15-16所述的计算机存储介质,其中,所述方法还包括:
    根据所述目标分类模型对系统日志数据进行分析,获得分析结果,所述分析结果包括所述系统日志数据属于每个异常等级的概率。
PCT/CN2020/119368 2020-07-30 2020-09-30 基于分类模型的数据处理方法、装置、电子设备及介质 WO2021139279A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010751730.0 2020-07-30
CN202010751730.0A CN111881983B (zh) 2020-07-30 2020-07-30 基于分类模型的数据处理方法、装置、电子设备及介质

Publications (1)

Publication Number Publication Date
WO2021139279A1 true WO2021139279A1 (zh) 2021-07-15

Family

ID=73204632

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/119368 WO2021139279A1 (zh) 2020-07-30 2020-09-30 基于分类模型的数据处理方法、装置、电子设备及介质

Country Status (2)

Country Link
CN (1) CN111881983B (zh)
WO (1) WO2021139279A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113806536A (zh) * 2021-09-14 2021-12-17 广州华多网络科技有限公司 文本分类方法及其装置、设备、介质、产品
CN114064434A (zh) * 2021-11-17 2022-02-18 建信金融科技有限责任公司 一种日志异常的预警方法、装置、电子设备及存储介质
CN114119964A (zh) * 2021-11-29 2022-03-01 上海商汤临港智能科技有限公司 一种网络训练的方法及装置、目标检测的方法及装置
CN114785606A (zh) * 2022-04-27 2022-07-22 哈尔滨工业大学 一种基于预训练LogXLNet模型的日志异常检测方法、电子设备及存储介质
CN117240700A (zh) * 2023-11-10 2023-12-15 浙江九州未来信息科技有限公司 一种基于贝叶斯分类器的网络故障诊断方法及装置
CN117421595A (zh) * 2023-10-25 2024-01-19 广东技术师范大学 一种基于深度学习技术的系统日志异常检测方法及系统

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112926631A (zh) * 2021-02-01 2021-06-08 大箴(杭州)科技有限公司 金融文本的分类方法、装置及计算机设备
CN113011531B (zh) * 2021-04-29 2024-05-07 平安科技(深圳)有限公司 分类模型训练方法、装置、终端设备及存储介质
CN113657461A (zh) * 2021-07-28 2021-11-16 北京宝兰德软件股份有限公司 基于文本分类的日志异常检测方法、系统、设备及介质
CN113962737A (zh) * 2021-10-26 2022-01-21 北京沃东天骏信息技术有限公司 目标识别模型训练方法和装置、目标识别方法和装置
CN114943879B (zh) * 2022-07-22 2022-10-04 中国科学院空天信息创新研究院 基于域适应半监督学习的sar目标识别方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108108351A (zh) * 2017-12-05 2018-06-01 华南理工大学 一种基于深度学习组合模型的文本情感分类方法
US20190197109A1 (en) * 2017-12-26 2019-06-27 The Allen Institute For Artificial Intelligence System and methods for performing nlp related tasks using contextualized word representations
CN110110080A (zh) * 2019-03-29 2019-08-09 平安科技(深圳)有限公司 文本分类模型训练方法、装置、计算机设备及存储介质
CN110532377A (zh) * 2019-05-13 2019-12-03 南京大学 一种基于对抗训练和对抗学习网络的半监督文本分类方法
US20200019642A1 (en) * 2018-07-12 2020-01-16 International Business Machines Corporation Question Answering Using Trained Generative Adversarial Network Based Modeling of Text
CN111522958A (zh) * 2020-05-28 2020-08-11 泰康保险集团股份有限公司 文本分类方法和装置

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107153630B (zh) * 2016-03-04 2020-11-06 阿里巴巴集团控股有限公司 一种机器学习系统的训练方法和训练系统
EP3591561A1 (en) * 2018-07-06 2020-01-08 Synergic Partners S.L.U. An anonymized data processing method and computer programs thereof
CN109818929A (zh) * 2018-12-26 2019-05-28 天翼电子商务有限公司 基于主动自步学习的未知威胁感知方法、系统、存储介质、终端
CN110321371B (zh) * 2019-07-01 2024-04-26 腾讯科技(深圳)有限公司 日志数据异常检测方法、装置、终端及介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108108351A (zh) * 2017-12-05 2018-06-01 华南理工大学 一种基于深度学习组合模型的文本情感分类方法
US20190197109A1 (en) * 2017-12-26 2019-06-27 The Allen Institute For Artificial Intelligence System and methods for performing nlp related tasks using contextualized word representations
US20200019642A1 (en) * 2018-07-12 2020-01-16 International Business Machines Corporation Question Answering Using Trained Generative Adversarial Network Based Modeling of Text
CN110110080A (zh) * 2019-03-29 2019-08-09 平安科技(深圳)有限公司 文本分类模型训练方法、装置、计算机设备及存储介质
CN110532377A (zh) * 2019-05-13 2019-12-03 南京大学 一种基于对抗训练和对抗学习网络的半监督文本分类方法
CN111522958A (zh) * 2020-05-28 2020-08-11 泰康保险集团股份有限公司 文本分类方法和装置

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CHEN ZHI, GUO WU: "Text Classification Based on Depth Learning on Unbalanced Data", JOURNAL OF CHINESE COMPUTER SYSTEMS, GAI-KAN BIANJIBU , SHENYANG, CN, vol. 41, no. 1, 1 January 2020 (2020-01-01), CN, pages 1 - 5, XP055827798, ISSN: 1000-1220 *
LIU LIZHEN, SONG HAN-TAO, LU YU CHANG: "The Method of Web Text Classification of Using Non-labeled Training Sample", COMPUTER SCIENCE, vol. 33, no. 3, 1 January 2006 (2006-01-01), pages 200 - 211, XP055827797 *
WANG KUI, LIU BAISONG: "Review of Text Classification Research", DATA COMMUNICATION, TÜBINGEN, no. 3, 1 January 2019 (2019-01-01), Tübingen, pages 37 - 47, XP055827800, ISBN: 978-3-16-155833-7 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113806536A (zh) * 2021-09-14 2021-12-17 广州华多网络科技有限公司 文本分类方法及其装置、设备、介质、产品
CN113806536B (zh) * 2021-09-14 2024-04-16 广州华多网络科技有限公司 文本分类方法及其装置、设备、介质、产品
CN114064434A (zh) * 2021-11-17 2022-02-18 建信金融科技有限责任公司 一种日志异常的预警方法、装置、电子设备及存储介质
CN114119964A (zh) * 2021-11-29 2022-03-01 上海商汤临港智能科技有限公司 一种网络训练的方法及装置、目标检测的方法及装置
CN114785606A (zh) * 2022-04-27 2022-07-22 哈尔滨工业大学 一种基于预训练LogXLNet模型的日志异常检测方法、电子设备及存储介质
CN114785606B (zh) * 2022-04-27 2024-02-02 哈尔滨工业大学 一种基于预训练LogXLNet模型的日志异常检测方法、电子设备及存储介质
CN117421595A (zh) * 2023-10-25 2024-01-19 广东技术师范大学 一种基于深度学习技术的系统日志异常检测方法及系统
CN117240700A (zh) * 2023-11-10 2023-12-15 浙江九州未来信息科技有限公司 一种基于贝叶斯分类器的网络故障诊断方法及装置
CN117240700B (zh) * 2023-11-10 2024-02-06 浙江九州未来信息科技有限公司 一种基于贝叶斯分类器的网络故障诊断方法及装置

Also Published As

Publication number Publication date
CN111881983B (zh) 2024-05-28
CN111881983A (zh) 2020-11-03

Similar Documents

Publication Publication Date Title
WO2021139279A1 (zh) 基于分类模型的数据处理方法、装置、电子设备及介质
CN111914090B (zh) 一种企业行业分类识别及其特征污染物识别的方法及装置
CN113312447B (zh) 基于概率标签估计的半监督日志异常检测方法
CN108549817A (zh) 一种基于文本深度学习的软件安全漏洞预测方法
CN111625516A (zh) 检测数据状态的方法、装置、计算机设备和存储介质
CN111339260A (zh) 一种基于bert和qa思想的细粒度情感分析方法
WO2021168617A1 (zh) 业务风控处理方法、装置、电子设备以及存储介质
CN110245232A (zh) 文本分类方法、装置、介质和计算设备
CN112561320A (zh) 机构风险预测模型的训练方法、机构风险预测方法和装置
CN113111908A (zh) 一种基于模板序列或词序列的bert异常检测方法及设备
CN116541838A (zh) 一种基于对比学习的恶意软件检测方法
CN114816962A (zh) 基于attention-lstm的网络故障预测方法
CN117521063A (zh) 基于残差神经网络并结合迁移学习的恶意软件检测方法及装置
CN116384223A (zh) 基于退化状态智能辨识的核设备可靠性评估方法及系统
CN116164822A (zh) 基于知识图谱的流量计故障诊断方法、装置、介质
CN115688101A (zh) 一种基于深度学习的文件分类方法及装置
CN115660101A (zh) 一种基于业务节点信息的数据服务提供方法及装置
US20210241147A1 (en) Method and device for predicting pair of similar questions and electronic equipment
CN115333973A (zh) 设备异常检测方法、装置、计算机设备和存储介质
CN113448860A (zh) 测试案例分析方法及装置
CN113821571A (zh) 基于bert和改进pcnn的食品安全关系抽取方法
CN118070775B (zh) 摘要生成模型的性能评测方法、装置、计算机设备
AU2021312671B2 (en) Value over replacement feature (VORF) based determination of feature importance in machine learning
CN110728615B (zh) 基于序贯假设检验的隐写分析方法、终端设备及存储介质
Marali et al. Vulnerability Classification Based on Fine-Tuned BERT and Deep Neural Network Approaches

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20911538

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20911538

Country of ref document: EP

Kind code of ref document: A1