CN112069795A - Corpus detection method, apparatus, device and medium based on mask language model - Google Patents

Corpus detection method, apparatus, device and medium based on mask language model Download PDF

Info

Publication number
CN112069795A
CN112069795A CN202010888877.4A CN202010888877A CN112069795A CN 112069795 A CN112069795 A CN 112069795A CN 202010888877 A CN202010888877 A CN 202010888877A CN 112069795 A CN112069795 A CN 112069795A
Authority
CN
China
Prior art keywords
corpus
word
discriminator
generator
probability distribution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010888877.4A
Other languages
Chinese (zh)
Other versions
CN112069795B (en
Inventor
邓悦
郑立颖
徐亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202010888877.4A priority Critical patent/CN112069795B/en
Priority to PCT/CN2020/117434 priority patent/WO2021151292A1/en
Publication of CN112069795A publication Critical patent/CN112069795A/en
Application granted granted Critical
Publication of CN112069795B publication Critical patent/CN112069795B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to artificial intelligence, and particularly discloses a corpus detection method, a device, equipment and a medium based on a mask language model, wherein the method comprises the following steps: inputting the corpus words to be trained into the generator for training to obtain the probability distribution corresponding to the corpus words; inputting the probability distribution into the discriminator for training to obtain a prediction result corresponding to the probability distribution, wherein the prediction result comprises whether the corpus word is replaced or not, and the prediction result is stored in a block chain node; inputting a classification label in the discriminator according to the category of the corpus word, and adjusting the prediction result through the discriminator based on the classification label and the corpus word to obtain a context vector; and detecting the state of the linguistic data word to be trained according to the context vector. The model training efficiency is effectively improved, and the abnormal condition of the log file can be efficiently and accurately judged.

Description

Corpus detection method, apparatus, device and medium based on mask language model
Technical Field
The present application relates to the field of intelligent decision making technologies, and in particular, to a corpus detection method and apparatus based on a mask language model, a computer device, and a medium.
Background
In text processing, anomaly detection of log files plays an important role in management of modern large distributed systems, and is widely used in logs for recording information during system operation. Currently, operation and maintenance personnel typically use keyword searches and rule matching to check and match logs. However, as the workload and the business demand increase, the time required for manual detection also increases, and becomes more time-consuming and labor-consuming. In order to reduce the workload of manpower and improve the accuracy of the current detection, the application of the log anomaly detection method based on deep learning in the anomaly detection direction is gradually increased.
The current popular text processing model is a mask-based pre-training language model, but due to high requirements on computing resources, the modification and training of the model are limited by training cost and running time.
Disclosure of Invention
The application provides a corpus detection method, a corpus detection device, computer equipment and a corpus detection medium based on a mask language model, which are intelligently decided, effectively improve the model training efficiency and can efficiently and accurately judge the abnormal condition of a log file.
In a first aspect, the present application provides a corpus detection method based on a mask language model, where the method is applied to a mask language model, and the mask language model includes a generator and a discriminator; the method comprises the following steps:
inputting the corpus words to be trained into the generator for training to obtain the probability distribution corresponding to the corpus words;
inputting the probability distribution into the discriminator for training to obtain a prediction result corresponding to the probability distribution, wherein the prediction result comprises whether the corpus word is replaced or not;
inputting a classification label in the discriminator according to the category of the corpus word, and adjusting the prediction result through the discriminator based on the classification label and the corpus word to obtain a context vector;
detecting the state of the corpus word to be trained according to the context vector; the category of the corpus word includes a log file category.
In a second aspect, the present application further provides a corpus detecting device, the device including:
the first training module is used for inputting the corpus words to be trained into the generator for training to obtain the probability distribution corresponding to the corpus words;
the second training module is used for inputting the probability distribution to the discriminator for training to obtain a prediction result corresponding to the probability distribution, wherein the prediction result comprises whether the corpus words are replaced or not;
the adjusting module is used for inputting a classification label into the discriminator according to the category of the corpus words, and adjusting the prediction result through the discriminator based on the classification label and the corpus words to obtain a context vector;
and the detection module is used for detecting the state of the corpus word to be trained according to the context vector.
In a third aspect, the present application further provides a computer device comprising a memory and a processor; the memory is used for storing a computer program; the processor is configured to execute the computer program and implement the corpus detection method based on the mask language model when executing the computer program.
In a fourth aspect, the present application further provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the processor is enabled to implement the corpus detecting method based on the mask language model.
The application discloses a corpus detection method, a device, computer equipment and a storage medium based on a mask language model, wherein a brand-new mask language model is adopted, the mask language model comprises a generator and a discriminator, during training, a corpus word to be trained is input into the generator to be trained, so that probability distribution corresponding to the corpus word is obtained, then the probability distribution is input into the discriminator to be trained, so that a prediction result corresponding to the probability distribution is obtained, and therefore the prediction result of the mask language model is determined, wherein the prediction result comprises whether the corpus word is replaced or not; when the model is used, only the classifier is used for inputting the classification labels aiming at the categories of the corpus words, so that the efficiency of testing the model is greatly improved, and the testing time is effectively reduced; after the context vector is obtained, the state of the corpus word to be trained is detected according to the context vector, for example, the abnormal condition of a log file of an operation and maintenance server is detected, so that the abnormal result is detected more efficiently and quickly, and the detection speed is greatly improved for a daily detection task.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is an architecture diagram of a masked speech model provided by an embodiment of the present application;
FIG. 2 is a generator architecture diagram provided by an embodiment of the present application;
FIG. 3 is a schematic diagram of an input word vector for a generator provided by an embodiment of the present application;
FIG. 4 is a diagram of an arbiter architecture provided by an embodiment of the present application;
fig. 5 is a schematic flowchart of a corpus detection method based on a mask language model according to an embodiment of the present application;
FIG. 6 is a diagram of an arbiter structure for verifying a log file, provided by an embodiment of the present application;
FIG. 7 is a schematic block diagram of another corpus detection apparatus provided in an embodiment of the present application;
fig. 8 is a schematic block diagram of a structure of a computer device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
It is to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
The embodiment of the application provides a corpus detection method and device based on a mask language model, computer equipment and a storage medium. The corpus detection method based on the mask language model effectively improves the model training efficiency and can efficiently and accurately judge the abnormal condition of the log file.
Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.
Referring to fig. 1, fig. 1 is a training architecture of a mask language model according to an embodiment of the present application, where the training architecture of the mask language model includes a generator and a discriminator, and the generator and the discriminator are trained together during training, and only the discriminator is used for detection during detection, so as to effectively improve training efficiency.
Generator architecture as shown in fig. 2, the generator includes a series of Encoder encoders and corresponding inputs and outputs, the generator inputs are word vectors, the generator outputs are probability distributions, i.e., the probability distribution of the word at the current position, corresponding to the probability of each word, and the most likely word can be selected by the maximum probability.
Specifically, as shown in FIG. 3, the input to the encoder of the generator is a vector w corresponding to a word1、w2、……wnThe generation of the word vector consists of the superposition of 3 partial vectors, which may include a word dimension vector, a sentence dimension vector, and a position dimension vector.
The generator is inherently simple in task and the parameters are not particularly numerous, and is structured to input more complex tasks to the arbiter.
Arbiter architecture as shown in fig. 4, the architecture of the arbiter is generally similar to the generator architecture, and includes a series of encoders, inputs, and outputs, O in fig. 41……OnIs the output from the generator, likewise the input O1……OnAfter the word vector layer Embedding of the discriminator, the word vector layer Embedding is input into an Encoder Encoder structure of the discriminator, and different from a generator, a layer of classifier Classif is added in the Encoder of the discriminator during outputier, for judging whether each word is replaced or not, the corresponding output is R1……RnAnd probability, 0/1 classification, i.e., whether the word was replaced.
Based on the structure of the mask language model, a corpus detection method based on the mask language model is provided.
Referring to fig. 5, fig. 5 is a schematic flowchart of a corpus detection method based on a mask language model according to an embodiment of the present application. The corpus detection method based on the mask language model can be applied to the mask language model in the graph 1, effectively improves the model training efficiency, and can efficiently and accurately judge the abnormal condition of the log file.
As shown in fig. 5, the corpus detecting method based on the mask language model specifically includes steps S101 to S104.
S101, inputting the corpus words to be trained into the generator for training to obtain the probability distribution corresponding to the corpus words.
When a user needs to perform partial masking on a certain sentence, the whole sentence is input into a masking language model, the whole sentence is used as a corpus word to be trained, and a generator of the masking language model is input for training, which specifically comprises the following steps:
s11, inputting word vectors corresponding to the corpus words to be trained into a generator, wherein the word vectors comprise word dimensions, sentence dimensions and position dimensions.
As shown in FIG. 3, the input to the encoder of the generator is a vector w corresponding to a word1、w2、……wnThe generation of the word vector consists of the superposition of 3 partial vectors, which may include a word dimension vector, a sentence dimension vector, and a position dimension vector.
As shown in fig. 4, the corpus words to be trained are sentences, and taking "the value is high" and "it is urgent new" as examples, the "the value is high" and "it is urgent new" are taken as input of the generator, and the generator includes word dimension, sentence dimension and position dimension when inputting.
And S12, inputting the word vector obtained by superposing the word dimension, the sentence dimension and the position dimension into an encoder of the generator through the generator for encoding to obtain each dimension word vector, wherein the encoder comprises a plurality of encoders.
The generator superposes the input word vectors including three dimensions of word dimension, sentence dimension and position dimension to obtain superposed word vectors, then inputs the superposed word vectors into an encoder of the generator to be encoded to obtain word vectors of each dimension, the encoder of the generator has a plurality of layers, and the word vectors of each dimension are obtained by layer-by-layer encoding of the generator.
The generator may be a pre-trained model, or may be trained using a model during input, and for word dimensions, for example, the total length of a word vector may be 768 lengths, and if the input corpus word to be trained corresponds to 6 words, the output word dimension corresponding to the generator model is (6, 768).
In some embodiments, for sentence dimensionality, when the input corpus word to be trained includes two sentences, then the generator may add different word vectors (embedding) for different sentences, with the corresponding first sentence 1 having a dimensionality of (1, 768) and the second sentence 2 having a dimensionality of (2, 768).
In some embodiments, for the position dimension, when the corpus word to be trained includes the same word at different positions, the position information of the word is also considered, and the position information of the word is determined according to the position information of the word. For example, the input sentence is "I come and I watch", then, for the generator, the two "I" inside are different, and for the information of the position, the generator adopts the way of sinusoidal coding, and the formula is as follows:
Figure BDA0002656339400000061
Figure BDA0002656339400000062
where pos is the index of the position, representing the position of the word in the sequence, i is the index in the vector, d is the dimension of the generator model, which uses 768 dimensions. This formula allows the information of the position to be encoded with a sine function at even positions of the vector and a cosine function at odd positions, so that each dimension of the position-encoded vector is a waveform of a different frequency, each value being between-1 and 1, thus obtaining the position dimension.
And S13, randomly replacing part of words in the dimension word vectors according to a preset replacement rule, and obtaining the probability distribution corresponding to each dimension word vector.
When the generator performs masking processing, when a corpus word to be trained is input, the generator replaces partial words of the corpus word to be trained according to a preset rule, for example, token at the [ mask ] position is replaced, when the generator outputs, a word which is masked is predicted by a context of a word which is not masked in a sentence, for example, as shown in fig. 1, when the input corpus word to be trained is "the value is high", the [ mask ] position is corresponding to "the" and "value", then the mask word is "the" and "value", the context of the word which is not masked is "and" high ", and the" and "value" is predicted by the "is" and "high".
In some embodiments, when masking, a preset replacement rule is used, for example, the preset replacement rule is that the generator uses the following rule in the 20% of the [ mask ] mask in addition to randomly selecting the 20% of the [ mask ] input:
1. 10% are replaced with any word;
2. 10% of words do not change;
3. 80% of the words are replaced with mask.
And randomly replacing part of words in the dimension word vectors according to a preset replacement rule to obtain the probability distribution corresponding to each dimension word vector.
It can be understood that, for the Encoder of the generator, an attention mechanism is used, and the purpose of the attention mechanism is to find out the relevant words in the sentence where the word is located when processing a single word, and to merge the words into the words to be processed, thereby achieving a better encoding effect. The attention mechanism in this case is a multi-head attention mechanism, and based on the attention mechanism (self-attention), the Encoder is set to 16 layers, so that the corresponding attention mechanism is used 16 times, and a final output is obtained through linear mapping. Different positions of the model are captured through a multi-head attention mechanism, so that information of the blind dimensions of the sentence is captured.
By using a multi-head attention mechanism and a brand-new word embedding method, coded information in three dimensions (position dimension, sentence dimension and word dimension) is introduced, so that the understanding of words has more dimensions.
In some embodiments, after inputting the corpus word to be trained into the generator for training, and obtaining the probability distribution corresponding to the corpus word, the method may further include:
a loss function of the generator is calculated and the generator is adjusted according to the loss function of the generator.
The loss function of the generator is whether the word is predicted correct by context for those words that are [ mask ], by equation 3:
Figure BDA0002656339400000071
wherein L isMLMIs a loss function of the generator, x is a sample, xmaskedIs a sample occluded by a mask in the Embedding process, thetaGIs a parameter of the generator, (x)i|xmasked) Is the known case of sample xiThe condition distribution of (2).
And performing word vector superposition and coding on the corpus words to be trained through a generator to realize mask, and outputting each word vector and the probability distribution corresponding to the word vector.
In some embodiments, when a user performs some business operations, the business system will usually generate some corresponding log files, and during log detection, the mask language model applied in the scheme is used. If the category of the corpus word to be trained is a log file category, before inputting the log file to be trained into the generator for training and obtaining the probability distribution corresponding to the corpus word, the method may include:
and preprocessing the log file to be trained.
Specifically, the preprocessing may be to convert the upper case format in the log file into the lower case format, filter some fixed texts with the same structure, and replace some unimportant information (address/time/date).
After the log file is preprocessed, the processed log file is input into a generator for training.
Different preprocessing needs to be performed for different corpus words to be trained, and the present case takes log files as an example, but is not limited to log files.
S102, inputting the probability distribution into the discriminator for training to obtain a prediction result corresponding to the probability distribution, wherein the prediction result comprises whether the corpus word is replaced or not.
Specifically, as shown in fig. 4, in a discriminator, inputting the probability distribution into the discriminator for training, so as to obtain a prediction result corresponding to the probability distribution, where the prediction result includes whether the corpus word is replaced or not, and the predicting may include:
and replacing the word vectors corresponding to the probability distribution according to a preset replacement probability through the discriminator so as to predict whether the word vectors corresponding to the probability distribution are replaced or not, and obtain a prediction result.
In some embodiments, after the output of the generator is received by the discriminator, the discriminator replaces the input word vector with a certain probability to predict whether the word output by the generator is replaced, specifically, O1……OnFor each output from the generator, these inputs will likewise pass through EmbAn encoding layer, which is input into an Encoder structure, and a Classifier is added to the encoding layer to judge whether each word is original or replaced, and the corresponding output is R1……RN. The prediction result includes whether the word vector is replaced, i.e., includes both replaced and not replaced prediction results.
For example, the corpus word to be trained is "the value is high" and is input into the generator to be subjected to three-dimensional superposition and partial masking, and then "the key is high" is output, obviously, "vlaue" is subjected to masking processing, then "the key is high" output by the generator is input into the discriminator to be discriminated, and when discrimination is performed, substitution is performed according to a preset substitution probability, so that "the", "is" and "high" are all original states, and "key" is a replaced state, that is, the discriminator discriminates a word which is replaced.
In some embodiments, after detecting the state of the corpus word to be trained according to the context vector, the method further comprises:
calculating a loss function of the discriminator, and adjusting the discriminator according to the loss function of the discriminator. The penalty function of the discriminator is as in equation 4:
Figure BDA0002656339400000081
wherein L isDiscIs the loss function of the discriminator, l (x) is the exponential function, l (x) is the linear function, x is the sample, t is the time step, x is the time stepcorruptIs the sample after being replaced, thetaDIs a parameter of the discriminator, and D is the discriminator.
And superposing the loss function of the generator and the loss function of the discriminator to obtain a total loss function so as to adjust the mask language model.
Specifically, the loss function of the generator is superimposed with the loss function of the discriminator to obtain the total loss function of the model, which is formula 5:
Figure BDA0002656339400000091
since the generator and the arbiter have the same structure, the model training can be more efficient by sharing the parameters of the model with the parameters contained in the generator and the arbiter. And, in the training, generator and arbiter train together, and to the time of using, only have the arbiter to put into use, to this, the model can reduce more parameters, and have better training efficiency.
It is emphasized that the prediction result may also be stored in a node of a block chain in order to further ensure the privacy and security of the prediction result.
S103, inputting a classification label into the discriminator according to the category of the corpus word, and adjusting the prediction result through the discriminator based on the classification label and the corpus word to obtain a context vector.
In some embodiments, when the corpus words are log files, the log files are preprocessed, then the preprocessed log files are input into a generator and a discriminator to be trained, and when the corpus words are detected, a prediction result obtained after training is input into the discriminator of the model, and the input of the model is the words corresponding to each log text.
Specifically, when the category of the corpus word is a log file category, inputting a classification tag in the discriminator according to the category of the corpus word, and adjusting the prediction result by the discriminator based on the classification tag and the corpus word to obtain a context vector, which may include:
and S31, replacing the first word corresponding to the log file with a classification label in the discriminator.
Specifically, the length of the input may be set to 512, while replacing the start O1 of each sentence with a [ CLS ] classification tag at the time of input, corresponding to an abnormality or no abnormality of the log.
S32, inputting all words corresponding to the log file into an encoder for training, inputting vectors corresponding to the classification labels into a binary neural network for training, and outputting context vectors, wherein the first position of the context vectors corresponds to the classification labels.
Specifically, considering that the anomaly detection is a two-class task, after the training of a layer-by-layer encoder of the generator, a vector is obtained for each layer, the length of the vector can be set to 768, the vector of the [ CLS ] classification tag is directly input into a two-class neural network, which corresponds to a classifier in the upper graph, and is used for judging whether the anomaly is generated, and the output result is 0/1, which is a judgment result corresponding to the anomaly or the non-anomaly.
In some embodiments, if the multi-classification task is detected, a multi-classification neural network may be used to replace the classifier, and a SoftMax logistic regression function is used to obtain the probability of each classification, and the probability is assigned to the class corresponding to the maximum probability, that is, the classification is completed, so as to obtain the classification result.
S104, detecting the state of the corpus word to be trained according to the context vector; the category of the corpus word includes a log file category.
In some embodiments, when the category of the corpus word is a log file category, detecting the state of the corpus word to be trained according to the context vector may include: and judging the abnormal condition of the log file according to the first position of the context vector.
When detecting an abnormal log file, only the output vector of which the first position is [ CLS ] is taken as a context vector after the vector of the [ CLS ] classification label is directly input into a two-classification neural network.
As shown in fig. 6, fig. 6 is a discriminator configuration diagram of a log file in which an abnormality is detected, and for log abnormality detection, an input is a sentence corresponding to each log file, and the length of the input is set to 512. And simultaneously, replacing the beginning of each sentence with [ CLS ] (classification label) when inputting, and obtaining the detection result of the abnormity or non-abnormity of the corresponding log.
During detection, only the discriminator is needed, so that loads of a CPU and a memory are reduced for a server for judging abnormal information by operation and maintenance, an abnormal result is detected more efficiently and rapidly, and the detection speed is greatly improved for a daily detection task.
The embodiment provides a corpus detection method based on a mask language model, a brand-new mask language model is adopted, the mask language model comprises a generator and a discriminator, during training, a corpus word to be trained is input into the generator for training to obtain probability distribution corresponding to the corpus word, then the probability distribution is input into the discriminator for training to obtain a prediction result corresponding to the probability distribution, so that the prediction result of the mask language model is determined, wherein the prediction result comprises whether the corpus word is replaced or not; when the model is used, only the classifier is used for inputting the classification labels aiming at the categories of the corpus words, so that the efficiency of testing the model is greatly improved, and the testing time is effectively reduced; after the context vector is obtained, the state of the corpus word to be trained is detected according to the context vector, for example, the abnormal condition of a log file of an operation and maintenance server is detected, so that the abnormal result is detected more efficiently and quickly, and the detection speed is greatly improved for a daily detection task.
Referring to fig. 7, fig. 7 is a schematic block diagram of a corpus detecting device according to an embodiment of the present application, the corpus detecting device is configured to perform the above-mentioned corpus detecting method based on a mask language model. The corpus detecting device may be configured in a terminal or a server.
As shown in fig. 7, the corpus detecting device 400 includes: a first training module 401, a second training module 402, an adjustment module 403, and a detection module 404.
A first training module 401, configured to input a corpus word to be trained into the generator for training, so as to obtain a probability distribution corresponding to the corpus word;
a second training module 402, configured to input the probability distribution to the discriminator for training, so as to obtain a prediction result corresponding to the probability distribution, where the prediction result includes whether the corpus word is replaced;
an adjusting module 403, configured to input a classification label into the discriminator according to the category of the corpus word, and adjust the prediction result based on the classification label and the corpus word by the discriminator to obtain a context vector;
a detecting module 404, configured to detect a state of the corpus word to be trained according to the context vector.
It should be noted that, as will be clear to those skilled in the art, for convenience and brevity of description, the specific working processes of the apparatus and the modules described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The apparatus described above may be implemented in the form of a computer program which is executable on a computer device as shown in fig. 8.
Referring to fig. 8, fig. 8 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device may be a server.
Referring to fig. 8, the computer device includes a processor, a memory, and a network interface connected through a system bus, wherein the memory may include a nonvolatile storage medium and an internal memory.
The non-volatile storage medium may store an operating system and a computer program. The computer program includes program instructions that, when executed, cause a processor to perform any one of a masking language model based corpus detection method.
The processor is used for providing calculation and control capability and supporting the operation of the whole computer equipment.
The internal memory provides an environment for running a computer program in the non-volatile storage medium, and the computer program, when executed by the processor, causes the processor to perform any one of the corpus detection methods based on the mask language model.
The network interface is used for network communication, such as sending assigned tasks and the like. Those skilled in the art will appreciate that the architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
It should be understood that the Processor may be a Central Processing Unit (CPU), and the Processor may be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Wherein, in one embodiment, the processor is configured to execute a computer program stored in the memory to implement the steps of:
inputting the corpus words to be trained into the generator for training to obtain the probability distribution corresponding to the corpus words;
inputting the probability distribution into the discriminator for training to obtain a prediction result corresponding to the probability distribution, wherein the prediction result comprises whether the corpus word is replaced or not;
inputting a classification label in the discriminator according to the category of the corpus word, and adjusting the prediction result through the discriminator based on the classification label and the corpus word to obtain a context vector;
and detecting the state of the linguistic data word to be trained according to the context vector.
In some embodiments, the inputting, by the processor, the corpus word to be trained into the generator for training to obtain the probability distribution corresponding to the corpus word includes:
inputting a word vector corresponding to the corpus word to be trained into a generator, wherein the word vector comprises a word dimension, a sentence dimension and a position dimension;
inputting the word vectors obtained by superposing the word dimensions, sentence dimensions and position dimensions into an encoder of the generator through the generator for encoding to obtain word vectors of all dimensions, wherein the plurality of encoders are included;
and randomly replacing part of words in the dimension word vectors according to a preset replacement rule, and obtaining the probability distribution corresponding to each dimension word vector.
In some embodiments, before the processor implements that the category of the corpus word to be trained is a log file category, and inputs the corpus word to be trained into the generator for training to obtain the probability distribution corresponding to the corpus word, the method includes:
and preprocessing the log file to be trained.
In some embodiments, after the processor inputs the corpus word to be trained into the generator for training to obtain the probability distribution corresponding to the corpus word, the method includes:
a loss function of the generator is calculated and the generator is adjusted according to the loss function of the generator.
In some embodiments, after the processor implements the detecting the state of the corpus word to be trained according to the context vector, the processor comprises:
calculating a loss function of the discriminator, and adjusting the discriminator according to the loss function of the discriminator.
In some embodiments, the processor implementation includes:
and superposing the loss function of the generator and the loss function of the discriminator to obtain a total loss function so as to adjust the mask language model.
In some embodiments, the inputting, by the processor, the probability distribution to the discriminator for training to obtain a prediction result corresponding to the probability distribution, where the prediction result includes whether the corpus word is replaced, includes:
and replacing the word vectors corresponding to the probability distribution according to a preset replacement probability through the discriminator so as to predict whether the word vectors corresponding to the probability distribution are replaced or not, and obtain a prediction result, wherein the prediction result is stored in the block chain node.
In some embodiments, the processor further implements a classification label input in the discriminator according to the category of the corpus word, and the obtaining, by the discriminator, a context vector by adjusting the prediction result based on the classification label and the corpus word includes:
in the discriminator, replacing the first word corresponding to the log file with a classification label;
inputting all words corresponding to the log file into an encoder for training, inputting vectors corresponding to classification labels into a two-class neural network for training, and outputting context vectors, wherein the first position of each context vector corresponds to the classification label;
the detecting the state of the corpus word to be trained according to the context vector includes:
and judging the abnormal condition of the log file according to the first position of the context vector.
The embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, where the computer program includes program instructions, and the processor executes the program instructions to implement any one of the corpus detection methods based on the mask language model provided in the embodiment of the present application.
The computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device.
Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. The corpus detection method based on mask language model is characterized in that the method is applied to the mask language model, and the mask language model comprises a generator and a discriminator; the method comprises the following steps:
inputting the corpus words to be trained into the generator for training to obtain the probability distribution corresponding to the corpus words;
inputting the probability distribution into the discriminator for training to obtain a prediction result corresponding to the probability distribution, wherein the prediction result comprises whether the corpus word is replaced or not;
inputting a classification label in the discriminator according to the category of the corpus word, and adjusting the prediction result through the discriminator based on the classification label and the corpus word to obtain a context vector;
and detecting the state of the linguistic data word to be trained according to the context vector.
2. The method according to claim 1, wherein the inputting the corpus word to be trained into the generator for training to obtain the probability distribution corresponding to the corpus word comprises:
inputting a word vector corresponding to the corpus word to be trained into a generator, wherein the word vector comprises a word dimension, a sentence dimension and a position dimension;
inputting the word vectors obtained by superposing the word dimensions, sentence dimensions and position dimensions into an encoder of the generator through the generator for encoding to obtain word vectors of all dimensions, wherein the plurality of encoders are included;
and randomly replacing part of words in the dimension word vectors according to a preset replacement rule to obtain the probability distribution corresponding to each dimension word vector.
3. The method according to claim 1, wherein the category of the corpus word to be trained is a log file category, and before the corpus word to be trained is input into the generator for training and a probability distribution corresponding to the corpus word is obtained, the method comprises:
and preprocessing the log file to be trained.
4. The method according to claim 1, wherein after the corpus words to be trained are input into the generator for training, and a probability distribution corresponding to the corpus words is obtained, the method further comprises:
calculating a loss function of the generator and adjusting the generator according to the loss function of the generator;
after the detecting the state of the corpus word to be trained according to the context vector, the method further includes:
calculating a loss function of the discriminator, and adjusting the discriminator according to the loss function of the discriminator.
5. The method of claim 4, further comprising:
and superposing the loss function of the generator and the loss function of the discriminator to obtain a total loss function so as to adjust the mask language model.
6. The method according to claim 1, wherein the inputting the probability distribution into the discriminator for training to obtain a prediction result corresponding to the probability distribution, the prediction result including whether the corpus word is replaced comprises:
and replacing the word vectors corresponding to the probability distribution according to a preset replacement probability through the discriminator so as to predict whether the word vectors corresponding to the probability distribution are replaced or not, and obtain a prediction result, wherein the prediction result is stored in a block chain.
7. The method according to claim 1, wherein the class of the corpus word to be trained is a log file class, the classifying label is input into the discriminator according to the class of the corpus word, and the context vector is obtained by the discriminator by adjusting the prediction result based on the classifying label and the corpus word, comprising:
in the discriminator, replacing the first word corresponding to the log file with a classification label;
inputting all words corresponding to the log file into an encoder for training, inputting vectors corresponding to classification labels into a two-class neural network for training, and outputting context vectors, wherein the first position of each context vector corresponds to the classification label;
the detecting the state of the corpus word to be trained according to the context vector includes:
and judging the abnormal condition of the log file according to the first position of the context vector.
8. A corpus detecting device, comprising:
the first training module is used for inputting the corpus words to be trained into the generator for training to obtain the probability distribution corresponding to the corpus words;
the second training module is used for inputting the probability distribution to the discriminator for training to obtain a prediction result corresponding to the probability distribution, wherein the prediction result comprises whether the corpus words are replaced or not;
the adjusting module is used for inputting a classification label into the discriminator according to the category of the corpus words, and adjusting the prediction result through the discriminator based on the classification label and the corpus words to obtain a context vector;
and the detection module is used for detecting the state of the corpus word to be trained according to the context vector.
9. A computer device, wherein the computer device comprises a memory and a processor;
the memory is used for storing a computer program;
the processor, configured to execute the computer program and when executing the computer program, implement the mask language model-based corpus detection method according to any one of claims 1 to 7.
10. A computer-readable storage medium, wherein a computer program is stored, and when executed by a processor, causes the processor to implement the mask language model-based corpus detection method according to any one of claims 1 to 7.
CN202010888877.4A 2020-08-28 2020-08-28 Corpus detection method, device, equipment and medium based on mask language model Active CN112069795B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010888877.4A CN112069795B (en) 2020-08-28 2020-08-28 Corpus detection method, device, equipment and medium based on mask language model
PCT/CN2020/117434 WO2021151292A1 (en) 2020-08-28 2020-09-24 Corpus monitoring method based on mask language model, corpus monitoring apparatus, device, and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010888877.4A CN112069795B (en) 2020-08-28 2020-08-28 Corpus detection method, device, equipment and medium based on mask language model

Publications (2)

Publication Number Publication Date
CN112069795A true CN112069795A (en) 2020-12-11
CN112069795B CN112069795B (en) 2023-05-30

Family

ID=73660536

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010888877.4A Active CN112069795B (en) 2020-08-28 2020-08-28 Corpus detection method, device, equipment and medium based on mask language model

Country Status (2)

Country Link
CN (1) CN112069795B (en)
WO (1) WO2021151292A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113011177A (en) * 2021-03-15 2021-06-22 北京百度网讯科技有限公司 Model training and word vector determination methods, apparatus, devices, media and products
CN113094482A (en) * 2021-03-29 2021-07-09 中国地质大学(北京) Lightweight semantic intelligent service adaptation training evolution method and system
CN117332038A (en) * 2023-09-19 2024-01-02 鹏城实验室 Text information detection method, device, equipment and storage medium
CN117786104A (en) * 2023-11-17 2024-03-29 中信建投证券股份有限公司 Model training method and device, electronic equipment and storage medium

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113642312A (en) * 2021-08-19 2021-11-12 平安医疗健康管理股份有限公司 Physical examination data processing method, physical examination data processing device, physical examination equipment and storage medium
CN113657104A (en) * 2021-08-31 2021-11-16 平安医疗健康管理股份有限公司 Text extraction method and device, computer equipment and storage medium
CN114049662B (en) * 2021-10-18 2024-05-28 天津大学 Facial feature transfer learning-based expression recognition network device and method
CN114723073B (en) * 2022-06-07 2023-09-05 阿里健康科技(杭州)有限公司 Language model pre-training method, product searching method, device and computer equipment
CN114936327B (en) * 2022-07-22 2022-10-28 腾讯科技(深圳)有限公司 Element recognition model acquisition method and device, computer equipment and storage medium
CN115495314A (en) * 2022-09-30 2022-12-20 中国电信股份有限公司 Log template identification method and device, electronic equipment and readable medium
CN116662579B (en) * 2023-08-02 2024-01-26 腾讯科技(深圳)有限公司 Data processing method, device, computer and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110137653A1 (en) * 2009-12-04 2011-06-09 At&T Intellectual Property I, L.P. System and method for restricting large language models
CN108009628A (en) * 2017-10-30 2018-05-08 杭州电子科技大学 A kind of method for detecting abnormality based on generation confrontation network
CN108734276A (en) * 2018-04-28 2018-11-02 同济大学 A kind of learning by imitation dialogue generation method generating network based on confrontation
CN110196894A (en) * 2019-05-30 2019-09-03 北京百度网讯科技有限公司 The training method and prediction technique of language model
CN111028206A (en) * 2019-11-21 2020-04-17 万达信息股份有限公司 Prostate cancer automatic detection and classification system based on deep learning
CN111241291A (en) * 2020-04-24 2020-06-05 支付宝(杭州)信息技术有限公司 Method and device for generating countermeasure sample by utilizing countermeasure generation network
CN111414772A (en) * 2020-03-12 2020-07-14 北京小米松果电子有限公司 Machine translation method, device and medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111539223B (en) * 2020-05-29 2023-08-18 北京百度网讯科技有限公司 Language model training method and device, electronic equipment and readable storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110137653A1 (en) * 2009-12-04 2011-06-09 At&T Intellectual Property I, L.P. System and method for restricting large language models
CN108009628A (en) * 2017-10-30 2018-05-08 杭州电子科技大学 A kind of method for detecting abnormality based on generation confrontation network
CN108734276A (en) * 2018-04-28 2018-11-02 同济大学 A kind of learning by imitation dialogue generation method generating network based on confrontation
CN110196894A (en) * 2019-05-30 2019-09-03 北京百度网讯科技有限公司 The training method and prediction technique of language model
CN111028206A (en) * 2019-11-21 2020-04-17 万达信息股份有限公司 Prostate cancer automatic detection and classification system based on deep learning
CN111414772A (en) * 2020-03-12 2020-07-14 北京小米松果电子有限公司 Machine translation method, device and medium
CN111241291A (en) * 2020-04-24 2020-06-05 支付宝(杭州)信息技术有限公司 Method and device for generating countermeasure sample by utilizing countermeasure generation network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
KEVIN CLARK ET.AL: "ELECTRA: PRE-TRAINING TEXT ENCODERS AS DISCRIMINATORS RATHER THAN GENERATORS", 《ARXIV:2003.10555V1》 *
任璐;杨亮;徐琳宏;樊小超;刁宇峰;林鸿飞;: "中文笑话语料库的构建与应用", 中文信息学报 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113011177A (en) * 2021-03-15 2021-06-22 北京百度网讯科技有限公司 Model training and word vector determination methods, apparatus, devices, media and products
CN113011177B (en) * 2021-03-15 2023-09-29 北京百度网讯科技有限公司 Model training and word vector determining method, device, equipment, medium and product
CN113094482A (en) * 2021-03-29 2021-07-09 中国地质大学(北京) Lightweight semantic intelligent service adaptation training evolution method and system
CN113094482B (en) * 2021-03-29 2023-10-17 中国地质大学(北京) Lightweight semantic intelligent service adaptation training evolution method and system
CN117332038A (en) * 2023-09-19 2024-01-02 鹏城实验室 Text information detection method, device, equipment and storage medium
CN117332038B (en) * 2023-09-19 2024-07-02 鹏城实验室 Text information detection method, device, equipment and storage medium
CN117786104A (en) * 2023-11-17 2024-03-29 中信建投证券股份有限公司 Model training method and device, electronic equipment and storage medium
CN117786104B (en) * 2023-11-17 2024-06-21 中信建投证券股份有限公司 Model training method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
WO2021151292A1 (en) 2021-08-05
CN112069795B (en) 2023-05-30

Similar Documents

Publication Publication Date Title
CN112069795B (en) Corpus detection method, device, equipment and medium based on mask language model
Nedelkoski et al. Self-attentive classification-based anomaly detection in unstructured logs
Mehdiyev et al. A multi-stage deep learning approach for business process event prediction
Chang et al. A hybrid system integrating a wavelet and TSK fuzzy rules for stock price forecasting
CN109657947B (en) Enterprise industry classification-oriented anomaly detection method
US11763091B2 (en) Automated content tagging with latent dirichlet allocation of contextual word embeddings
Gomes et al. BERT-and TF-IDF-based feature extraction for long-lived bug prediction in FLOSS: a comparative study
WO2020197757A1 (en) Data-driven deep learning model generalization analysis and improvement
CN113590451B (en) Root cause positioning method, operation and maintenance server and storage medium
CN110658905B (en) Early warning method, early warning system and early warning device for equipment operation state
CN116402630B (en) Financial risk prediction method and system based on characterization learning
Doğan Analysis of the relationship between LSTM network traffic flow prediction performance and statistical characteristics of standard and nonstandard data
CN113821418A (en) Fault tracking analysis method and device, storage medium and electronic equipment
CN110472231B (en) Method and device for identifying legal document case
CN111709225A (en) Event cause and effect relationship judging method and device and computer readable storage medium
CN114881173A (en) Resume classification method and device based on self-attention mechanism
Hu et al. Estimate remaining useful life for predictive railways maintenance based on LSTM autoencoder
CN114357171A (en) Emergency event processing method and device, storage medium and electronic equipment
Chang et al. An ensemble of neural networks for stock trading decision making
Bernardelli et al. The BeMi stardust: a structured ensemble of binarized neural networks
US20220164705A1 (en) Method and apparatus for providing information based on machine learning
CN114579761A (en) Information security knowledge entity relation connection prediction method, system and medium
CN113919357A (en) Method, device and equipment for training address entity recognition model and storage medium
Li et al. LogPS: A robust log sequential anomaly detection approach based on natural language processing
Verenich et al. Tell me what’s ahead? predicting remaining activity sequences of business process instances

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40040160

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant