CN112069795A

CN112069795A - Corpus detection method, apparatus, device and medium based on mask language model

Info

Publication number: CN112069795A
Application number: CN202010888877.4A
Authority: CN
Inventors: 邓悦; 郑立颖; 徐亮
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-08-28
Filing date: 2020-08-28
Publication date: 2020-12-11
Anticipated expiration: 2040-08-28
Also published as: CN112069795B; WO2021151292A1

Abstract

The application relates to artificial intelligence, and particularly discloses a corpus detection method, a device, equipment and a medium based on a mask language model, wherein the method comprises the following steps: inputting the corpus words to be trained into the generator for training to obtain the probability distribution corresponding to the corpus words; inputting the probability distribution into the discriminator for training to obtain a prediction result corresponding to the probability distribution, wherein the prediction result comprises whether the corpus word is replaced or not, and the prediction result is stored in a block chain node; inputting a classification label in the discriminator according to the category of the corpus word, and adjusting the prediction result through the discriminator based on the classification label and the corpus word to obtain a context vector; and detecting the state of the linguistic data word to be trained according to the context vector. The model training efficiency is effectively improved, and the abnormal condition of the log file can be efficiently and accurately judged.

Description

Corpus detection method, apparatus, device and medium based on mask language model

Technical Field

The present application relates to the field of intelligent decision making technologies, and in particular, to a corpus detection method and apparatus based on a mask language model, a computer device, and a medium.

Background

In text processing, anomaly detection of log files plays an important role in management of modern large distributed systems, and is widely used in logs for recording information during system operation. Currently, operation and maintenance personnel typically use keyword searches and rule matching to check and match logs. However, as the workload and the business demand increase, the time required for manual detection also increases, and becomes more time-consuming and labor-consuming. In order to reduce the workload of manpower and improve the accuracy of the current detection, the application of the log anomaly detection method based on deep learning in the anomaly detection direction is gradually increased.

The current popular text processing model is a mask-based pre-training language model, but due to high requirements on computing resources, the modification and training of the model are limited by training cost and running time.

Disclosure of Invention

The application provides a corpus detection method, a corpus detection device, computer equipment and a corpus detection medium based on a mask language model, which are intelligently decided, effectively improve the model training efficiency and can efficiently and accurately judge the abnormal condition of a log file.

In a first aspect, the present application provides a corpus detection method based on a mask language model, where the method is applied to a mask language model, and the mask language model includes a generator and a discriminator; the method comprises the following steps:

inputting the corpus words to be trained into the generator for training to obtain the probability distribution corresponding to the corpus words;

inputting the probability distribution into the discriminator for training to obtain a prediction result corresponding to the probability distribution, wherein the prediction result comprises whether the corpus word is replaced or not;

inputting a classification label in the discriminator according to the category of the corpus word, and adjusting the prediction result through the discriminator based on the classification label and the corpus word to obtain a context vector;

detecting the state of the corpus word to be trained according to the context vector; the category of the corpus word includes a log file category.

In a second aspect, the present application further provides a corpus detecting device, the device including:

the first training module is used for inputting the corpus words to be trained into the generator for training to obtain the probability distribution corresponding to the corpus words;

the second training module is used for inputting the probability distribution to the discriminator for training to obtain a prediction result corresponding to the probability distribution, wherein the prediction result comprises whether the corpus words are replaced or not;

the adjusting module is used for inputting a classification label into the discriminator according to the category of the corpus words, and adjusting the prediction result through the discriminator based on the classification label and the corpus words to obtain a context vector;

and the detection module is used for detecting the state of the corpus word to be trained according to the context vector.

In a third aspect, the present application further provides a computer device comprising a memory and a processor; the memory is used for storing a computer program; the processor is configured to execute the computer program and implement the corpus detection method based on the mask language model when executing the computer program.

In a fourth aspect, the present application further provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the processor is enabled to implement the corpus detecting method based on the mask language model.

The application discloses a corpus detection method, a device, computer equipment and a storage medium based on a mask language model, wherein a brand-new mask language model is adopted, the mask language model comprises a generator and a discriminator, during training, a corpus word to be trained is input into the generator to be trained, so that probability distribution corresponding to the corpus word is obtained, then the probability distribution is input into the discriminator to be trained, so that a prediction result corresponding to the probability distribution is obtained, and therefore the prediction result of the mask language model is determined, wherein the prediction result comprises whether the corpus word is replaced or not; when the model is used, only the classifier is used for inputting the classification labels aiming at the categories of the corpus words, so that the efficiency of testing the model is greatly improved, and the testing time is effectively reduced; after the context vector is obtained, the state of the corpus word to be trained is detected according to the context vector, for example, the abnormal condition of a log file of an operation and maintenance server is detected, so that the abnormal result is detected more efficiently and quickly, and the detection speed is greatly improved for a daily detection task.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is an architecture diagram of a masked speech model provided by an embodiment of the present application;

FIG. 2 is a generator architecture diagram provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of an input word vector for a generator provided by an embodiment of the present application;

FIG. 4 is a diagram of an arbiter architecture provided by an embodiment of the present application;

fig. 5 is a schematic flowchart of a corpus detection method based on a mask language model according to an embodiment of the present application;

FIG. 6 is a diagram of an arbiter structure for verifying a log file, provided by an embodiment of the present application;

FIG. 7 is a schematic block diagram of another corpus detection apparatus provided in an embodiment of the present application;

fig. 8 is a schematic block diagram of a structure of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

It is to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

The embodiment of the application provides a corpus detection method and device based on a mask language model, computer equipment and a storage medium. The corpus detection method based on the mask language model effectively improves the model training efficiency and can efficiently and accurately judge the abnormal condition of the log file.

Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.

Referring to fig. 1, fig. 1 is a training architecture of a mask language model according to an embodiment of the present application, where the training architecture of the mask language model includes a generator and a discriminator, and the generator and the discriminator are trained together during training, and only the discriminator is used for detection during detection, so as to effectively improve training efficiency.

Generator architecture as shown in fig. 2, the generator includes a series of Encoder encoders and corresponding inputs and outputs, the generator inputs are word vectors, the generator outputs are probability distributions, i.e., the probability distribution of the word at the current position, corresponding to the probability of each word, and the most likely word can be selected by the maximum probability.

Specifically, as shown in FIG. 3, the input to the encoder of the generator is a vector w corresponding to a word₁、w₂、……w_nThe generation of the word vector consists of the superposition of 3 partial vectors, which may include a word dimension vector, a sentence dimension vector, and a position dimension vector.

The generator is inherently simple in task and the parameters are not particularly numerous, and is structured to input more complex tasks to the arbiter.

Arbiter architecture as shown in fig. 4, the architecture of the arbiter is generally similar to the generator architecture, and includes a series of encoders, inputs, and outputs, O in fig. 4₁……O_nIs the output from the generator, likewise the input O₁……O_nAfter the word vector layer Embedding of the discriminator, the word vector layer Embedding is input into an Encoder Encoder structure of the discriminator, and different from a generator, a layer of classifier Classif is added in the Encoder of the discriminator during outputier, for judging whether each word is replaced or not, the corresponding output is R₁……R_nAnd probability, 0/1 classification, i.e., whether the word was replaced.

Based on the structure of the mask language model, a corpus detection method based on the mask language model is provided.

Referring to fig. 5, fig. 5 is a schematic flowchart of a corpus detection method based on a mask language model according to an embodiment of the present application. The corpus detection method based on the mask language model can be applied to the mask language model in the graph 1, effectively improves the model training efficiency, and can efficiently and accurately judge the abnormal condition of the log file.

As shown in fig. 5, the corpus detecting method based on the mask language model specifically includes steps S101 to S104.

S101, inputting the corpus words to be trained into the generator for training to obtain the probability distribution corresponding to the corpus words.

When a user needs to perform partial masking on a certain sentence, the whole sentence is input into a masking language model, the whole sentence is used as a corpus word to be trained, and a generator of the masking language model is input for training, which specifically comprises the following steps:

s11, inputting word vectors corresponding to the corpus words to be trained into a generator, wherein the word vectors comprise word dimensions, sentence dimensions and position dimensions.

As shown in FIG. 3, the input to the encoder of the generator is a vector w corresponding to a word₁、w₂、……w_nThe generation of the word vector consists of the superposition of 3 partial vectors, which may include a word dimension vector, a sentence dimension vector, and a position dimension vector.

As shown in fig. 4, the corpus words to be trained are sentences, and taking "the value is high" and "it is urgent new" as examples, the "the value is high" and "it is urgent new" are taken as input of the generator, and the generator includes word dimension, sentence dimension and position dimension when inputting.

And S12, inputting the word vector obtained by superposing the word dimension, the sentence dimension and the position dimension into an encoder of the generator through the generator for encoding to obtain each dimension word vector, wherein the encoder comprises a plurality of encoders.

The generator superposes the input word vectors including three dimensions of word dimension, sentence dimension and position dimension to obtain superposed word vectors, then inputs the superposed word vectors into an encoder of the generator to be encoded to obtain word vectors of each dimension, the encoder of the generator has a plurality of layers, and the word vectors of each dimension are obtained by layer-by-layer encoding of the generator.

The generator may be a pre-trained model, or may be trained using a model during input, and for word dimensions, for example, the total length of a word vector may be 768 lengths, and if the input corpus word to be trained corresponds to 6 words, the output word dimension corresponding to the generator model is (6, 768).

In some embodiments, for sentence dimensionality, when the input corpus word to be trained includes two sentences, then the generator may add different word vectors (embedding) for different sentences, with the corresponding first sentence 1 having a dimensionality of (1, 768) and the second sentence 2 having a dimensionality of (2, 768).

In some embodiments, for the position dimension, when the corpus word to be trained includes the same word at different positions, the position information of the word is also considered, and the position information of the word is determined according to the position information of the word. For example, the input sentence is "I come and I watch", then, for the generator, the two "I" inside are different, and for the information of the position, the generator adopts the way of sinusoidal coding, and the formula is as follows:

where pos is the index of the position, representing the position of the word in the sequence, i is the index in the vector, d is the dimension of the generator model, which uses 768 dimensions. This formula allows the information of the position to be encoded with a sine function at even positions of the vector and a cosine function at odd positions, so that each dimension of the position-encoded vector is a waveform of a different frequency, each value being between-1 and 1, thus obtaining the position dimension.

And S13, randomly replacing part of words in the dimension word vectors according to a preset replacement rule, and obtaining the probability distribution corresponding to each dimension word vector.

When the generator performs masking processing, when a corpus word to be trained is input, the generator replaces partial words of the corpus word to be trained according to a preset rule, for example, token at the [ mask ] position is replaced, when the generator outputs, a word which is masked is predicted by a context of a word which is not masked in a sentence, for example, as shown in fig. 1, when the input corpus word to be trained is "the value is high", the [ mask ] position is corresponding to "the" and "value", then the mask word is "the" and "value", the context of the word which is not masked is "and" high ", and the" and "value" is predicted by the "is" and "high".

In some embodiments, when masking, a preset replacement rule is used, for example, the preset replacement rule is that the generator uses the following rule in the 20% of the [ mask ] mask in addition to randomly selecting the 20% of the [ mask ] input:

1. 10% are replaced with any word;

2. 10% of words do not change;

3. 80% of the words are replaced with mask.

And randomly replacing part of words in the dimension word vectors according to a preset replacement rule to obtain the probability distribution corresponding to each dimension word vector.

It can be understood that, for the Encoder of the generator, an attention mechanism is used, and the purpose of the attention mechanism is to find out the relevant words in the sentence where the word is located when processing a single word, and to merge the words into the words to be processed, thereby achieving a better encoding effect. The attention mechanism in this case is a multi-head attention mechanism, and based on the attention mechanism (self-attention), the Encoder is set to 16 layers, so that the corresponding attention mechanism is used 16 times, and a final output is obtained through linear mapping. Different positions of the model are captured through a multi-head attention mechanism, so that information of the blind dimensions of the sentence is captured.

By using a multi-head attention mechanism and a brand-new word embedding method, coded information in three dimensions (position dimension, sentence dimension and word dimension) is introduced, so that the understanding of words has more dimensions.

In some embodiments, after inputting the corpus word to be trained into the generator for training, and obtaining the probability distribution corresponding to the corpus word, the method may further include:

a loss function of the generator is calculated and the generator is adjusted according to the loss function of the generator.

The loss function of the generator is whether the word is predicted correct by context for those words that are [ mask ], by equation 3:

wherein L is_MLMIs a loss function of the generator, x is a sample, x^maskedIs a sample occluded by a mask in the Embedding process, theta_GIs a parameter of the generator, (x)_i|x^masked) Is the known case of sample x_iThe condition distribution of (2).

And performing word vector superposition and coding on the corpus words to be trained through a generator to realize mask, and outputting each word vector and the probability distribution corresponding to the word vector.

In some embodiments, when a user performs some business operations, the business system will usually generate some corresponding log files, and during log detection, the mask language model applied in the scheme is used. If the category of the corpus word to be trained is a log file category, before inputting the log file to be trained into the generator for training and obtaining the probability distribution corresponding to the corpus word, the method may include:

and preprocessing the log file to be trained.

Specifically, the preprocessing may be to convert the upper case format in the log file into the lower case format, filter some fixed texts with the same structure, and replace some unimportant information (address/time/date).

After the log file is preprocessed, the processed log file is input into a generator for training.

Different preprocessing needs to be performed for different corpus words to be trained, and the present case takes log files as an example, but is not limited to log files.

S102, inputting the probability distribution into the discriminator for training to obtain a prediction result corresponding to the probability distribution, wherein the prediction result comprises whether the corpus word is replaced or not.

Specifically, as shown in fig. 4, in a discriminator, inputting the probability distribution into the discriminator for training, so as to obtain a prediction result corresponding to the probability distribution, where the prediction result includes whether the corpus word is replaced or not, and the predicting may include:

and replacing the word vectors corresponding to the probability distribution according to a preset replacement probability through the discriminator so as to predict whether the word vectors corresponding to the probability distribution are replaced or not, and obtain a prediction result.

In some embodiments, after the output of the generator is received by the discriminator, the discriminator replaces the input word vector with a certain probability to predict whether the word output by the generator is replaced, specifically, O₁……O_nFor each output from the generator, these inputs will likewise pass through EmbAn encoding layer, which is input into an Encoder structure, and a Classifier is added to the encoding layer to judge whether each word is original or replaced, and the corresponding output is R₁……R_N. The prediction result includes whether the word vector is replaced, i.e., includes both replaced and not replaced prediction results.

For example, the corpus word to be trained is "the value is high" and is input into the generator to be subjected to three-dimensional superposition and partial masking, and then "the key is high" is output, obviously, "vlaue" is subjected to masking processing, then "the key is high" output by the generator is input into the discriminator to be discriminated, and when discrimination is performed, substitution is performed according to a preset substitution probability, so that "the", "is" and "high" are all original states, and "key" is a replaced state, that is, the discriminator discriminates a word which is replaced.

In some embodiments, after detecting the state of the corpus word to be trained according to the context vector, the method further comprises:

calculating a loss function of the discriminator, and adjusting the discriminator according to the loss function of the discriminator. The penalty function of the discriminator is as in equation 4:

wherein L is_DiscIs the loss function of the discriminator, l (x) is the exponential function, l (x) is the linear function, x is the sample, t is the time step, x is the time step^corruptIs the sample after being replaced, theta_DIs a parameter of the discriminator, and D is the discriminator.

And superposing the loss function of the generator and the loss function of the discriminator to obtain a total loss function so as to adjust the mask language model.

Specifically, the loss function of the generator is superimposed with the loss function of the discriminator to obtain the total loss function of the model, which is formula 5:

since the generator and the arbiter have the same structure, the model training can be more efficient by sharing the parameters of the model with the parameters contained in the generator and the arbiter. And, in the training, generator and arbiter train together, and to the time of using, only have the arbiter to put into use, to this, the model can reduce more parameters, and have better training efficiency.

It is emphasized that the prediction result may also be stored in a node of a block chain in order to further ensure the privacy and security of the prediction result.

S103, inputting a classification label into the discriminator according to the category of the corpus word, and adjusting the prediction result through the discriminator based on the classification label and the corpus word to obtain a context vector.

In some embodiments, when the corpus words are log files, the log files are preprocessed, then the preprocessed log files are input into a generator and a discriminator to be trained, and when the corpus words are detected, a prediction result obtained after training is input into the discriminator of the model, and the input of the model is the words corresponding to each log text.

Specifically, when the category of the corpus word is a log file category, inputting a classification tag in the discriminator according to the category of the corpus word, and adjusting the prediction result by the discriminator based on the classification tag and the corpus word to obtain a context vector, which may include:

and S31, replacing the first word corresponding to the log file with a classification label in the discriminator.

Specifically, the length of the input may be set to 512, while replacing the start O1 of each sentence with a [ CLS ] classification tag at the time of input, corresponding to an abnormality or no abnormality of the log.

S32, inputting all words corresponding to the log file into an encoder for training, inputting vectors corresponding to the classification labels into a binary neural network for training, and outputting context vectors, wherein the first position of the context vectors corresponds to the classification labels.

Specifically, considering that the anomaly detection is a two-class task, after the training of a layer-by-layer encoder of the generator, a vector is obtained for each layer, the length of the vector can be set to 768, the vector of the [ CLS ] classification tag is directly input into a two-class neural network, which corresponds to a classifier in the upper graph, and is used for judging whether the anomaly is generated, and the output result is 0/1, which is a judgment result corresponding to the anomaly or the non-anomaly.

In some embodiments, if the multi-classification task is detected, a multi-classification neural network may be used to replace the classifier, and a SoftMax logistic regression function is used to obtain the probability of each classification, and the probability is assigned to the class corresponding to the maximum probability, that is, the classification is completed, so as to obtain the classification result.

S104, detecting the state of the corpus word to be trained according to the context vector; the category of the corpus word includes a log file category.

In some embodiments, when the category of the corpus word is a log file category, detecting the state of the corpus word to be trained according to the context vector may include: and judging the abnormal condition of the log file according to the first position of the context vector.

When detecting an abnormal log file, only the output vector of which the first position is [ CLS ] is taken as a context vector after the vector of the [ CLS ] classification label is directly input into a two-classification neural network.

As shown in fig. 6, fig. 6 is a discriminator configuration diagram of a log file in which an abnormality is detected, and for log abnormality detection, an input is a sentence corresponding to each log file, and the length of the input is set to 512. And simultaneously, replacing the beginning of each sentence with [ CLS ] (classification label) when inputting, and obtaining the detection result of the abnormity or non-abnormity of the corresponding log.

During detection, only the discriminator is needed, so that loads of a CPU and a memory are reduced for a server for judging abnormal information by operation and maintenance, an abnormal result is detected more efficiently and rapidly, and the detection speed is greatly improved for a daily detection task.

The embodiment provides a corpus detection method based on a mask language model, a brand-new mask language model is adopted, the mask language model comprises a generator and a discriminator, during training, a corpus word to be trained is input into the generator for training to obtain probability distribution corresponding to the corpus word, then the probability distribution is input into the discriminator for training to obtain a prediction result corresponding to the probability distribution, so that the prediction result of the mask language model is determined, wherein the prediction result comprises whether the corpus word is replaced or not; when the model is used, only the classifier is used for inputting the classification labels aiming at the categories of the corpus words, so that the efficiency of testing the model is greatly improved, and the testing time is effectively reduced; after the context vector is obtained, the state of the corpus word to be trained is detected according to the context vector, for example, the abnormal condition of a log file of an operation and maintenance server is detected, so that the abnormal result is detected more efficiently and quickly, and the detection speed is greatly improved for a daily detection task.

Referring to fig. 7, fig. 7 is a schematic block diagram of a corpus detecting device according to an embodiment of the present application, the corpus detecting device is configured to perform the above-mentioned corpus detecting method based on a mask language model. The corpus detecting device may be configured in a terminal or a server.

As shown in fig. 7, the corpus detecting device 400 includes: a first training module 401, a second training module 402, an adjustment module 403, and a detection module 404.

A first training module 401, configured to input a corpus word to be trained into the generator for training, so as to obtain a probability distribution corresponding to the corpus word;

a second training module 402, configured to input the probability distribution to the discriminator for training, so as to obtain a prediction result corresponding to the probability distribution, where the prediction result includes whether the corpus word is replaced;

an adjusting module 403, configured to input a classification label into the discriminator according to the category of the corpus word, and adjust the prediction result based on the classification label and the corpus word by the discriminator to obtain a context vector;

a detecting module 404, configured to detect a state of the corpus word to be trained according to the context vector.

It should be noted that, as will be clear to those skilled in the art, for convenience and brevity of description, the specific working processes of the apparatus and the modules described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The apparatus described above may be implemented in the form of a computer program which is executable on a computer device as shown in fig. 8.

Referring to fig. 8, fig. 8 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device may be a server.

Referring to fig. 8, the computer device includes a processor, a memory, and a network interface connected through a system bus, wherein the memory may include a nonvolatile storage medium and an internal memory.

The non-volatile storage medium may store an operating system and a computer program. The computer program includes program instructions that, when executed, cause a processor to perform any one of a masking language model based corpus detection method.

The processor is used for providing calculation and control capability and supporting the operation of the whole computer equipment.

The internal memory provides an environment for running a computer program in the non-volatile storage medium, and the computer program, when executed by the processor, causes the processor to perform any one of the corpus detection methods based on the mask language model.

The network interface is used for network communication, such as sending assigned tasks and the like. Those skilled in the art will appreciate that the architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

It should be understood that the Processor may be a Central Processing Unit (CPU), and the Processor may be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Wherein, in one embodiment, the processor is configured to execute a computer program stored in the memory to implement the steps of:

and detecting the state of the linguistic data word to be trained according to the context vector.

In some embodiments, the inputting, by the processor, the corpus word to be trained into the generator for training to obtain the probability distribution corresponding to the corpus word includes:

inputting a word vector corresponding to the corpus word to be trained into a generator, wherein the word vector comprises a word dimension, a sentence dimension and a position dimension;

inputting the word vectors obtained by superposing the word dimensions, sentence dimensions and position dimensions into an encoder of the generator through the generator for encoding to obtain word vectors of all dimensions, wherein the plurality of encoders are included;

and randomly replacing part of words in the dimension word vectors according to a preset replacement rule, and obtaining the probability distribution corresponding to each dimension word vector.

In some embodiments, before the processor implements that the category of the corpus word to be trained is a log file category, and inputs the corpus word to be trained into the generator for training to obtain the probability distribution corresponding to the corpus word, the method includes:

and preprocessing the log file to be trained.

In some embodiments, after the processor inputs the corpus word to be trained into the generator for training to obtain the probability distribution corresponding to the corpus word, the method includes:

In some embodiments, after the processor implements the detecting the state of the corpus word to be trained according to the context vector, the processor comprises:

calculating a loss function of the discriminator, and adjusting the discriminator according to the loss function of the discriminator.

In some embodiments, the processor implementation includes:

In some embodiments, the inputting, by the processor, the probability distribution to the discriminator for training to obtain a prediction result corresponding to the probability distribution, where the prediction result includes whether the corpus word is replaced, includes:

and replacing the word vectors corresponding to the probability distribution according to a preset replacement probability through the discriminator so as to predict whether the word vectors corresponding to the probability distribution are replaced or not, and obtain a prediction result, wherein the prediction result is stored in the block chain node.

In some embodiments, the processor further implements a classification label input in the discriminator according to the category of the corpus word, and the obtaining, by the discriminator, a context vector by adjusting the prediction result based on the classification label and the corpus word includes:

in the discriminator, replacing the first word corresponding to the log file with a classification label;

inputting all words corresponding to the log file into an encoder for training, inputting vectors corresponding to classification labels into a two-class neural network for training, and outputting context vectors, wherein the first position of each context vector corresponds to the classification label;

the detecting the state of the corpus word to be trained according to the context vector includes:

and judging the abnormal condition of the log file according to the first position of the context vector.

The embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, where the computer program includes program instructions, and the processor executes the program instructions to implement any one of the corpus detection methods based on the mask language model provided in the embodiment of the present application.

The computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device.

Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. The corpus detection method based on mask language model is characterized in that the method is applied to the mask language model, and the mask language model comprises a generator and a discriminator; the method comprises the following steps:

2. The method according to claim 1, wherein the inputting the corpus word to be trained into the generator for training to obtain the probability distribution corresponding to the corpus word comprises:

3. The method according to claim 1, wherein the category of the corpus word to be trained is a log file category, and before the corpus word to be trained is input into the generator for training and a probability distribution corresponding to the corpus word is obtained, the method comprises:

and preprocessing the log file to be trained.

4. The method according to claim 1, wherein after the corpus words to be trained are input into the generator for training, and a probability distribution corresponding to the corpus words is obtained, the method further comprises:

calculating a loss function of the generator and adjusting the generator according to the loss function of the generator;

after the detecting the state of the corpus word to be trained according to the context vector, the method further includes:

5. The method of claim 4, further comprising:

6. The method according to claim 1, wherein the inputting the probability distribution into the discriminator for training to obtain a prediction result corresponding to the probability distribution, the prediction result including whether the corpus word is replaced comprises:

and replacing the word vectors corresponding to the probability distribution according to a preset replacement probability through the discriminator so as to predict whether the word vectors corresponding to the probability distribution are replaced or not, and obtain a prediction result, wherein the prediction result is stored in a block chain.

7. The method according to claim 1, wherein the class of the corpus word to be trained is a log file class, the classifying label is input into the discriminator according to the class of the corpus word, and the context vector is obtained by the discriminator by adjusting the prediction result based on the classifying label and the corpus word, comprising:

8. A corpus detecting device, comprising:

9. A computer device, wherein the computer device comprises a memory and a processor;

the memory is used for storing a computer program;

the processor, configured to execute the computer program and when executing the computer program, implement the mask language model-based corpus detection method according to any one of claims 1 to 7.

10. A computer-readable storage medium, wherein a computer program is stored, and when executed by a processor, causes the processor to implement the mask language model-based corpus detection method according to any one of claims 1 to 7.