CN115587007A - Robertta-based weblog security detection method and system - Google Patents

Robertta-based weblog security detection method and system Download PDF

Info

Publication number
CN115587007A
CN115587007A CN202211178487.3A CN202211178487A CN115587007A CN 115587007 A CN115587007 A CN 115587007A CN 202211178487 A CN202211178487 A CN 202211178487A CN 115587007 A CN115587007 A CN 115587007A
Authority
CN
China
Prior art keywords
weblog
log
roberta
model
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211178487.3A
Other languages
Chinese (zh)
Inventor
宋厚营
张铭伦
尹雷
陈浩
臧磊
王瑞
刘景雯
陈境宇
李琦
赵厚凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lianyungang Power Supply Co of State Grid Jiangsu Electric Power Co Ltd
Original Assignee
Lianyungang Power Supply Co of State Grid Jiangsu Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lianyungang Power Supply Co of State Grid Jiangsu Electric Power Co Ltd filed Critical Lianyungang Power Supply Co of State Grid Jiangsu Electric Power Co Ltd
Priority to CN202211178487.3A priority Critical patent/CN115587007A/en
Publication of CN115587007A publication Critical patent/CN115587007A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3089Monitoring arrangements determined by the means or processing involved in sensing the monitored data, e.g. interfaces, connectors, sensors, probes, agents
    • G06F11/3093Configuration details thereof, e.g. installation, enabling, spatial arrangement of the probes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Quality & Reliability (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a weblog safety detection method and a weblog safety detection system based on RoBERTA, wherein the method comprises the following steps: acquiring tagged weblog data sets of all network devices; preprocessing the labeled weblog data; constructing a RoBERTA model and training the RoBERTA model through a labeled weblog data set, wherein the RoBERTA model adopts a bidirectional Transformer network structure as an encoder and adopts a Softmax classifier to obtain the risk probability of the log; screening an optimal model through a dropout function; and inputting the data of the tagged weblog into an optimal RoBERTA model to obtain the risk probability of the log. The invention can process various logs of unknown types and formats, and improves the accuracy of the safety detection of the weblogs.

Description

Robertta-based weblog security detection method and system
Technical Field
The invention relates to the technical field of network security, in particular to a weblog security detection method and system based on RoBERTA.
Background
The weblog data is very important to the network administrator because it contains information for every event that occurs in the network, including system errors, alarms, and packet delivery status. Effectively analyzing a large amount of different log data gives the opportunity to identify problems and prevent future cyber attacks before they become problematic; however, the processing of different NetFlow data presents challenges such as capacity, speed, and accuracy of log data. The invention can simplify the advanced network attack detection model through the RoBERTA model. By knowing the network attack behavior and using a log analysis system for cross-validation, the characteristics of various network attacks can be learned from the model.
The weblog includes various types of messages, from severe failures to normal console logs. Log messages are typically composed of three components: a timestamp, a host identifier (e.g., an IP address), and a message. The format of the log message depends on the vendor or service and there is no uniform description rule. This is why it is very time consuming to describe regular expressions and to define new alert rules for each message.
The syslog protocol is commonly used in the industry today as a recording standard for message delivery in the internet protocol, which is mainly used for network information management and security audit work. The message format of Syslog has certain structuralization, and the log server can directly receive Syslog messages to analyze the content of the Syslog messages, so that the time is simply judged.
There are numerous drawbacks to the current syslog component, such as: the method has no strict format control, and an operation and maintenance engineer needs to learn a large amount of professional knowledge; the log warning level classification has no unified standard, and effective correlation analysis cannot be carried out. Therefore, for network operation and maintenance engineers, the requirement of a log processing method which is simple and easy to operate and has low requirements on knowledge reserves is extremely urgent.
Logs have become an important information resource generated by current information systems. The log-based anomaly detection technology can effectively find out the security problems existing in the system, explore potential security threats and become a hotspot of the current invention. With the development and popularization of artificial intelligence technology, more and more related invention achievements have been applied to log-based anomaly detection. The log-based anomaly detection method comprises the steps of log collection, log analysis, feature extraction, anomaly detection and the like. The log analysis and the anomaly detection are core parts and are also the contents of the important discussion of the patent.
Currently, the invention of log parsing develops from a traditional definition regular expression to an automated method, which mainly includes code analysis, machine learning, natural language processing, and the like. The log-based anomaly detection method is mainly classified into supervised learning, unsupervised learning, deep learning, and the like. Most of the anomaly detection methods perform offline analysis on specific scenes and data sets, and practical methods with universality and high accuracy are lacked. When the number of samples is small, the model cannot always send the best detection effect, and a large number of labeled data sets are required to be subjected to iterative training for multiple times to obtain the ideal model effect, so that a large amount of manpower and material resources are consumed in the process. Moreover, the existing attacks are more and more hidden, the attack steps are more and more complicated, and the log joint analysis of related equipment can effectively discover the potential attacks.
To sum up, in order to solve these problems, the model not only needs to pay attention to a single log source, but also needs to perform log analysis by combining different events and different devices, and further performs anomaly detection, etc.; in addition, the related invention utilizing machine learning is further applied to online detection, and a universal and effective online log-based anomaly detection method is constructed and is applied to the reality.
Disclosure of Invention
The invention aims to provide a weblog security detection method and system based on RoBERTA, which can process various logs of unknown types and formats and improve the accuracy of weblog security detection.
The technical solution for realizing the purpose of the invention is as follows: a weblog security detection method based on RoBERTA comprises the following steps:
acquiring tagged weblog data sets of all network devices;
preprocessing the labeled weblog data;
constructing a RoBERTA model and training the RoBERTA model through a labeled weblog data set, wherein the RoBERTA model adopts a bidirectional Transformer network structure as an encoder and adopts a Softmax classifier to obtain the risk probability of the log;
screening an optimal model through a dropout function;
and inputting the data of the tagged weblog into an optimal RoBERTA model to obtain the risk probability of the log.
Further, the RoBERTa model converts the input log data into a 768-dimensional high-dimensional vector.
Further, the BilTM of the RoBERTA model includes forward LSTM and backward LSTM.
Further, the Transformer block comprises a plurality of sub-layers, each sub-layer comprises a multi-head self-attention mechanism and a fully-connected feedforward network, and a residual error connection module and a normalization module are added between every two sub-layers.
Further, the multi-head self-attention mechanism executes multiple groups of linear conversion on a Query vector, a Key vector and a Value vector of each character, respectively performs self-attention calculation, and then splices all calculation results.
Further, the length of each of the Query vector, the Key vector and the Value vector is 64.
Further, the multi-head self-attention mechanism is modified by a scaling factor.
Further, the RoBERTa model adds [ CLS ] [ SEP ] characters to the incoming log text data and divides the log text data into individual characters, which are then stored as a vocabulary, each character corresponding to a unique identifier.
Further, adding a [ CLS ] [ SEP ] character to the log text data specifically includes: the 1 st vector of each log text data is the [ CLS ] flag for the weblog classification task downstream, the end-of-sentence vector is the [ SEP ] flag, which serves as a separator of different logs, and the log text data input by the RoBERTa model uses only one sentence vector.
A weblog security detection system based on RoBERTA comprises a data acquisition module, a log word segmentation module, a weblog security detection module, a training module and a database, wherein the data acquisition module is used for acquiring equipment information and log files thereof in a network environment and storing acquired data in the database; the log word segmentation module is used for preprocessing data; the weblog security detection module is based on a RoBERTA model, the RoBERTA model adopts a bidirectional transform network structure as an encoder, and a Softmax classifier is adopted to obtain the risk probability of the logs; the training module is used for training the updated weblog security detection module and screening an optimal model through a dropout function; the database is used for storing log data.
Compared with the prior art, the invention has the beneficial effects that: the method not only focuses on a single log source, but also analyzes logs by combining different events and different devices, and performs efficient and accurate detection through the constructed RoBERTA; the invention processes various logs of unknown types and formats, solves the weakness of the conventional template-based method aiming at unknown defined log analysis, and improves the usability of the system and the operability of users; the invention has the advantages of low cost, low consumption and high execution efficiency.
Drawings
FIG. 1 is a block diagram of a log-based anomaly detection framework.
FIG. 2 is a flow chart for training the RoBERTA model.
Fig. 3 is a model overall architecture diagram.
FIG. 4 is a diagram showing a structure of a Transformer.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the description of the invention and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be understood that the term "and/or" as used in this patent is merely a field that describes the same of an associated object, indicating that three relationships may exist, e.g., A and/or B, may indicate: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in this patent generally indicates that the preceding and following related objects are in an "or" relationship.
The invention can simplify the advanced network attack detection model through the BERT model. By knowing the behavior of the cyber attack and using a log analysis system for cross-validation, the characteristics of various cyber attacks can be learned from the model.
The weblog safety detection system based on RoBERTA provided by the embodiment comprises a natural language processing component and a database; the natural language processing component comprises a word segmentation system and a RoBERTA algorithm; the database comprises a classified word bank and a plurality of network equipment export logs of different types, and each log is attached with a corresponding label; the database is associated with a natural language processing component; the word segmentation system adopts a split function and takes the space as an interval to perform word segmentation.
The natural language processing component is used for carrying out induction classification and analysis on syslog source data and log files of equipment, determining meanings contained in log sentences and acquiring a certain number of sentences to train a language processing model of the database, the natural language processing component is used for carrying out training and learning on the sentences according to a preset word meaning library to generate a plurality of training words as keywords, generating analysis information for the keywords and generating the language processing model according to the keywords and the corresponding analysis information, and the language processing model adopts a RoBERTA model and a two-way transform structure.
Preferably, the natural language processing component further comprises a data collection module, a log word segmentation module and an analysis module: the acquisition module is used for receiving basic information of an equipment source or caching a training statement into a database, converting the statement into a 768-dimensional vector through a RoBERTA model, and carrying out classification processing according to a predefined rule.
Preferably, the acquisition module includes an equipment determination module, a log acquisition module and an association analysis module, the equipment determination module acquires equipment information in a network environment by adopting an equipment discovery technology, stores basic information of the equipment in a word segmentation library of a database, and the log acquisition module acquires a network log file from network equipment monitored by a syslog server as a data source for RoBERTA model training and acquires logs to be analyzed.
Preferably, the system further includes a training module, and the training module is configured to update the RoBERTa model with the parsing information obtained from the parsing module, so as to update the model of the corresponding device.
Preferably, the devices include, but are not limited to, switches, servers, gateways, routers, and network security devices.
Preferably, the basic information of the device includes, but is not limited to, a device name, a device type, a device IP, and a manufacturer name.
The invention provides a weblog processing method based on RoBERTA, which comprises the following specific steps: the method comprises the steps that an acquisition module acquires equipment information in a network environment, records basic information of the equipment information as a basic statement for statement acquisition, and acquires a network log of equipment in a syslog server; then, the log and the equipment information are imported into a trained RoBERTA model for analysis; and finally, outputting a judgment result of whether the log has potential safety hazards.
A weblog security detection method based on RoBERTA is characterized by comprising the following steps:
the model building module is used for obtaining a weblog data set and building a log anomaly detection network model according to mapping vectors in all weblogs in the weblog data set:
and the model analysis module is used for identifying abnormal logs in the weblog data set according to the network access characteristics of each user in the user behavior network model and the access states of each node and path in the user behavior network model.
Example 1
Referring to fig. 1, an abnormal user detection method based on a weblog according to an embodiment of the present invention includes:
step 1, firstly, various network equipment tagged weblog data sets are obtained.
And 2, after a weblog data set consisting of a large number of weblogs is obtained, because the data of the original logs are of dictionary type, the fields of user IP, request time, request method, request size, state code, request UL and the like are mainly contained. The request UL can be expressed as a set of request resource path and optional request parameters. And identifying abnormal users in the network log data set according to the network access characteristics of each user in the pre-training network model and the access states of each node and path in the user behavior network model. At this time, the collected network log is preprocessed by word segmentation, word stop and the like, the needed parameters are extracted from the original log, and the extracted parameters are processed into a data format which can be directly processed.
And 3, processing the processed log data set by using a pre-trained RoBERTA model, and extracting corresponding characteristic characterization vectors after the data set passes through a network structure of a bidirectional Transformer encoder. RoBERTa was introduced as a vectorized feature representation method for pre-training text, which uses a bidirectional Transformer network structure as an encoder.
And 4, calculating the safety polarity probability through the Softmax function normalization, and finally outputting the probability that the log is at risk. And processing the extracted [ CLS ] characteristic vector through a Softmax classifier function, and finally comparing the probability of the security model to deduce whether an abnormal condition exists or not. And regarding the logs with the same IP in each log as logs generated by the same user, and generating a user behavior network model through the access relation among paths after the preprocessed log data are obtained. For each user, a smaller behavior network model N1 belonging to the user can be constructed according to the log condition of the access of the user, and then a larger pre-training model N2 can be constructed by using the log access conditions of all the users.
Specifically, after the abnormal detection model is constructed, each user and each node in the user behavior network model are analyzed as analysis indexes according to the network access characteristics of a single user in the model and the access state of each node and path in the user behavior network model, so that the abnormal user and the abnormal node are detected.
According to the method, after the original log data are preprocessed, the needed data are extracted, then a user behavior network model is established based on the data, in the process of analyzing the log, the user network access characteristics and the state of each node in the user behavior network model are used as analysis indexes, abnormal log detection can be conducted quantitatively according to the log data, and the rapid, efficient and scientific analysis of the network log is achieved through a cosine similarity detection method.
Referring to fig. 2, the RoBERTa model training procedure includes:
and 5, artificially marking risk levels according to the acquired data set.
And 6, preprocessing the labeled data set and deleting irrelevant sequences.
And 7, dividing the data set into 80% of the training set and 20% of the testing set, and simultaneously ensuring that the data volumes of different types of risks in the training set and the testing set are consistent.
And 8, adding [ CLS ] and [ SEP ] characters in the data set. The 1 st vector of each log is a [ CLS ] flag that can be used for downstream weblog classification tasks, and the tail vector [ SEP ] flag serves as a separator for different logs, since a log is a classification problem at the sentence level, i.e. the input is one sentence, only one sentence vector is used.
And 9, converting the logs into 768-dimensional high-dimensional vectors by using a pre-trained RoBERTA model, wherein each log comprises a plurality of word embedding vectors, and the word embedding vectors are static codes of each word.
And step 10, transmitting the high-dimensional vector into a BilSTM structure network for training, and outputting a characterization vector.
And step 11, calculating different types of probabilities by passing the [ CLS ] vector through a Softmax normalization function.
And 12, screening an optimal model through a dropout function.
Referring to fig. 3, the RoBERTa model uses a transform Encoder block for concatenation because it is a typical bi-directional coding model. In the transform block, data first passes through a multi-head attention module to obtain a weighted feature vector. In the attention mechanism, there are 3 different vectors for each character, a Query vector (Q), a Key vector (K), and a Value vector (V), each of 64 in length. The method specifically comprises a participle input layer 19, a division layer 18, a transform encoder block 17, a vector output layer 16, a forward LSTM 15, a backward LSTM14 and a hidden layer 13 which are sequentially connected; in particular
Unlike the simple weight-recombination models of RoBERTa and bilst, the hidden layer 13 of the RoBERTa model uses the character vectors generated by RoBERTa as a character embedding layer in the upstream part and uses the bilst as a feature extractor for modeling and mining the weblog in the downstream part. RoBERTa is used to dynamically construct a representation of a character vector, while BiLSTM is used to integrate textual information and the sequential nature of a sentence. The two are combined to obtain more complex semantic features, and more accurate semantic representation is constructed.
The backward LSTM14 of RoBERTa model, the backward LSTM14 can easily capture the sequential information of the text due to its linear structure, but cannot encode the information from the backward to the forward, while the bilst can.
The forward LSTM 15, biLSTM of the RoBERTA model is a combination of forward LSTM 15 and backward LSTM 14. The results of the incident report mining are generated using BilSTM as a language model with long-term sequence information, with the results of the character-vectorized representation of the RoBERTA output layer used as input to the BilSTM layer.
The vector output layer 16 of the RoBERTa model outputs a high-dimensional token vector.
The concrete structure of the transform encoder block 17 of the RoBERTa model is shown in fig. 4.Transformer is the core structure of RoBERTa. Each Trm of the RoBERTa structure corresponds to one transform block on the right. Each transform block consists of two sublayers, namely a multi-head self-attention mechanism and a fully-connected feed-forward network. Residual concatenation and normalization are added for each sub-layer. The Transformer is a new model proposed by Google to solve the problem of Seq2Seq, and the structure of Attention is adopted to completely replace the LSTM structure, so that the Attention mechanism idea is brought into play to the utmost, and a substantial progress is made in the machine translation field. Because the structure of LSTM is not adopted in the Transformer model to model the time sequence information, a position embedding layer is added in the encoding process to make up the defect that the self-attention mechanism can not capture the time sequence information. The Transformer Encoder block 17 comprises an input module 25, a weighting module 24, a multi-head attention mechanism module 23, a second residual module 22, an Encoder module 21 and a first residual module 20 which are connected in a stepwise manner; specifically, the method comprises the following steps:
the first residual block 20 of the transform coder block, the residual is added, and the residual is introduced to solve the network degradation caused by the depth, and in practice, many experiments show that the residual contributes to the network performance greatly. Norm is a normalization module, which refers to normalization processing of output values.
The Encoder module 21 part of the Transformer model is formed by overlapping N sublayers with the same structure. Each sub-layer is a combination of two parts of network structures, namely a Multi-Head self attention network (MSAN) and a fully connected feedforward neural network (FFN). And, for each layer output, the degradation problem in the multi-layer network stacking process is processed through layer normalization and residual concatenation.
The second residual module 22 is identical to the first residual module 20.
The multi-head attention mechanism module 23 starts from expanding the feature representation space by performing attention calculation on the input in different subspaces, so as to achieve the purpose of deep modeling of the text. The implementation mode is that multiple groups of linear conversion are executed on Q, K and V, self-attention calculation is respectively carried out, and then all results are spliced. The multi-head self-attention network adopted in the encoder is added with a scaling factor and a multi-head attention mechanism on the basis of the traditional self-attention mechanism. The scaling factor refers to correction in a traditional attention calculation mode, and the problems that the result is too large and the gradient is too small due to too high dimensionality and the like are solved.
In the transform block, the data first passes through the multi-head attention block to obtain a weighted eigenvector by the weighting module 24. In the attention mechanism, there are 3 different vectors for each character, a Query vector (Q), a Key vector (K), and a Value vector (V), each of 64 in length.
The input module 25 implements input word embedding.
High-dimensional vectors after word embedding, position embedding and segment embedding, the dividing layer 18 divides characters, divides input text data into individual characters (E1, E2, \8230;, en), and then stores the individual characters as a vocabulary, i.e., each character has a corresponding unique identifier (denoted as [ character: ID ]).
After preprocessing such as word segmentation, log data after [ CLS ] [ SEP ] characters are added through the word segmentation input layer 19.
A weblog security detection system based on RoBERTA comprises a data acquisition module, a log word segmentation module, a weblog security detection module, a training module and a database, wherein the data acquisition module is used for acquiring equipment information and log files thereof in a network environment and storing acquired data in the database; the log word segmentation module is used for preprocessing data; the weblog security detection module is based on a RoBERTA model, the RoBERTA model adopts a bidirectional transform network structure as an encoder, and a Softmax classifier is adopted to obtain the risk probability of the logs; the training module is used for training the updated weblog security detection module and screening an optimal model through a dropout function; the database is used for storing log data. The system comprises all technical characteristics of the abnormal user detection method.
It should be noted that, since RoBERTa is a well-known model, the present embodiment only elaborates the innovation point, and common general knowledge in the art, which is not elaborated, will not be reiterated.

Claims (10)

1. A weblog security detection method based on RoBERTA is characterized by comprising the following steps:
acquiring tagged weblog data sets of all network devices;
preprocessing the labeled weblog data;
constructing a RoBERTA model and training the RoBERTA model through a labeled weblog data set, wherein the RoBERTA model adopts a bidirectional Transformer network structure as an encoder and adopts a Softmax classifier to obtain the risk probability of the log;
screening an optimal model through a dropout function;
and inputting the labeled weblog data into an optimal RoBERTA model to obtain the risk probability of the log.
2. The weblog security detection method of claim 1, wherein the RoBERTa model converts incoming log data into 768-dimensional high-dimensional vectors.
3. The weblog security detection method of claim 1, wherein the BilTM of the RoBERTA model comprises forward LSTM and backward LSTM.
4. The method of claim 1, wherein the transform block comprises a plurality of sub-layers, each sub-layer comprises a multi-head self-attention mechanism and a full-connection feed-forward network, and a residual connection module and a normalization module are added between each two sub-layers.
5. The weblog security detection method of claim 4, wherein the multi-head self-attention mechanism performs multiple sets of linear transformations on the Query vector, the Key vector and the Value vector of each character, performs self-attention calculation respectively, and then splices all calculation results.
6. The weblog security detection method of claim 1, wherein the Query vector, the Key vector, and the Value vector are all 64 in length.
7. The weblog security detection method of claim 1, wherein the multi-headed self-attentiveness mechanism is modified using a scaling factor.
8. The weblog security detection method of claim 1, wherein the RoBERTa model adds [ CLS ] [ SEP ] characters to the incoming log text data and divides the log text data into individual characters, and then stores the individual characters as a vocabulary, each character corresponding to a unique identifier.
9. The weblog security detection method of claim 1, wherein the adding of [ CLS ] [ SEP ] characters to the log text data is specifically: the 1 st vector of each log text data is the [ CLS ] flag for the weblog classification task downstream, the end-of-sentence vector is the [ SEP ] flag, which serves as a separator of different logs, and the log text data input by the RoBERTa model uses only one sentence vector.
10. A weblog security detection system based on RoBERTA is characterized by comprising a data acquisition module, a log word segmentation module, a weblog security detection module, a training module and a database, wherein the data acquisition module is used for acquiring equipment information and log files thereof in a network environment and storing acquired data in the database; the log word segmentation module is used for preprocessing data; the weblog security detection module is based on a RoBERTA model, the RoBERTA model adopts a bidirectional transform network structure as an encoder, and a Softmax classifier is adopted to obtain the risk probability of the logs; the training module is used for training the updated weblog security detection module and screening an optimal model through a dropout function; the database is used for storing log data.
CN202211178487.3A 2022-09-26 2022-09-26 Robertta-based weblog security detection method and system Pending CN115587007A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211178487.3A CN115587007A (en) 2022-09-26 2022-09-26 Robertta-based weblog security detection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211178487.3A CN115587007A (en) 2022-09-26 2022-09-26 Robertta-based weblog security detection method and system

Publications (1)

Publication Number Publication Date
CN115587007A true CN115587007A (en) 2023-01-10

Family

ID=84773028

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211178487.3A Pending CN115587007A (en) 2022-09-26 2022-09-26 Robertta-based weblog security detection method and system

Country Status (1)

Country Link
CN (1) CN115587007A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116915511A (en) * 2023-09-13 2023-10-20 中移(苏州)软件技术有限公司 Information processing method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116915511A (en) * 2023-09-13 2023-10-20 中移(苏州)软件技术有限公司 Information processing method, device, equipment and storage medium
CN116915511B (en) * 2023-09-13 2023-12-08 中移(苏州)软件技术有限公司 Information processing method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN113094200A (en) Application program fault prediction method and device
CN111600919B (en) Method and device for constructing intelligent network application protection system model
Zhou et al. Deepsyslog: Deep anomaly detection on syslog using sentence embedding and metadata
Dong et al. Towards interpreting recurrent neural networks through probabilistic abstraction
CN117235745B (en) Deep learning-based industrial control vulnerability mining method, system, equipment and storage medium
CN115344414A (en) Log anomaly detection method and system based on LSTM-Transformer
CN116955604A (en) Training method, detection method and device of log detection model
CN116662184B (en) Industrial control protocol fuzzy test case screening method and system based on Bert
CN116561748A (en) Log abnormality detection device for component subsequence correlation sensing
CN116827656A (en) Network information safety protection system and method thereof
CN113194064A (en) Webshell detection method and device based on graph convolution neural network
CN116361147A (en) Method for positioning root cause of test case, device, equipment, medium and product thereof
CN116361788A (en) Binary software vulnerability prediction method based on machine learning
CN115587007A (en) Robertta-based weblog security detection method and system
CN114416479A (en) Log sequence anomaly detection method based on out-of-stream regularization
Tian et al. An event knowledge graph system for the operation and maintenance of power equipment
CN117874662A (en) Micro-service log anomaly detection method based on graph mode
CN116828087B (en) Information security system based on block chain connection
CN117632659A (en) Log exception processing method, device, equipment and medium
CN117763144A (en) Log abnormality detection method and terminal
CN117354207A (en) Reverse analysis method and device for unknown industrial control protocol
CN115757062A (en) Log anomaly detection method based on sentence embedding and Transformer-XL
CN115344563A (en) Data deduplication method and device, storage medium and electronic equipment
Liao et al. LogBASA: Log Anomaly Detection Based on System Behavior Analysis and Global Semantic Awareness
Liu et al. The runtime system problem identification method based on log analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination