CN115587007A - Robertta-based weblog security detection method and system - Google Patents
Robertta-based weblog security detection method and system Download PDFInfo
- Publication number
- CN115587007A CN115587007A CN202211178487.3A CN202211178487A CN115587007A CN 115587007 A CN115587007 A CN 115587007A CN 202211178487 A CN202211178487 A CN 202211178487A CN 115587007 A CN115587007 A CN 115587007A
- Authority
- CN
- China
- Prior art keywords
- log
- network
- roberta
- model
- security detection
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 51
- 230000002457 bidirectional effect Effects 0.000 claims abstract description 9
- 238000007781 pre-processing Methods 0.000 claims abstract description 7
- 239000013598 vector Substances 0.000 claims description 50
- 238000012549 training Methods 0.000 claims description 19
- 230000007246 mechanism Effects 0.000 claims description 15
- 230000011218 segmentation Effects 0.000 claims description 15
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000013480 data collection Methods 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 5
- 238000012937 correction Methods 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 238000000844 transformation Methods 0.000 claims description 3
- 238000000034 method Methods 0.000 abstract description 16
- 230000008569 process Effects 0.000 abstract description 5
- 238000004458 analytical method Methods 0.000 description 16
- 230000006399 behavior Effects 0.000 description 9
- 230000006870 function Effects 0.000 description 8
- 230000002159 abnormal effect Effects 0.000 description 7
- 238000003058 natural language processing Methods 0.000 description 6
- 238000012545 processing Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000015556 catabolic process Effects 0.000 description 2
- 238000006731 degradation reaction Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000012423 maintenance Methods 0.000 description 2
- 238000003672 processing method Methods 0.000 description 2
- 230000002123 temporal effect Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000005856 abnormality Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000012098 association analyses Methods 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010219 correlation analysis Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000008676 import Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 238000011144 upstream manufacturing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3065—Monitoring arrangements determined by the means or processing involved in reporting the monitored data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3089—Monitoring arrangements determined by the means or processing involved in sensing the monitored data, e.g. interfaces, connectors, sensors, probes, agents
- G06F11/3093—Configuration details thereof, e.g. installation, enabling, spatial arrangement of the probes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Quality & Reliability (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Debugging And Monitoring (AREA)
Abstract
本发明公开了一种基于RoBERTa的网络日志安全检测方法及系统,该方法包括:获取所有网络设备的带标签网络日志数据集;对带标签网络日志数据预处理;构建RoBERTa模型并通过带标签网络日志数据集对其训练,所述RoBERTa模型采用双向Transformer网络结构作为编码器,采用Softmax分类器获取日志存在风险的概率;通过dropout函数筛选最优模型;将带标签网络日志数据输入至最优的RoBERTa模型获取该日志存在风险的概率。本发明可以处理各类未知种类和格式的日志,提高了网络日志安全检测的准确性。
The invention discloses a RoBERTa-based network log security detection method and system. The method includes: acquiring tagged network log data sets of all network devices; preprocessing tagged network log data; constructing a RoBERTa model and passing the tagged network The log data set trains it, and the RoBERTa model uses a bidirectional Transformer network structure as an encoder, and uses a Softmax classifier to obtain the probability that the log is at risk; the optimal model is screened through the dropout function; the labeled network log data is input to the optimal The RoBERTa model obtains the probability that the log is at risk. The invention can process logs of various unknown types and formats, and improves the accuracy of network log security detection.
Description
技术领域technical field
本发明涉及网络安全技术领域,特别涉及一种基于RoBERTa的网络日志安全检测方法及系统。The invention relates to the technical field of network security, in particular to a RoBERTa-based network log security detection method and system.
背景技术Background technique
网络日志数据对网络管理员来说非常重要,因为它包含网络中发生的每一个事件的信息,包括系统错误、警报和数据包发送状态。有效分析大量不同的日志数据带来了在问题成为问题之前识别问题并防止未来网络攻击的机会;然而,不同NetFlow数据的处理带来了诸如日志数据的容量、速度和准确性等挑战。本发明通过RoBERTa模型可以简化先进的网络攻击检测模型。通过了解网络攻击行为并使用日志分析系统进行交叉验证,可以从该模型中了解各种网络攻击的特征。Network log data is very important to network administrators because it contains information on every event that occurs in the network, including system errors, alerts, and packet sending status. Efficiently analyzing large volumes of disparate log data brings opportunities to identify problems before they become a problem and prevent future cyber attacks; however, processing of disparate NetFlow data presents challenges such as volume, speed, and accuracy of log data. The invention can simplify the advanced network attack detection model through the RoBERTa model. The characteristics of various cyber attacks can be learned from this model by understanding the cyber attack behavior and cross-validating with a log analysis system.
网络日志包括各种类型的消息,从严重故障到正常控制台日志。日志消息通常由三个组件组成:时间戳,主机标识符(例如IP地址)和消息。日志消息的格式取决于供应商或服务,没有统一的描述规则。这就是为什么描述正则表达式和为每条消息定义新的警报规则非常耗时的原因。Network logs include messages of various types, from critical failures to normal console logs. Log messages typically consist of three components: a timestamp, a host identifier (such as an IP address), and the message. The format of log messages depends on the provider or service, and there is no uniform description rule. This is why it is time consuming to describe regular expressions and define new alert rules for each message.
目前行业内通常使用syslog协议作为在互联网协议中传递消息的记录标准,该协议主要用于网络信息管理及安全审计工作。Syslog的报文格式具有一定的结构化,日志服务器可以直接接受syslog消息对其内容进行解析从而实现时间的简单判断。At present, the industry usually uses the syslog protocol as the recording standard for transmitting messages in the Internet protocol. This protocol is mainly used for network information management and security auditing. The format of the Syslog message has a certain structure, and the log server can directly receive the syslog message and analyze its content to realize the simple judgment of the time.
当前syslog日志组件存在众多缺点,例如:无严格格式控制,运维工程师需要学习大量专业知识;日志警告级别分类无统一规范,无法进行有效的关联分析。所以对于网络运维工程师,一种简单易操作对知识储备要求较低的日志处理方法的需求是极为迫切的。The current syslog log component has many shortcomings. For example, there is no strict format control, and operation and maintenance engineers need to learn a lot of professional knowledge; there is no unified standard for log warning level classification, and effective correlation analysis cannot be performed. Therefore, for network operation and maintenance engineers, there is an extremely urgent need for a simple and easy-to-operate log processing method that requires less knowledge reserve.
日志已经成为当前信息系统产生的重要信息资源。基于日志的异常检测技术可以有效发现系统中存在的安全问题,发掘潜在的安全威胁,成为当前发明的热点。随着人工智能技术的发展和普及,越来越多的相关发明成果已经应用于基于日志的异常检测。在基于日志的异常检测方法中,包括日志收集、日志解析、特征提取、异常检测等步骤。其中,日志解析和异常检测是核心部分,也是本专利重点论述的内容。Logs have become an important information resource generated by current information systems. Log-based anomaly detection technology can effectively discover security problems in the system and explore potential security threats, which has become a hot spot of current inventions. With the development and popularization of artificial intelligence technology, more and more related inventions have been applied to log-based anomaly detection. In the log-based anomaly detection method, steps such as log collection, log parsing, feature extraction, and anomaly detection are included. Among them, log parsing and anomaly detection are the core parts, which are also the focus of this patent.
当前,日志解析的发明从传统的定义正则表达式发展到自动化的方法,主要包括代码分析、机器学习和自然语言处理等。在基于日志的异常检测方法中,主要分为监督学习、无监督学习和深度学习等。异常检测方法大多针对特定场景和数据集进行离线分析,缺乏通用性和高准确性的实用方法。当样本较少时,模型往往不能发会最好的检测效果,想要得到理想的模型效果就需要海量带标签的数据集进行多次迭代训练,这期间需要消耗大量人力物力。并且,现在的攻击越来越隐蔽,攻击步骤越来越繁琐,而对相关设备的日志联合分析可以有效发现潜在攻击。At present, the invention of log parsing has developed from the traditional definition of regular expressions to automated methods, mainly including code analysis, machine learning, and natural language processing. In the log-based anomaly detection method, it is mainly divided into supervised learning, unsupervised learning and deep learning. Most anomaly detection methods perform offline analysis for specific scenarios and datasets, lacking practical methods with versatility and high accuracy. When the number of samples is small, the model often cannot produce the best detection effect. To obtain the ideal model effect, a large number of labeled data sets are required for multiple iterations of training, which consumes a lot of manpower and material resources. Moreover, the current attacks are becoming more and more hidden, and the attack steps are becoming more and more cumbersome, and the joint analysis of related device logs can effectively discover potential attacks.
综上所述,为解决这些问题,模型不仅要关注单一日志来源,还需要结合不同事件、不同设备进行日志解析,进而进行异常检测等;此外,利用机器学习的相关发明将进一步应用于在线检测,构建通用、有效的在线基于日志的异常检测方法,并应用到实际中变得十分重要。To sum up, in order to solve these problems, the model should not only focus on a single log source, but also combine different events and different devices for log analysis, and then perform anomaly detection, etc. In addition, related inventions using machine learning will be further applied to online detection , it becomes very important to construct a general and effective online log-based anomaly detection method and apply it to practice.
发明内容Contents of the invention
本发明的目的在于提供一种基于RoBERTa的网络日志安全检测方法及系统,本发明可以处理各类未知种类和格式的日志,提高了网络日志安全检测的准确性。The object of the present invention is to provide a RoBERTa-based network log security detection method and system. The present invention can process logs of various unknown types and formats, and improves the accuracy of network log security detection.
实现本发明目的的技术解决方案为:一种基于RoBERTa的网络日志安全检测方法,包括步骤:The technical solution that realizes the object of the present invention is: a kind of network log security detection method based on RoBERTa, comprises steps:
获取所有网络设备的带标签网络日志数据集;Obtain a labeled network log dataset of all network devices;
对带标签网络日志数据预处理;Preprocessing of labeled web log data;
构建RoBERTa模型并通过带标签网络日志数据集对其训练,所述RoBERTa模型采用双向Transformer网络结构作为编码器,采用Softmax分类器获取日志存在风险的概率;Construct a RoBERTa model and train it through a labeled network log data set. The RoBERTa model uses a bidirectional Transformer network structure as an encoder, and uses a Softmax classifier to obtain the probability that the log is at risk;
通过dropout函数筛选最优模型;Filter the optimal model through the dropout function;
将带标签网络日志数据输入至最优的RoBERTa模型获取该日志存在风险的概率。Input the labeled network log data into the optimal RoBERTa model to obtain the probability that the log is at risk.
进一步地,所述RoBERTa模型将输入的日志数据转化为768维的高维向量。Further, the RoBERTa model converts the input log data into a 768-dimensional high-dimensional vector.
进一步地,所述RoBERTa模型的BiLSTM包括前向LSTM和后向LSTM。Further, the BiLSTM of the RoBERTa model includes a forward LSTM and a backward LSTM.
进一步地,所述Transformer块包括多个子层,每个子层包括多头自注意力机制和全连接前馈网络,每两个子层之间增加了残差连接模块和归一化模块。Further, the Transformer block includes multiple sublayers, each sublayer includes a multi-head self-attention mechanism and a fully connected feedforward network, and a residual connection module and a normalization module are added between every two sublayers.
进一步地,所述多头自注意力机制对每个字符的Query向量、Key向量和Value向量执行多组线性转换,分别进行自注意力计算,然后将所有计算结果进行拼接。Further, the multi-head self-attention mechanism performs multiple sets of linear transformations on the Query vector, Key vector and Value vector of each character, performs self-attention calculations respectively, and then splices all calculation results.
进一步地,所述Query向量、Key向量和Value向量长度均为64。Further, the lengths of the Query vector, the Key vector and the Value vector are all 64.
进一步地,所述多头自注意力机制采用缩放因子进行修正。Further, the multi-head self-attention mechanism uses a scaling factor for correction.
进一步地,所述RoBERTa模型对输入的日志文本数据添加[CLS][SEP]字符,并经日志文本数据划分为单个字符,然后将单个字符存储为词汇表,每个字符对应一个唯一标识符。Further, the RoBERTa model adds [CLS][SEP] characters to the input log text data, and divides the log text data into individual characters, and then stores the individual characters as a vocabulary, and each character corresponds to a unique identifier.
进一步地,所述日志文本数据添加[CLS][SEP]字符具体为:每个日志文本数据的第1个向量是[CLS]标志,用于下游的网络日志分类任务,句尾向量是[SEP]标志,用作不同日志的分隔符,RoBERTa模型输入的日志文本数据仅使用一个句向量。Further, adding [CLS][SEP] characters to the log text data is specifically: the first vector of each log text data is a [CLS] sign, which is used for downstream network log classification tasks, and the sentence end vector is [SEP ] mark, which is used as a separator for different logs, and the log text data input by the RoBERTa model only uses one sentence vector.
一种基于RoBERTa的网络日志安全检测系统,包括数据采集模块、日志分词模块、网络日志安全检测模块、训练模块和数据库,所述数据采集模块用于采集网络环境中的设备信息及其日志文件,并将采集数据保存到数据库;所述日志分词模块用于对数据预处理;所述网络日志安全检测模块基于RoBERTa模型,所述RoBERTa模型采用双向Transformer网络结构作为编码器,采用Softmax分类器获取日志存在风险的概率;所述训练模块用于训练更新网络日志安全检测模块,通过dropout函数筛选最优模型;所述数据库用于保存日志数据。A network log security detection system based on RoBERTa, comprising a data collection module, a log word segmentation module, a network log security detection module, a training module and a database, the data collection module is used to collect device information and log files thereof in a network environment, And the collected data is saved to the database; the log word segmentation module is used for data preprocessing; the network log security detection module is based on the RoBERTa model, and the RoBERTa model adopts a bidirectional Transformer network structure as an encoder, and uses a Softmax classifier to obtain logs There is a probability of risk; the training module is used to train and update the network log security detection module, and the optimal model is screened through the dropout function; the database is used to save log data.
与现有技术相比,本发明的有益效果在于:本发明不仅关注单一日志来源,还结合不同事件、不同设备进行日志解析,通过构建的RoBERTa进行高效准确的检测;本发明处理各类未知种类和格式的日志,解决了以往基于模板的方法针对未知定义日志分析的弱点,提高了系统的可用性及用户的易操作性;本发明具有低成本、低消耗、执行高效的优点。Compared with the prior art, the beneficial effect of the present invention is that: the present invention not only focuses on a single log source, but also combines different events and different devices for log analysis, and performs efficient and accurate detection through the constructed RoBERTa; the present invention handles various unknown types and format logs, which solves the weakness of the previous template-based method for log analysis of unknown definitions, and improves the usability of the system and the ease of operation for users; the invention has the advantages of low cost, low consumption, and high execution efficiency.
附图说明Description of drawings
图1为基于日志的异常检测框架图。Figure 1 is a framework diagram of log-based anomaly detection.
图2为训练RoBERTa模型的流程图。Figure 2 is a flowchart of training the RoBERTa model.
图3为模型总体架构图。Figure 3 is the overall architecture diagram of the model.
图4为Transformer结构图。Figure 4 is a Transformer structure diagram.
具体实施方式detailed description
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some of the embodiments of the present invention, not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.
在本发明实施例中使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本发明。在本发明实施例和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。Terms used in the embodiments of the present invention are only for the purpose of describing specific embodiments, and are not intended to limit the present invention. As used in the embodiments of the present invention and the appended claims, the singular forms "a", "said" and "the" are also intended to include the plural forms unless the context clearly indicates otherwise.
应当理解,本专利中使用的术语“和/或”仅仅是一种描述关联对象的相同的字段,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本专利中字符“/”,一般表示前后关联对象是一种“或”的关系。It should be understood that the term "and/or" used in this patent is just a same field describing associated objects, indicating that there may be three relationships, for example, A and/or B, which may mean: A exists alone, and at the same time A and B, there are three cases of B alone. In addition, the character "/" in this patent generally indicates that the contextual objects are an "or" relationship.
本发明通过BERT模型可以简化先进的网络攻击检测模型。通过了解网络攻击行为并使用日志分析系统进行交叉验证,可以从该模型中了解各种网络攻击的特征。The invention can simplify the advanced network attack detection model through the BERT model. The characteristics of various cyber attacks can be learned from this model by understanding the cyber attack behavior and cross-validating with a log analysis system.
本实施方式提供的一种基于RoBERTa的网络日志安全检测系统包括自然语言处理组件和数据库;所述自然语言处理组件包括分词系统和RoBERTa算法;所述数据库内包括分类词库和大量不同类型的网络设备导出日志,每条日志附有相应的标注;所述数据库与自然语言处理组件关联;所述分词系统采用分词系统采用split函数,以空作为间隔进行分词工作。A RoBERTa-based network log security detection system provided in this embodiment includes a natural language processing component and a database; the natural language processing component includes a word segmentation system and a RoBERTa algorithm; the database includes a thesaurus and a large number of different types of networks The device exports logs, and each log is accompanied by a corresponding label; the database is associated with the natural language processing component; the word segmentation system adopts the split function, and the word segmentation work is performed with empty intervals.
所述自然处理组件用于对设备的syslog源数据和日志文件进行归纳分类、分析并确定日志语句所包含的含义,以及获取一定数量的语句训练所述数据库的语言处理模型,自然语言处理组件根据预设词义库将语句获取有效字段进行训练学习,以生成若干个训练词作为关键词,并为关键词生成解析信息,根据关键词和对应的解析信息生成语言处理模型,所述语言处理模型采用RoBERTa模型加双向Transformer结构。The natural processing component is used to summarize and classify the syslog source data and log files of the device, analyze and determine the meaning contained in the log statement, and obtain a certain number of statements to train the language processing model of the database. The natural language processing component is based on The preset word meaning database acquires valid fields of the sentence for training and learning to generate several training words as keywords, generates analytical information for the keywords, and generates a language processing model based on the keywords and corresponding analytical information, and the language processing model adopts RoBERTa model plus bidirectional Transformer structure.
优选地,所述自然语言处理组件还包括数据采集模块、日志分词模块以及解析模块:所述采集模块用于接收设备源的基本信息或训练语句缓存至数据库中,通过RoBERTa模型将语句转化为768维向量,根据预定义规则进行分类处理。Preferably, the natural language processing component also includes a data acquisition module, a log word segmentation module, and an analysis module: the acquisition module is used to receive basic information of the device source or cache the training sentences in the database, and convert the sentences into 768 words through the RoBERTa model Dimension vector, according to the predefined rules for classification processing.
优选地,所述采集模块包括设备确定模块、日志采集模块和关联分析模块,所述设备确定模块采用设备发现技术获取网络环境中的设备信息,将设备的基本信息存入数据库的分词库中,所述日志采集模块从syslog日志服务器监控的网络设备中采集网络日志文件作为RoBERTa模型训练的数据源,同时采集需要分析的日志。Preferably, the collection module includes a device determination module, a log collection module and an association analysis module, and the device determination module uses device discovery technology to obtain device information in the network environment, and stores the basic information of the device in the word segmentation database of the database , the log collection module collects network log files from the network equipment monitored by the syslog log server as a data source for RoBERTa model training, and collects logs that need to be analyzed simultaneously.
优选地,该系统还包括训练模块,所述训练模块用于将从解析模块获取的解析信息对RoBERTa模型进行更新,以更新所对应设备的模型。Preferably, the system further includes a training module, the training module is used to update the RoBERTa model with the analysis information obtained from the analysis module, so as to update the model of the corresponding equipment.
优选地,所述设备包括但不限于交换机、服务器、网关、路由器和网络安全设备。Preferably, the devices include but not limited to switches, servers, gateways, routers and network security devices.
优选地,所述设备的基本信息包括但不限于设备名、设备类型、设备IP和厂商名。Preferably, the basic information of the device includes but not limited to device name, device type, device IP and manufacturer name.
本发明提供了一种基于RoBERTa的网络日志处理方法,具体步骤如下:采集模块获取网络环境中的设备信息,记录其基本信息并作为语句采集的基础语句,同时获取syslog日志服务器中的设备的网络日志;随后将日志及设备信息导入训练好的RoBERTa模型进行分析;最后输出日志是否存在安全隐患的判断结果。The invention provides a network log processing method based on RoBERTa, the specific steps are as follows: the acquisition module obtains the equipment information in the network environment, records its basic information and uses it as the basic statement for statement collection, and simultaneously acquires the network information of the equipment in the syslog log server log; then import the log and device information into the trained RoBERTa model for analysis; finally output the judgment result of whether there is a security risk in the log.
一种基于RoBERTa的网络日志安全检测方法,其特征包括:A kind of network log security detection method based on RoBERTa, its feature comprises:
模型构建模块,用于获取网络日志数据集,根据所述网络日志数据集中所有网络日志内的映射向量,构建日志异常检测网络模型:The model building module is used to obtain the network log data set, and constructs a log anomaly detection network model according to the mapping vectors in all network logs in the network log data set:
模型分析模块,用于根据所述用户行为网络模型中每一个用户的网络访问特征,以及所述用户行为网络模型中每一个节点和路径的访问状态,对所述网络日志数据集中的异常日志进行识别。A model analysis module, configured to analyze the abnormal logs in the network log data set according to the network access characteristics of each user in the user behavior network model, and the access status of each node and path in the user behavior network model identify.
实施例1Example 1
参考图1,本发明实施例提供的一种基于网络日志的异常用户检测方法包括:With reference to Fig. 1, a kind of abnormal user detection method based on network log that the embodiment of the present invention provides comprises:
步骤1,首先获取各种网络设备带标签网络日志数据集。
步骤2,在获得了由大量网络日志组成的网络日志数据集后,由于原始日志的数据是字典类型的,主要包含了用户IP、请求时间、请求方法、请求大小、状态码、请求UL等字段。所以请求UL可以表示为请求资源路径与可选请求参数的集合。根据所述预训练网络模型中每一个用户的网络访问特征,以及所述用户行为网络模型中每一个节点和路径的访问状态,对所述网络日志数据集中的异常用户进行识别。此时将采集的网络日志进行分词、去停用词等预处理,把需要的参数从原始日志中提取出来,并将其处理成能直接被处理的数据格式。
步骤3,使用预训练好的RoBERTa模型处理处理好的日志数据集,数据集经过双向Transformer编码器网络结构后,提取出相应的特征表征向量。引入RoBERTa作为预训练文本的向量化特征表示方法,它使用双向Transformer网络结构作为编码器。
步骤4,随后经过Softmax函数归一化计算安全极性概率,最后输出该日志是存在风险的概率。通过Softmax分类器函数对提取的[CLS]特征向量进行处理,最后比对安全机型概率推断是否存在异常情况。将每一条日志中的具有相同IP的日志看作是同一个用户产生的日志,在得到预处理的日志数据后,可以通过路径间的访问关系来生成用户行为网络模型。对于每一个用户来说,可以对根据其访问的日志情况来构建属于该用户的较小的行为网络模型N1,然后可以利用所有用户的日志访问情况来构建一个较大的预训练模型N2。
具体的,在构建了异常检测模型后,根据模型中单一用户的网络访问的特征,以及用户行为网络模型中每一个节点和路径的访问状态,作为分析指标对用户行为网络模型中的每个用户和节点进行分析,从而对异常用户和异常的节点进行检测。Specifically, after constructing the anomaly detection model, according to the network access characteristics of a single user in the model, and the access status of each node and path in the user behavior network model, each user in the user behavior network model is analyzed as an analysis index And nodes are analyzed to detect abnormal users and abnormal nodes.
通过此方法,在经过原始日志数据进行预处理后,提取需要的数据,然后基于这些数据建立起用户行为网络模型,对日志进行分析中,将用户行为网络模型中的用户网络访问特征和每个节点的状态作为分析指标,可以定量的根据日志数据进行异常日志检测,通过余弦相似度检测的方法,实现了对网络日志快捷高效科学地分析。Through this method, after the original log data is preprocessed, the required data is extracted, and then a user behavior network model is established based on these data. During the log analysis, the user network access characteristics and each The status of the node is used as an analysis index, and abnormal log detection can be carried out quantitatively based on the log data. Through the method of cosine similarity detection, the network log can be analyzed quickly, efficiently and scientifically.
参考图2,所述RoBERTa模型训练流程包括:With reference to Fig. 2, described RoBERTa model training procedure comprises:
步骤5,针对采集到的数据集进行人为标注风险等级。
步骤6,对已标注的数据集进行预处理,删除无关序列。
步骤7,将数据集按照训练集80%,测试集20%的比例划分数据集,同时保证训练集和测试集中不同类型风险的数据量一致。Step 7: Divide the data set according to the proportion of 80% of the training set and 20% of the test set, while ensuring that the data volumes of different types of risks in the training set and the test set are consistent.
步骤8,在数据集中添加[CLS]和[SEP]字符。每个日志的第1个向量是[CLS]标志,可以用于下游的网络日志分类任务,句尾向量[SEP]标志用作不同日志的分隔符,由于日志是句子级别的分类问题,即输入是一个句子,故仅使用一个句向量。
步骤9,通过预训练好RoBERTa模型将日志转化为768维的高维向量,其中每条日志包含多个词嵌入向量,词嵌入向量就是每个词的静态编码。Step 9: Convert the log into a 768-dimensional high-dimensional vector through the pre-trained RoBERTa model, where each log contains multiple word embedding vectors, and the word embedding vector is the static code of each word.
步骤10,将高维向量向量传入BiLSTM结构网络训练,输出表征向量。
步骤11,将[CLS]向量经过Softmax归一化函数计算不同类型的概率。
步骤12,通过dropout函数,筛选最优模型。
参考图3,所述RoBERTa模型采用了Transformer Encoder block进行连接,因为是一个典型的双向编码模型。在Transformer块中,数据首先会经过多头注意力模块,得到一个加权的特征向量。在注意力机制中,每个字符有3个不同的向量,分别是Query向量(Q)、Key向量(K)和Value向量(V),长度均为64。具体包括依次连接的分词输入层19、划分层18、Transformer编码器块17、向量输出层16、前向LSTM 15、后向LSTM 14和隐藏层13;具体的Referring to Figure 3, the RoBERTa model uses the Transformer Encoder block for connection, because it is a typical two-way encoding model. In the Transformer block, the data first passes through the multi-head attention module to obtain a weighted feature vector. In the attention mechanism, each character has 3 different vectors, namely Query vector (Q), Key vector (K) and Value vector (V), all of length 64. Concretely include word
RoBERTa模型的隐藏层13,与RoBERTa和BiLSTM的简单权重组合模型不同,本发明在上游部分使用RoBERTa生成的字符向量作为字符嵌入层,在下游部分使用BiLSTM作为特征提取器进行建模并对网络日志进行挖掘。RoBERTa用于动态构建字符向量的表示,而BiLSTM用于整合文本信息和句子的顺序特征。两者结合可以获得更复杂的语义特征,构建更准确的语义表示。The hidden
RoBERTa模型的后向LSTM 14,后向LSTM 14由于其线性结构可以很容易地捕捉到文本的顺序信息,但无法编码从后向前的信息,而BiLSTM可以。The
RoBERTa模型的前向LSTM 15,BiLSTM是前向LSTM 15和后向LSTM14的组合。采用BiLSTM作为具有长期序列信息的语言模型来生成事故报告挖掘结果,其中RoBERTa输出层的字符向量化表示的结果被用作BiLSTM层的输入。
RoBERTa模型的向量输出层16输出高维表征向量。The
RoBERTa模型的Transformer编码器块17,具体结构见图4。Transformer是RoBERTa的核心结构。RoBERTa结构的每个Trm对应右侧的一个Transformer块。每个Transformer块由两个子层组成,分别为多头自注意力机制和全连接前馈网络。每个子层都增加了残差连接和归一化。Transformer是Google提出的解决Seq2Seq问题的新模型,采用Attention的结构全面代替LSTM结构,将注意力机制思想发挥到极致,并在机器翻译领域取得了切实可观的进步。因为Transformer模型中没有采用LSTM的结构对时序信息建模,所以在编码过程中加入位置嵌入层来弥补自注意力机制无法捕捉时序信息的不足。所述Transformer编码器块17包括逐级连接的输入模块25、加权模块24、多头注意力机制模块23、第二残差模块22、Encoder模块21和第一残差模块20;具体的:The
Transformer编码器块的第一残差模块20,残差相加,残差的引入是为了解决由于深度带来的网络退化,实际上有很多实验表明,这里的残差对网络性能贡献十分大。Norm为归一化模块,指的是对输出值进行归一化处理。The first
Transformer模型的Encoder模块21部分由N层结构相同的子层叠加而成。每个子层都是两部分网络结构的组合,分别为多头自注意力网络(Multi-HeadSelfAttentionNetwork,MSAN)和全连接的前馈神经网络(Feed ForwardNetwork,FFN)。并且,对于每层输出都经过层归一化和残差连接来处理多层网络堆叠过程中的退化问题。The
第二残差模块22同第一残差模块20。The second
多头注意力机制模块23的出发点在于,通过对输入的在不同子空间内进行注意力计算,扩展特征表示空间,达到对文本深层次建模的目的。实现方式为对Q,K,V执行多组线性转换,分别进行自注意力计算,然后将所有结果进行拼接。编码器中采用的多头自注意力网络在传统的自注意力机制的基础之上增加了缩放因子与多头注意力机制。缩放因子是指在传统的注意力计算方式上进行修正,防止维度过高导致结果太大而梯度过小等问题。The starting point of the multi-head
加权模块24,在Transformer块中,数据首先会经过多头注意力模块,得到一个加权的特征向量。在注意力机制中,每个字符有3个不同的向量,分别是Query向量(Q)、Key向量(K)和Value向量(V),长度均为64。
输入模块25实现输入词嵌入。The
经过词嵌入、位置嵌入、段嵌入后的高维向量,划分层18分割字符,将输入的文本数据划分为单个字符(E1,E2,…,En),然后将单个字符存储为词汇表,即每个字符都有一个对应的唯一标识符(表示为[字符:ID])。After word embedding, position embedding, and segment embedding, the high-dimensional vector is divided into
经过分词等预处理后通过分词输入层19添加[CLS][SEP]字符后的日志数据。After preprocessing such as word segmentation, the log data after adding [CLS][SEP] characters through the word
一种基于RoBERTa的网络日志安全检测系统,包括数据采集模块、日志分词模块、网络日志安全检测模块、训练模块和数据库,所述数据采集模块用于采集网络环境中的设备信息及其日志文件,并将采集数据保存到数据库;所述日志分词模块用于对数据预处理;所述网络日志安全检测模块基于RoBERTa模型,所述RoBERTa模型采用双向Transformer网络结构作为编码器,采用Softmax分类器获取日志存在风险的概率;所述训练模块用于训练更新网络日志安全检测模块,通过dropout函数筛选最优模型;所述数据库用于保存日志数据。所述系统包括所述异常用户检测方法的全部技术特征。A network log security detection system based on RoBERTa, comprising a data collection module, a log word segmentation module, a network log security detection module, a training module and a database, the data collection module is used to collect device information and log files thereof in a network environment, And the collected data is saved to the database; the log word segmentation module is used for data preprocessing; the network log security detection module is based on the RoBERTa model, and the RoBERTa model adopts a bidirectional Transformer network structure as an encoder, and uses a Softmax classifier to obtain logs There is a probability of risk; the training module is used to train and update the network log security detection module, and the optimal model is screened through the dropout function; the database is used to save log data. The system includes all technical features of the abnormal user detection method.
需要指出的是,由于RoBERTa为一种公知模型,本实施例仅对改进创新点进行了详细阐述,未进行详细阐述的本领域公知常识,不再累述。It should be pointed out that since RoBERTa is a well-known model, this embodiment only elaborates on the improvements and innovations in detail, and the common knowledge in the field that is not elaborated in detail will not be repeated here.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211178487.3A CN115587007A (en) | 2022-09-26 | 2022-09-26 | Robertta-based weblog security detection method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211178487.3A CN115587007A (en) | 2022-09-26 | 2022-09-26 | Robertta-based weblog security detection method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115587007A true CN115587007A (en) | 2023-01-10 |
Family
ID=84773028
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211178487.3A Pending CN115587007A (en) | 2022-09-26 | 2022-09-26 | Robertta-based weblog security detection method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115587007A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116915511A (en) * | 2023-09-13 | 2023-10-20 | 中移(苏州)软件技术有限公司 | Information processing method, device, equipment and storage medium |
CN118784360A (en) * | 2024-08-12 | 2024-10-15 | 北京弘勤安技术有限公司 | A network security detection individual system based on BERT |
-
2022
- 2022-09-26 CN CN202211178487.3A patent/CN115587007A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116915511A (en) * | 2023-09-13 | 2023-10-20 | 中移(苏州)软件技术有限公司 | Information processing method, device, equipment and storage medium |
CN116915511B (en) * | 2023-09-13 | 2023-12-08 | 中移(苏州)软件技术有限公司 | Information processing method, device, equipment and storage medium |
CN118784360A (en) * | 2024-08-12 | 2024-10-15 | 北京弘勤安技术有限公司 | A network security detection individual system based on BERT |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20220197923A1 (en) | Apparatus and method for building big data on unstructured cyber threat information and method for analyzing unstructured cyber threat information | |
US20210037032A1 (en) | Methods and systems for automated parsing and identification of textual data | |
Zhou et al. | Deepsyslog: Deep anomaly detection on syslog using sentence embedding and metadata | |
Huang et al. | Protocol reverse-engineering methods and tools: A survey | |
CN114553983A (en) | An efficient industrial control protocol analysis method based on deep learning | |
CN115587007A (en) | Robertta-based weblog security detection method and system | |
CN113194064B (en) | Webshell detection method and device based on graph convolution neural network | |
CN118672990B (en) | Log analysis method, device, program product and medium | |
CN116167370B (en) | Anomaly detection method for distributed systems based on log spatiotemporal feature analysis | |
CN117763144A (en) | Log abnormality detection method and terminal | |
CN116561748A (en) | Log abnormality detection device for component subsequence correlation sensing | |
CN117874662A (en) | Microservice log anomaly detection method based on graph mode | |
CN115344414A (en) | Log anomaly detection method and system based on LSTM-Transformer | |
Yu et al. | Self-supervised log parsing using semantic contribution difference | |
CN118611962A (en) | A log analysis and APT attack tracing method based on unsupervised learning | |
CN118227361A (en) | Unsupervised log anomaly detection method independent of log analyzer | |
Pal et al. | DLME: distributed log mining using ensemble learning for fault prediction | |
Zhan et al. | Toward automated field semantics inference for binary protocol reverse engineering | |
CN117135038A (en) | Network fault monitoring methods, devices and electronic equipment | |
CN114707508A (en) | Event detection method based on multi-hop neighbor information fusion of graph structure | |
Chen et al. | Unsupervised Anomaly Detection Based on System Logs. | |
Li et al. | Logspy: System log anomaly detection for distributed systems | |
CN104636404A (en) | Method and device for generating large-scale data used for testing | |
CN112256838A (en) | Similar domain name searching method and device and electronic equipment | |
CN118013440A (en) | An abnormal detection method for personal sensitive information desensitization operation based on event graph |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |