CN115587007A

CN115587007A - Robertta-based weblog security detection method and system

Info

Publication number: CN115587007A
Application number: CN202211178487.3A
Authority: CN
Inventors: 宋厚营; 张铭伦; 尹雷; 陈浩; 臧磊; 王瑞; 刘景雯; 陈境宇; 李琦; 赵厚凯
Original assignee: Lianyungang Power Supply Co of State Grid Jiangsu Electric Power Co Ltd
Current assignee: Lianyungang Power Supply Co of State Grid Jiangsu Electric Power Co Ltd
Priority date: 2022-09-26
Filing date: 2022-09-26
Publication date: 2023-01-10

Abstract

The invention discloses a RoBERTa-based network log security detection method and system. The method includes: acquiring tagged network log data sets of all network devices; preprocessing tagged network log data; constructing a RoBERTa model and passing the tagged network The log data set trains it, and the RoBERTa model uses a bidirectional Transformer network structure as an encoder, and uses a Softmax classifier to obtain the probability that the log is at risk; the optimal model is screened through the dropout function; the labeled network log data is input to the optimal The RoBERTa model obtains the probability that the log is at risk. The invention can process logs of various unknown types and formats, and improves the accuracy of network log security detection.

Description

Network log security detection method and system based on RoBERTa

技术领域technical field

本发明涉及网络安全技术领域，特别涉及一种基于RoBERTa的网络日志安全检测方法及系统。The invention relates to the technical field of network security, in particular to a RoBERTa-based network log security detection method and system.

背景技术Background technique

网络日志数据对网络管理员来说非常重要，因为它包含网络中发生的每一个事件的信息，包括系统错误、警报和数据包发送状态。有效分析大量不同的日志数据带来了在问题成为问题之前识别问题并防止未来网络攻击的机会；然而，不同NetFlow数据的处理带来了诸如日志数据的容量、速度和准确性等挑战。本发明通过RoBERTa模型可以简化先进的网络攻击检测模型。通过了解网络攻击行为并使用日志分析系统进行交叉验证，可以从该模型中了解各种网络攻击的特征。Network log data is very important to network administrators because it contains information on every event that occurs in the network, including system errors, alerts, and packet sending status. Efficiently analyzing large volumes of disparate log data brings opportunities to identify problems before they become a problem and prevent future cyber attacks; however, processing of disparate NetFlow data presents challenges such as volume, speed, and accuracy of log data. The invention can simplify the advanced network attack detection model through the RoBERTa model. The characteristics of various cyber attacks can be learned from this model by understanding the cyber attack behavior and cross-validating with a log analysis system.

网络日志包括各种类型的消息，从严重故障到正常控制台日志。日志消息通常由三个组件组成：时间戳，主机标识符(例如IP地址)和消息。日志消息的格式取决于供应商或服务，没有统一的描述规则。这就是为什么描述正则表达式和为每条消息定义新的警报规则非常耗时的原因。Network logs include messages of various types, from critical failures to normal console logs. Log messages typically consist of three components: a timestamp, a host identifier (such as an IP address), and the message. The format of log messages depends on the provider or service, and there is no uniform description rule. This is why it is time consuming to describe regular expressions and define new alert rules for each message.

目前行业内通常使用syslog协议作为在互联网协议中传递消息的记录标准，该协议主要用于网络信息管理及安全审计工作。Syslog的报文格式具有一定的结构化，日志服务器可以直接接受syslog消息对其内容进行解析从而实现时间的简单判断。At present, the industry usually uses the syslog protocol as the recording standard for transmitting messages in the Internet protocol. This protocol is mainly used for network information management and security auditing. The format of the Syslog message has a certain structure, and the log server can directly receive the syslog message and analyze its content to realize the simple judgment of the time.

当前syslog日志组件存在众多缺点，例如：无严格格式控制，运维工程师需要学习大量专业知识；日志警告级别分类无统一规范，无法进行有效的关联分析。所以对于网络运维工程师，一种简单易操作对知识储备要求较低的日志处理方法的需求是极为迫切的。The current syslog log component has many shortcomings. For example, there is no strict format control, and operation and maintenance engineers need to learn a lot of professional knowledge; there is no unified standard for log warning level classification, and effective correlation analysis cannot be performed. Therefore, for network operation and maintenance engineers, there is an extremely urgent need for a simple and easy-to-operate log processing method that requires less knowledge reserve.

日志已经成为当前信息系统产生的重要信息资源。基于日志的异常检测技术可以有效发现系统中存在的安全问题，发掘潜在的安全威胁，成为当前发明的热点。随着人工智能技术的发展和普及，越来越多的相关发明成果已经应用于基于日志的异常检测。在基于日志的异常检测方法中，包括日志收集、日志解析、特征提取、异常检测等步骤。其中，日志解析和异常检测是核心部分，也是本专利重点论述的内容。Logs have become an important information resource generated by current information systems. Log-based anomaly detection technology can effectively discover security problems in the system and explore potential security threats, which has become a hot spot of current inventions. With the development and popularization of artificial intelligence technology, more and more related inventions have been applied to log-based anomaly detection. In the log-based anomaly detection method, steps such as log collection, log parsing, feature extraction, and anomaly detection are included. Among them, log parsing and anomaly detection are the core parts, which are also the focus of this patent.

当前，日志解析的发明从传统的定义正则表达式发展到自动化的方法，主要包括代码分析、机器学习和自然语言处理等。在基于日志的异常检测方法中，主要分为监督学习、无监督学习和深度学习等。异常检测方法大多针对特定场景和数据集进行离线分析，缺乏通用性和高准确性的实用方法。当样本较少时，模型往往不能发会最好的检测效果，想要得到理想的模型效果就需要海量带标签的数据集进行多次迭代训练，这期间需要消耗大量人力物力。并且，现在的攻击越来越隐蔽，攻击步骤越来越繁琐，而对相关设备的日志联合分析可以有效发现潜在攻击。At present, the invention of log parsing has developed from the traditional definition of regular expressions to automated methods, mainly including code analysis, machine learning, and natural language processing. In the log-based anomaly detection method, it is mainly divided into supervised learning, unsupervised learning and deep learning. Most anomaly detection methods perform offline analysis for specific scenarios and datasets, lacking practical methods with versatility and high accuracy. When the number of samples is small, the model often cannot produce the best detection effect. To obtain the ideal model effect, a large number of labeled data sets are required for multiple iterations of training, which consumes a lot of manpower and material resources. Moreover, the current attacks are becoming more and more hidden, and the attack steps are becoming more and more cumbersome, and the joint analysis of related device logs can effectively discover potential attacks.

综上所述，为解决这些问题，模型不仅要关注单一日志来源，还需要结合不同事件、不同设备进行日志解析，进而进行异常检测等；此外，利用机器学习的相关发明将进一步应用于在线检测，构建通用、有效的在线基于日志的异常检测方法，并应用到实际中变得十分重要。To sum up, in order to solve these problems, the model should not only focus on a single log source, but also combine different events and different devices for log analysis, and then perform anomaly detection, etc. In addition, related inventions using machine learning will be further applied to online detection , it becomes very important to construct a general and effective online log-based anomaly detection method and apply it to practice.

发明内容Contents of the invention

本发明的目的在于提供一种基于RoBERTa的网络日志安全检测方法及系统，本发明可以处理各类未知种类和格式的日志，提高了网络日志安全检测的准确性。The object of the present invention is to provide a RoBERTa-based network log security detection method and system. The present invention can process logs of various unknown types and formats, and improves the accuracy of network log security detection.

实现本发明目的的技术解决方案为：一种基于RoBERTa的网络日志安全检测方法，包括步骤:The technical solution that realizes the object of the present invention is: a kind of network log security detection method based on RoBERTa, comprises steps:

获取所有网络设备的带标签网络日志数据集；Obtain a labeled network log dataset of all network devices;

对带标签网络日志数据预处理；Preprocessing of labeled web log data;

构建RoBERTa模型并通过带标签网络日志数据集对其训练，所述RoBERTa模型采用双向Transformer网络结构作为编码器，采用Softmax分类器获取日志存在风险的概率；Construct a RoBERTa model and train it through a labeled network log data set. The RoBERTa model uses a bidirectional Transformer network structure as an encoder, and uses a Softmax classifier to obtain the probability that the log is at risk;

通过dropout函数筛选最优模型；Filter the optimal model through the dropout function;

将带标签网络日志数据输入至最优的RoBERTa模型获取该日志存在风险的概率。Input the labeled network log data into the optimal RoBERTa model to obtain the probability that the log is at risk.

进一步地，所述RoBERTa模型将输入的日志数据转化为768维的高维向量。Further, the RoBERTa model converts the input log data into a 768-dimensional high-dimensional vector.

进一步地，所述RoBERTa模型的BiLSTM包括前向LSTM和后向LSTM。Further, the BiLSTM of the RoBERTa model includes a forward LSTM and a backward LSTM.

进一步地，所述Transformer块包括多个子层，每个子层包括多头自注意力机制和全连接前馈网络，每两个子层之间增加了残差连接模块和归一化模块。Further, the Transformer block includes multiple sublayers, each sublayer includes a multi-head self-attention mechanism and a fully connected feedforward network, and a residual connection module and a normalization module are added between every two sublayers.

进一步地，所述多头自注意力机制对每个字符的Query向量、Key向量和Value向量执行多组线性转换，分别进行自注意力计算，然后将所有计算结果进行拼接。Further, the multi-head self-attention mechanism performs multiple sets of linear transformations on the Query vector, Key vector and Value vector of each character, performs self-attention calculations respectively, and then splices all calculation results.

进一步地，所述Query向量、Key向量和Value向量长度均为64。Further, the lengths of the Query vector, the Key vector and the Value vector are all 64.

进一步地，所述多头自注意力机制采用缩放因子进行修正。Further, the multi-head self-attention mechanism uses a scaling factor for correction.

进一步地，所述RoBERTa模型对输入的日志文本数据添加[CLS][SEP]字符，并经日志文本数据划分为单个字符，然后将单个字符存储为词汇表，每个字符对应一个唯一标识符。Further, the RoBERTa model adds [CLS][SEP] characters to the input log text data, and divides the log text data into individual characters, and then stores the individual characters as a vocabulary, and each character corresponds to a unique identifier.

进一步地，所述日志文本数据添加[CLS][SEP]字符具体为：每个日志文本数据的第1个向量是[CLS]标志,用于下游的网络日志分类任务,句尾向量是[SEP]标志，用作不同日志的分隔符,RoBERTa模型输入的日志文本数据仅使用一个句向量。Further, adding [CLS][SEP] characters to the log text data is specifically: the first vector of each log text data is a [CLS] sign, which is used for downstream network log classification tasks, and the sentence end vector is [SEP ] mark, which is used as a separator for different logs, and the log text data input by the RoBERTa model only uses one sentence vector.

一种基于RoBERTa的网络日志安全检测系统，包括数据采集模块、日志分词模块、网络日志安全检测模块、训练模块和数据库，所述数据采集模块用于采集网络环境中的设备信息及其日志文件，并将采集数据保存到数据库；所述日志分词模块用于对数据预处理；所述网络日志安全检测模块基于RoBERTa模型，所述RoBERTa模型采用双向Transformer网络结构作为编码器，采用Softmax分类器获取日志存在风险的概率；所述训练模块用于训练更新网络日志安全检测模块，通过dropout函数筛选最优模型；所述数据库用于保存日志数据。A network log security detection system based on RoBERTa, comprising a data collection module, a log word segmentation module, a network log security detection module, a training module and a database, the data collection module is used to collect device information and log files thereof in a network environment, And the collected data is saved to the database; the log word segmentation module is used for data preprocessing; the network log security detection module is based on the RoBERTa model, and the RoBERTa model adopts a bidirectional Transformer network structure as an encoder, and uses a Softmax classifier to obtain logs There is a probability of risk; the training module is used to train and update the network log security detection module, and the optimal model is screened through the dropout function; the database is used to save log data.

与现有技术相比，本发明的有益效果在于：本发明不仅关注单一日志来源，还结合不同事件、不同设备进行日志解析，通过构建的RoBERTa进行高效准确的检测；本发明处理各类未知种类和格式的日志，解决了以往基于模板的方法针对未知定义日志分析的弱点，提高了系统的可用性及用户的易操作性；本发明具有低成本、低消耗、执行高效的优点。Compared with the prior art, the beneficial effect of the present invention is that: the present invention not only focuses on a single log source, but also combines different events and different devices for log analysis, and performs efficient and accurate detection through the constructed RoBERTa; the present invention handles various unknown types and format logs, which solves the weakness of the previous template-based method for log analysis of unknown definitions, and improves the usability of the system and the ease of operation for users; the invention has the advantages of low cost, low consumption, and high execution efficiency.

附图说明Description of drawings

图1为基于日志的异常检测框架图。Figure 1 is a framework diagram of log-based anomaly detection.

图2为训练RoBERTa模型的流程图。Figure 2 is a flowchart of training the RoBERTa model.

图3为模型总体架构图。Figure 3 is the overall architecture diagram of the model.

图4为Transformer结构图。Figure 4 is a Transformer structure diagram.

具体实施方式detailed description

下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some of the embodiments of the present invention, not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

在本发明实施例中使用的术语是仅仅出于描述特定实施例的目的，而非旨在限制本发明。在本发明实施例和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式，除非上下文清楚地表示其他含义。Terms used in the embodiments of the present invention are only for the purpose of describing specific embodiments, and are not intended to limit the present invention. As used in the embodiments of the present invention and the appended claims, the singular forms "a", "said" and "the" are also intended to include the plural forms unless the context clearly indicates otherwise.

应当理解，本专利中使用的术语“和/或”仅仅是一种描述关联对象的相同的字段，表示可以存在三种关系，例如，A和/或B，可以表示：单独存在A，同时存在A和B，单独存在B这三种情况。另外，本专利中字符“/”，一般表示前后关联对象是一种“或”的关系。It should be understood that the term "and/or" used in this patent is just a same field describing associated objects, indicating that there may be three relationships, for example, A and/or B, which may mean: A exists alone, and at the same time A and B, there are three cases of B alone. In addition, the character "/" in this patent generally indicates that the contextual objects are an "or" relationship.

本发明通过BERT模型可以简化先进的网络攻击检测模型。通过了解网络攻击行为并使用日志分析系统进行交叉验证，可以从该模型中了解各种网络攻击的特征。The invention can simplify the advanced network attack detection model through the BERT model. The characteristics of various cyber attacks can be learned from this model by understanding the cyber attack behavior and cross-validating with a log analysis system.

本实施方式提供的一种基于RoBERTa的网络日志安全检测系统包括自然语言处理组件和数据库；所述自然语言处理组件包括分词系统和RoBERTa算法；所述数据库内包括分类词库和大量不同类型的网络设备导出日志，每条日志附有相应的标注；所述数据库与自然语言处理组件关联；所述分词系统采用分词系统采用split函数，以空作为间隔进行分词工作。A RoBERTa-based network log security detection system provided in this embodiment includes a natural language processing component and a database; the natural language processing component includes a word segmentation system and a RoBERTa algorithm; the database includes a thesaurus and a large number of different types of networks The device exports logs, and each log is accompanied by a corresponding label; the database is associated with the natural language processing component; the word segmentation system adopts the split function, and the word segmentation work is performed with empty intervals.

所述自然处理组件用于对设备的syslog源数据和日志文件进行归纳分类、分析并确定日志语句所包含的含义,以及获取一定数量的语句训练所述数据库的语言处理模型,自然语言处理组件根据预设词义库将语句获取有效字段进行训练学习,以生成若干个训练词作为关键词,并为关键词生成解析信息,根据关键词和对应的解析信息生成语言处理模型,所述语言处理模型采用RoBERTa模型加双向Transformer结构。The natural processing component is used to summarize and classify the syslog source data and log files of the device, analyze and determine the meaning contained in the log statement, and obtain a certain number of statements to train the language processing model of the database. The natural language processing component is based on The preset word meaning database acquires valid fields of the sentence for training and learning to generate several training words as keywords, generates analytical information for the keywords, and generates a language processing model based on the keywords and corresponding analytical information, and the language processing model adopts RoBERTa model plus bidirectional Transformer structure.

优选地，所述自然语言处理组件还包括数据采集模块、日志分词模块以及解析模块：所述采集模块用于接收设备源的基本信息或训练语句缓存至数据库中，通过RoBERTa模型将语句转化为768维向量，根据预定义规则进行分类处理。Preferably, the natural language processing component also includes a data acquisition module, a log word segmentation module, and an analysis module: the acquisition module is used to receive basic information of the device source or cache the training sentences in the database, and convert the sentences into 768 words through the RoBERTa model Dimension vector, according to the predefined rules for classification processing.

优选地，所述采集模块包括设备确定模块、日志采集模块和关联分析模块，所述设备确定模块采用设备发现技术获取网络环境中的设备信息，将设备的基本信息存入数据库的分词库中，所述日志采集模块从syslog日志服务器监控的网络设备中采集网络日志文件作为RoBERTa模型训练的数据源,同时采集需要分析的日志。Preferably, the collection module includes a device determination module, a log collection module and an association analysis module, and the device determination module uses device discovery technology to obtain device information in the network environment, and stores the basic information of the device in the word segmentation database of the database , the log collection module collects network log files from the network equipment monitored by the syslog log server as a data source for RoBERTa model training, and collects logs that need to be analyzed simultaneously.

优选地,该系统还包括训练模块,所述训练模块用于将从解析模块获取的解析信息对RoBERTa模型进行更新,以更新所对应设备的模型。Preferably, the system further includes a training module, the training module is used to update the RoBERTa model with the analysis information obtained from the analysis module, so as to update the model of the corresponding equipment.

优选地,所述设备包括但不限于交换机、服务器、网关、路由器和网络安全设备。Preferably, the devices include but not limited to switches, servers, gateways, routers and network security devices.

优选地,所述设备的基本信息包括但不限于设备名、设备类型、设备IP和厂商名。Preferably, the basic information of the device includes but not limited to device name, device type, device IP and manufacturer name.

本发明提供了一种基于RoBERTa的网络日志处理方法,具体步骤如下：采集模块获取网络环境中的设备信息，记录其基本信息并作为语句采集的基础语句，同时获取syslog日志服务器中的设备的网络日志；随后将日志及设备信息导入训练好的RoBERTa模型进行分析；最后输出日志是否存在安全隐患的判断结果。The invention provides a network log processing method based on RoBERTa, the specific steps are as follows: the acquisition module obtains the equipment information in the network environment, records its basic information and uses it as the basic statement for statement collection, and simultaneously acquires the network information of the equipment in the syslog log server log; then import the log and device information into the trained RoBERTa model for analysis; finally output the judgment result of whether there is a security risk in the log.

一种基于RoBERTa的网络日志安全检测方法，其特征包括：A kind of network log security detection method based on RoBERTa, its feature comprises:

模型构建模块，用于获取网络日志数据集，根据所述网络日志数据集中所有网络日志内的映射向量，构建日志异常检测网络模型：The model building module is used to obtain the network log data set, and constructs a log anomaly detection network model according to the mapping vectors in all network logs in the network log data set:

模型分析模块，用于根据所述用户行为网络模型中每一个用户的网络访问特征，以及所述用户行为网络模型中每一个节点和路径的访问状态，对所述网络日志数据集中的异常日志进行识别。A model analysis module, configured to analyze the abnormal logs in the network log data set according to the network access characteristics of each user in the user behavior network model, and the access status of each node and path in the user behavior network model identify.

实施例1Example 1

参考图1，本发明实施例提供的一种基于网络日志的异常用户检测方法包括：With reference to Fig. 1, a kind of abnormal user detection method based on network log that the embodiment of the present invention provides comprises:

步骤1，首先获取各种网络设备带标签网络日志数据集。Step 1, first obtain the labeled network log data sets of various network devices.

步骤2，在获得了由大量网络日志组成的网络日志数据集后，由于原始日志的数据是字典类型的，主要包含了用户IP、请求时间、请求方法、请求大小、状态码、请求UL等字段。所以请求UL可以表示为请求资源路径与可选请求参数的集合。根据所述预训练网络模型中每一个用户的网络访问特征，以及所述用户行为网络模型中每一个节点和路径的访问状态，对所述网络日志数据集中的异常用户进行识别。此时将采集的网络日志进行分词、去停用词等预处理，把需要的参数从原始日志中提取出来，并将其处理成能直接被处理的数据格式。Step 2. After obtaining the network log data set composed of a large number of network logs, since the data of the original log is a dictionary type, it mainly includes fields such as user IP, request time, request method, request size, status code, and request UL . So requesting UL can be expressed as a set of request resource path and optional request parameters. According to the network access characteristics of each user in the pre-trained network model, and the access status of each node and path in the user behavior network model, the abnormal users in the network log data set are identified. At this time, the collected network logs are preprocessed such as word segmentation and stop words removal, and the required parameters are extracted from the original logs, and processed into a data format that can be directly processed.

步骤3，使用预训练好的RoBERTa模型处理处理好的日志数据集，数据集经过双向Transformer编码器网络结构后，提取出相应的特征表征向量。引入RoBERTa作为预训练文本的向量化特征表示方法，它使用双向Transformer网络结构作为编码器。Step 3, use the pre-trained RoBERTa model to process the processed log data set, and extract the corresponding feature representation vector after the data set passes through the bidirectional Transformer encoder network structure. Introduce RoBERTa as a vectorized feature representation method for pre-trained text, which uses a bidirectional Transformer network structure as an encoder.

步骤4，随后经过Softmax函数归一化计算安全极性概率，最后输出该日志是存在风险的概率。通过Softmax分类器函数对提取的[CLS]特征向量进行处理，最后比对安全机型概率推断是否存在异常情况。将每一条日志中的具有相同IP的日志看作是同一个用户产生的日志，在得到预处理的日志数据后，可以通过路径间的访问关系来生成用户行为网络模型。对于每一个用户来说，可以对根据其访问的日志情况来构建属于该用户的较小的行为网络模型N1，然后可以利用所有用户的日志访问情况来构建一个较大的预训练模型N2。Step 4, then normalize and calculate the safety polarity probability through the Softmax function, and finally output the probability that the log is at risk. The extracted [CLS] feature vector is processed through the Softmax classifier function, and finally compared with the probability of the safe model to infer whether there is an abnormality. The logs with the same IP in each log are regarded as logs generated by the same user. After obtaining the preprocessed log data, the user behavior network model can be generated through the access relationship between paths. For each user, a smaller behavioral network model N1 belonging to the user can be constructed according to the log access conditions of the user, and then a larger pre-training model N2 can be constructed using the log access conditions of all users.

具体的，在构建了异常检测模型后，根据模型中单一用户的网络访问的特征，以及用户行为网络模型中每一个节点和路径的访问状态，作为分析指标对用户行为网络模型中的每个用户和节点进行分析，从而对异常用户和异常的节点进行检测。Specifically, after constructing the anomaly detection model, according to the network access characteristics of a single user in the model, and the access status of each node and path in the user behavior network model, each user in the user behavior network model is analyzed as an analysis index And nodes are analyzed to detect abnormal users and abnormal nodes.

通过此方法，在经过原始日志数据进行预处理后，提取需要的数据，然后基于这些数据建立起用户行为网络模型，对日志进行分析中，将用户行为网络模型中的用户网络访问特征和每个节点的状态作为分析指标，可以定量的根据日志数据进行异常日志检测，通过余弦相似度检测的方法，实现了对网络日志快捷高效科学地分析。Through this method, after the original log data is preprocessed, the required data is extracted, and then a user behavior network model is established based on these data. During the log analysis, the user network access characteristics and each The status of the node is used as an analysis index, and abnormal log detection can be carried out quantitatively based on the log data. Through the method of cosine similarity detection, the network log can be analyzed quickly, efficiently and scientifically.

参考图2，所述RoBERTa模型训练流程包括：With reference to Fig. 2, described RoBERTa model training procedure comprises:

步骤5，针对采集到的数据集进行人为标注风险等级。Step 5, artificially mark the risk level for the collected data set.

步骤6，对已标注的数据集进行预处理，删除无关序列。Step 6. Preprocess the labeled data set and delete irrelevant sequences.

步骤7，将数据集按照训练集80％，测试集20％的比例划分数据集，同时保证训练集和测试集中不同类型风险的数据量一致。Step 7: Divide the data set according to the proportion of 80% of the training set and 20% of the test set, while ensuring that the data volumes of different types of risks in the training set and the test set are consistent.

步骤8，在数据集中添加[CLS]和[SEP]字符。每个日志的第1个向量是[CLS]标志,可以用于下游的网络日志分类任务,句尾向量[SEP]标志用作不同日志的分隔符,由于日志是句子级别的分类问题,即输入是一个句子,故仅使用一个句向量。Step 8, add [CLS] and [SEP] characters in the dataset. The first vector of each log is the [CLS] flag, which can be used for downstream network log classification tasks. The sentence-end vector [SEP] flag is used as a separator for different logs. Since the log is a sentence-level classification problem, the input is a sentence, so only one sentence vector is used.

步骤9，通过预训练好RoBERTa模型将日志转化为768维的高维向量，其中每条日志包含多个词嵌入向量，词嵌入向量就是每个词的静态编码。Step 9: Convert the log into a 768-dimensional high-dimensional vector through the pre-trained RoBERTa model, where each log contains multiple word embedding vectors, and the word embedding vector is the static code of each word.

步骤10，将高维向量向量传入BiLSTM结构网络训练，输出表征向量。Step 10, the high-dimensional vector vector is passed into the BiLSTM structure network for training, and the representation vector is output.

步骤11，将[CLS]向量经过Softmax归一化函数计算不同类型的概率。Step 11, the [CLS] vector is subjected to the Softmax normalization function to calculate the probability of different types.

步骤12，通过dropout函数，筛选最优模型。Step 12, filter the optimal model through the dropout function.

参考图3，所述RoBERTa模型采用了Transformer Encoder block进行连接，因为是一个典型的双向编码模型。在Transformer块中，数据首先会经过多头注意力模块，得到一个加权的特征向量。在注意力机制中，每个字符有3个不同的向量，分别是Query向量(Q)、Key向量(K)和Value向量(V)，长度均为64。具体包括依次连接的分词输入层19、划分层18、Transformer编码器块17、向量输出层16、前向LSTM 15、后向LSTM 14和隐藏层13；具体的Referring to Figure 3, the RoBERTa model uses the Transformer Encoder block for connection, because it is a typical two-way encoding model. In the Transformer block, the data first passes through the multi-head attention module to obtain a weighted feature vector. In the attention mechanism, each character has 3 different vectors, namely Query vector (Q), Key vector (K) and Value vector (V), all of length 64. Concretely include word segmentation input layer 19, division layer 18, Transformer encoder block 17, vector output layer 16, forward LSTM 15, backward LSTM 14 and hidden layer 13 connected in sequence;

RoBERTa模型的隐藏层13，与RoBERTa和BiLSTM的简单权重组合模型不同，本发明在上游部分使用RoBERTa生成的字符向量作为字符嵌入层，在下游部分使用BiLSTM作为特征提取器进行建模并对网络日志进行挖掘。RoBERTa用于动态构建字符向量的表示，而BiLSTM用于整合文本信息和句子的顺序特征。两者结合可以获得更复杂的语义特征，构建更准确的语义表示。The hidden layer 13 of the RoBERTa model is different from the simple weight combination model of RoBERTa and BiLSTM. The present invention uses the character vector generated by RoBERTa as the character embedding layer in the upstream part, and uses BiLSTM as the feature extractor in the downstream part to model and analyze network logs. to dig. RoBERTa is used to dynamically construct representations of character vectors, while BiLSTM is used to integrate textual information and sequential features of sentences. The combination of the two can obtain more complex semantic features and build a more accurate semantic representation.

RoBERTa模型的后向LSTM 14，后向LSTM 14由于其线性结构可以很容易地捕捉到文本的顺序信息，但无法编码从后向前的信息，而BiLSTM可以。The backward LSTM 14 of the RoBERTa model, the backward LSTM 14 can easily capture the sequential information of the text due to its linear structure, but it cannot encode the information from the back to the front, while BiLSTM can.

RoBERTa模型的前向LSTM 15，BiLSTM是前向LSTM 15和后向LSTM14的组合。采用BiLSTM作为具有长期序列信息的语言模型来生成事故报告挖掘结果，其中RoBERTa输出层的字符向量化表示的结果被用作BiLSTM层的输入。Forward LSTM 15 of the RoBERTa model, BiLSTM is a combination of forward LSTM 15 and backward LSTM 14. BiLSTM is adopted as a language model with long-term sequence information to generate accident report mining results, where the result of the character vectorization representation of the RoBERTa output layer is used as the input of the BiLSTM layer.

RoBERTa模型的向量输出层16输出高维表征向量。The vector output layer 16 of the RoBERTa model outputs high-dimensional representation vectors.

RoBERTa模型的Transformer编码器块17，具体结构见图4。Transformer是RoBERTa的核心结构。RoBERTa结构的每个Trm对应右侧的一个Transformer块。每个Transformer块由两个子层组成，分别为多头自注意力机制和全连接前馈网络。每个子层都增加了残差连接和归一化。Transformer是Google提出的解决Seq2Seq问题的新模型，采用Attention的结构全面代替LSTM结构，将注意力机制思想发挥到极致，并在机器翻译领域取得了切实可观的进步。因为Transformer模型中没有采用LSTM的结构对时序信息建模，所以在编码过程中加入位置嵌入层来弥补自注意力机制无法捕捉时序信息的不足。所述Transformer编码器块17包括逐级连接的输入模块25、加权模块24、多头注意力机制模块23、第二残差模块22、Encoder模块21和第一残差模块20；具体的：The Transformer encoder block 17 of the RoBERTa model has a specific structure shown in FIG. 4 . Transformer is the core structure of RoBERTa. Each Trm of the RoBERTa structure corresponds to a Transformer block on the right. Each Transformer block consists of two sublayers, a multi-head self-attention mechanism and a fully connected feedforward network. Each sublayer adds residual connections and normalization. Transformer is a new model proposed by Google to solve the Seq2Seq problem. It uses the Attention structure to completely replace the LSTM structure, brings the idea of the attention mechanism to the extreme, and has made considerable progress in the field of machine translation. Because the Transformer model does not use the LSTM structure to model temporal information, a position embedding layer is added in the encoding process to make up for the inability of the self-attention mechanism to capture temporal information. The Transformer encoder block 17 includes an input module 25, a weighting module 24, a multi-head attention mechanism module 23, a second residual module 22, an Encoder module 21 and a first residual module 20 connected step by step; specifically:

Transformer编码器块的第一残差模块20，残差相加，残差的引入是为了解决由于深度带来的网络退化，实际上有很多实验表明，这里的残差对网络性能贡献十分大。Norm为归一化模块，指的是对输出值进行归一化处理。The first residual module 20 of the Transformer encoder block adds residuals. The introduction of residuals is to solve the network degradation caused by depth. In fact, many experiments have shown that the residuals here contribute greatly to network performance. Norm is a normalization module, which refers to normalizing the output value.

Transformer模型的Encoder模块21部分由N层结构相同的子层叠加而成。每个子层都是两部分网络结构的组合，分别为多头自注意力网络(Multi-HeadSelfAttentionNetwork，MSAN)和全连接的前馈神经网络(Feed ForwardNetwork，FFN)。并且，对于每层输出都经过层归一化和残差连接来处理多层网络堆叠过程中的退化问题。The Encoder module 21 of the Transformer model is partially composed of sublayers with the same N-layer structure. Each sublayer is a combination of two network structures, namely Multi-Head Self Attention Network (MSAN) and fully connected Feed Forward Network (FFN). Moreover, the output of each layer is normalized and residually connected to deal with the degradation problem in the process of multi-layer network stacking.

第二残差模块22同第一残差模块20。The second residual module 22 is the same as the first residual module 20 .

多头注意力机制模块23的出发点在于，通过对输入的在不同子空间内进行注意力计算，扩展特征表示空间，达到对文本深层次建模的目的。实现方式为对Q，K，V执行多组线性转换，分别进行自注意力计算，然后将所有结果进行拼接。编码器中采用的多头自注意力网络在传统的自注意力机制的基础之上增加了缩放因子与多头注意力机制。缩放因子是指在传统的注意力计算方式上进行修正，防止维度过高导致结果太大而梯度过小等问题。The starting point of the multi-head attention mechanism module 23 is to expand the feature representation space by performing attention calculations on the input in different subspaces, so as to achieve the purpose of deep modeling of the text. The implementation method is to perform multiple sets of linear transformations on Q, K, and V, perform self-attention calculations separately, and then stitch all the results. The multi-head self-attention network used in the encoder adds a scaling factor and a multi-head attention mechanism to the traditional self-attention mechanism. The scaling factor refers to the correction of the traditional attention calculation method to prevent the problem that the dimension is too high and the result is too large and the gradient is too small.

加权模块24，在Transformer块中，数据首先会经过多头注意力模块，得到一个加权的特征向量。在注意力机制中，每个字符有3个不同的向量，分别是Query向量(Q)、Key向量(K)和Value向量(V)，长度均为64。Weighting module 24, in the Transformer block, the data first passes through the multi-head attention module to obtain a weighted feature vector. In the attention mechanism, each character has 3 different vectors, namely Query vector (Q), Key vector (K) and Value vector (V), all of length 64.

输入模块25实现输入词嵌入。The input module 25 implements input word embedding.

经过词嵌入、位置嵌入、段嵌入后的高维向量，划分层18分割字符，将输入的文本数据划分为单个字符(E1，E2，…，En)，然后将单个字符存储为词汇表，即每个字符都有一个对应的唯一标识符(表示为[字符：ID])。After word embedding, position embedding, and segment embedding, the high-dimensional vector is divided into layer 18 to segment characters, and the input text data is divided into individual characters (E1, E2, ..., En), and then the individual characters are stored as a vocabulary, namely Each character has a corresponding unique identifier (denoted as [character: ID]).

经过分词等预处理后通过分词输入层19添加[CLS][SEP]字符后的日志数据。After preprocessing such as word segmentation, the log data after adding [CLS][SEP] characters through the word segmentation input layer 19.

一种基于RoBERTa的网络日志安全检测系统，包括数据采集模块、日志分词模块、网络日志安全检测模块、训练模块和数据库，所述数据采集模块用于采集网络环境中的设备信息及其日志文件，并将采集数据保存到数据库；所述日志分词模块用于对数据预处理；所述网络日志安全检测模块基于RoBERTa模型，所述RoBERTa模型采用双向Transformer网络结构作为编码器，采用Softmax分类器获取日志存在风险的概率；所述训练模块用于训练更新网络日志安全检测模块，通过dropout函数筛选最优模型；所述数据库用于保存日志数据。所述系统包括所述异常用户检测方法的全部技术特征。A network log security detection system based on RoBERTa, comprising a data collection module, a log word segmentation module, a network log security detection module, a training module and a database, the data collection module is used to collect device information and log files thereof in a network environment, And the collected data is saved to the database; the log word segmentation module is used for data preprocessing; the network log security detection module is based on the RoBERTa model, and the RoBERTa model adopts a bidirectional Transformer network structure as an encoder, and uses a Softmax classifier to obtain logs There is a probability of risk; the training module is used to train and update the network log security detection module, and the optimal model is screened through the dropout function; the database is used to save log data. The system includes all technical features of the abnormal user detection method.

需要指出的是，由于RoBERTa为一种公知模型，本实施例仅对改进创新点进行了详细阐述，未进行详细阐述的本领域公知常识，不再累述。It should be pointed out that since RoBERTa is a well-known model, this embodiment only elaborates on the improvements and innovations in detail, and the common knowledge in the field that is not elaborated in detail will not be repeated here.

Claims

1. a network log safety detection method based on RoBERTa, is characterized in that, comprises steps:

Obtain a labeled network log dataset of all network devices;

Preprocessing of labeled web log data;

Construct a RoBERTa model and train it through a labeled network log data set. The RoBERTa model uses a bidirectional Transformer network structure as an encoder, and uses a Softmax classifier to obtain the probability that the log is at risk;

Filter the optimal model through the dropout function;

Input the labeled network log data into the optimal RoBERTa model to obtain the probability that the log is at risk.

2. The network log security detection method according to claim 1, wherein the RoBERTa model converts the input log data into a high-dimensional vector of 768 dimensions.

3. the network log security detection method according to claim 1, is characterized in that, the BiLSTM of described RoBERTa model comprises forward LSTM and backward LSTM.

4. the network log security detection method according to claim 1, is characterized in that, described Transformer block comprises a plurality of sub-layers, and each sub-layer comprises multi-head self-attention mechanism and fully connected feed-forward network, between every two sub-layers A residual connection module and a normalization module have been added.

5. the network log security detection method according to claim 4, is characterized in that, described multi-head self-attention mechanism carries out multiple groups of linear transformations to the Query vector, Key vector and Value vector of each character, carries out self-attention respectively Calculate, and then splicing all the calculation results.

6. The network log security detection method according to claim 1, wherein the lengths of the Query vector, the Key vector and the Value vector are all 64.

7. The network log security detection method according to claim 1, wherein the multi-head self-attention mechanism adopts a scaling factor for correction.

8. the network log security detection method according to claim 1, is characterized in that, described RoBERTa model adds [CLS] [SEP] character to the log text data of input, and is divided into single character through log text data, and then Individual characters are stored as a vocabulary, with each character corresponding to a unique identifier.

9. The network log security detection method according to claim 1, wherein adding [CLS][SEP] characters to the log text data is specifically: the first vector of each log text data is a [CLS] sign , used for downstream network log classification tasks, the sentence end vector is the [SEP] flag, used as a separator for different logs, and the log text data input by the RoBERTa model only uses one sentence vector.

10. A network log security detection system based on RoBERTa, it is characterized in that, comprises data collection module, log segmentation module, network log security detection module, training module and database, and described data collection module is used for collecting the equipment in the network environment information and its log files, and save the collected data to the database; the log word segmentation module is used for data preprocessing; the network log security detection module is based on the RoBERTa model, and the RoBERTa model uses a bidirectional Transformer network structure as an encoder, A Softmax classifier is used to obtain the probability that the log is at risk; the training module is used to train and update the network log security detection module, and the optimal model is screened through a dropout function; the database is used to save log data.