CN114444485A

CN114444485A - Cloud environment network equipment entity identification method

Info

Publication number: CN114444485A
Application number: CN202210078810.3A
Authority: CN
Inventors: 陈兴蜀; 郑涛; 袁磊; 刘朋; 黄铁脉; 廖志红; 宋可儿; 王海舟
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2022-01-24
Filing date: 2022-01-24
Publication date: 2022-05-06
Anticipated expiration: 2042-01-24
Also published as: CN114444485B

Abstract

The invention discloses a cloud environment network equipment entity identification method, which comprises the steps of firstly, collecting original data of a text corpus corresponding to cloud environment network equipment information; manually making a small number of features to form a feature template according to the characteristics of the cloud environment network equipment information, and converting the input sentences into corresponding word vector sequences; then, extracting character-level features of each word aiming at the converted word vector sequence; combining the character-level feature vector and the local context feature as input information, and extracting semantic features of the input information; and finally, labeling the cloud environment network equipment entity for each character by using a conditional random field according to the semantic features of the obtained input information, and completing the identification of the cloud environment network equipment entity. The method can be used for realizing identification and classification of the network equipment entities in the cloud environment, and is convenient for quickly reconstructing the internal system architecture outside the closed cloud environment.

Description

Cloud environment network equipment entity identification method

Technical Field

The invention relates to the technical field of cloud environment network equipment entity identification, in particular to a cloud environment network equipment entity identification method based on a convolutional neural network, a bidirectional long-short term memory neural network and a conditional random field.

Background

Cloud environment: the cloud environment is a distributed computing system based on virtualization technology, the system architecture of the system can be mainly divided into three layers of an Infrastructure layer, a platform layer and a software Service layer, corresponding names are Infrastructure as a Service (IaaS) respectively, the system mainly comprises a computer server, communication equipment, storage equipment and the like, and the computing resources are unified and virtualized into computing resources, storage resources and network resources through the virtualization technology; platform as a Service (PaaS), which mainly comprises a supporting Platform for providing a whole set of application software for development, operation and operation for users; software as a Service (Software as a Service), which has a main function of providing an application mode of Software Service for a user through the internet, in which the user only needs to pay application use fees, and Software and hardware consumption and maintenance generated in the application development process are managed by a cloud Service provider.

Cloud environment network equipment entity identification: the cloud environment network device mainly refers to a computing resource device (CPU, memory and the like), a storage resource device (database and the like) and a network resource device (firewall, data center switch, VPN, router and the like). The entity identification is entity identification in a specific field in the field of named entity identification, and the main task is to identify various network equipment entities in the cloud environment from network security text data so as to identify and classify the network equipment entities in the cloud environment.

In recent years, a deep learning method based on a neural network has a good effect in named entity recognition in the general field, and completes feature extraction of data by building a multi-layer network structure, and then identifies and classifies the data. The more common deep Neural Network structures include a Recurrent Neural Network (RNN), a Convolutional Neural Network (CNN), and a Long Short-Term Memory Neural Network (LSTM). Although the deep neural network has a good effect in the field of naming identification in the general field, in a cloud environment, due to the fact that physical nodes and virtual equipment are difficult to distinguish after virtualization layer by layer, dynamically-changed cloud resources are difficult to locate, and various safety protection measures in the cloud, the whole appearance of a cloud environment network equipment entity is difficult to carve through traditional conventional port scanning, IP scanning and penetrating measures, and even can be found and disposed by the cloud safety protection measures rapidly. Therefore, it is necessary to effectively process the network security text information obtained by network detection through the deep neural network, and further complete identification and classification of various network device entity information in the cloud environment.

Disclosure of Invention

In view of the above problems, an object of the present invention is to provide a cloud environment network device entity identification method based on a deep neural network, which is implemented based on a convolutional neural network, a long-short term memory neural network, and a conditional random field, and can implement a cloud environment network device entity identification function while ensuring a high accuracy. The technical scheme is as follows:

a cloud environment network equipment entity identification method comprises

Step 1: collecting original data of a text corpus corresponding to the cloud environment network equipment information;

step 2: manually formulating a small amount of characteristics to form a characteristic template according to the characteristics of the cloud environment network equipment information; then extracting local context information, and converting the input sentence into a corresponding word vector sequence by using a pre-trained word vector file;

and step 3: performing convolution pooling on each word by using a convolution neural network aiming at the converted word vector sequence, and extracting the character-level characteristics of each word;

and 4, step 4: combining the character-level feature vector and the local context feature to serve as input information, transmitting the input information into a bidirectional long-short term memory neural network for training, and extracting semantic features of the input information;

and 5: and aiming at the semantic features of the obtained input information, carrying out cloud environment network equipment entity labeling on each character by using a conditional random field, and labeling cloud environment network equipment entity information in the sentence sequence, so as to obtain an optimal labeling sequence and further finish the identification of the cloud environment network equipment entity.

Further, the step 2 specifically comprises:

step 2.1: comprehensively considering the category information and the specific position information of the collected text data equipment, and manually making a characteristic template as follows:

T[-2,0],T[-1,0],T[0,0],T[1,0],T[2,0]；

wherein T represents the current character, the first number in parentheses represents its position relative to the current character, and the second character represents the number of columns of the selected specific feature;

step 2.2: a set of feature functions is formulated according to the features:

F_t(L_loc-1,L_loc,w,loc)

wherein ,L_locRepresenting the current character label, L_loc-1Represents the next label, w represents the current character, and loc represents the current position;

step 2.3: giving a weight δ to all feature functions_locIf a feature function is activated, its weight δ_locWill be added to the final cumulative value, the feature score value F_tThe higher the value is, the more accurate the prediction result is; step 2.4: and training the original corpus of the cloud environment network equipment by using a CBOW model of word2vec to obtain a corresponding word vector file.

Further, the step 3 specifically includes:

step 3.1: setting different character vectors for characters of different types of equipment in a word embedding mode, and distinguishing the case and the type of the characters;

step 3.2: converting each word of the input sequence into a corresponding character vector, and filling placeholders on the basis of the longest word to make all character vector matrixes consistent in size;

step 3.3: local features of the character vector matrix are extracted by utilizing the convolution layer of the convolution neural network, dimension reduction features of the layers are pooled, and character-level features are further extracted.

Further, the step 4 specifically includes:

simultaneously using the extracted character-level features and the context local information of the input feature information as input, and splicing the semantic feature sequences output by the positive LSTM at the h moment

And reverse LSTM semantic feature vector sequences

Obtaining a complete semantic feature vector sequence

The preprocessing is done using the tanh activation function, and the output result of the LSTM hidden layer is obtained as:

wherein ,Out_hRepresenting the output value, w, of the h-time character calculated by the LSTM hidden layer_hIs a corresponding weight value, T_kIs a bias vector.

Further, the step 5 specifically includes:

step 5.1: defining the input sequence as S ═ S₁,...,s_n]； wherein ,s_nIs the input vector for the nth character, L ═ L₁,...,l_n]Is a predicted tag sequence of sequence S;

step 5.2: calculating the label score corresponding to each character C at the moment T:

Score_T＝W_c×Out_T+b_c+F_S

wherein, Score_TEach character corresponding to T-timeTotal score value of label, W_c×Out_TRepresenting the label score, W, of the LSTM hidden layer computation_cRepresenting the weight matrix, Out, corresponding to each character C_TThe output value of each character C at the moment T is calculated by an LSTM hidden layer; b_cOffset vector of LSTM hidden layer input gate representing character C, F_SRepresenting the weight values of the feature templates in the input sequence;

step 5.3: calculating the transfer score between tags:

defining a transition matrix U, and adding two special labels for starting and ending in the e labels, wherein the total score value of a label sequence is determined by the transition of each label, and the specific calculation formula is as follows:

wherein ,

representative label L_iTransfer to L_i+1Is the transfer score; score (L)_i+1) Represents the current label L_i+1A score of (a); l is a radical of an alcohol₀ and L_eRespectively representing the beginning and the end of two special labels, and the corresponding score value of the label is 0;

step 5.4: calculating the generation probability of the label sequence:

wherein, L' represents a tag sequence to be calculated except for a start tag and an end tag in the input sequence; XL represents the scalar space composed of total fractional value X and prediction label L;

step 5.5: obtaining a loss function by utilizing maximum likelihood estimation, so that the probability of generating a correct label sequence L is maximum under the condition of an input sequence S; the loss function is calculated by the formula:

wherein add represents an accumulation function;

step 5.6: obtaining the sequence L with the highest score through a gradient descent learning algorithm and a Viterbi algorithm^*As a final result of the cloud environment network device entity identification, namely:

L^*＝argmaxX(S,L')。

the invention has the beneficial effects that: according to the method, a cloud environment characteristic template is manually established, and data text labeling of a cloud environment network equipment entity is completed through a convolutional neural network, a bidirectional long-short term memory neural network and a conditional random field; constructing a complete cloud environment network equipment entity identification method, and providing a theoretical method for comprehensively and completely depicting a cloud environment network equipment entity; the method can be used for effectively realizing the identification and classification of the network equipment entities in the cloud environment, and is convenient for quickly reconstructing the internal system architecture outside the closed cloud environment.

Drawings

Fig. 1 is a schematic diagram of a cloud environment network device entity identification method according to the present invention.

FIG. 2 is a schematic diagram of a neural network model designed by the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and specific embodiments. The method for identifying the cloud environment network equipment entity comprises the following steps of firstly collecting a data information text of the cloud environment network equipment entity, manually formulating a characteristic template by combining with proprietary information of cloud equipment, and then transmitting the characteristic template into a convolutional neural network and a bidirectional long-short term memory neural network for characteristic information extraction:

(1) and collecting original data of a text corpus corresponding to the cloud environment network equipment information.

Network text data related to cloud environment network equipment entities are obtained from various network security websites and cloud manufacturers, the data volume is 18920 pieces in total, ten thousand pieces of original data are selected as unmarked data and used as training word vector files, and the rest 8920 pieces of data are used for cloud environment network equipment entity identification marking corpora.

(2) Manually formulating a small amount of characteristics to form a characteristic template according to the characteristics of the cloud environment network equipment information; and then extracting local context information, and converting the input sentence into a corresponding word vector sequence by using the pre-trained word vector file.

The main body identified by the method is the equipment name, the software version number information and the category (computing resource equipment, storage resource equipment and network resource equipment) in the text data of the cloud environment network equipment; firstly, according to the information characteristics of the class to which the equipment belongs, a small number of characteristics are manually formulated to form a characteristic template.

The method comprises the following specific steps: the text data collected by comprehensive consideration is classified into category information and specific position information, and the characteristic template is manually set as T-2, 0],T[-1,0],T[0,0],T[1,0],T[2,0]. T denotes the current character, the first number in parentheses represents its position relative to the current character, and the second character denotes the number of columns of the particular feature selected, such as word and part of speech. A set of characteristic functions F is formulated according to the characteristics_t(L_loc-1,L_locW, loc) in which L_locRepresenting the current character label, L_loc-1Representing the next label, w the current character, and loc the current position. All feature functions are then given a weight δ_locIf a characteristic function is activated, its weight δ_locWill be added to the final cumulative value, the feature score value F_tHigher values indicate more accurate prediction results. The invention trains the original corpus of the cloud environment network equipment by using the CBOW model of word2vec to obtain a corresponding word vector file.

(3) And (4) carrying out convolution pooling on each word by utilizing a convolution neural network aiming at the converted word vector sequence, and extracting the character-level characteristics of the word.

After the word vector file corresponding to the original corpus is obtained, the invention adopts the convolutional neural network to extract the character-level characteristics in the word vector file, thereby improving the performance of the model. The specific operation is as follows: firstly, different character vectors are set for characters of different classes (computing resource equipment, storage resource equipment and network resource equipment) to which different equipment belongs in a word embedding mode, and are used for distinguishing the case and the type of the characters. Second, each word of the input sequence is converted to a corresponding character vector, and all character vector matrices are made uniform in size by filling placeholders with reference to the longest word. And finally, extracting local features of the character vector matrix by using the convolution layer of the convolution neural network, pooling layer dimension reduction features, and further extracting character level features.

(4) Combining the character-level feature vector and the local context feature as input information, transmitting the input information into a bidirectional long-short term memory neural network for training, and extracting semantic features of the input information.

The invention adopts a bidirectional long-short term memory neural network (BilSTM) training character feature vector sequence, and in order to improve the accuracy of cloud environment network equipment entity identification, the invention also takes context local information of input feature information as input at the same time and outputs the feature sequence by splicing the feature sequence output by the positive LSTM at the h moment

And inverse LSTM feature vector sequence

Obtaining complete characteristic vector sequence

The preprocessing is completed by using the tanh activation function, so that the output result of the hidden layer is obtained

wherein w_hIs a corresponding weight value, T_kIs a bias vector.

(5) And aiming at the semantic features of the obtained input information, carrying out cloud environment network equipment entity labeling on each character by using a conditional random field, and labeling cloud environment network equipment entity information in the sentence sequence, so as to obtain an optimal labeling sequence and further finish the identification of the cloud environment network equipment entity.

And (5) by labeling the hidden layer output result obtained in the step (4), converting the cloud environment network equipment entity identification problem into a character sequence labeling problem. In order to improve the accuracy of cloud environment network equipment entity identification, the invention utilizes the chain random condition field CRF to calculate the probability of the whole tag sequence, thereby obtaining the globally optimal tag sequence, namely the optimal output result of cloud environment network equipment entity identification.

Specifically, assume that the input sequence of the present invention is S ═ S₁,...,s_n], wherein s_nIs the input vector for the nth character, L ═ L₁,...,l_n]Is the tag sequence of sequence S. In order to obtain the correct label corresponding to each character, it is assumed that each character in the sequence S has a corresponding score value, and the larger the value of the score value, the larger the value of the label correct value. Obviously, the larger the tag score value of the input sequence S, the higher the probability that the corresponding tag is correct. The label with the highest total value of the added label score values before and after the sequence L is the final prediction result of each character. Adding the label score calculated by the hidden layer of the bidirectional long and short term memory neural network (LSTM) in the step (4) and the weight value calculated by the feature template in the step (2) to obtain the label score corresponding to each character C at the moment T, wherein the specific calculation formula is as follows:

Score_T＝W_c×Out_T+b_c+F_S

Score_Trepresenting the total score value, W, of the label corresponding to each character at time T_c×Out_TRepresenting the label score, W, of the LSTM hidden layer computation_cRepresenting the weight matrix, Out, corresponding to each character C_TIs the output value calculated by the LSTM hidden layer of each character C at the time T, b_cOffset vector of LSTM hidden layer input gate representing character C, F_SRepresenting the weight values of the feature templates in the input sequence.

Because the transition between labels is generated when the probability value of the whole label sequence is calculated by using the chain random condition field CRF, in order to calculate the transition score between the labels, the invention defines a transition matrix U, and two special labels of the beginning and the end are added in the e labels. Therefore, the total score value of a tag sequence is determined by the transfer of each tag, and the specific calculation formula is as follows:

wherein S is an input sequence, L is a predicted tag sequence,

representative label L_iTransfer to L_i+1I.e. the transition score. Score (L)_i+1) Represents the current label L_i+1The fraction of (c). L is a radical of an alcohol₀ and L_eRepresenting the beginning and the end of two special labels, respectively, with a score value of 0 for the label.

The probability of generation of a tag sequence L can be calculated given an input sequence S by the following formula:

wherein, L' represents the sequence of tags to be calculated except for the start and end tags in the input sequence. Then the invention obtains a loss function by utilizing maximum likelihood estimation, so that the probability of generating a correct label sequence L is maximum under the condition of an input sequence S. The loss function is calculated by the formula:

obtaining the sequence L with the highest score through a gradient descent learning algorithm and a Viterbi algorithm^*As a final result of the cloud environment network device entity identification, namely:

L^*＝argmaxX(S,L')

according to the cloud environment network equipment entity identification method, various equipment entities in the cloud environment network are accurately identified from network text data by manually establishing characteristic templates of cloud environment computing resource equipment, storage resource equipment and network resource equipment and combining methods such as a convolutional neural network, a bidirectional long-short term memory neural network and a conditional random field.

Claims

1. A cloud environment network equipment entity identification method is characterized by comprising

step 2: manually formulating a small quantity of characteristics to form a characteristic template according to the characteristics of the cloud environment network equipment information; then extracting local context information, and converting the input sentence into a corresponding word vector sequence by using a pre-trained word vector file;

and step 3: performing convolution pooling on each word by using a convolution neural network aiming at the converted word vector sequence, and extracting character-level features of each word;

2. The cloud environment network device entity identification method according to claim 1, wherein the step 2 specifically is:

T[-2,0],T[-1,0],T[0,0],T[1,0],T[2,0]；

step 2.2: a set of feature functions is formulated according to the features:

F_t(L_loc-1,L_loc,w,loc)

step 2.3: giving a weight δ to all feature functions_locIf a feature function is activated, its weight δ_locWill be added to the final cumulative value, the feature score value F_tThe higher the value is, the more accurate the prediction result is;

step 2.4: and training the original corpus of the cloud environment network equipment by using a CBOW model of word2vec to obtain a corresponding word vector file.

3. The cloud environment network device entity identification method according to claim 1, wherein the step 3 specifically is:

4. The cloud environment network device entity identification method according to claim 1, wherein the step 4 specifically is:

And reverse LSTM semantic feature vector sequences

Obtaining a complete semantic feature vector sequence

5. The cloud environment network device entity identification method according to claim 4, wherein the step 5 specifically comprises:

Score_T＝W_c×Out_T+b_c+F_S

wherein, Score_TRepresents the total mark point value, W, corresponding to each character at the moment T_c×Out_TRepresenting the label score, W, of the LSTM hidden layer computation_cRepresenting the weight matrix, Out, corresponding to each character C_TThe output value of each character C at the moment T is calculated by an LSTM hidden layer; b_cOffset vector of input gate of LSTM hidden layer representing character C, F_SRepresenting the weight values of the feature templates in the input sequence;

step 5.3: calculating the transfer score between tags:

wherein ,

representative label L_iTransfer to L_i+1Is the transition score; score (L)_i+1) Represents the current label L_i+1A fraction of (d); l is₀ and L_eRespectively representing the beginning and the end of two special labels, wherein the corresponding fraction value of the labels is 0;

step 5.4: calculating the generation probability of the label sequence:

wherein add represents an accumulation function;

step 5.6: obtaining the sequence L with the highest score through a gradient descent learning algorithm and a Viterbi algorithm^*As a cloud environment network deviceThe final result of the body identification, namely:

L^*＝argmaxX(S,L')。