Method for automatically extracting information of Internet of things equipment
Technical Field
The invention relates to the field of information security, in particular to a method for automatically extracting information of equipment of the Internet of things.
Background
Hundreds of millions of internet of things devices are accessed in a network space, and the variety of the internet of things devices is various, including office equipment, monitoring equipment, network equipment, industrial control equipment and the like. The internet of things equipment is the most important asset in the network space, and the detection, discovery and identification of the internet of things equipment in the network space become effective means for guaranteeing the safety of key infrastructure of the network space. The information of the internet of things records the type of a certain device, the manufacturer from the certain device, the specific product type number and other related information, and the information of the internet of things is important for security audit and security defense. At present, the existing method for extracting the information of the internet of things equipment depends on manual writing rules, or the range of extracting the information is limited, so that the method has certain limitations in the aspects of large-scale application and field deployment.
Therefore, when various internet of things devices exist in a network space, including a router, a network camera, a network printer and the like, how to effectively and automatically extract triples (device types, device manufacturers and product models) in application layer message information has application value.
Disclosure of Invention
The invention aims to provide a method for automatically extracting information of equipment of the Internet of things, so as to solve the problems in the technology in the background discussion.
The technical scheme of the invention is as follows:
a method for automatically extracting Internet of things equipment information comprises the following steps:
the method comprises the following steps: the determination of the device type information includes: step a, preprocessing message information of an application layer, deleting interference content, and converting slogans into text formats as input of all subsequent steps after a preprocessing module is completed; b, converting characters in a plain text format into word vectors, and training to obtain an equipment type classifier; step c, processing the application layer message to obtain equipment type information;
step two: the confirmation of the equipment manufacturer information comprises the following steps: step d, utilizing a named entity identification technology to identify the entity to which the text belongs; step e, obtaining equipment manufacturer information by using a recurrent neural network model;
step three: the confirmation of the product model information includes: extracting characters exceeding a threshold value by utilizing similarity calculation near the character information of the equipment manufacturer to obtain product model information;
step four, the confirmation of the information of the equipment of the Internet of things comprises the following steps: and combining the three steps to obtain the information of the equipment of the Internet of things, namely (equipment type, equipment manufacturer and product model).
Preferably, the pretreatment in step a comprises the steps of: a1, deleting the error state code of the application layer; a2, deleting irrelevant contents of the hypertext markup language; a3, removing special characters; a4, deleting time stamps, numbers, punctuation and stop words; a5, extracting a plain text from the rest message content, splitting the plain text into single characters, and performing word marking;
the step b specifically comprises the following steps: processing training data by using Word2Vec to obtain a pre-trained model, converting characters in a plain text format into Word vectors, and training to obtain a classifier of the equipment type by using the bidirectional long-short term memory network model based on an attention mechanism and taking the Word vectors as input;
the step c specifically comprises the following steps: giving an application layer message information, converting the application layer message information into a text mark and a vector mark as the input of a model; and the classifier gives the judgment of the type of the equipment of the Internet of things and provides a label of the equipment type: (device type, #, #).
Preferably, step d specifically includes: the application layer message information processed in the first step becomes a plain text, the category to which each word belongs is identified and marked as V and O, wherein V represents the category of equipment manufacturers, and O represents other categories;
the step e specifically comprises the following steps: carrying out three different vectorization on the plain text information in the step one, wherein the vectorization comprises a word vector, a letter vector and a mixed vector; using a gate control circulation unit model to express the letter vector of a word, and finally combining the word vector and the letter vector to be used as a single sequence vector, namely mixed vector expression; taking the mixed vector representation as the input of each gated cyclic unit, and training a cyclic neural network model so as to mark each character in the plain text information in the step one; searching a text marked as V, serving as a manufacturer of the Internet of things equipment, and providing a label of the equipment manufacturer: (#, equipment manufacturer, #).
Preferably, the third step is specifically: setting a window with the length of W based on the equipment manufacturer category V in the step two, finding all the characters appearing in the window, and generating a candidate set B; performing letter-level word vector representation and general word vector representation on each character in the set B; the known product model name of the internet of things is used as a set A, vector representation of characters in a set B is compared with vector representation of characters in the set A, if the similarity exceeds a threshold value T, the characters are used as the product model of the equipment, and a label of the product model is obtained: (#, #, product type).
The invention has the beneficial effects that: the method provides an effective automation technology, and the information of the Internet of things equipment (equipment type, equipment manufacturer and product model) is automatically and effectively extracted from the application layer message. The method is convenient to deploy, does not need to compile rules manually, and is a low-cost and high-efficiency Internet of things equipment information extraction technology.
Drawings
Fig. 1 is a flowchart of a method for automatically extracting information of an internet of things device according to an embodiment of the present invention;
FIG. 2 is a flowchart of extracting device types using a classifier according to an embodiment of the present invention;
fig. 3 is a model structure diagram of an internet of things device type according to an embodiment of the present invention;
fig. 4 is a diagram illustrating extraction of information of a device manufacturer of the internet of things by using a named entity recognition technology according to an embodiment of the present invention.
Fig. 5 is a flowchart of extracting a product model based on a device manufacturer and an existing product information set according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
Fig. 1 is a flowchart of a method for automatically extracting information of an internet of things device. Specifically, a method for automatically extracting information of an internet of things device includes:
the method comprises the following steps: and training to obtain the device type classifier by using the deep neural network model. And giving an application layer message, and obtaining the equipment type information through a trained classifier.
Step two: and on the basis of the application layer message in the step one, extracting the characters of the equipment manufacturer of the Internet of things in the message as equipment manufacturer information by using a named entity identification technology.
Step three: and based on the equipment manufacturer information obtained in the step two, extracting characters exceeding a threshold value from the characters around the equipment manufacturer information by utilizing similarity calculation to serve as product model information.
Step four: aiming at different characteristics of different kinds of information of the Internet of things equipment, the method automatically extracts the equipment type, equipment manufacturer and product model in the application layer message.
FIG. 2 is a flow chart of extracting device types using a classifier.
The first step specifically comprises the following steps:
aiming at message information of an application layer, the method needs to be preprocessed and the interference content is deleted, and the method comprises the following steps: (1) the error status codes of the application layer, e.g. 4XX, 5XX, are deleted. 400 indicates an error request and 500 indicates an internal server error; (2) irrelevant content such as tags, CSS, and JS in the hypertext markup language (HTML) is deleted. Specifically, these tags are surrounded by sharp brackets, such as < br >; (3) removing special characters, such as "$", "%"; (4) deleting timestamps, numbers, punctuation and stop words; (5) in the rest of the message contents, the plain text is extracted and split into single characters, which is called word tagging. And after the preprocessing module is finished, converting the slogan into a text format as the input of all the subsequent steps.
For a word in plain text format, this step will convert it into a word vector. Specifically, the method uses Word2Vec to process training data to obtain a pre-trained model, and words in a plain text format are converted into Word vectors. In the step, a Bidirectional Long Short-Term Memory network model (all called as extension-Based Bidirectional Short-Term Memory Networks) Based on an Attention mechanism is utilized, and a word vector is used as input to train to obtain a classifier of the equipment type. Giving an application layer message information, converting the application layer message information into a text mark and a vector mark as the input of a model; and the classifier gives a decision on the type of the internet of things device, i.e. it provides a label in the form of (device type, #, #) for it.
Fig. 3 is a model structure diagram of the types of devices in the internet of things in the embodiment of the present invention. The attention mechanism model contains 5 parts: (1) an input layer: inputting a statement into the model through the layer; (2) embedding layer: each word is mapped to a low-dimensional vector. Given a sentence consisting of T words: s ═ x1,x2,……,xTIs given by the formula ei=WwrdviEvery word xiConverted into corresponding word vectors eiWherein W iswrdIs a matrix obtained by learning, viIs a vector taking the total amount of words as a dimension; (3) LSTM layer: obtaining high-level features from the embedding layer using a two-way long-short term memory network, wherein the model uses a sum-by-element approach to combine the forward and backward passed outputs; (4) attention layer: and generating a weight vector w, multiplying the word-level feature of each time step by the weight vector, and combining into a sentence-level feature vector. The resulting statement representation for classification: h is*Tanh (r). Wherein r ═ H αT,a=softmax(wTM), M ═ tanh (H), H is the output vector H ═ H of the LSTM layer1,h2,…,hT](ii) a (5) An output layer: the sentence-level feature vectors are finally used for classification, and the activation function softmax is used to obtain the feature vectors belonging to each device typeAnd probability, wherein the device type with the maximum probability is used as the type of the Internet of things device.
Fig. 4 illustrates the method for extracting the manufacturer information of the internet of things device by using the named entity recognition technology. Namely, the second step specifically comprises:
named entity recognition technology is an entity used to recognize specific meanings in natural language text. The application layer message information becomes a plain text through the step one, and the method identifies the category of each word by using a named entity identification technology. In the step, the two categories are respectively marked as V and O, wherein V represents the category of equipment manufacturers, and O represents other categories.
In the named entity recognition task, the step firstly carries out three different vectorization on the plain text information in the step one, including word vectors, letter vectors and mixed vectors. In the step, letter vector representation of words is carried out by using a gated circulation Unit (GRU) model, finally, word vectors and letter vectors are combined to be used as an independent sequence vector, namely mixed vector representation, and the independent sequence vector is used as input of each gated circulation Unit (GRU) to train a circular neural network model, so that each word in the pure text information in the step one is marked. In the step, a text marked as V is found and used as a manufacturer of the equipment of the Internet of things, namely, a label in the form of (#, equipment manufacturer, #) is provided for the equipment.
Fig. 5 is a flowchart of extracting a product model based on an equipment manufacturer and an existing product information set in the embodiment of the present invention, that is, step three specifically includes:
and setting a window with the length of W in the step based on the equipment manufacturer category V in the step II, finding all the appeared characters in the window, and generating a candidate set B. In this step, an alphabetical level word vector (character embedding) representation and a general word vector (word embedding) representation are performed on each character in the set B. In the step, the known product model name of the internet of things is used as a set A, vector representation of characters in the set B and vector representation of characters in the set A are compared, if the similarity exceeds a threshold value T, the characters (information in the set B, information in the set B and information in the set A, and the similarity exceeds the threshold value T) are used as the product model of the equipment, and the label in the form of (#, #, product model) is provided for the equipment.
Letter-level word vectors and general word vectors are character-level and word-level word vectors. Specifically, the letter-level word vector is obtained by firstly vectorizing letters in a word and then obtaining the vector of the word; the generic word vector is the vector from which the word is directly derived. The former favors the representation of low frequency words and the latter favors the representation of high frequency words.