CN113191149A - Method for automatically extracting information of Internet of things equipment - Google Patents
Method for automatically extracting information of Internet of things equipment Download PDFInfo
- Publication number
- CN113191149A CN113191149A CN202110516557.0A CN202110516557A CN113191149A CN 113191149 A CN113191149 A CN 113191149A CN 202110516557 A CN202110516557 A CN 202110516557A CN 113191149 A CN113191149 A CN 113191149A
- Authority
- CN
- China
- Prior art keywords
- equipment
- information
- internet
- things
- characters
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a method for automatically extracting information of Internet of things equipment, which comprises the following steps: the method comprises the following steps: and training to obtain the device type classifier by using the deep neural network model. And giving an application layer message, and obtaining the equipment type information through a trained classifier. Step two: and on the basis of the application layer message in the step one, extracting the characters of the equipment manufacturer of the Internet of things in the message as equipment manufacturer information by using a named entity identification technology. Step three: and based on the equipment manufacturer information obtained in the step two, extracting characters exceeding a threshold value from the characters around the equipment manufacturer information by utilizing similarity calculation to serve as product model information. Step four: aiming at different characteristics of different kinds of information of the Internet of things equipment, the method automatically extracts the equipment type, equipment manufacturer and product model in the application layer message. The method is convenient to deploy, does not need to compile rules manually, and is a low-cost and high-efficiency Internet of things equipment information extraction technology.
Description
Technical Field
The invention relates to the field of information security, in particular to a method for automatically extracting information of equipment of the Internet of things.
Background
Hundreds of millions of internet of things devices are accessed in a network space, and the variety of the internet of things devices is various, including office equipment, monitoring equipment, network equipment, industrial control equipment and the like. The internet of things equipment is the most important asset in the network space, and the detection, discovery and identification of the internet of things equipment in the network space become effective means for guaranteeing the safety of key infrastructure of the network space. The information of the internet of things records the type of a certain device, the manufacturer from the certain device, the specific product type number and other related information, and the information of the internet of things is important for security audit and security defense. At present, the existing method for extracting the information of the internet of things equipment depends on manual writing rules, or the range of extracting the information is limited, so that the method has certain limitations in the aspects of large-scale application and field deployment.
Therefore, when various internet of things devices exist in a network space, including a router, a network camera, a network printer and the like, how to effectively and automatically extract triples (device types, device manufacturers and product models) in application layer message information has application value.
Disclosure of Invention
The invention aims to provide a method for automatically extracting information of equipment of the Internet of things, so as to solve the problems in the technology in the background discussion.
The technical scheme of the invention is as follows:
a method for automatically extracting Internet of things equipment information comprises the following steps:
the method comprises the following steps: the determination of the device type information includes: step a, preprocessing message information of an application layer, deleting interference content, and converting slogans into text formats as input of all subsequent steps after a preprocessing module is completed; b, converting characters in a plain text format into word vectors, and training to obtain an equipment type classifier; step c, processing the application layer message to obtain equipment type information;
step two: the confirmation of the equipment manufacturer information comprises the following steps: step d, utilizing a named entity identification technology to identify the entity to which the text belongs; step e, obtaining equipment manufacturer information by using a recurrent neural network model;
step three: the confirmation of the product model information includes: extracting characters exceeding a threshold value by utilizing similarity calculation near the character information of the equipment manufacturer to obtain product model information;
step four, the confirmation of the information of the equipment of the Internet of things comprises the following steps: and combining the three steps to obtain the information of the equipment of the Internet of things, namely (equipment type, equipment manufacturer and product model).
Preferably, the pretreatment in step a comprises the steps of: a1, deleting the error state code of the application layer; a2, deleting irrelevant contents of the hypertext markup language; a3, removing special characters; a4, deleting time stamps, numbers, punctuation and stop words; a5, extracting a plain text from the rest message content, splitting the plain text into single characters, and performing word marking;
the step b specifically comprises the following steps: processing training data by using Word2Vec to obtain a pre-trained model, converting characters in a plain text format into Word vectors, and training to obtain a classifier of the equipment type by using the bidirectional long-short term memory network model based on an attention mechanism and taking the Word vectors as input;
the step c specifically comprises the following steps: giving an application layer message information, converting the application layer message information into a text mark and a vector mark as the input of a model; and the classifier gives the judgment of the type of the equipment of the Internet of things and provides a label of the equipment type: (device type, #, #).
Preferably, step d specifically includes: the application layer message information processed in the first step becomes a plain text, the category to which each word belongs is identified and marked as V and O, wherein V represents the category of equipment manufacturers, and O represents other categories;
the step e specifically comprises the following steps: carrying out three different vectorization on the plain text information in the step one, wherein the vectorization comprises a word vector, a letter vector and a mixed vector; using a gate control circulation unit model to express the letter vector of a word, and finally combining the word vector and the letter vector to be used as a single sequence vector, namely mixed vector expression; taking the mixed vector representation as the input of each gated cyclic unit, and training a cyclic neural network model so as to mark each character in the plain text information in the step one; searching a text marked as V, serving as a manufacturer of the Internet of things equipment, and providing a label of the equipment manufacturer: (#, equipment manufacturer, #).
Preferably, the third step is specifically: setting a window with the length of W based on the equipment manufacturer category V in the step two, finding all the characters appearing in the window, and generating a candidate set B; performing letter-level word vector representation and general word vector representation on each character in the set B; the known product model name of the internet of things is used as a set A, vector representation of characters in a set B is compared with vector representation of characters in the set A, if the similarity exceeds a threshold value T, the characters are used as the product model of the equipment, and a label of the product model is obtained: (#, #, product type).
The invention has the beneficial effects that: the method provides an effective automation technology, and the information of the Internet of things equipment (equipment type, equipment manufacturer and product model) is automatically and effectively extracted from the application layer message. The method is convenient to deploy, does not need to compile rules manually, and is a low-cost and high-efficiency Internet of things equipment information extraction technology.
Drawings
Fig. 1 is a flowchart of a method for automatically extracting information of an internet of things device according to an embodiment of the present invention;
FIG. 2 is a flowchart of extracting device types using a classifier according to an embodiment of the present invention;
fig. 3 is a model structure diagram of an internet of things device type according to an embodiment of the present invention;
fig. 4 is a diagram illustrating extraction of information of a device manufacturer of the internet of things by using a named entity recognition technology according to an embodiment of the present invention.
Fig. 5 is a flowchart of extracting a product model based on a device manufacturer and an existing product information set according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
Fig. 1 is a flowchart of a method for automatically extracting information of an internet of things device. Specifically, a method for automatically extracting information of an internet of things device includes:
the method comprises the following steps: and training to obtain the device type classifier by using the deep neural network model. And giving an application layer message, and obtaining the equipment type information through a trained classifier.
Step two: and on the basis of the application layer message in the step one, extracting the characters of the equipment manufacturer of the Internet of things in the message as equipment manufacturer information by using a named entity identification technology.
Step three: and based on the equipment manufacturer information obtained in the step two, extracting characters exceeding a threshold value from the characters around the equipment manufacturer information by utilizing similarity calculation to serve as product model information.
Step four: aiming at different characteristics of different kinds of information of the Internet of things equipment, the method automatically extracts the equipment type, equipment manufacturer and product model in the application layer message.
FIG. 2 is a flow chart of extracting device types using a classifier.
The first step specifically comprises the following steps:
aiming at message information of an application layer, the method needs to be preprocessed and the interference content is deleted, and the method comprises the following steps: (1) the error status codes of the application layer, e.g. 4XX, 5XX, are deleted. 400 indicates an error request and 500 indicates an internal server error; (2) irrelevant content such as tags, CSS, and JS in the hypertext markup language (HTML) is deleted. Specifically, these tags are surrounded by sharp brackets, such as < br >; (3) removing special characters, such as "$", "%"; (4) deleting timestamps, numbers, punctuation and stop words; (5) in the rest of the message contents, the plain text is extracted and split into single characters, which is called word tagging. And after the preprocessing module is finished, converting the slogan into a text format as the input of all the subsequent steps.
For a word in plain text format, this step will convert it into a word vector. Specifically, the method uses Word2Vec to process training data to obtain a pre-trained model, and words in a plain text format are converted into Word vectors. In the step, a Bidirectional Long Short-Term Memory network model (all called as extension-Based Bidirectional Short-Term Memory Networks) Based on an Attention mechanism is utilized, and a word vector is used as input to train to obtain a classifier of the equipment type. Giving an application layer message information, converting the application layer message information into a text mark and a vector mark as the input of a model; and the classifier gives a decision on the type of the internet of things device, i.e. it provides a label in the form of (device type, #, #) for it.
Fig. 3 is a model structure diagram of the types of devices in the internet of things in the embodiment of the present invention. The attention mechanism model contains 5 parts: (1) an input layer: inputting a statement into the model through the layer; (2) embedding layer: each word is mapped to a low-dimensional vector. Given a sentence consisting of T words: s ═ x1,x2,……,xTIs given by the formula ei=WwrdviEvery word xiConverted into corresponding word vectors eiWherein W iswrdIs a matrix obtained by learning, viIs a vector taking the total amount of words as a dimension; (3) LSTM layer: obtaining high-level features from the embedding layer using a two-way long-short term memory network, wherein the model uses a sum-by-element approach to combine the forward and backward passed outputs; (4) attention layer: and generating a weight vector w, multiplying the word-level feature of each time step by the weight vector, and combining into a sentence-level feature vector. The resulting statement representation for classification: h is*Tanh (r). Wherein r ═ H αT,a=softmax(wTM), M ═ tanh (H), H is the output vector H ═ H of the LSTM layer1,h2,…,hT](ii) a (5) An output layer: the sentence-level feature vectors are finally used for classification, and the activation function softmax is used to obtain the feature vectors belonging to each device typeAnd probability, wherein the device type with the maximum probability is used as the type of the Internet of things device.
Fig. 4 illustrates the method for extracting the manufacturer information of the internet of things device by using the named entity recognition technology. Namely, the second step specifically comprises:
named entity recognition technology is an entity used to recognize specific meanings in natural language text. The application layer message information becomes a plain text through the step one, and the method identifies the category of each word by using a named entity identification technology. In the step, the two categories are respectively marked as V and O, wherein V represents the category of equipment manufacturers, and O represents other categories.
In the named entity recognition task, the step firstly carries out three different vectorization on the plain text information in the step one, including word vectors, letter vectors and mixed vectors. In the step, letter vector representation of words is carried out by using a gated circulation Unit (GRU) model, finally, word vectors and letter vectors are combined to be used as an independent sequence vector, namely mixed vector representation, and the independent sequence vector is used as input of each gated circulation Unit (GRU) to train a circular neural network model, so that each word in the pure text information in the step one is marked. In the step, a text marked as V is found and used as a manufacturer of the equipment of the Internet of things, namely, a label in the form of (#, equipment manufacturer, #) is provided for the equipment.
Fig. 5 is a flowchart of extracting a product model based on an equipment manufacturer and an existing product information set in the embodiment of the present invention, that is, step three specifically includes:
and setting a window with the length of W in the step based on the equipment manufacturer category V in the step II, finding all the appeared characters in the window, and generating a candidate set B. In this step, an alphabetical level word vector (character embedding) representation and a general word vector (word embedding) representation are performed on each character in the set B. In the step, the known product model name of the internet of things is used as a set A, vector representation of characters in the set B and vector representation of characters in the set A are compared, if the similarity exceeds a threshold value T, the characters (information in the set B, information in the set B and information in the set A, and the similarity exceeds the threshold value T) are used as the product model of the equipment, and the label in the form of (#, #, product model) is provided for the equipment.
Letter-level word vectors and general word vectors are character-level and word-level word vectors. Specifically, the letter-level word vector is obtained by firstly vectorizing letters in a word and then obtaining the vector of the word; the generic word vector is the vector from which the word is directly derived. The former favors the representation of low frequency words and the latter favors the representation of high frequency words.
Claims (4)
1. A method for automatically extracting Internet of things equipment information is characterized by comprising the following steps:
the method comprises the following steps: the determination of the device type information includes: step a, preprocessing message information of an application layer, deleting interference content, and converting slogans into text formats as input of all subsequent steps after a preprocessing module is completed; b, converting characters in a plain text format into word vectors, and training to obtain an equipment type classifier; step c, processing the application layer message to obtain equipment type information;
step two: the confirmation of the equipment manufacturer information comprises the following steps: step d, utilizing a named entity identification technology to identify the entity to which the text belongs; step e, obtaining equipment manufacturer information by using a recurrent neural network model;
step three: the confirmation of the product model information includes: extracting characters exceeding a threshold value by utilizing similarity calculation near equipment manufacturer information to obtain product model information;
step four, the confirmation of the information of the equipment of the Internet of things comprises the following steps: and combining the three steps to obtain the information of the equipment of the Internet of things, namely (equipment type, equipment manufacturer and product model).
2. The method for automatically extracting the information of the equipment of the Internet of things as claimed in claim 1,
the pretreatment in the step a comprises the following steps: a1, deleting the error state code of the application layer; a2, deleting irrelevant contents of the hypertext markup language; a3, removing special characters; a4, deleting time stamps, numbers, punctuation and stop words; a5, extracting a plain text from the rest message content, splitting the plain text into single characters, and performing word marking;
the step b specifically comprises the following steps: processing training data by using Word2Vec to obtain a pre-trained model, converting characters in a plain text format into Word vectors, and training to obtain a classifier of the equipment type by using the bidirectional long-short term memory network model based on an attention mechanism and taking the Word vectors as input;
the step c specifically comprises the following steps: giving an application layer message information, converting the application layer message information into a text mark and a vector mark as the input of a model; and the classifier gives the judgment of the type of the equipment of the Internet of things and provides a label of the equipment type: (device type, #, #).
3. The method for automatically extracting information of the internet of things equipment according to claim 1, wherein the step d specifically comprises: the application layer message information processed in the first step becomes a plain text, the category of each character is identified and marked by V and O, wherein V represents the category of equipment manufacturers, and O represents other categories;
the step e specifically comprises the following steps: carrying out three different vectorization on the plain text information in the step one, wherein the vectorization comprises a word vector, a letter vector and a mixed vector; using a gate control circulation unit model to express the letter vector of a word, and finally combining the word vector and the letter vector to be used as a single sequence vector, namely mixed vector expression; taking the mixed vector representation as the input of each gated cyclic unit, and training a cyclic neural network model so as to mark each character in the plain text information in the step one; searching a text marked as V, serving as a manufacturer of the Internet of things equipment, and providing a label of the equipment manufacturer: (#, equipment manufacturer, #).
4. The method for automatically extracting information of the internet of things equipment according to claim 1, wherein the third step is specifically as follows: setting a window with the length of W based on the equipment manufacturer category V in the second step, finding all the characters appearing in the window, and generating a candidate set B; performing letter-level word vector representation and general word vector representation on each character in the set B; the known product model name of the internet of things is used as a set A, vector representation of characters in a set B is compared with vector representation of characters in the set A, if the similarity exceeds a threshold value T, the characters are used as the product model of the equipment, and a label of the product model is obtained: (#, #, product type).
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110516557.0A CN113191149B (en) | 2021-05-12 | 2021-05-12 | Method for automatically extracting information of Internet of things equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110516557.0A CN113191149B (en) | 2021-05-12 | 2021-05-12 | Method for automatically extracting information of Internet of things equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113191149A true CN113191149A (en) | 2021-07-30 |
CN113191149B CN113191149B (en) | 2023-04-07 |
Family
ID=76981573
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110516557.0A Active CN113191149B (en) | 2021-05-12 | 2021-05-12 | Method for automatically extracting information of Internet of things equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113191149B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090326923A1 (en) * | 2006-05-15 | 2009-12-31 | Panasonic Corporatioin | Method and apparatus for named entity recognition in natural language |
CN111726336A (en) * | 2020-05-14 | 2020-09-29 | 北京邮电大学 | Method and system for extracting identification information of networked intelligent equipment |
CN111783466A (en) * | 2020-07-15 | 2020-10-16 | 电子科技大学 | Named entity identification method for Chinese medical records |
CN111897962A (en) * | 2020-07-27 | 2020-11-06 | 绿盟科技集团股份有限公司 | Internet of things asset marking method and device |
CN112564974A (en) * | 2020-12-08 | 2021-03-26 | 武汉大学 | Deep learning-based fingerprint identification method for Internet of things equipment |
-
2021
- 2021-05-12 CN CN202110516557.0A patent/CN113191149B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090326923A1 (en) * | 2006-05-15 | 2009-12-31 | Panasonic Corporatioin | Method and apparatus for named entity recognition in natural language |
CN111726336A (en) * | 2020-05-14 | 2020-09-29 | 北京邮电大学 | Method and system for extracting identification information of networked intelligent equipment |
CN111783466A (en) * | 2020-07-15 | 2020-10-16 | 电子科技大学 | Named entity identification method for Chinese medical records |
CN111897962A (en) * | 2020-07-27 | 2020-11-06 | 绿盟科技集团股份有限公司 | Internet of things asset marking method and device |
CN112564974A (en) * | 2020-12-08 | 2021-03-26 | 武汉大学 | Deep learning-based fingerprint identification method for Internet of things equipment |
Also Published As
Publication number | Publication date |
---|---|
CN113191149B (en) | 2023-04-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111444721B (en) | Chinese text key information extraction method based on pre-training language model | |
CN107729309B (en) | Deep learning-based Chinese semantic analysis method and device | |
CN110851596A (en) | Text classification method and device and computer readable storage medium | |
CN110457689B (en) | Semantic processing method and related device | |
CN111767725B (en) | Data processing method and device based on emotion polarity analysis model | |
CN111950273A (en) | Network public opinion emergency automatic identification method based on emotion information extraction analysis | |
CN111198948A (en) | Text classification correction method, device and equipment and computer readable storage medium | |
CN112836509B (en) | Expert system knowledge base construction method and system | |
CN108829823A (en) | A kind of file classification method | |
CN108388554A (en) | Text emotion identifying system based on collaborative filtering attention mechanism | |
CN116416480B (en) | Visual classification method and device based on multi-template prompt learning | |
Sheshikala et al. | Natural language processing and machine learning classifier used for detecting the author of the sentence | |
CN114416979A (en) | Text query method, text query equipment and storage medium | |
CN114676255A (en) | Text processing method, device, equipment, storage medium and computer program product | |
CN111401064A (en) | Named entity identification method and device and terminal equipment | |
CN112257425A (en) | Power data analysis method and system based on data classification model | |
CN115718792A (en) | Sensitive information extraction method based on natural semantic processing and deep learning | |
CN115064154A (en) | Method and device for generating mixed language voice recognition model | |
CN115859980A (en) | Semi-supervised named entity identification method, system and electronic equipment | |
CN113204975A (en) | Sensitive character wind identification method based on remote supervision | |
CN109446299A (en) | The method and system of searching email content based on event recognition | |
CN115098673A (en) | Business document information extraction method based on variant attention and hierarchical structure | |
CN112134858B (en) | Sensitive information detection method, device, equipment and storage medium | |
CN112445862A (en) | Internet of things equipment data set construction method and device, electronic equipment and storage medium | |
CN113191149B (en) | Method for automatically extracting information of Internet of things equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |