CN113191149A - Method for automatically extracting information of Internet of things equipment - Google Patents

Method for automatically extracting information of Internet of things equipment Download PDF

Info

Publication number
CN113191149A
CN113191149A CN202110516557.0A CN202110516557A CN113191149A CN 113191149 A CN113191149 A CN 113191149A CN 202110516557 A CN202110516557 A CN 202110516557A CN 113191149 A CN113191149 A CN 113191149A
Authority
CN
China
Prior art keywords
equipment
information
internet
things
characters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110516557.0A
Other languages
Chinese (zh)
Other versions
CN113191149B (en
Inventor
李强
黄敏
万上锋
张雅鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jiaotong University
Original Assignee
Beijing Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jiaotong University filed Critical Beijing Jiaotong University
Priority to CN202110516557.0A priority Critical patent/CN113191149B/en
Publication of CN113191149A publication Critical patent/CN113191149A/en
Application granted granted Critical
Publication of CN113191149B publication Critical patent/CN113191149B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method for automatically extracting information of Internet of things equipment, which comprises the following steps: the method comprises the following steps: and training to obtain the device type classifier by using the deep neural network model. And giving an application layer message, and obtaining the equipment type information through a trained classifier. Step two: and on the basis of the application layer message in the step one, extracting the characters of the equipment manufacturer of the Internet of things in the message as equipment manufacturer information by using a named entity identification technology. Step three: and based on the equipment manufacturer information obtained in the step two, extracting characters exceeding a threshold value from the characters around the equipment manufacturer information by utilizing similarity calculation to serve as product model information. Step four: aiming at different characteristics of different kinds of information of the Internet of things equipment, the method automatically extracts the equipment type, equipment manufacturer and product model in the application layer message. The method is convenient to deploy, does not need to compile rules manually, and is a low-cost and high-efficiency Internet of things equipment information extraction technology.

Description

Method for automatically extracting information of Internet of things equipment
Technical Field
The invention relates to the field of information security, in particular to a method for automatically extracting information of equipment of the Internet of things.
Background
Hundreds of millions of internet of things devices are accessed in a network space, and the variety of the internet of things devices is various, including office equipment, monitoring equipment, network equipment, industrial control equipment and the like. The internet of things equipment is the most important asset in the network space, and the detection, discovery and identification of the internet of things equipment in the network space become effective means for guaranteeing the safety of key infrastructure of the network space. The information of the internet of things records the type of a certain device, the manufacturer from the certain device, the specific product type number and other related information, and the information of the internet of things is important for security audit and security defense. At present, the existing method for extracting the information of the internet of things equipment depends on manual writing rules, or the range of extracting the information is limited, so that the method has certain limitations in the aspects of large-scale application and field deployment.
Therefore, when various internet of things devices exist in a network space, including a router, a network camera, a network printer and the like, how to effectively and automatically extract triples (device types, device manufacturers and product models) in application layer message information has application value.
Disclosure of Invention
The invention aims to provide a method for automatically extracting information of equipment of the Internet of things, so as to solve the problems in the technology in the background discussion.
The technical scheme of the invention is as follows:
a method for automatically extracting Internet of things equipment information comprises the following steps:
the method comprises the following steps: the determination of the device type information includes: step a, preprocessing message information of an application layer, deleting interference content, and converting slogans into text formats as input of all subsequent steps after a preprocessing module is completed; b, converting characters in a plain text format into word vectors, and training to obtain an equipment type classifier; step c, processing the application layer message to obtain equipment type information;
step two: the confirmation of the equipment manufacturer information comprises the following steps: step d, utilizing a named entity identification technology to identify the entity to which the text belongs; step e, obtaining equipment manufacturer information by using a recurrent neural network model;
step three: the confirmation of the product model information includes: extracting characters exceeding a threshold value by utilizing similarity calculation near the character information of the equipment manufacturer to obtain product model information;
step four, the confirmation of the information of the equipment of the Internet of things comprises the following steps: and combining the three steps to obtain the information of the equipment of the Internet of things, namely (equipment type, equipment manufacturer and product model).
Preferably, the pretreatment in step a comprises the steps of: a1, deleting the error state code of the application layer; a2, deleting irrelevant contents of the hypertext markup language; a3, removing special characters; a4, deleting time stamps, numbers, punctuation and stop words; a5, extracting a plain text from the rest message content, splitting the plain text into single characters, and performing word marking;
the step b specifically comprises the following steps: processing training data by using Word2Vec to obtain a pre-trained model, converting characters in a plain text format into Word vectors, and training to obtain a classifier of the equipment type by using the bidirectional long-short term memory network model based on an attention mechanism and taking the Word vectors as input;
the step c specifically comprises the following steps: giving an application layer message information, converting the application layer message information into a text mark and a vector mark as the input of a model; and the classifier gives the judgment of the type of the equipment of the Internet of things and provides a label of the equipment type: (device type, #, #).
Preferably, step d specifically includes: the application layer message information processed in the first step becomes a plain text, the category to which each word belongs is identified and marked as V and O, wherein V represents the category of equipment manufacturers, and O represents other categories;
the step e specifically comprises the following steps: carrying out three different vectorization on the plain text information in the step one, wherein the vectorization comprises a word vector, a letter vector and a mixed vector; using a gate control circulation unit model to express the letter vector of a word, and finally combining the word vector and the letter vector to be used as a single sequence vector, namely mixed vector expression; taking the mixed vector representation as the input of each gated cyclic unit, and training a cyclic neural network model so as to mark each character in the plain text information in the step one; searching a text marked as V, serving as a manufacturer of the Internet of things equipment, and providing a label of the equipment manufacturer: (#, equipment manufacturer, #).
Preferably, the third step is specifically: setting a window with the length of W based on the equipment manufacturer category V in the step two, finding all the characters appearing in the window, and generating a candidate set B; performing letter-level word vector representation and general word vector representation on each character in the set B; the known product model name of the internet of things is used as a set A, vector representation of characters in a set B is compared with vector representation of characters in the set A, if the similarity exceeds a threshold value T, the characters are used as the product model of the equipment, and a label of the product model is obtained: (#, #, product type).
The invention has the beneficial effects that: the method provides an effective automation technology, and the information of the Internet of things equipment (equipment type, equipment manufacturer and product model) is automatically and effectively extracted from the application layer message. The method is convenient to deploy, does not need to compile rules manually, and is a low-cost and high-efficiency Internet of things equipment information extraction technology.
Drawings
Fig. 1 is a flowchart of a method for automatically extracting information of an internet of things device according to an embodiment of the present invention;
FIG. 2 is a flowchart of extracting device types using a classifier according to an embodiment of the present invention;
fig. 3 is a model structure diagram of an internet of things device type according to an embodiment of the present invention;
fig. 4 is a diagram illustrating extraction of information of a device manufacturer of the internet of things by using a named entity recognition technology according to an embodiment of the present invention.
Fig. 5 is a flowchart of extracting a product model based on a device manufacturer and an existing product information set according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
Fig. 1 is a flowchart of a method for automatically extracting information of an internet of things device. Specifically, a method for automatically extracting information of an internet of things device includes:
the method comprises the following steps: and training to obtain the device type classifier by using the deep neural network model. And giving an application layer message, and obtaining the equipment type information through a trained classifier.
Step two: and on the basis of the application layer message in the step one, extracting the characters of the equipment manufacturer of the Internet of things in the message as equipment manufacturer information by using a named entity identification technology.
Step three: and based on the equipment manufacturer information obtained in the step two, extracting characters exceeding a threshold value from the characters around the equipment manufacturer information by utilizing similarity calculation to serve as product model information.
Step four: aiming at different characteristics of different kinds of information of the Internet of things equipment, the method automatically extracts the equipment type, equipment manufacturer and product model in the application layer message.
FIG. 2 is a flow chart of extracting device types using a classifier.
The first step specifically comprises the following steps:
aiming at message information of an application layer, the method needs to be preprocessed and the interference content is deleted, and the method comprises the following steps: (1) the error status codes of the application layer, e.g. 4XX, 5XX, are deleted. 400 indicates an error request and 500 indicates an internal server error; (2) irrelevant content such as tags, CSS, and JS in the hypertext markup language (HTML) is deleted. Specifically, these tags are surrounded by sharp brackets, such as < br >; (3) removing special characters, such as "$", "%"; (4) deleting timestamps, numbers, punctuation and stop words; (5) in the rest of the message contents, the plain text is extracted and split into single characters, which is called word tagging. And after the preprocessing module is finished, converting the slogan into a text format as the input of all the subsequent steps.
For a word in plain text format, this step will convert it into a word vector. Specifically, the method uses Word2Vec to process training data to obtain a pre-trained model, and words in a plain text format are converted into Word vectors. In the step, a Bidirectional Long Short-Term Memory network model (all called as extension-Based Bidirectional Short-Term Memory Networks) Based on an Attention mechanism is utilized, and a word vector is used as input to train to obtain a classifier of the equipment type. Giving an application layer message information, converting the application layer message information into a text mark and a vector mark as the input of a model; and the classifier gives a decision on the type of the internet of things device, i.e. it provides a label in the form of (device type, #, #) for it.
Fig. 3 is a model structure diagram of the types of devices in the internet of things in the embodiment of the present invention. The attention mechanism model contains 5 parts: (1) an input layer: inputting a statement into the model through the layer; (2) embedding layer: each word is mapped to a low-dimensional vector. Given a sentence consisting of T words: s ═ x1,x2,……,xTIs given by the formula ei=WwrdviEvery word xiConverted into corresponding word vectors eiWherein W iswrdIs a matrix obtained by learning, viIs a vector taking the total amount of words as a dimension; (3) LSTM layer: obtaining high-level features from the embedding layer using a two-way long-short term memory network, wherein the model uses a sum-by-element approach to combine the forward and backward passed outputs; (4) attention layer: and generating a weight vector w, multiplying the word-level feature of each time step by the weight vector, and combining into a sentence-level feature vector. The resulting statement representation for classification: h is*Tanh (r). Wherein r ═ H αT,a=softmax(wTM), M ═ tanh (H), H is the output vector H ═ H of the LSTM layer1,h2,…,hT](ii) a (5) An output layer: the sentence-level feature vectors are finally used for classification, and the activation function softmax is used to obtain the feature vectors belonging to each device typeAnd probability, wherein the device type with the maximum probability is used as the type of the Internet of things device.
Fig. 4 illustrates the method for extracting the manufacturer information of the internet of things device by using the named entity recognition technology. Namely, the second step specifically comprises:
named entity recognition technology is an entity used to recognize specific meanings in natural language text. The application layer message information becomes a plain text through the step one, and the method identifies the category of each word by using a named entity identification technology. In the step, the two categories are respectively marked as V and O, wherein V represents the category of equipment manufacturers, and O represents other categories.
In the named entity recognition task, the step firstly carries out three different vectorization on the plain text information in the step one, including word vectors, letter vectors and mixed vectors. In the step, letter vector representation of words is carried out by using a gated circulation Unit (GRU) model, finally, word vectors and letter vectors are combined to be used as an independent sequence vector, namely mixed vector representation, and the independent sequence vector is used as input of each gated circulation Unit (GRU) to train a circular neural network model, so that each word in the pure text information in the step one is marked. In the step, a text marked as V is found and used as a manufacturer of the equipment of the Internet of things, namely, a label in the form of (#, equipment manufacturer, #) is provided for the equipment.
Fig. 5 is a flowchart of extracting a product model based on an equipment manufacturer and an existing product information set in the embodiment of the present invention, that is, step three specifically includes:
and setting a window with the length of W in the step based on the equipment manufacturer category V in the step II, finding all the appeared characters in the window, and generating a candidate set B. In this step, an alphabetical level word vector (character embedding) representation and a general word vector (word embedding) representation are performed on each character in the set B. In the step, the known product model name of the internet of things is used as a set A, vector representation of characters in the set B and vector representation of characters in the set A are compared, if the similarity exceeds a threshold value T, the characters (information in the set B, information in the set B and information in the set A, and the similarity exceeds the threshold value T) are used as the product model of the equipment, and the label in the form of (#, #, product model) is provided for the equipment.
Letter-level word vectors and general word vectors are character-level and word-level word vectors. Specifically, the letter-level word vector is obtained by firstly vectorizing letters in a word and then obtaining the vector of the word; the generic word vector is the vector from which the word is directly derived. The former favors the representation of low frequency words and the latter favors the representation of high frequency words.

Claims (4)

1. A method for automatically extracting Internet of things equipment information is characterized by comprising the following steps:
the method comprises the following steps: the determination of the device type information includes: step a, preprocessing message information of an application layer, deleting interference content, and converting slogans into text formats as input of all subsequent steps after a preprocessing module is completed; b, converting characters in a plain text format into word vectors, and training to obtain an equipment type classifier; step c, processing the application layer message to obtain equipment type information;
step two: the confirmation of the equipment manufacturer information comprises the following steps: step d, utilizing a named entity identification technology to identify the entity to which the text belongs; step e, obtaining equipment manufacturer information by using a recurrent neural network model;
step three: the confirmation of the product model information includes: extracting characters exceeding a threshold value by utilizing similarity calculation near equipment manufacturer information to obtain product model information;
step four, the confirmation of the information of the equipment of the Internet of things comprises the following steps: and combining the three steps to obtain the information of the equipment of the Internet of things, namely (equipment type, equipment manufacturer and product model).
2. The method for automatically extracting the information of the equipment of the Internet of things as claimed in claim 1,
the pretreatment in the step a comprises the following steps: a1, deleting the error state code of the application layer; a2, deleting irrelevant contents of the hypertext markup language; a3, removing special characters; a4, deleting time stamps, numbers, punctuation and stop words; a5, extracting a plain text from the rest message content, splitting the plain text into single characters, and performing word marking;
the step b specifically comprises the following steps: processing training data by using Word2Vec to obtain a pre-trained model, converting characters in a plain text format into Word vectors, and training to obtain a classifier of the equipment type by using the bidirectional long-short term memory network model based on an attention mechanism and taking the Word vectors as input;
the step c specifically comprises the following steps: giving an application layer message information, converting the application layer message information into a text mark and a vector mark as the input of a model; and the classifier gives the judgment of the type of the equipment of the Internet of things and provides a label of the equipment type: (device type, #, #).
3. The method for automatically extracting information of the internet of things equipment according to claim 1, wherein the step d specifically comprises: the application layer message information processed in the first step becomes a plain text, the category of each character is identified and marked by V and O, wherein V represents the category of equipment manufacturers, and O represents other categories;
the step e specifically comprises the following steps: carrying out three different vectorization on the plain text information in the step one, wherein the vectorization comprises a word vector, a letter vector and a mixed vector; using a gate control circulation unit model to express the letter vector of a word, and finally combining the word vector and the letter vector to be used as a single sequence vector, namely mixed vector expression; taking the mixed vector representation as the input of each gated cyclic unit, and training a cyclic neural network model so as to mark each character in the plain text information in the step one; searching a text marked as V, serving as a manufacturer of the Internet of things equipment, and providing a label of the equipment manufacturer: (#, equipment manufacturer, #).
4. The method for automatically extracting information of the internet of things equipment according to claim 1, wherein the third step is specifically as follows: setting a window with the length of W based on the equipment manufacturer category V in the second step, finding all the characters appearing in the window, and generating a candidate set B; performing letter-level word vector representation and general word vector representation on each character in the set B; the known product model name of the internet of things is used as a set A, vector representation of characters in a set B is compared with vector representation of characters in the set A, if the similarity exceeds a threshold value T, the characters are used as the product model of the equipment, and a label of the product model is obtained: (#, #, product type).
CN202110516557.0A 2021-05-12 2021-05-12 Method for automatically extracting information of Internet of things equipment Active CN113191149B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110516557.0A CN113191149B (en) 2021-05-12 2021-05-12 Method for automatically extracting information of Internet of things equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110516557.0A CN113191149B (en) 2021-05-12 2021-05-12 Method for automatically extracting information of Internet of things equipment

Publications (2)

Publication Number Publication Date
CN113191149A true CN113191149A (en) 2021-07-30
CN113191149B CN113191149B (en) 2023-04-07

Family

ID=76981573

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110516557.0A Active CN113191149B (en) 2021-05-12 2021-05-12 Method for automatically extracting information of Internet of things equipment

Country Status (1)

Country Link
CN (1) CN113191149B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090326923A1 (en) * 2006-05-15 2009-12-31 Panasonic Corporatioin Method and apparatus for named entity recognition in natural language
CN111726336A (en) * 2020-05-14 2020-09-29 北京邮电大学 Method and system for extracting identification information of networked intelligent equipment
CN111783466A (en) * 2020-07-15 2020-10-16 电子科技大学 Named entity identification method for Chinese medical records
CN111897962A (en) * 2020-07-27 2020-11-06 绿盟科技集团股份有限公司 Internet of things asset marking method and device
CN112564974A (en) * 2020-12-08 2021-03-26 武汉大学 Deep learning-based fingerprint identification method for Internet of things equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090326923A1 (en) * 2006-05-15 2009-12-31 Panasonic Corporatioin Method and apparatus for named entity recognition in natural language
CN111726336A (en) * 2020-05-14 2020-09-29 北京邮电大学 Method and system for extracting identification information of networked intelligent equipment
CN111783466A (en) * 2020-07-15 2020-10-16 电子科技大学 Named entity identification method for Chinese medical records
CN111897962A (en) * 2020-07-27 2020-11-06 绿盟科技集团股份有限公司 Internet of things asset marking method and device
CN112564974A (en) * 2020-12-08 2021-03-26 武汉大学 Deep learning-based fingerprint identification method for Internet of things equipment

Also Published As

Publication number Publication date
CN113191149B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
CN111444721B (en) Chinese text key information extraction method based on pre-training language model
CN107729309B (en) Deep learning-based Chinese semantic analysis method and device
CN110851596A (en) Text classification method and device and computer readable storage medium
CN110457689B (en) Semantic processing method and related device
CN111767725B (en) Data processing method and device based on emotion polarity analysis model
CN111950273A (en) Network public opinion emergency automatic identification method based on emotion information extraction analysis
CN111198948A (en) Text classification correction method, device and equipment and computer readable storage medium
CN112836509B (en) Expert system knowledge base construction method and system
CN108829823A (en) A kind of file classification method
CN108388554A (en) Text emotion identifying system based on collaborative filtering attention mechanism
CN116416480B (en) Visual classification method and device based on multi-template prompt learning
Sheshikala et al. Natural language processing and machine learning classifier used for detecting the author of the sentence
CN114416979A (en) Text query method, text query equipment and storage medium
CN114676255A (en) Text processing method, device, equipment, storage medium and computer program product
CN111401064A (en) Named entity identification method and device and terminal equipment
CN112257425A (en) Power data analysis method and system based on data classification model
CN115718792A (en) Sensitive information extraction method based on natural semantic processing and deep learning
CN115064154A (en) Method and device for generating mixed language voice recognition model
CN115859980A (en) Semi-supervised named entity identification method, system and electronic equipment
CN113204975A (en) Sensitive character wind identification method based on remote supervision
CN109446299A (en) The method and system of searching email content based on event recognition
CN115098673A (en) Business document information extraction method based on variant attention and hierarchical structure
CN112134858B (en) Sensitive information detection method, device, equipment and storage medium
CN112445862A (en) Internet of things equipment data set construction method and device, electronic equipment and storage medium
CN113191149B (en) Method for automatically extracting information of Internet of things equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant