CN112256840A

CN112256840A - Device for carrying out industrial internet discovery and extracting information by improving transfer learning model

Info

Publication number: CN112256840A
Application number: CN202011256306.5A
Authority: CN
Inventors: 林飞; 汪致伦; 王丹; 易永波; 古元
Original assignee: Beijing Act Technology Development Co ltd
Current assignee: Beijing Act Technology Development Co ltd
Priority date: 2020-11-12
Filing date: 2020-11-12
Publication date: 2021-01-22

Abstract

A device for improving a transfer learning model to discover industrial Internet and extract information relates to the technical field of information. The system consists of a web crawler, a text cleaning module, a content classification execution module, an improved transfer learning model and an entity identification module; the invention does not need massive texts with labels for training, thereby saving a great deal of labor cost; and secondly, the method is not influenced by word segmentation, and more relevant text features can be obtained for website classification and key service information extraction of the industrial internet platform website.

Description

Device for carrying out industrial internet discovery and extracting information by improving transfer learning model

Technical Field

The invention relates to the technical field of information, in particular to the technical field of information security.

Background

With the accelerated progress of the manufacturing industry from the digitalization stage to the networking stage, the industrial internet platform in China is rapidly started, and the timely discovery and management of the platform information become a problem which is urgently needed to be solved at present. The types of websites in the internet are numerous, the first problem which is faced at present is how to automatically find the industrial internet platform website from a large number of websites, and the second problem is how to extract key platform information from the platform website content.

At present, the industrial internet platform information is mainly collected manually, and manpower and time cost are wasted, so that the method for automatically discovering and extracting the platform information is very urgent.

In recent years, rapid development of artificial intelligence technology has made little progress in the field of natural language processing, in which text classification is used for text with different characteristics, and named entity recognition technology is mainly used for information extraction and text data structuring.

The current website classification method is mainly based on the traditional machine learning algorithm and the deep learning mode, and the traditional machine learning algorithm, such as the invention patent CN106168968A, determines the website category by calculating the weight of the data matched to the dictionary. Due to the difficulty in constructing the dictionary and the numerous types of websites, the conventional algorithm is difficult to accurately classify the websites according to the dictionary. The method based on deep learning, such as the invention patent CN110442823A, requires a large number of training samples to train parameters of the neural network, and the process of collecting a large number of samples is long, and consumes a large amount of human resources.

In the prior art, a named entity identification method is mainly an entity identification method based on traditional machine learning and an entity identification method based on deep learning. The entity identification method based on the traditional machine learning, such as the invention patent CN111274804A, model learning is carried out on labeled data through statistics, the data to be predicted are sent to model prediction, the model calculates the entity with the maximum possibility by utilizing a Viterbi algorithm, and the method has the biggest defect that the semantics cannot be understood and the method cannot be competent for the task of complex entity identification. The named entity recognition method based on deep learning, such as patent CN111126068A of the invention, builds a neural network model to learn semantic features, and can learn more complex semantics, but needs a large amount of labeled data to learn, and the data labeling work is very time-consuming and labor-consuming.

Based on the characteristics of high complexity, high implementation cost and large labor consumption of the prior art, the device for discovering the industrial Internet and extracting information by the improved migration learning model improves the migration learning model, improves the calculation efficiency of the migration learning model by sharing the layered calculation parameters of the migration learning model, can perform rapid classified modeling on classified industrial Internet sample data to obtain an industrial Internet classification model, then obtains real-time data by network information capture and data cleaning, inputs the real-time data into the industrial Internet classification model for classification to obtain the industrial Internet classification of the real-time data, then captures key information of the real-time data to obtain updated industrial Internet sample data, and updates the updated industrial Internet sample data into the classified industrial sample data, the invention can automatically complete the classification and information capture of the industrial internet in the whole process, and gradually amend and enrich the classified industrial internet sample data, thereby achieving the continuous evolution and improvement of the industrial internet classification model. The invention has the characteristics of high efficiency and real-time performance.

General technical description of the use

A transfer learning model: the transfer learning model used in the patent application refers to structBERT, which is an NLP pre-training model proposed by the Alibardamol institute, and makes related improvements on the basis of the traditional BERT. The author thinks that the pretraining task of Bert ignores the language structure information, so that structBert adds two language structure-based training targets to the original MaskLM training target of Bert: word order and sentence order tasks.

Named entity recognition: named entity recognition refers to the recognition of specific objects in text, the semantic categories of which are usually predefined before recognition, such as people, addresses, organizations, etc. Named entity recognition is not just an independent information extraction task, it also plays a key role in many large NLP applications such as information retrieval, automatic text summarization, question and answer systems, machine translation, and knowledge base building.

Disclosure of Invention

In view of the defects of the prior art, the device for discovering the industrial internet and extracting the information by the improved transfer learning model provided by the invention consists of a web crawler, a text cleaning module, a content classification execution module, the improved transfer learning model and an entity identification module;

the web crawler is responsible for crawling web page content and sending the web page content and the web page address to the text cleaning module;

the text cleaning module is responsible for removing noise characters in a text formed by the webpage content and the webpage address to generate clean webpage information, and the text cleaning module sends the clean webpage information to the content classification execution module; the noise characters include: html tags, stop words, forwarding symbols, urls and marking information;

the content classification execution module comprises an industrial internet classification model, and the industrial internet classification model is obtained by performing language training on classified internet sample data through an improved transfer learning model; the industrial internet classification model consists of classification labels of classified internet sample data and the probability that the content of the classified internet sample data belongs to each classification label;

the algorithm of the improved transfer learning model is represented as: 1) the method comprises the steps of using a structBERT to represent each word of each sentence in a text, then using a bidirectional Transformer to learn the represented text, wherein the Transformer is a standard program in the structBERT, each layer of parameters of the traditional Transformer are independent, when the number of layers is increased, the number of the parameters is obviously increased, and the model shares the parameters of all the layers and learns the parameter quantity of one layer; 2) the word representation of the improved StructBERT is represented by a word vector, a segment vector and a position vector together; the first word of the word vector is used for a subsequent classification task, the segment vector is used for distinguishing two sentences, and the position vector is used for representing word position information; 3) semantic features are learned through four training tasks: i) a masked language model, ii) a predict next sentence task, iii) a word order task, iv) a sentence structure task; the hidden language model task means that the model predicts that 15% of words are randomly hidden in the training process, 80% of the words in the 15% of words are replaced by mask symbols, 10% of the words are not replaced, and 10% of the words are replaced by other words; the model learns the semantic information of the text through the task; predicting next sentence task in order for the model to learn the relationship between sentences, assuming that the input of training is sentences S1 and S2, and S2 has half the probability of being the next sentence of S1, the two sentences are input, and the model predicts whether S2 is the next sentence of S1; the word sequence task selects a part of 3 subsequences with the length of 5% from the unmasked sequences, the word sequences in the subsequences are disordered, and the model reconstructs the original word sequences, so that the model learns the word sequence relation in sentences; a sentence structure task, wherein a sentence pair is given (S1, S2), the context and the independence of S2 and S1 are judged; in sampling, for a sentence S, the next sentence of the probability sampling S of 1/3 constitutes a sentence pair, the previous sentence of the probability sampling S of 1/3 constitutes a sentence pair, and the probability of 1/3 randomly samples sentence constituting sentence pairs of another document;

the content classification execution module compares clean webpage information with the industrial internet classification model, discards clean webpage information which is not classified by the industrial internet and sends the clean webpage information belonging to the industrial internet classification to the entity identification module;

the entity identification module comprises an entity category model, the entity category model is obtained by performing language training on classified industrial internet sample data with entity category labels through an improved transfer learning model, and the entity category model is composed of the classification labels of the classified industrial internet sample data with the entity category labels and the probability that the content of the classified industrial internet sample data with the entity category labels belongs to each classification label;

the entity identification module compares clean webpage information with an entity type model, outputs content in the clean webpage information and an entity type label corresponding to the content in the clean webpage information, and generates updated classified industrial internet data with the entity type label;

the entity identification module incorporates the updated categorized industrial internet data with the entity category label into categorized industrial internet sample data with the entity category label.

Advantageous effects

Compared with the traditional text classification and information extraction technology, the method does not need massive texts with labels for training, and saves a large amount of labor cost; and secondly, the method is not influenced by word segmentation, and more relevant text features can be obtained for website classification and key service information extraction of the industrial internet platform website.

Drawings

FIG. 1 is a system block diagram of the present invention.

Detailed Description

The device for realizing industrial internet discovery and information extraction of the improved transfer learning model provided by the invention with reference to fig. 1 is composed of a web crawler 1, a text cleaning module 2, a content classification execution module 3, an improved transfer learning model 4 and an entity recognition module 5;

the web crawler 1 is responsible for crawling web page contents and sending the web page contents and the web page addresses 10 to the text cleaning module 2;

the text cleaning module 2 is responsible for removing noise characters in the text formed by the webpage content and the webpage address 10 to generate clean webpage information, and the text cleaning module 2 sends the clean webpage information to the content classification execution module 3; the noise characters include: html tags, stop words, forwarding symbols, urls and marking information;

the content classification execution module 3 comprises an industrial internet classification model 41, and the industrial internet classification model 41 is obtained by performing language training on classified internet sample data 40 through an improved transfer learning model 4; the industrial internet classification model 41 is composed of classification labels of the classified internet sample data 40 and probabilities that the contents of the classified internet sample data 40 belong to each classification label;

the algorithm of the improved migration learning model 4 is represented as: 1) the method comprises the steps of using a structBERT to represent each word of each sentence in a text, then using a bidirectional Transformer to learn the represented text, wherein the Transformer is a standard program in the structBERT, each layer of parameters of the traditional Transformer are independent, when the number of layers is increased, the number of the parameters is obviously increased, and the model shares the parameters of all the layers and learns the parameter quantity of one layer; 2) the word representation of the improved StructBERT is represented by a word vector, a segment vector and a position vector together; the first word of the word vector is used for a subsequent classification task, the segment vector is used for distinguishing two sentences, and the position vector is used for representing word position information; 3) semantic features are learned through four training tasks: i) a masked language model, ii) a predict next sentence task, iii) a word order task, iv) a sentence structure task; the hidden language model task means that the model predicts that 15% of words are randomly hidden in the training process, 80% of the words in the 15% of words are replaced by mask symbols, 10% of the words are not replaced, and 10% of the words are replaced by other words; the model learns the semantic information of the text through the task; predicting next sentence task in order for the model to learn the relationship between sentences, assuming that the input of training is sentences S1 and S2, and S2 has half the probability of being the next sentence of S1, the two sentences are input, and the model predicts whether S2 is the next sentence of S1; the word sequence task selects a part of 3 subsequences with the length of 5% from the unmasked sequences, the word sequences in the subsequences are disordered, and the model reconstructs the original word sequences, so that the model learns the word sequence relation in sentences; a sentence structure task, wherein a sentence pair is given (S1, S2), the context and the independence of S2 and S1 are judged; in sampling, for a sentence S, the next sentence of the probability sampling S of 1/3 constitutes a sentence pair, the previous sentence of the probability sampling S of 1/3 constitutes a sentence pair, and the probability of 1/3 randomly samples sentence constituting sentence pairs of another document;

the content classification execution module 3 compares the clean webpage information with the industrial internet classification model 41, discards the clean webpage information which is not classified by the industrial internet and sends the clean webpage information belonging to the industrial internet classification to the entity identification module 5;

the entity identification module 5 comprises an entity category model 51, the entity category model 51 is obtained by language training of the classified industrial internet sample data 50 with the entity category label through the improved transfer learning model 4, and the entity category model 51 is composed of the classification label of the classified industrial internet sample data 50 with the entity category label and the probability that the content of the classified industrial internet sample data 50 with the entity category label belongs to each classification label;

the entity identification module 5 compares the clean webpage information with the entity type model 51, outputs the content in the clean webpage information and the entity type label corresponding to the content in the clean webpage information, and generates the updated classified industrial internet data 52 with the entity type label;

the entity identification module 5 incorporates the updated entity class tagged classified industrial internet data 52 into the entity class tagged classified industrial internet sample data 50.

Claims

1. The device for carrying out industrial internet discovery and extracting information by improving the transfer learning model is characterized by consisting of a web crawler, a text cleaning module, a content classification execution module, an improved transfer learning model and an entity identification module;