CN111581476A

CN111581476A - Intelligent webpage information extraction method based on BERT and LSTM

Info

Publication number: CN111581476A
Application number: CN202010351978.8A
Authority: CN
Inventors: 王敏
Original assignee: Shenzhen Hezong Data Technology Co ltd
Current assignee: Shenzhen Hezong Data Technology Co ltd
Priority date: 2020-04-28
Filing date: 2020-04-28
Publication date: 2020-08-25

Abstract

The invention belongs to the field of Internet information extraction and mining, and discloses an intelligent webpage information extraction method based on BERT and LSTM, which comprises the following steps: s1, crawling a webpage; s2, preprocessing a webpage; s3, webpage target labeling; s4, training a neural network model; s5, deploying a neural network model; s6, after the training is completed, the target information is recognized for the input web page using the obtained model. According to the scheme, the information can be accurately positioned after the page content is updated and the webpage structure is changed, so that an accurate extraction result is obtained, information can be intelligently extracted from the webpage, the webpage information extraction cost is reduced, the webpage information extraction speed is increased, through detection, the identification accuracy of the model to the page in the training set is over 98%, and the recall rate is over 97%.

Description

An Intelligent Webpage Information Extraction Method Based on BERT and LSTM

技术领域technical field

本发明涉及互联网信息抽取跟挖掘技术领域，更具体地说，涉及基于BERT跟LSTM的一种智能化网页信息抽取方法。The invention relates to the technical field of Internet information extraction and mining, and more particularly, to an intelligent webpage information extraction method based on BERT and LSTM.

背景技术Background technique

随着科技的飞速发展，Web已经成为世界上最大的百科全书数据库，尽管用户可以通过浏览器等应用程序方便地在web上获取信息，但是在互联网中检索信息还是比较困难的，如果能将互联网网页信息进行抽取，将信息进行结构化存储，形成知识图谱，可以让搜索更加准确跟便捷。With the rapid development of science and technology, the Web has become the largest encyclopedia database in the world. Although users can easily obtain information on the Web through applications such as browsers, it is still relatively difficult to retrieve information on the Internet. Web page information is extracted, and the information is structured and stored to form a knowledge map, which can make search more accurate and convenient.

基于HTML结构的信息抽取即通常意义上网络爬虫所采用的信息抽取方式，首先在定位信息在HTML中位置的前提下，采用正则表达式与HTML选择器等方式相结合，编写符合当前HTML页面格式的提取规则，将数据从HTML文档中提取出来，这也是目前应用最为广泛的抽取方式。但是这种方式扩展性不好，需要大量人工工作量，成本非常高。The information extraction based on HTML structure is the information extraction method used by web crawlers in the usual sense. First, on the premise of locating the location of the information in HTML, the combination of regular expressions and HTML selectors is used to write the format that conforms to the current HTML page. The extraction rules are used to extract data from HTML documents, which is also the most widely used extraction method at present. However, this method has poor scalability, requires a lot of manual work, and is very costly.

深度学习(Deep Learning)的概念源于人工神经网络的研宄，是机器学习(Machine Learning)领域的一个分支，是一系列为高层次抽象数据建模的算法集合。相对与浅层学习(Shallow Learning)而言，它能够通过多层次组合低层特性形成更抽象的高层特征，从而实现自动的学习特征，而不需要人参与特征的选取。The concept of deep learning (Deep Learning) originated from the research of artificial neural network. It is a branch of the field of machine learning (Machine Learning) and is a collection of algorithms for modeling high-level abstract data. Compared with Shallow Learning, it can form more abstract high-level features by combining low-level features at multiple levels, so as to realize automatic learning of features without human participation in feature selection.

深度学习的在近年得以迅速发展是因为2006年由加拿大多伦大学Hinton等人提出基于深度置信网络(Deep Belife Network DBN)的"逐层初始化"(Layer-wise Pre-training算法，这一算法成功地解决了深层结构算学习效果不理想的问题,为解决深层结构相关的优化难题带来希望。The rapid development of deep learning in recent years is due to the "Layer-wise Pre-training" (Layer-wise Pre-training) algorithm based on the Deep Belife Network (DBN) proposed by Hinton et al. It solves the problem that the learning effect of deep structure calculation is not ideal, and brings hope to solve the optimization problems related to deep structure.

Bert模型(Bidirectional Encoder Representations from Transformers)BERT是一种基于Transformer的双向编码表征。双向的意思表示它在处理一个词的时候，能考虑到该词前面和后面单词的信息，从而获取上下文的语义。Bert model (Bidirectional Encoder Representations from Transformers) BERT is a Transformer-based bidirectional encoding representation. The bidirectional meaning means that when processing a word, it can take into account the information of the words before and after the word, so as to obtain the semantics of the context.

长短时记忆模型(LSTM，Long-short Term Memory)是一种具体的RNN结构模型，它的目的是用来解决神经网络中的长期依赖问题(Long Term Dependencies)，即使用过去的序列信息来推测当前的序列信息的能力。LSTM中的基本节点被称为一个”cell”,cell中的state参数负责记忆信息，输入及输出分别通过Input Gate,Output Gate与cell交互，同时模型中加入Forget Gate将过期的记忆信息丢弃遗忘，从而达到一定程度上的记忆效果。Long-short-term memory model (LSTM, Long-short Term Memory) is a specific RNN structure model, its purpose is to solve the long-term dependency problem (Long Term Dependencies) in neural networks, that is to use past sequence information to speculate current sequence information capability. The basic node in LSTM is called a "cell". The state parameter in the cell is responsible for memory information. The input and output interact with the cell through Input Gate and Output Gate respectively. At the same time, Forget Gate is added to the model to discard the expired memory information and forget it. So as to achieve a certain degree of memory effect.

现有技术存在如下问题：精确信息抽取,需要不少人工工作量，并且人工是通过分析页面结构定位抽取的信息，由于网页信息是一类动态变化、实时更新的数据，在页面内容更新、网页结构变化后，容易出现定位信息失效导致的抽取失败或者抽取结果不准确的问题。The prior art has the following problems: accurate information extraction requires a lot of manual workload, and the information extracted manually is located by analyzing the page structure. After the structure changes, the problem of extraction failure or inaccurate extraction results caused by the failure of positioning information is prone to occur.

发明内容SUMMARY OF THE INVENTION

1.要解决的技术问题1. Technical problems to be solved

针对现有技术中存在的问题，本发明的目的在于提供基于BERT跟LSTM的一种智能化网页信息抽取方法，在页面内容更新，网页结构变化后仍能够准确的定位信息，得到准确的抽取结果。In view of the problems existing in the prior art, the purpose of the present invention is to provide an intelligent web page information extraction method based on BERT and LSTM, which can accurately locate the information after the page content is updated and the web page structure changes, and obtain accurate extraction results. .

2.技术方案2. Technical solutions

为解决上述问题，本发明采用如下的技术方案。In order to solve the above problems, the present invention adopts the following technical solutions.

基于BERT跟LSTM的一种智能化网页信息抽取方法，包括以下步骤：An intelligent web page information extraction method based on BERT and LSTM, including the following steps:

S1、爬取网页：爬取需要抽取的垂直领域网页，至少爬取3000个公司页面信息作为训练集合，500个页面信息作为测试集合；S1. Crawling web pages: Crawling the vertical domain web pages that need to be extracted, at least 3000 company page information as a training set, and 500 page information as a test set;

S2、网页预处理：对网页进行预处理，将无用的HTML标签清洗掉；S2, web page preprocessing: preprocess the web page and clean the useless HTML tags;

S3、网页目标标注：利用人工规则进行网页目标自动化标注，人工先通过观测需要抽取信息的HTML标签，然后利用xpath为该HTML标签打上相关分类标记；S3. Web page target labeling: Use manual rules to automatically label web page targets. Manually first observe the HTML tags that need to extract information, and then use xpath to mark the HTML tags with relevant classification tags;

S4、训练神经网络模型：先建立词库表并建立映射，建立完成后，对网页源码进行切分，切成一个一个词，然后在预训练好的BERT模型基础上进行微调，得到词向量；S4. Train the neural network model: first establish the vocabulary table and establish the mapping. After the establishment is completed, the source code of the webpage is divided into words one by one, and then fine-tuned on the basis of the pre-trained BERT model to obtain the word vector;

S5、部署神经网络模型：对网页进行自动化信息抽取并评估模型的准确率跟召回率；S5. Deploy the neural network model: perform automatic information extraction on web pages and evaluate the accuracy and recall rate of the model;

S6、训练完成之后，使用得到的模型来对输入网页进行目标信息的识别，使用测试集合中的500个页面来测试识别效果，在识别过程中忽略标记信息，根据文本节点的内容以及上下文来判断目标信息的类型。S6. After the training is completed, use the obtained model to identify the target information of the input web page, use 500 pages in the test set to test the recognition effect, ignore the tag information during the recognition process, and judge according to the content and context of the text node Type of target information.

优选的，所述公司页面信息包含统一社会信用代码、类型、业务范围、住所等。Preferably, the company page information includes a unified social credit code, type, business scope, address, and the like.

优选的，所述训练神经网络模型采用Nesterov Momentum算法作为优化算法，对训练集进行100个epoll训练，训练得到loss小于0.09。Preferably, the training neural network model adopts the Nesterov Momentum algorithm as the optimization algorithm, and 100 epoll trainings are performed on the training set, and the loss obtained from the training is less than 0.09.

3.有益效果3. Beneficial effects

相比于现有技术，本发明的优点在于：Compared with the prior art, the advantages of the present invention are:

(1)本方案应用于互联网信息抽取跟挖掘，开放知识图谱构建等领域，在页面内容更新、网页结构变化后仍能够准确的定位信息，得到准确的抽取结果，从而实现智能化从网页抽取信息，降低网页信息抽取成本，提高网页信息抽取速度。(1) This solution is applied to the fields of Internet information extraction and mining, open knowledge graph construction, etc. After the page content is updated and the page structure changes, the information can still be accurately located, and accurate extraction results can be obtained, thereby realizing intelligent extraction of information from web pages. , reduce the cost of web page information extraction and improve the speed of web page information extraction.

(2)经检测，模型对于训练集中的页面的识别准确率在98％以上,召回率达到97％以上。(2) After testing, the recognition accuracy rate of the model for the pages in the training set is over 98%, and the recall rate is over 97%.

附图说明Description of drawings

图1为本发明中神经网络架构示意图。FIG. 1 is a schematic diagram of the neural network architecture in the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述；显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例，基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention; obviously, the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. The embodiments of the present invention, and all other embodiments obtained by those of ordinary skill in the art without creative work, fall within the protection scope of the present invention.

在本发明的描述中，需要说明的是，术语“上”、“下”、“内”、“外”“顶/底端”等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了便于描述本发明和简化描述，而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，因此不能理解为对本发明的限制。此外，术语“第一”、“第二”仅用于描述目的，而不能理解为指示或暗示相对重要性。In the description of the present invention, it should be noted that the orientations or positional relationships indicated by the terms "upper", "lower", "inner", "outer", "top/bottom", etc. are based on the orientations shown in the drawings or The positional relationship is only for the convenience of describing the present invention and simplifying the description, rather than indicating or implying that the indicated device or element must have a specific orientation, be constructed and operated in a specific orientation, and therefore should not be construed as a limitation of the present invention. Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed to indicate or imply relative importance.

在本发明的描述中，需要说明的是，除非另有明确的规定和限定，术语“安装”、“设置有”、“套设/接”、“连接”等，应做广义理解，例如“连接”，可以是固定连接，也可以是可拆卸连接，或一体地连接；可以是机械连接，也可以是电连接；可以是直接相连，也可以通过中间媒介间接相连，可以是两个元件内部的连通。对于本领域的普通技术人员而言，可以具体情况理解上述术语在本发明中的具体含义。In the description of the present invention, it should be noted that, unless otherwise expressly specified and limited, the terms "installation", "provided with", "sleeve/connection", "connection", etc., should be understood in a broad sense, such as " Connection", which can be a fixed connection, a detachable connection, or an integral connection; it can be a mechanical connection or an electrical connection; it can be a direct connection or an indirect connection through an intermediate medium, and it can be an internal connection between two components. of connectivity. For those of ordinary skill in the art, the specific meanings of the above terms in the present invention can be understood in specific situations.

请参阅图1，基于BERT跟LSTM的一种智能化网页信息抽取方法，包括以下步骤：Please refer to Figure 1, an intelligent web page information extraction method based on BERT and LSTM, including the following steps:

S1、爬取网页：爬取需要抽取的垂直领域网页，至少爬取3000个公司页面信息作为训练集合，500个页面信息作为测试集合，信息抽取旨在通过神经网络的训练实现对具有相似结构的同类网站目标信息的自动提取，省去了为每类网站单独编写抽取规则的麻烦；S1. Crawling web pages: Crawling the vertical domain web pages that need to be extracted, at least 3000 company page information as a training set, 500 page information as a test set, information extraction aims to achieve through the training of neural networks. The automatic extraction of target information of similar websites saves the trouble of writing extraction rules for each type of website separately;

S2、网页预处理：对网页进行预处理，将无用的HTML标签清洗掉，企业信息网页HTML代码里面有很多无用信息，过于繁杂，且多数对于目标信息的识别没有帮助，所以在训练前先对网页进行预处理，简化页面信息的内容，经过预处理后的页面信息结构非常清晰，且较容易观察出目标信息的位置结构，如类型信息紧跟在“类型”节点之后、统一社会信用代码信息紧跟在“统一社会信用代码”节点之后等，由于位置结构对于系统判断目标信息的类型尤为重要，所以预处理是将普通网页加入训练集这一过程中的重要环节；S2. Web page preprocessing: Preprocess the web page and clean the useless HTML tags. There is a lot of useless information in the HTML code of the enterprise information web page, which is too complicated, and most of them are not helpful for the identification of target information. The web page is preprocessed to simplify the content of the page information. The preprocessed page information structure is very clear, and it is easier to observe the location structure of the target information. For example, the type information follows the "type" node, and the unified social credit code information Immediately after the "Unified Social Credit Code" node, etc., since the location structure is particularly important for the system to determine the type of target information, preprocessing is an important part of the process of adding ordinary web pages to the training set;

S3、网页目标标注：利用人工规则进行网页目标自动化标注，人工先通过观测需要抽取信息的HTML标签，然后利用xpath为该HTML标签打上相关分类标记，其目的是告诉系统在网页结构中，哪些通用的信息是需要获取的，从而在训练中对此类信息进行识别，经过正确标记后的页面可以作为有效样本加入训练集中，保证训练结果的正确性；S3. Web page target labeling: Use manual rules to automatically label web page targets. Humans first observe the HTML tags that need to extract information, and then use xpath to mark relevant classification tags for the HTML tags. The purpose is to tell the system in the web page structure, which common The information needs to be obtained, so that such information can be identified during training, and the correctly marked pages can be added to the training set as valid samples to ensure the correctness of the training results;

S5、训练神经网络模型：先建立词库表并建立映射，为了识别文本节点中的关键文字、标识文本节点的特征、生成词向量用于神经网络输入，需要为网页信息中遇到的字词建立词库表；S5. Train the neural network model: first establish a thesaurus table and establish a mapping. In order to identify the key words in the text nodes, identify the features of the text nodes, and generate word vectors for neural network input, it is necessary to identify the words encountered in the web page information. build a thesaurus;

基于网页中信息的结构特征，每个文本节点中的文字更像是一个整体，因此为每个单独的中文汉字而不是词组建立到词库表的映射，为每个解析到的外文单词建立单独的映射，网页中的外文单词(通常为英文单词)不同于汉字，需要由字母组合成才具有对应意义，因此为每一个单词建立单独的映射；Based on the structural characteristics of the information in the web page, the text in each text node is more like a whole, so a mapping to the thesaurus table is established for each individual Chinese character instead of a phrase, and a separate word is established for each parsed foreign word. The mapping of foreign words (usually English words) in web pages is different from Chinese characters and needs to be composed of letters to have corresponding meanings, so a separate mapping is established for each word;

建立完成之后，对网页源码进行切分，切成一个一个词，然后在预训练好的BERT模型基础上进行微调，得到词向量；After the establishment is completed, the source code of the webpage is divided into words one by one, and then fine-tuned on the basis of the pre-trained BERT model to obtain the word vector;

基于Web页面中信息的特点，使用多层次的神经网络来构建模型，其中Bert与softmax层分别用于页面信息的词向量表示以及结果分类，中间的两层使用LSTM来训练学习页面间文本信息的特征,鉴于LSTM在自然语言处理中的强大能力，将Web信息抽取技术与LSTM相结合还可以额外得到两点好处一一省去了文本处理中的分词步骤和信息抽取中建立特征工程的步骤，由于Web页面文本信息的特征即信息文本通常以小段单位嵌入在HTML格式的标签页中,且标签页中的每段文本大多都已经包含独立的语义,所以模型在建立时直接以标签页中的文本节点为单位，而不对整个网页中的信息进行分词，同时神经网络的优势之一便是能够建立对实体间模糊关系的掌握，通过这种关系可以建立起页面信息间的描述关系。Based on the characteristics of information in Web pages, a multi-level neural network is used to build a model. The Bert and softmax layers are used for word vector representation and result classification of page information, respectively. The middle two layers use LSTM to train and learn the text information between pages. Features, in view of the powerful capabilities of LSTM in natural language processing, combining Web information extraction technology with LSTM can also gain two additional benefits - one saves the step of word segmentation in text processing and the step of building feature engineering in information extraction, Because the feature of Web page text information, that is, the information text, is usually embedded in HTML-formatted tab pages in small units, and most of the text in the tab page already contains independent semantics, so the model is directly based on the tab page. The text node is used as a unit, and the information in the entire web page is not divided into words. At the same time, one of the advantages of the neural network is that it can establish a grasp of the fuzzy relationship between entities, and through this relationship, the description relationship between the page information can be established.

S5、部署神经网络模型：对网页进行自动化信息抽取并评估模型的准确率跟召回率。S5. Deploy the neural network model: Automatically extract information from web pages and evaluate the accuracy and recall rate of the model.

进一步的，公司页面信息包含统一社会信用代码、类型、业务范围、住所等。Further, the company page information includes the unified social credit code, type, business scope, address, etc.

进一步的，训练神经网络模型采用Nesterov Momentum算法作为优化算法，对训练集进行100个epoll训练，训练得到loss小于0.09，采用Nesterov Momentum算法作为神经网络的优化算法，使用的动量优化因子为0.9更新所有权重，在网络的每一次迭代训练中，采用softmax交叉熵作为分类的损失函数训练各级神经网络，在误差值反向传播过程中，使用Nesterov Momentum算法计算梯度值，并利用计算得到的梯度值更新网络的参数，对网络的所有参数使用L2正则化项，权重衰减因子为0.0005，并且对训练数据集进行100poll训练。Further, the Nesterov Momentum algorithm is used as the optimization algorithm to train the neural network model, and 100 epoll training is performed on the training set, and the loss obtained from the training is less than 0.09. The Nesterov Momentum algorithm is used as the optimization algorithm of the neural network, and the momentum optimization factor used is 0.9 to update all Weight, in each iterative training of the network, the softmax cross entropy is used as the loss function of the classification to train the neural network at all levels. During the back propagation of the error value, the Nesterov Momentum algorithm is used to calculate the gradient value, and the calculated gradient value is used. Update the parameters of the network, use the L2 regularization term for all parameters of the network, the weight decay factor is 0.0005, and perform 100poll training on the training dataset.

经检测，模型对于训练集中的页面的识别准确率在98％以上,召回率达到97％以上。After testing, the recognition accuracy rate of the model for the pages in the training set is over 98%, and the recall rate is over 97%.

以上所述，仅为本发明较佳的具体实施方式；但本发明的保护范围并不局限于此。任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，根据本发明的技术方案及其改进构思加以等同替换或改变，都应涵盖在本发明的保护范围内。The above description is only a preferred embodiment of the present invention; however, the protection scope of the present invention is not limited thereto. Any person skilled in the art who is familiar with the technical scope of the present invention, according to the technical solution of the present invention and its improvement concept, equivalently replaces or changes, should be covered within the protection scope of the present invention.

Claims

1. An intelligent webpage information extraction method based on BERT and LSTM is characterized in that: the method comprises the following steps:

s1, crawling the webpage: crawling a vertical field webpage to be extracted, crawling at least 3000 company page information as a training set, and crawling 500 page information as a testing set;

s2, webpage preprocessing: preprocessing a webpage, and cleaning useless HTML tags;

s3, webpage target labeling: automatically labeling a webpage target by using a manual rule, manually observing an HTML (hypertext markup language) label needing to extract information, and then marking a relevant classification mark on the HTML label by using xpath;

s5, training a neural network model: firstly establishing a word bank table and mapping, after the establishment is finished, segmenting a webpage source code into words, and then finely tuning on the basis of a pre-trained BERT model to obtain word vectors;

s5, deploying a neural network model: carrying out automatic information extraction on the webpage and evaluating the accuracy and the recall rate of the model;

and S6, after training is completed, identifying the target information of the input webpage by using the obtained model, testing the identification effect by using 500 pages in the test set, neglecting the marking information in the identification process, and judging the type of the target information according to the content and the context of the text node.

2. The intelligent web page information extraction method based on BERT and LSTM as claimed in claim 1, wherein: the corporate page information includes a unified social credit code, type, business range, residence, etc.

3. The intelligent web page information extraction method based on BERT and LSTM as claimed in claim 1, wherein: the training neural network model adopts a Nesterov Momentum algorithm as an optimization algorithm, 100 epoll training is carried out on a training set, and the loss obtained by training is less than 0.09.