CN108897749A

CN108897749A - Method for abstracting web page information and system based on syntax tree and text block density

Info

Publication number: CN108897749A
Application number: CN201810355382.8A
Authority: CN
Inventors: 舒琦赟; 汪立东; 刘晓飞; 王慧; 俞晓明; 赵忠华; 刘悦; 王卿; 程学旗
Original assignee: Institute of Computing Technology of CAS; National Computer Network and Information Security Management Center
Current assignee: Institute of Computing Technology of CAS; National Computer Network and Information Security Management Center
Priority date: 2018-04-19
Filing date: 2018-04-19
Publication date: 2018-11-27

Abstract

The invention relates to a web page information extraction method based on syntax tree and text block density, comprising: acquiring title text information of web page; setting screening threshold, calculating text block densities of all nodes of the web page, if the text block density is greater than the screening The threshold node is the collection node, and the node text information of the collection node is extracted; if the number of the collection node is 1, the text information of the node is used as the target information for extraction; if the number of the collection node is greater than 1, the title The text information and the node text information are respectively converted into the title deep syntax tree and the node deep syntax tree that uniquely express the semantics of the sentence; the overall similarity between each node deep syntax tree and the title deep syntax tree is obtained, and the overall similarity The node text information corresponding to the maximum value of is extracted as the target information.

Description

Web page information extraction method and system based on syntax tree and text block density

技术领域technical field

本发明属于网络数据采集领域，特别涉及一种基于语法树语义识别与标签文本块密度比的网页信息抽取方法和系统。The invention belongs to the field of network data collection, in particular to a method and system for extracting web page information based on syntax tree semantic recognition and tag text block density ratio.

背景技术Background technique

随着信息技术的飞速发展，数据越来越电子化，快速有效的网络数据收集技术也变得尤为重要。有效的网络数据收集是企业分析市场环境，客户需求的必然需求，拥有高效数据采集能力的企业在大数据时代展现出强大的竞争力。同时高效的数据采集技术也关乎国家政治安全。在信息技术日趋成熟的今天，形式多样的信息在网络上飞速传播，网民主体日趋庞大，舆论瞬息万变，思想言论传播速度更是犹如脱缰野马，这也为舆论管控能力提出了新的挑战，在这种情况下，对网络信息的高效收集也变得尤为重要，其为网研室提供必须的网络数据，关乎国家的政治安全。With the rapid development of information technology, data is becoming more and more electronic, and fast and effective network data collection technology has become particularly important. Effective network data collection is an inevitable requirement for enterprises to analyze the market environment and customer needs. Enterprises with efficient data collection capabilities show strong competitiveness in the era of big data. At the same time, efficient data collection technology is also related to national political security. Today, with the maturity of information technology, various forms of information are disseminated rapidly on the Internet, the body of netizens is becoming larger and larger, public opinion is changing rapidly, and the speed of dissemination of ideas and speech is like a wild horse, which also poses new challenges for the ability to control public opinion. In this case, the efficient collection of network information has become particularly important, as it provides necessary network data for the network research room, which is related to the political security of the country.

现在由于技术需要，网络数据采集技术近些年也是遍地开花，针对不同网页数据的采集技术层出不穷。其中一个比较难以解决的技术难点是对于网页上短正文信息的抽取。在网页主体文本较短的情况下，主体信息的识别就变得更加困难，因其相比起长主体文本网页，与网页中例如广告等无用信息，“噪声”的辨识度较低，在执行网页信息筛选的时候，更有可能误将其当做垃圾信息过滤排除，反而将一些广告信息错误的抽取出来当做文本主体。Now due to technical needs, network data collection technology has blossomed everywhere in recent years, and collection technologies for different web page data emerge in endlessly. One of the more difficult technical difficulties is the extraction of short text information on web pages. When the body text of the webpage is short, it becomes more difficult to identify the main body information, because compared with webpages with long body text, and useless information such as advertisements in the webpage, the recognition degree of "noise" is lower. When filtering webpage information, it is more likely to mistakenly filter and exclude it as spam information, and instead extract some advertising information by mistake as the main body of the text.

发明内容Contents of the invention

针对上述问题，本发明提出一种基于语法树和文本块密度的网页信息抽取方法，包括：通过正则表达式获取获取网页的标题文本信息；设定筛选阈值，计算网页所有节点的文本块密度，以文本块密度大于筛选阈值的节点为采集节点，提取采集节点的节点文本信息；若采集节点的数量为1，则以该节点文本信息为目标信息进行抽取；若采集节点的总数量大于1，则通过概率型上下文无关模型分别将该标题文本信息和该节点文本信息转化为标题语法树和节点语法树；通过同步树替换文法将标题语法树和节点语法树分别转换为唯一表达句子语义的标题深层语法树和节点深层语法树；计算标题深层语法树与每个节点深层语法树的整体相似度，对整体相似度中的最大值对应的节点文本信息为目标信息进行抽取。In view of the above problems, the present invention proposes a method for extracting web page information based on syntax tree and text block density, including: obtaining the title text information of the web page through a regular expression; setting a screening threshold to calculate the text block density of all nodes of the web page, Take the node whose text block density is greater than the screening threshold as the collection node, and extract the node text information of the collection node; if the number of collection nodes is 1, then use the text information of this node as the target information to extract; if the total number of collection nodes is greater than 1, The text information of the title and the text information of the node are converted into a title syntax tree and a node syntax tree respectively through a probabilistic context-free model; the title syntax tree and the node syntax tree are respectively converted into a title that uniquely expresses the semantics of a sentence through a synchronous tree replacement grammar Deep syntax tree and node deep syntax tree; calculate the overall similarity between the title deep syntax tree and each node deep syntax tree, and extract the node text information corresponding to the maximum value in the overall similarity as the target information.

本发明所述的网页信息抽取方法，其中文本块密度通过以下方式获得：In the web page information extraction method of the present invention, the text block density is obtained in the following manner:

其中，TBD(v)为节点v的文本块密度，v.children为节点v的子节点集合，v_i为节点v的子节点，CN_vi为子节点v_i的文本块所包含的文本字符数，LCN_vi为子节点v_i的文本块所包含的超链接字符数，TN_vi为子节点v_i的文本块所包含的标签的个数，LTN_vi为子节点v_i的文本块所包含的超链接标签的个数。Among them, TBD(v) is the text block density of node v, v.children is the child node set of node v, v _i is the child node of node v, CN _vi is the number of text characters contained in the text block of child node v _i , LCN _vi is the number of hyperlink characters contained in the text block of sub-node v _i , TN _vi is the number of labels contained in the text block of sub-node v _i , LTN _vi is the number of tags contained in the text block of sub-node v _i The number of hyperlink tags.

本发明所述的网页信息抽取方法，当采集节点的总数量大于1时，进行以下步骤：The method for extracting web page information of the present invention, when the total number of collection nodes is greater than 1, carry out the following steps:

提取该标题深层语法树的标题词向量t_i，以及与该标题深层语法树结构相同的该节点深层语法树的文本词向量a_i；以该标题词向量t_i和该文本词向量a_i的词向量相似度得到该整体相似度S＝S₁·S₂·S₃·……·S_n；其中，0<i≤n，n为正整数，n为该标题深层语法树节点数。Extract the title word vector t _i of the title deep grammar tree, and the text word vector a _i of the node deep grammar tree with the same structure as the title deep grammar tree; use the title word vector t _i and the text word vector a _i word vector similarity The overall similarity S=S ₁ ·S ₂ ·S ₃ ·...·S _{n is} obtained; wherein, 0<i≤n, n is a positive integer, and n is the number of deep syntax tree nodes of the title.

本发明还涉及一种基于语法树和文本块密度的网页信息抽取系统，包括：The present invention also relates to a web page information extraction system based on syntax tree and text block density, comprising:

文本信息获取模块，用于通过正则表达式获取网页的标题文本信息，以及采集节点的节点文本信息；其中包括设定筛选阈值，计算网页所有节点的文本块密度，以文本块密度大于筛选阈值的节点为采集节点，提取采集节点的节点文本信息；The text information acquisition module is used to obtain the title text information of the webpage through regular expressions, and the node text information of the collection node; including setting the screening threshold, calculating the text block density of all nodes on the web page, and taking the text block density greater than the screening threshold The node is a collection node, and the node text information of the collection node is extracted;

第一目标信息获取模块，用于当节点文本信息的数量为1时获取目标信息；The first target information acquisition module is used to acquire target information when the number of node text information is 1;

第二目标信息获取模块，用于当节点文本信息的数量大于1时获取目标信息进行抽取；其中通过概率型上下文无关模型分别将该标题文本信息和该节点文本信息转化为标题语法树和节点语法树；通过同步树替换文法将标题语法树和节点语法树分别转换为唯一表达句子语义的标题深层语法树和节点深层语法树；获取标题深层语法树与每个节点深层语法树的整体相似度，对整体相似度中的最大值对应的节点文本信息进行提取。The second target information acquisition module is used to obtain target information for extraction when the number of node text information is greater than 1; wherein the title text information and the node text information are respectively converted into title syntax trees and node grammars through a probabilistic context-free model tree; the title syntax tree and the node syntax tree are respectively converted into a title deep syntax tree and a node deep syntax tree that uniquely express sentence semantics by synchronizing the tree replacement grammar; obtaining the overall similarity between the title deep syntax tree and each node deep syntax tree, Extract the node text information corresponding to the maximum value in the overall similarity.

本发明所述的网页信息抽取系统，在文本信息获取模块中，文本块密度通过以下方式获得：In the web page information extraction system of the present invention, in the text information acquisition module, the text block density is obtained in the following manner:

本发明所述的网页信息抽取系统，第二目标信息获取模块具体包括：In the web page information extraction system of the present invention, the second target information acquisition module specifically includes:

词向量获取模块，用于获取该标题深层语法树的标题词向量t，以及与该标题深层语法树结构相同的该节点深层语法树的文本词向量a；The word vector acquisition module is used to obtain the title word vector t of the title deep grammar tree, and the text word vector a of the node deep grammar tree having the same structure as the title deep grammar tree;

相似度获取模块，用于以该标题词向量t_i和该文本词向量a_i的词向量相似度得到该整体相似度S＝S₁·S₂·S₃·……·S_n；其中0<i≤n，n为正整数，n为该标题深层语法树节点数。The similarity acquisition module is used to obtain the word vector similarity of the title word vector t _i and the text word vector a _i The overall similarity S=S ₁ ·S ₂ ·S ₃ ····S _{n is} obtained; where 0<i≤n, n is a positive integer, and n is the number of deep syntax tree nodes of the title.

附图说明Description of drawings

图1是本发明实施例的网页信息抽取方法流程图。FIG. 1 is a flowchart of a method for extracting web page information according to an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图，对本发明提出的一种基于语法树和文本块密度的网页信息抽取方法及系统进一步详细说明。应当理解，此处所描述的具体实施方法仅仅用以解释本发明，并不用于限定本发明。In order to make the purpose, technical solution and advantages of the present invention clearer, a method and system for extracting webpage information based on syntax tree and text block density proposed by the present invention will be further described in detail below in conjunction with the accompanying drawings. It should be understood that the specific implementation methods described here are only used to explain the present invention, and are not intended to limit the present invention.

本发明的目的是解决现有技术的不足，提出了针对于处理网页长短文本不一的数据采集方法。通过对于传统的基于标签文本密度比的数据采集算法在结果选取上进行的调整，由选取一个目标标签变为选取多个标签密度比较相近的目标标签，使得原来在文本主体长度较短的情况下，文本主体被错误过滤掉的情况不再发生；将容易获得并且采集精度高，还能总体概括文本主体语义的新闻标题作为文本主体语义的比较对象，使通过语义匹配网页正文的方法具有了语义比较标准；将候选的多个文本与文章标题进行节点语法树分析，构建节点语法树，为之后做语义匹配做了前期准备工作；对构建好的所有语法树进行语法树变形处理，将语法树中的主成分提取出来，保留具有关键意义的主谓宾等关键词与结构、将句子结构不同而语义相同的句子做形式上的统一，为之后比较文本语义做准备；对于做完预处理的多个文本主体语法树与新闻标题语法树进行语法树整体语义匹配，通过语义识别哪个短文本是采集对象。The purpose of the present invention is to solve the deficiencies of the prior art, and proposes a data collection method for processing web pages with texts of different lengths. By adjusting the result selection of the traditional data collection algorithm based on the tag-to-text density ratio, the selection of one target tag is changed to the selection of multiple target tags with similar tag densities, so that when the length of the text body is short , the situation that the main body of the text is filtered out by mistake will no longer happen; news headlines that are easy to obtain, have high collection accuracy, and can generally summarize the semantics of the main body of the text will be used as the comparison object for the semantics of the main body of the text, so that the method of matching the body of the web page through semantics has semantics Compare the standard; analyze the node syntax tree of multiple candidate texts and article titles, construct the node syntax tree, and make preliminary preparations for semantic matching later; perform syntax tree deformation processing on all the constructed syntax trees, and convert the syntax tree Extract the principal components in the text, retain the key words and structures such as subject-predicate-object, and unify the sentences with different sentence structures but the same semantics in form, so as to prepare for the comparison of text semantics in the future; for the pre-processed Multiple text body syntax trees and news headline syntax trees perform overall semantic matching of the syntax trees, and identify which short text is the collection object through semantics.

图1是本发明实施例的网页信息抽取方法流程图。如图1所示，本发明的网页信息抽取方法的步骤如下：FIG. 1 is a flowchart of a method for extracting web page information according to an embodiment of the present invention. As shown in Figure 1, the steps of the web page information extraction method of the present invention are as follows:

步骤S1，获取网页的标题文本信息；Step S1, obtaining the title text information of the webpage;

步骤S2，运行网页标签文本密度算法；Step S2, running the web page label text density algorithm;

步骤S3，设定筛选阈值，筛选采集节点，提取采集节点的节点文本信息；Step S3, setting a screening threshold, screening collection nodes, and extracting node text information of collection nodes;

步骤S4，判断采集节点的数量；Step S4, judging the number of collection nodes;

步骤S5，如果仅有一个采集节点，则以这个采集节点的节点文本信息为目标信息，对其进行抽取；Step S5, if there is only one collection node, the node text information of the collection node is used as the target information to extract it;

步骤S6，如果采集节点的数量大于1，则分别对标题文本信息和节点文本信息进行处理，生成标题语法树和节点语法树；Step S6, if the number of collected nodes is greater than 1, then process the title text information and the node text information respectively to generate a title syntax tree and a node syntax tree;

步骤S7，对所有语法树进行归一化处理，分别生成标题深层语法树和节点深层语法树；Step S7, performing normalization processing on all syntax trees to generate title deep syntax trees and node deep syntax trees respectively;

步骤S8，计算标题深层语法树和节点深层语法树之间的整体相似度；Step S8, calculating the overall similarity between the title deep syntax tree and the node deep syntax tree;

步骤S9，选取整体相似度最大值对应的节点文本信息为目标信息，对其进行抽取。Step S9, select the node text information corresponding to the maximum value of the overall similarity as the target information, and extract it.

具体来说，于本发明实施例中，首先通过专用于识别文章标题与作者，发表时间等位置固定，形式整齐划一的数据的正则表达式，进行前述数据的匹配，得到文章的标题等信息。Specifically, in the embodiment of the present invention, the aforementioned data is matched through a regular expression dedicated to identifying the title and author of the article, the time of publication, and other fixed-position, uniform-form data to obtain information such as the title of the article.

其次根据计算公式(1)计算各个节点的文本块密度TBD；Next, calculate the text block density TBD of each node according to calculation formula (1);

其中，TBD(textblockdensity)为文本块密度；设v为网页解析树T中的一个节点，Blk(v)为以节点v为根节点的文本块，定义节点v的文本块密度TBD(v)为节点v的所有子节点为根的文本块中，非链接文本字符数与非链接标签数比值之和。Among them, TBD (textblockdensity) is the text block density; let v be a node in the web page parsing tree T, Blk (v) be the text block with node v as the root node, define the text block density TBD (v) of node v as The sum of the ratio of the number of non-link text characters to the number of non-link labels in all text blocks rooted at the child nodes of node v.

CN(ContentNumber)为文本块字符数，即文本块所包含的文本字符数；通常情况下，正文文本块下的文本比较集中，文本字符长度会比较大；噪声文本块下的文本比较分散，文本字符长度会比较小。CN (ContentNumber) is the number of characters in the text block, that is, the number of text characters contained in the text block; usually, the text under the body text block is relatively concentrated, and the length of the text characters will be relatively large; the text under the noise text block is scattered, and the text The character length will be smaller.

LCN(LinkContentNumber)为文本块超链接字符数，即文本块所包含的超链接字符数；正文文本块下的超链接文本比较少，噪声文本块下的超链接文本比较多。LCN (LinkContentNumber) is the number of hyperlink characters in the text block, that is, the number of hyperlink characters contained in the text block; the hyperlink text under the text text block is relatively small, and the hyperlink text under the noise text block is relatively large.

TN(Tag Number)为文本块标签数，即文本块所包含的标签的个数；正文文本块下多为连续文本，标签个数少；噪声文本块下为分散文本，标签个数多。TN (Tag Number) is the number of tags in the text block, that is, the number of tags contained in the text block; the body text block is mostly continuous text, and the number of tags is small; the noise text block is scattered text, and the number of tags is large.

LTN(LinkTagNumber)为文本块超链接标签数，即文本块所包含的超链接标签的个数；超链接标签下的文本多为噪声信息，正文文本块下含有的超链接标签个数少，噪声文本块下含有的超链接标签个数多。LTN (LinkTagNumber) is the number of hyperlink tags in the text block, that is, the number of hyperlink tags contained in the text block; the text under the hyperlink tags is mostly noise information, and the number of hyperlink tags contained in the body text block is small, noise The number of hyperlink tags contained under the text block is large.

当获取所有节点的文本块密度TBD后，设定一个筛选阈值，将所有TBD值大于这个筛选阈值的节点定为采集节点，并将以上所有的采集节点中的文本信息提取出来。After obtaining the text block density TBD of all nodes, a screening threshold is set, and all nodes whose TBD value is greater than the screening threshold are designated as collection nodes, and the text information in all above collection nodes is extracted.

如果通过筛选阈值得到的采集节点的数量为1，即获得的文本信息唯一，则将这个采集节点的节点文本信息作为目标信息进行提取；如果通过筛选阈值得到的采集节点的数量大于1，即获得的文本信息不唯一，则需要通过将前述标题文本信息与采集节点的节点文本信息进行处理，以获得与标题文本信息相似度最高的节点文本信息作为目标信息进行提取。If the number of collection nodes obtained through the screening threshold is 1, that is, the obtained text information is unique, then the node text information of this collection node is extracted as the target information; if the number of collection nodes obtained through the screening threshold is greater than 1, that is, the obtained The text information of the title text information is not unique, it is necessary to process the aforementioned title text information and the node text information of the collection node to obtain the node text information with the highest similarity to the title text information as the target information for extraction.

于本发明实施例中采用概率型上下文无关模型(PCFG)进行标题文本信息和节点文本信息的预处理，即通过PCFG模型分析并分别生成标题文本信息和节点文本信息的标题语法树和节点语法树。PCFG模型是一种常用的自然语言句法分析模型。PCFG的分析算法与非概率型上下文无关文法相同，均是从非终结符开始扩展，通过PCFG对于每种不同的分析树，计算出其相应的概率。当句子具有歧义时，计算概率来进行选择哪个语法分析结果，选择标准即为生成概率最大。令T为备选树，当句子具有歧义时可通过概率来选择句子的分析结果T*，即：In the embodiment of the present invention, the probabilistic context-free model (PCFG) is used to preprocess the title text information and the node text information, that is, the PCFG model is used to analyze and generate the title syntax tree and the node syntax tree of the title text information and the node text information respectively . PCFG model is a commonly used natural language syntax analysis model. The analysis algorithm of PCFG is the same as that of non-probabilistic context-free grammar. They all start to expand from non-terminal symbols, and calculate the corresponding probability for each different analysis tree through PCFG. When the sentence has ambiguity, the probability is calculated to select which grammatical analysis result, and the selection criterion is the highest generation probability. Let T be an alternative tree. When the sentence is ambiguous, the analysis result T* of the sentence can be selected through probability, namely:

分析备选树T的生成概率就是生成T所需要的所有规则的条件概率乘积：The generation probability of the analysis candidate tree T is the product of the conditional probabilities of all the rules required to generate T:

其中r即为规则，P(r)为满足这条规则的概率。Where r is the rule, and P(r) is the probability of satisfying this rule.

PCFG作为一种成熟自然语言分析模型，其具有一定的消除歧义的能力，生成语法树精度高。并且由于模型本身的马尔科夫性，其不考虑前后文环境，故对于数据的稀疏性问题不敏感，故其分析结果具有一定的鲁棒性。As a mature natural language analysis model, PCFG has a certain ability to disambiguate and generate syntax trees with high precision. Moreover, due to the Markov nature of the model itself, it does not consider the context environment, so it is not sensitive to the sparsity of the data, so its analysis results have certain robustness.

进一步的是对以上生成的所有语法树(标题语法树和节点语法树)进行处理。本发明实施例采用同步树替换文法(STSG)将所有标题语法树和节点语法树分别转换为标题深层语法树和节点深层语法树。A further step is to process all syntax trees (title syntax trees and node syntax trees) generated above. The embodiment of the present invention uses Synchronous Tree Substitution Grammar (STSG) to convert all title syntax trees and node syntax trees into title deep syntax trees and node deep syntax trees respectively.

这里所提及的变换文法，是一种针对于句子句法与句子内在语义关系的理论，此理论认为所有的自然语言语句均具有深层和表层两个结构；表层结构即为文档中记载的人眼可见的文字，即为实际的文字序列；句子的深层结构区别于句子到的表层结构，句子的深层结构实际上决定了一个句子的实际语义；多个语义相同而表层结构不同的句子对应着同一个深层句子结构。The transformation grammar mentioned here is a theory aimed at the relationship between sentence syntax and the inner semantics of sentences. This theory believes that all natural language sentences have two structures: deep and surface; the surface structure is the human eye recorded in the document. The visible text is the actual text sequence; the deep structure of a sentence is different from the superficial structure of a sentence, and the deep structure of a sentence actually determines the actual semantics of a sentence; multiple sentences with the same semantics but different superficial structures correspond to the same A deep sentence structure.

例如：我今天的午餐是一个汉堡包。Example: My lunch today is a hamburger.

我吃了个汉堡包作为今天的午餐。I had a hamburger for lunch today.

这两个句子虽结构不同，但是内在的深层句子结构是完全一样的，因其表达的是同一个意思。Although the structures of these two sentences are different, the inner deep sentence structure is exactly the same, because they express the same meaning.

STSG是一种基于语法树的规则自学习算法，其通过语法树来自行学习语法规则。将句子的表层结构转换为深层结构，使得句法不同但语义相同的句子生成相同的句子语法树。STSG is a rule self-learning algorithm based on grammar trees, which learns grammar rules by itself through grammar trees. Convert the surface structure of a sentence into a deep structure, so that sentences with different syntax but the same semantics can generate the same sentence syntax tree.

STSG基本规则抽取算法如下：The basic rule extraction algorithm of STSG is as follows:

输入：句法树对<T(f),T(e),A>,A为T(f)与T(e)的对齐关系。Input: syntax tree pair <T(f), T(e), A>, A is the alignment relationship between T(f) and T(e).

建立一个空的基本规则集合P.Create an empty base rule set P.

t(p)是以p为根节点的T(f)的子树；t(p) is a subtree of T(f) whose root node is p;

t(q)是以q为根节点的T(e)的子树；t(q) is a subtree of T(e) whose root node is q;

A(t(p),t(q))是A中与t(p)和t(q)相关的词对齐关系A(t(p),t(q)) is the word alignment relationship related to t(p) and t(q) in A

If<t(p),t(q),A(t(p),t(q))>满足词对齐限制和句法限制If<t(p),t(q),A(t(p),t(q))>satisfies word alignment restrictions and syntactic restrictions

Then将<t(p),t(q),A(t(p),t(q))>加入规则集合PThen add <t(p), t(q), A(t(p), t(q))> to the rule set P

输出：基本规则集合POutput: basic rule set P

使用训练好的STSG算法对标题语法树和节点语法树所对应的所有句子进行标准化，分别将标题语法树和节点语法树转换为唯一表达句子语义的标题深层语法树和节点深层语法树。对于同一个标签内的文本可能生成的多个语法树，将他们逐一进行标准化，并归于这个标签。Use the trained STSG algorithm to standardize all sentences corresponding to the title syntax tree and the node syntax tree, and convert the title syntax tree and the node syntax tree into a title deep syntax tree and a node deep syntax tree that uniquely express the sentence semantics. For the multiple syntax trees that may be generated by the text in the same tag, standardize them one by one and attribute them to this tag.

词向量是一种面向自然语言处理的语言模型，其核心思想是通过不同的语义标准将语言中不同的字或词映射成一个高维向量，这些向量的每一维由实数组成，词向量之间的关系将词与词之间抽象的语义关系进行了具体化，使得计算机能够通过具体的计算来近似处理抽象的语义关系。词向量之间的方向相似性也反映了词语之间的语义相似性。Word vector is a language model for natural language processing. Its core idea is to map different words or words in the language into a high-dimensional vector through different semantic standards. Each dimension of these vectors is composed of real numbers. The relationship between words concretizes the abstract semantic relationship between words, so that the computer can approximate the abstract semantic relationship through specific calculations. The directional similarity between word vectors also reflects the semantic similarity between words.

得到标题深层语法树和节点深层语法树后，需要通过训练好的词向量比较语法树之间的语义相似性。将出自同一个文本内的所有标题深层语法树与节点深层语法树进行匹配。After obtaining the title deep grammar tree and the node deep grammar tree, it is necessary to compare the semantic similarity between the grammar trees through the trained word vector. Matches all heading deep syntax trees with node deep syntax trees from within the same text.

匹配过程采用树的前序遍历方法。这里T为标题深层语法树，树A₁、A₂、A₃、…、A_m为出自文本深层语法树L_i的m个要与树T进行匹配的语法树；这里m为候选文本数量，m、i为正整数，0<i≤m；顺序的对这m个树进行匹配。T与A_i同步进行前序遍历，若T与A_i结构相同，则开始计算两棵树的相似度，若不相同，则跳过A_i；计算相似度时，对于处于相同位置的节点a∈T、b∈A_i，令其词向量为t_i与a_i，计算t_i与a_i的余弦相似度得到词向量相似度S_i。如果T共有n个节点，0<i≤n，则树A_i与树T的整体相似度最终以与T相似度最大值对应的A_i作为采集目标，以其对应的节点文本信息作为目标信息进行抽取。The matching process adopts the preorder traversal method of the tree. Here T is the title deep syntax tree, trees A ₁ , A ₂ , A ₃ ,..., A _m are m syntax trees from the text deep syntax tree L _i to be matched with tree T; here m is the number of candidate texts, m and i are positive integers, 0<i≤m; match the m trees sequentially. T and A _i perform preorder traversal synchronously. If T and A _i have the same structure, start to calculate the similarity of the two trees. If not, skip A _i . When calculating the similarity, for the node a in the same position ∈T, b∈A _i , let its word vectors be t _i and a _i , calculate the cosine similarity between t _i and a _i to get the word vector similarity S _i . If T has n nodes in total, 0<i≤n, the overall similarity between tree A _i and tree T Finally, the A _i corresponding to the maximum similarity value of T is taken as the collection target, and the corresponding node text information is taken as the target information for extraction.

具体算法如下：The specific algorithm is as follows:

输入为标题树T与文本L_i所对应的树的集合{A₁,A₂,A₃,…,A_n}The input is a set of trees corresponding to the title tree T and the text L _i {A ₁ ,A ₂ ,A ₃ ,…,A _n }

1、 1,

2、对A_i与T进行同步前序遍历，t_i与a_i为遍历到的节点对应的词向量，S_i为T与L_i的语义相似度，S＝1；2. Perform synchronous preorder traversal on A _i and T, t _i and a _i are the word vectors corresponding to the traversed nodes, S _i is the semantic similarity between T and L _i , S=1;

3、若t_i与a_i均不为空，则计算同时S＝S·S_i，否则跳过A_i，跳到步骤1；3. If both t _i and a _i are not empty, calculate At the same time S=S·S _i , otherwise skip A _i and go to step 1;

4、求以上每颗树A_i对应的S；4. Find the S corresponding to each tree A _i above;

输出即为T与L_i的整体相似度。The output is the overall _similarity between T and Li.

完成整体相似度计算后，选取节点文本信息与标题文本信息整体相似度中最大值对应的那个节点文本信息作为最终采集目标文本进行抽取。After the overall similarity calculation is completed, the node text information corresponding to the maximum value in the overall similarity between the node text information and the title text information is selected as the final acquisition target text for extraction.

本发明利用容易通过模板匹配采集，并且采集精度高，能够高度概括文本主体语义的标题作为语义匹配标准，对多个疑似文本主体的短文本进行语义匹配处理，并将语义匹配程度最高的文本作为采集对象。在通过网页标签等信息难以精确筛选要采集对象的情况下，将多个疑似网页主体文本的短文本语义作为筛选信息，提供了一种基于语义的网页数据采集方法，极大的突破了以往不能利用采集对象自身语义进行识别的采集方法的限制。The present invention uses titles that are easy to collect through template matching, have high collection accuracy, and can highly summarize the semantics of the text body as the semantic matching standard, perform semantic matching processing on multiple short texts that are suspected of being text bodies, and use the text with the highest degree of semantic matching as the semantic matching standard. Collection objects. In the case that it is difficult to accurately screen the objects to be collected through information such as webpage tags, the semantics of multiple short texts that are suspected of being the main text of the webpage are used as screening information to provide a semantics-based webpage data collection method, which greatly breaks through the previously impossible Limitations of collection methods that use the collection object's own semantics for recognition.

Claims

1. A webpage information extraction method based on syntax tree and text block density, is characterized in that, comprises:

Obtain the title text information of the webpage; set the screening threshold, calculate the text block density of all nodes of the web page, and use the node whose text block density is greater than the screening threshold as the collection node to extract the node text information of the collection node;

If the number of the collection node is 1, the text information of the node is used as the target information for extraction;

If the number of the collected nodes is greater than 1, the title text information and the node text information are respectively converted into a title deep syntax tree and a node deep syntax tree uniquely expressing sentence semantics; each node deep syntax tree and the title deep syntax tree are obtained For the overall similarity of the syntax tree, the node text information corresponding to the maximum value in the overall similarity is used as the target information for extraction.

2. The method for extracting web page information according to claim 1, characterized in that the title text information is obtained through a regular expression.

3. The method for extracting web page information as claimed in claim 1, wherein the text block density is obtained in the following manner:

Among them, TBD(v) is the text block density of the node v, v.children is the child node set of the node v, v _i is the child node of the node v, CN _vi is the text block contained in the child node v _i LCN _vi is the number of hyperlink characters contained in the text block of the sub-node v _i , TN _vi is the number of labels contained in the text block of the sub-node v _i , LTN _vi is the number of tags contained in the sub-node v i The number of hyperlink tags contained in the text block of _i .

4. The method for extracting web page information as claimed in claim 1, wherein the title text information and the node text information are converted into a title syntax tree and a node syntax tree respectively by using a probabilistic context-free model, and a synchronous tree is used to replace The grammar converts the heading syntax tree and the node-deep syntax tree into the heading-deep syntax tree and the node-deep syntax tree, respectively.

5. The method for extracting webpage information as claimed in claim 1, wherein the overall similarity is obtained by the following steps:

Extracting the title word vector t _i of the deep syntax tree of the title, and the text word vector a _i of the deep syntax tree of the node having the same structure as the title deep syntax tree;

Take the word vector similarity between the title word vector t _i and the text word vector a _i Obtain the overall similarity S=S ₁ ·S ₂ ·S ₃ ·····S _n ;

Where 0<i≤n, n is a positive integer, and n is the number of deep syntax tree nodes of the title.

6. A web page information extraction system based on syntax tree and text block density, characterized in that it comprises:

The text information acquisition module is used to obtain the title text information of the webpage, and collect the node text information of the nodes; it includes setting a screening threshold, calculating the text block density of all nodes in the web page, and taking the text block density greater than the screening threshold. The node is the collection node, and the node text information of the collection node is extracted;

The first target information acquisition module is used to acquire target information for extraction when the number of text information of the node is 1;

The second target information acquisition module is used to obtain target information for extraction when the number of the node text information is greater than 1; wherein the title text information and the node text information are respectively converted into a title deep syntax tree and a node that uniquely express the semantics of the sentence Deep syntax tree: Obtain the overall similarity between the deep syntax tree of each node and the deep syntax tree of the title, and use the node text information corresponding to the maximum value in the overall similarity as the target information.

7. The web page information extraction system according to claim 6, wherein, in the text information acquisition module, the title text information is acquired through a regular expression.

8. The web page information extraction system according to claim 6, wherein, in the text information acquisition module, the text block density is obtained in the following manner:

9. The webpage information extraction system as claimed in claim 6, wherein, in the second target information acquisition module, a probabilistic context-free model is used to convert the title text information and the node text information into a title syntax tree respectively and the node syntax tree, and convert the heading syntax tree and the node syntax tree into the heading deep syntax tree and the node deep syntax tree respectively by using a synchronous tree replacement grammar.

10. The web page information extraction system according to claim 6, wherein the second target information acquisition module further comprises:

The word vector obtaining module is used to obtain the title word vector t _i of the deep syntax tree of the title, and the text word vector a _i of the deep syntax tree of the node with the same structure as the title deep syntax tree;

The similarity acquisition module is used to obtain the word vector similarity of the title word vector t _i and the text word vector a _i The overall similarity S=S ₁ ·S ₂ ·S ₃ ····S _{n is} obtained; where 0<i≤n, n is a positive integer, and n is the number of deep syntax tree nodes of the title.