CN104317903A

CN104317903A - Chapter type text chapter integrity identification method and device

Info

Publication number: CN104317903A
Application number: CN201410578534.2A
Authority: CN
Inventors: 魏少俊; 郑燕琴
Original assignee: Beijing Qihoo Technology Co Ltd; Qizhi Software Beijing Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd
Priority date: 2014-10-24
Filing date: 2014-10-24
Publication date: 2015-01-28
Anticipated expiration: 2034-10-24
Also published as: CN104317903B

Abstract

The present invention provides a method and device for identifying chapter integrity of a chapter-style text. The method includes: respectively identifying the catalog page and multiple content pages of the chapter-style text from multiple sites, wherein each site corresponds to a catalog page, each catalog page corresponds to a plurality of content pages; according to the multiple content pages corresponding to each catalog page, determine the catalog page sets of the chapter text on different sites; analyze the catalog pages and content of each catalog page in the catalog page set /or the content page corresponding to each catalog page, and identify the chapter completeness of each catalog page in the catalog page set according to the analysis result. The technical solution provided by the invention can flexibly and quickly identify the chapter completeness of the chapter text, and the recognition result is accurate and objective.

Description

Method and device for identifying chapter integrity of chapter text

技术领域technical field

本发明涉及互联网技术领域，特别是一种章节式文本的章节完整性的识别方法和装置。The invention relates to the technical field of the Internet, in particular to a method and a device for identifying chapter integrity of a chapter text.

背景技术Background technique

随着计算机和计算机网络的日益普及，互联网已经深入到人们工作、学习和生活的各个领域，成为人们发布和获取信息的重要途径。With the increasing popularity of computers and computer networks, the Internet has penetrated into all areas of people's work, study and life, and has become an important way for people to publish and obtain information.

目前，章节式文本在互联网中大量存在，且同一文本可能被不同网站大量转载，由于转载时受到一些客观因素的影响，可能导致在一些网站该文本的内容并不完整，甚至出现内容虚假的情况。以小说文本为例，小说阅读是互联网用户的一种强需求，尤其在移动设备上更占有不小的需求比重。小说类网站大量存在，质量却良莠不齐，同一本网络小说会被不同网站大量转载，但受一些客观因素的影响，可能会导致在一些网站上该本小说的内容并不完整(如缺少章节)，甚至内容虚假(拼凑虚假章节)。搜索引擎在索引这些小说站点时，需要对小说的章节完整性做出判断，尽量给用户呈现内容完整的站点，提高用户获取小说内容的质量，提升用户体验。At present, there are a large number of chapter-style texts on the Internet, and the same text may be reproduced in large quantities by different websites. Due to the influence of some objective factors when reprinting, the content of the text may be incomplete or even false on some websites. . Taking novel texts as an example, reading novels is a strong demand of Internet users, especially on mobile devices. There are a large number of novel websites, but the quality varies. The same online novel will be reproduced in large quantities by different websites, but affected by some objective factors, the content of the novel may be incomplete (such as missing chapters) on some websites. Even the content is false (putting together false chapters). When indexing these novel sites, search engines need to make judgments on the integrity of the chapters of the novel, try to present the site with complete content to users, improve the quality of novel content obtained by users, and improve user experience.

相关技术中，通过对不同小说站点人工配置模板进行章节完整性判断，该方法虽然准确率很高，但是缺点也很明显：人力能覆盖的网站有限，不够智能，对于网站模板变化的响应不及时。因而，如何灵活、快速以及准确地识别章节式文本的章节完整性成为目前亟待解决的技术问题。In the related technology, the manual configuration templates of different novel sites are used to judge the integrity of the chapters. Although the accuracy of this method is very high, the disadvantages are also obvious: the number of websites covered by manpower is limited, it is not smart enough, and the response to changes in website templates is not timely. . Therefore, how to flexibly, quickly and accurately identify the chapter completeness of a chapter-style text has become a technical problem to be solved urgently.

发明内容Contents of the invention

鉴于上述问题，提出了本发明以便提供一种克服上述问题或者至少部分地解决上述问题的章节式文本的章节完整性的识别方法和相应的装置。In view of the above problems, the present invention is proposed to provide a method for identifying chapter integrity of a chapter text and a corresponding device that overcomes the above problems or at least partially solves the above problems.

依据本发明的一个方面，提供了一种章节式文本的章节完整性的识别方法，包括：从多个站点分别识别出章节式文本的目录页以及多个内容页，其中，每个站点对应一个目录页，每个目录页对应多个内容页；根据每个目录页对应的多个内容页，确定所述章节式文本在不同站点上的目录页集合；分析所述目录页集合中各目录页和/或各目录页对应的内容页，根据分析得到的结果识别出所述目录页集合中各目录页的章节完整性。According to one aspect of the present invention, a method for identifying chapter integrity of a chapter-style text is provided, including: respectively identifying the catalog page and multiple content pages of the chapter-style text from multiple sites, wherein each site corresponds to a Catalog pages, each catalog page corresponds to multiple content pages; according to the multiple content pages corresponding to each catalog page, determine the catalog page sets of the chapter text on different sites; analyze each catalog page in the catalog page set and/or the content page corresponding to each catalog page, and identify the chapter completeness of each catalog page in the catalog page set according to the result obtained from the analysis.

可选地，根据每个目录页对应的多个内容页，确定所述章节式文本在不同站点上的目录页集合，包括：计算每两个目录页对应的内容页之间的交集，并作为每两个目录页的交集；根据每两个目录页的交集，确定所述章节式文本在不同站点上的目录页集合。Optionally, according to the plurality of content pages corresponding to each catalog page, determining the set of catalog pages of the chapter-style text on different sites includes: calculating the intersection between the content pages corresponding to every two catalog pages, and using it as The intersection of every two catalog pages; according to the intersection of every two catalog pages, determine the set of catalog pages of the chapter text on different sites.

可选地，所述计算每两个目录页对应的内容页之间的交集，包括：提取多个目录页对应的内容页中每个内容页的文本特征向量；将具备相同文本特征向量的内容页进行聚类，生成多个内容页分组；根据所述多个内容页分组、以及每个目录页与其对应的内容页的映射关系，计算每两个目录页对应的内容页之间的交集。Optionally, the calculation of the intersection between the content pages corresponding to every two catalog pages includes: extracting the text feature vector of each content page in the content pages corresponding to multiple catalog pages; The pages are clustered to generate multiple content page groups; according to the multiple content page groups and the mapping relationship between each catalog page and its corresponding content page, the intersection between the content pages corresponding to every two catalog pages is calculated.

可选地，根据每两个目录页的交集，确定所述章节式文本在不同站点上的目录页集合，包括：将交集的元素个数大于或等于预设阈值的每两个目录页进行合并，得到合并结果；将所述合并结果作为所述章节式文本在不同站点上的目录页集合。Optionally, according to the intersection of every two catalog pages, determining the set of catalog pages of the chapter text on different sites includes: merging every two catalog pages whose number of elements in the intersection is greater than or equal to a preset threshold , to obtain a merged result; the merged result is used as a set of catalog pages of the chapter text on different sites.

可选地，分析所述目录页集合中各目录页和/或各目录页对应的内容页，根据分析得到的结果识别出所述目录页集合中各目录页的章节完整性，包括：计算所述目录页集合中每两个目录页的交集的元素个数的平均值；若某一目录页与所述目录页集合中多个其他目录页的交集的元素的个数均小于所述平均值，则确定该目录页对应的章节不完整。Optionally, analyzing each catalog page in the catalog page set and/or the content page corresponding to each catalog page, and identifying the chapter integrity of each catalog page in the catalog page set according to the analysis results, including: calculating the The average number of elements of the intersection of every two catalog pages in the set of catalog pages; if the number of elements of the intersection of a certain catalog page and multiple other catalog pages in the set of catalog pages is less than the average , it is determined that the chapter corresponding to the table of contents page is incomplete.

可选地，分析所述目录页集合中各目录页和/或各目录页对应的内容页，根据分析得到的结果识别出所述目录页集合中各目录页的章节完整性，包括：若某一目录页对应的内容页包含有所述目录页集合中多个其他目录页对应的内容页，且该目录页对应的内容页中还存在其他内容页，则确定所述其他内容页为最新章节的内容页，且该目录页具备持续贡献新章节的能力。Optionally, analyzing each catalog page in the catalog page set and/or the content page corresponding to each catalog page, and identifying the chapter integrity of each catalog page in the catalog page set according to the analysis results, including: if a certain The content page corresponding to a catalog page includes content pages corresponding to multiple other catalog pages in the catalog page set, and there are other content pages in the content page corresponding to the catalog page, then it is determined that the other content pages are the latest chapters , and the content page has the ability to continuously contribute new chapters.

可选地，分析所述目录页集合中各目录页和/或各目录页对应的内容页，根据分析得到的结果识别出所述目录页集合中各目录页的章节完整性，包括：若某一目录页对应的某个内容页未存在于所述目录页集合中其他目录页对应的内容页中，且该内容页长度不属于该目录页对应的内容页的平均长度对应的区间范围，则确定该内容页为虚假的内容页。Optionally, analyzing each catalog page in the catalog page set and/or the content page corresponding to each catalog page, and identifying the chapter integrity of each catalog page in the catalog page set according to the analysis results, including: if a certain A certain content page corresponding to a catalog page does not exist in the content pages corresponding to other catalog pages in the catalog page set, and the length of the content page does not belong to the range corresponding to the average length of the content pages corresponding to the catalog page, then Determine that the content page is a fake content page.

可选地，分析所述目录页集合中各目录页和/或各目录页对应的内容页，根据分析得到的结果识别出所述目录页集合中各目录页的章节完整性，包括：若某一目录页对应的某个内容页未存在于所述目录页集合中其他目录页对应的内容页中，该内容页长度属于该目录页对应的内容页的平均长度对应的区间范围，且该目录页不具备持续贡献新章节的能力，则确定该内容页为虚假的内容页。Optionally, analyzing each catalog page in the catalog page set and/or the content page corresponding to each catalog page, and identifying the chapter integrity of each catalog page in the catalog page set according to the analysis results, including: if a certain A certain content page corresponding to a catalog page does not exist in the content pages corresponding to other catalog pages in the catalog page set, the length of the content page belongs to the range corresponding to the average length of the content pages corresponding to the catalog page, and the catalog page If the page does not have the ability to continuously contribute new chapters, it is determined that the content page is a false content page.

可选地，所述从多个站点分别识别出章节式文本的目录页以及多个内容页，包括：从多个站点搜索到章节式文本相关的网页；从搜索到的网页中识别出所述章节式文本的目录页以及多个内容页。Optionally, identifying the catalog page and the plurality of content pages of the chapter text from multiple sites includes: searching for web pages related to the chapter text from multiple sites; identifying the A table of contents page with chapter-style text and multiple content pages.

可选地，从搜索到的网页中识别出所述章节式文本的目录页以及多个内容页，包括：将搜索到的网页解析成文本对象模型树结构；对所述文本对象模型树结构中的各结点进行分类，以确定所述网页的结构分块；根据所述结构分块抽取所述章节式文本的目录页以及多个内容页。Optionally, identifying the catalog page and multiple content pages of the chapter-style text from the searched webpage includes: parsing the searched webpage into a text object model tree structure; Classify each node of the webpage to determine the structural block of the web page; extract the catalog page and multiple content pages of the chapter text according to the structural block.

依据本发明的另一个方面，还提供了一种章节式文本的章节完整性的识别装置，包括：According to another aspect of the present invention, a device for identifying the completeness of chapters of a chapter text is also provided, including:

获取模块，适于从多个站点分别识别出章节式文本的目录页以及多个内容页，其中，每个站点对应一个目录页，每个目录页对应多个内容页；The acquisition module is adapted to identify the chapter-style text content page and multiple content pages from multiple sites, wherein each site corresponds to a content page, and each content page corresponds to multiple content pages;

确定模块，适于根据每个目录页对应的多个内容页，确定所述章节式文本在不同站点上的目录页集合；The determination module is adapted to determine the set of catalog pages of the chapter text on different sites according to the plurality of content pages corresponding to each catalog page;

识别模块，适于分析所述目录页集合中各目录页和/或各目录页对应的内容页，根据分析得到的结果识别出所述目录页集合中各目录页的章节完整性。The identification module is adapted to analyze each catalog page in the catalog page set and/or the content page corresponding to each catalog page, and identify the chapter integrity of each catalog page in the catalog page set according to the analysis result.

可选地，所述确定模块还适于：计算每两个目录页对应的内容页之间的交集，并作为每两个目录页的交集；根据每两个目录页的交集，确定所述章节式文本在不同站点上的目录页集合。Optionally, the determining module is further adapted to: calculate the intersection between the content pages corresponding to every two catalog pages, and use it as the intersection of every two catalog pages; determine the chapter according to the intersection of every two catalog pages A collection of catalog pages on different sites with the same text.

可选地，所述确定模块还适于：提取多个目录页对应的内容页中每个内容页的文本特征向量；将具备相同文本特征向量的内容页进行聚类，生成多个内容页分组；根据所述多个内容页分组、以及每个目录页与其对应的内容页的映射关系，计算每两个目录页对应的内容页之间的交集。Optionally, the determination module is further adapted to: extract the text feature vector of each content page in the content pages corresponding to multiple catalog pages; cluster the content pages with the same text feature vector to generate multiple content page groups ; According to the plurality of content page groups and the mapping relationship between each catalog page and its corresponding content page, calculate the intersection between the content pages corresponding to every two catalog pages.

可选地，所述确定模块还适于：将交集的元素个数大于或等于预设阈值的每两个目录页进行合并，得到合并结果；将所述合并结果作为所述章节式文本在不同站点上的目录页集合。Optionally, the determination module is further adapted to: merge every two catalog pages whose number of elements in the intersection is greater than or equal to a preset threshold to obtain a merged result; use the merged result as the chapter text in different A collection of catalog pages on the site.

可选地，所述识别模块还适于：计算所述目录页集合中每两个目录页的交集的元素的个数的平均值；若某一目录页与所述目录页集合中多个其他目录页的交集的元素的个数均小于所述平均值，则确定该目录页对应的章节不完整。Optionally, the identification module is further adapted to: calculate the average value of the number of elements in the intersection of every two catalog pages in the catalog page set; If the number of elements in the intersection of the catalog pages is less than the average value, it is determined that the chapter corresponding to the catalog page is incomplete.

可选地，所述识别模块还适于：若某一目录页对应的内容页包含有所述目录页集合中多个其他目录页对应的内容页，且该目录页对应的内容页中还存在其他内容页，则确定所述其他内容页为最新章节的内容页，且该目录页具备持续贡献新章节的能力。Optionally, the identification module is further adapted to: if the content page corresponding to a certain catalog page contains content pages corresponding to multiple other catalog pages in the catalog page set, and the content page corresponding to the catalog page still exists other content pages, it is determined that the other content pages are the content pages of the latest chapter, and the content page has the ability to continuously contribute new chapters.

可选地，所述识别模块还适于：若某一目录页对应的某个内容页未存在于所述目录页集合中其他目录页对应的内容页中，且该内容页长度不属于该目录页对应的内容页的平均长度对应的区间范围，则确定该内容页为虚假的内容页。Optionally, the identification module is further adapted to: if a certain content page corresponding to a certain catalog page does not exist in the content pages corresponding to other catalog pages in the catalog page set, and the length of the content page does not belong to the catalog page If the interval range corresponding to the average length of the content page corresponding to the page is determined, the content page is determined to be a false content page.

可选地，所述识别模块还适于：若某一目录页对应的某个内容页未存在于所述目录页集合中其他目录页对应的内容页中，该内容页长度属于该目录页对应的内容页的平均长度对应的区间范围，且该目录页不具备持续贡献新章节的能力，则确定该内容页为虚假的内容页。Optionally, the identification module is further adapted to: if a certain content page corresponding to a certain catalog page does not exist in the content pages corresponding to other catalog pages in the catalog page set, the length of the content page belongs to the content page corresponding to the catalog page The range corresponding to the average length of the content page, and the content page does not have the ability to continuously contribute new chapters, then it is determined that the content page is a false content page.

可选地，所述获取模块还适于：从多个站点搜索到章节式文本相关的网页；从搜索到的网页中识别出所述章节式文本的目录页以及多个内容页。Optionally, the acquiring module is further adapted to: search for webpages related to the chapter text from multiple sites; identify the catalog page and multiple content pages of the chapter text from the searched web pages.

可选地，所述获取模块还适于：将搜索到的网页解析成文本对象模型树结构；对所述文本对象模型树结构中的各结点进行分类，以确定所述网页的结构分块；根据所述结构分块抽取所述章节式文本的目录页以及多个内容页。Optionally, the acquisition module is further adapted to: parse the searched webpage into a text object model tree structure; classify each node in the text object model tree structure to determine the structural blocks of the webpage ; Extract the catalog page and multiple content pages of the chapter text in blocks according to the structure.

依据本发明提供的技术方案，从多个站点分别识别出章节式文本的目录页以及多个内容页，进而根据每个目录页对应的多个内容页，确定章节式文本在不同站点上的目录页集合。随后分析目录页集合中各目录页和/或各目录页对应的内容页，根据分析得到的结果识别出目录页集合中各目录页的章节完整性。由此可见，本发明实现了对数据源(多个站点上章节式文本的目录页以及多个内容页)的获取、目录页集合的确定以及对目录页集合的分析三者的自动化地处理，从而解决相关技术中通过人工配置模板进行章节完整性判断导致效率低的问题。并且，本发明能够灵活地获取数据源，进而确定目录页集合，对目录页集合进行分析，解决了相关技术中网站模板变化的响应不及时的问题。此外，目录页和内容页能够准确、客观地反映章节式文本的章节完整性，本发明有针对性地分析章节式文本在不同站点上的目录页集合中各目录页和/或各目录页对应的内容页，进而根据分析得到的结果识别出目录页集合中各目录页的章节完整性，使得识别结果更加准确。综上，本发明提供的技术方案能够灵活、快速地识别章节式文本的章节完整性，并且识别结果准确、客观。According to the technical solution provided by the present invention, the catalog pages and multiple content pages of the chapter-style text are respectively identified from multiple sites, and then according to the multiple content pages corresponding to each catalog page, the catalogs of the chapter-style text on different sites are determined. page collection. Then analyze each catalog page in the catalog page set and/or the content page corresponding to each catalog page, and identify the chapter integrity of each catalog page in the catalog page set according to the analysis result. It can be seen that the present invention realizes the automatic processing of the acquisition of data sources (the catalog pages and multiple content pages of chapter text on multiple sites), the determination of the catalog page set and the analysis of the catalog page set, Therefore, the problem of low efficiency caused by manual configuration of templates to judge the integrity of chapters in related technologies is solved. Moreover, the present invention can flexibly acquire data sources, and then determine the set of catalog pages, and analyze the set of catalog pages, which solves the problem of untimely response to website template changes in the related art. In addition, the catalog page and the content page can accurately and objectively reflect the chapter integrity of the chapter-style text, and the present invention analyzes the chapter-style text in the catalog page sets on different sites and/or the corresponding content of each catalog page. content pages, and then identify the chapter completeness of each catalog page in the catalog page set according to the analysis result, so that the identification result is more accurate. To sum up, the technical solution provided by the present invention can flexibly and quickly identify the integrity of the chapters of the chapter text, and the recognition result is accurate and objective.

上述说明仅是本发明技术方案的概述，为了能够更清楚了解本发明的技术手段，而可依照说明书的内容予以实施，并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂，以下特举本发明的具体实施方式。The above description is only an overview of the technical solution of the present invention. In order to better understand the technical means of the present invention, it can be implemented according to the contents of the description, and in order to make the above and other purposes, features and advantages of the present invention more obvious and understandable , the specific embodiments of the present invention are enumerated below.

根据下文结合附图对本发明具体实施例的详细描述，本领域技术人员将会更加明了本发明的上述以及其他目的、优点和特征。Those skilled in the art will be more aware of the above and other objects, advantages and features of the present invention according to the following detailed description of specific embodiments of the present invention in conjunction with the accompanying drawings.

附图说明Description of drawings

通过阅读下文优选实施方式的详细描述，各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的，而并不认为是对本发明的限制。而且在整个附图中，用相同的参考符号表示相同的部件。在附图中：Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiment. The drawings are only for the purpose of illustrating a preferred embodiment and are not to be considered as limiting the invention. Also throughout the drawings, the same reference numerals are used to designate the same parts. In the attached picture:

图1示出了根据本发明一个实施例的章节式文本的章节完整性的识别方法的流程图；以及FIG. 1 shows a flowchart of a method for identifying chapter completeness of a chapter text according to an embodiment of the present invention; and

图2示出了根据本发明一个实施例的章节式文本的章节完整性的识别装置的结构示意图。Fig. 2 shows a schematic structural diagram of an apparatus for identifying chapter integrity of a chapter text according to an embodiment of the present invention.

具体实施方式Detailed ways

下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例，然而应当理解，可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反，提供这些实施例是为了能够更透彻地理解本公开，并且能够将本公开的范围完整的传达给本领域的技术人员。Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided for more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.

为解决上述技术问题，本发明实施例提供了一种章节式文本的章节完整性的识别方法，图1示出了根据本发明一个实施例的章节式文本的章节完整性的识别方法的流程图。如图1所示，该方法至少包括以下步骤S102至步骤S106。In order to solve the above technical problems, an embodiment of the present invention provides a method for identifying chapter integrity of a chapter-style text, and FIG. 1 shows a flow chart of a method for identifying chapter integrity of a chapter-style text according to an embodiment of the present invention . As shown in Fig. 1, the method at least includes the following steps S102 to S106.

步骤S102、从多个站点分别识别出章节式文本的目录页以及多个内容页，其中，每个站点对应一个目录页，每个目录页对应多个内容页。Step S102 , identifying chapter-style text content pages and multiple content pages from multiple sites, wherein each site corresponds to a content page, and each content page corresponds to multiple content pages.

步骤S104、根据每个目录页对应的多个内容页，确定章节式文本在不同站点上的目录页集合。Step S104, according to the plurality of content pages corresponding to each catalog page, determine the catalog page sets of the chapter text on different sites.

步骤S106、分析目录页集合中各目录页和/或各目录页对应的内容页，根据分析得到的结果识别出目录页集合中各目录页的章节完整性。Step S106 , analyzing each catalog page in the catalog page set and/or the content page corresponding to each catalog page, and identifying the chapter integrity of each catalog page in the catalog page set according to the analysis result.

上文步骤S102中提及的章节式文本是指由若干章节组成的文本，如小说、论文等。目录页是指章节式文本的目录，例如用户搜索小说时，通常要找的是小说的目录页。内容页是指章节式文本某一章节的具体内容。本发明提供了一种从多个站点分别识别出章节式文本的目录页以及多个内容页的优选方案，在该方案中可以从多个站点搜索到章节式文本相关的网页，进而从搜索到的网页中识别出章节式文本的目录页以及多个内容页。The chapter text mentioned above in step S102 refers to a text composed of several chapters, such as a novel, a thesis, and the like. The catalog page refers to the catalog of chapter-style texts. For example, when users search for novels, they usually look for the catalog page of novels. A content page refers to the specific content of a chapter of a chapter-style text. The present invention provides a preferred scheme for identifying chapter-style text catalog pages and multiple content pages from multiple sites. A table of contents page of chapter-style text and multiple content pages are identified in the web page of .

进一步地，从搜索到的网页中识别出章节式文本的目录页以及多个内容页可以采用人工编写规则对搜索到的网页进行目录页以及内容页的识别抽取。或者，可以基于标注的模版，每次抽取在模板库中查到最佳匹配的模板，然后使用该模板目录页以及内容页的识别抽取。此外，为了提高识别效率，本发明还可以将搜索到的网页解析成文本对象模型树结构，并对文本对象模型树结构中的各结点进行分类，以确定网页的结构分块，进而根据结构分块抽取章节式文本的目录页以及多个内容页。这里提供了一种优选的对文本对象模型树结构中的各节点进行分类以确定网页的结构分块的方案，在该方案中，可以遍历文本对象模型树结构，得到文本对象模型树结构中各节点的内容，进而按照预设规则将各节点的内容输入决策树，由决策树对各节点进行分类。或者，可以遍历文本对象模型树结构，得到文本对象模型树结构中各节点的维度特征，进而按照预设规则将各节点的维度特征输入决策树，由决策树对各节点进行分类。Further, identifying and extracting the catalog pages and multiple content pages of the chapter text from the searched web pages can be carried out by manually writing rules to identify and extract the catalog pages and content pages from the searched web pages. Or, based on the marked template, the best matching template can be found in the template library for each extraction, and then the content page and content page of the template can be identified and extracted. In addition, in order to improve the recognition efficiency, the present invention can also parse the searched webpage into a text object model tree structure, and classify each node in the text object model tree structure to determine the structural blocks of the webpage, and then according to the structure Extract the table of contents page and multiple content pages of chapter text in blocks. A preferred scheme for classifying each node in the tree structure of the text object model to determine the structural blocks of the webpage is provided here. In this scheme, the tree structure of the text object model can be traversed to obtain each node in the tree structure of the text object model. The content of each node is then entered into the decision tree according to the preset rules, and each node is classified by the decision tree. Alternatively, the text object model tree structure can be traversed to obtain the dimensional characteristics of each node in the text object model tree structure, and then the dimensional characteristics of each node are input into the decision tree according to preset rules, and the decision tree classifies each node.

决策树是在已知各种分块中各种维度特征的统计数据的基础上，通过训练决策树利用各结点的维度特征得出各结点对应的分块类型。下面将详细介绍决策树对网页的文本对象模型树结构中各节点进行分类，以确定网页的结构分块的方案。The decision tree is based on the known statistical data of various dimensional features in various blocks, and uses the dimensional features of each node to obtain the block type corresponding to each node through training the decision tree. The decision tree will be introduced in detail below to classify each node in the tree structure of the text object model of the webpage, so as to determine the scheme of dividing the structure of the webpage into blocks.

首先，确定用于分块的维度特征，在本发明实施例中，可以使用的维度特征多达105个，主要涉及以下内容：文本长度、超链接个数、超链接文本长度、高亮文本长度(包括加大加粗的文字)、中文字符长度、英文字符长度、数字字符长度、特定关键词、特定标点符号等等。即一种类型的块可以由该105个维度特征中的一个或多个特征取特定的值来确定。需要说明的是，根据实际情况所确定的维度特征并不限于105个，在后续过程中还可以进行扩充。First, determine the dimensional features used for block segmentation. In the embodiment of the present invention, there are as many as 105 dimensional features that can be used, mainly involving the following contents: text length, number of hyperlinks, hyperlink text length, highlighted text length (including bold and bold text), Chinese character length, English character length, number character length, specific keywords, specific punctuation marks, etc. That is, a type of block can be determined by one or more of the 105 dimensional features taking specific values. It should be noted that the dimension features determined according to the actual situation are not limited to 105, and can be expanded in the subsequent process.

其次，将确定的用于分块的维度特征输入决策树，用于训练构建决策树。Secondly, input the determined dimensional features for block into the decision tree for training and constructing the decision tree.

再者，按照预设规则将网页的文本对象模型树结构中各节点的内容输入决策树，由决策树分析各节点的内容，得到各节点的维度特征，进而根据各节点的维度特征对各节点进行分类。Furthermore, according to preset rules, the content of each node in the text object model tree structure of the web page is input into the decision tree, and the content of each node is analyzed by the decision tree to obtain the dimensional characteristics of each node, and then according to the dimensional characteristics of each node. sort.

以上详细介绍了步骤S102中获取数据源(多个站点上章节式文本的目录页以及多个内容页)的多种实现方式，下面将介绍确定目录页集合的一种或多种实现方式。Multiple implementations of acquiring data sources in step S102 (content pages of chapter text and multiple content pages on multiple sites) have been introduced in detail above, and one or more implementations of determining a set of catalog pages will be introduced below.

在上文步骤S104中根据每个目录页对应的多个内容页，确定章节式文本在不同站点上的目录页集合，本发明提供了一种优选的方案，在该方案中计算每两个目录页对应的内容页之间的交集，并作为每两个目录页的交集，进而根据每两个目录页的交集，确定章节式文本在不同站点上的目录页集合。In the above step S104, according to the plurality of content pages corresponding to each catalog page, the set of catalog pages of the chapter text on different sites is determined. The present invention provides a preferred solution, in which every two catalog pages are calculated The intersection between the content pages corresponding to the page is used as the intersection of every two catalog pages, and then according to the intersection of every two catalog pages, the set of catalog pages of the chapter text on different sites is determined.

进一步地，在本发明的优选方案中，采用聚类的思想计算每两个目录页对应的内容页之间的交集，可以是提取多个目录页对应的内容页中每个内容页的文本特征向量，随后将具备相同文本特征向量的内容页进行聚类，生成多个内容页分组，进而根据多个内容页分组、以及每个目录页与其对应的内容页的映射关系，计算每两个目录页对应的内容页之间的交集。举例来说，多个站点为站点A、B和C，分别对应的章节式文本的目录页为目录页A、B和C。目录页A对应的多个内容页为内容页A1、A2、A3，目录页B对应的多个内容页为内容页B1、B2，目录页C对应的多个内容页为内容页C1、C2、C3、C4。提取内容页A1、A2、A3、B1、B2、C1、C2、C3、C4中每个内容页的文本特征向量分别为a、b、c、a、b’、a、b、c、d，将具备相同文本特征向量的内容页进行聚类，生成多个内容页分组为{a，a，a}，{b，b}，{b’}，{c，c}，{d}。进而根据多个内容页分组、以及每个目录页与其对应的内容页的映射关系，计算每两个目录页对应的内容页之间的交集，即目录页A和目录页B对应的内容页之间的交集为{a}，目录页A和目录页C对应的内容页之间的交集为{a，b，c}，目录页B和目录页C对应的内容页之间的交集为{a}。Further, in the preferred solution of the present invention, the idea of clustering is used to calculate the intersection between the content pages corresponding to every two catalog pages, which may be to extract the text features of each content page in the content pages corresponding to multiple catalog pages vector, and then cluster the content pages with the same text feature vector to generate multiple content page groups, and then calculate the The intersection between the corresponding content pages of the page. For example, the multiple sites are sites A, B, and C, and the corresponding catalog pages of chapter-style texts are catalog pages A, B, and C. The multiple content pages corresponding to catalog page A are content pages A1, A2, A3, the multiple content pages corresponding to catalog page B are content pages B1, B2, and the multiple content pages corresponding to catalog page C are content pages C1, C2, C3, C4. The text feature vectors of each content page in the extracted content pages A1, A2, A3, B1, B2, C1, C2, C3, and C4 are a, b, c, a, b', a, b, c, d, respectively, Cluster the content pages with the same text feature vector to generate multiple content pages grouped into {a, a, a}, {b, b}, {b'}, {c, c}, {d}. Then, according to the grouping of multiple content pages and the mapping relationship between each catalog page and its corresponding content page, calculate the intersection between the content pages corresponding to every two catalog pages, that is, the intersection of the content pages corresponding to catalog page A and catalog page B The intersection between content pages corresponding to catalog page A and catalog page C is {a}, the intersection set between content pages corresponding to catalog page A and catalog page C is {a, b, c}, and the intersection set between content pages corresponding to catalog page B and catalog page C is {a }.

此时，根据每两个目录页的交集，确定章节式文本在不同站点上的目录页集合，可以是将交集的元素个数大于或等于预设阈值的每两个目录页进行合并，得到合并结果，将合并结果作为章节式文本在不同站点上的目录页集合。仍以上述例子为例，将每两个目录页对应的内容页之间的交集作为每两个目录页的交集，即目录页A和目录页B的交集为{a}，目录页A和目录页C的交集为{a，b，c}，目录页B和目录页C的交集为{a}。取预设阈值为1，将交集的元素个数大于或等于1的每两个目录页进行合并，得到合并结果为目录页A、B、C，则该章节式文本在不同站点上的目录页集合为目录页A、B、C。At this time, according to the intersection of every two catalog pages, determine the set of catalog pages of the chapter text on different sites, which can be merged every two catalog pages whose number of elements in the intersection is greater than or equal to the preset threshold value to obtain the merged As a result, the combined results are collections of catalog pages on different sites as chapter-style text. Still taking the above example as an example, the intersection between the content pages corresponding to every two catalog pages is taken as the intersection of every two catalog pages, that is, the intersection of catalog page A and catalog page B is {a}, and catalog page A and catalog page The intersection of page C is {a, b, c}, and the intersection of catalog page B and catalog page C is {a}. Take the preset threshold value as 1, merge every two catalog pages with the number of intersection elements greater than or equal to 1, and obtain the merged results as catalog pages A, B, and C, then the catalog pages of the chapter text on different sites The collection is catalog pages A, B, C.

在上文步骤S104根据每个目录页对应的多个内容页，确定章节式文本在不同站点上的目录页集合之后，步骤S106分析目录页集合中各目录页和/或各目录页对应的内容页，根据分析得到的结果识别出目录页集合中各目录页的章节完整性。本发明提供了多种分析的方法，下面进行详细介绍。After step S104 above determines the catalog page sets of chapter text on different sites according to the multiple content pages corresponding to each catalog page, step S106 analyzes each catalog page and/or the content corresponding to each catalog page in the catalog page set pages, and identify the chapter completeness of each catalog page in the catalog page set according to the result obtained by the analysis. The present invention provides various analysis methods, which will be described in detail below.

第一种，计算目录页集合中每两个目录页的交集的元素个数的平均值，若某一目录页与目录页集合中多个其他目录页的交集的元素的个数均小于该平均值，则确定该目录页对应的章节不完整。以上述例子为例，目录页集合为目录页A、B、C，目录页A和目录页B的交集的元素个数为1，目录页A和目录页C的交集的元素个数为3，目录页B和目录页C的交集的元素个数为1，则平均值为5/3，其中目录页B与目录页A、目录页C的个数均为1，小于平均值5/3，则确定目录页B对应的章节不完整。The first one is to calculate the average number of elements in the intersection of every two catalog pages in the catalog page set, if the number of elements in the intersection of a certain catalog page and multiple other catalog pages in the catalog page set is less than the average value, it is determined that the chapter corresponding to the table of contents page is incomplete. Taking the above example as an example, the set of catalog pages is catalog pages A, B, and C, the number of elements in the intersection of catalog page A and catalog page B is 1, and the number of elements in the intersection of catalog page A and catalog page C is 3. If the number of elements in the intersection of catalog page B and catalog page C is 1, the average value is 5/3, and the number of catalog page B, catalog page A, and catalog page C is 1, which is less than the average value of 5/3. Then it is determined that the chapter corresponding to the content page B is incomplete.

第二种，若某一目录页对应的内容页包含有目录页集合中多个其他目录页对应的内容页，且该目录页对应的内容页中还存在其他内容页，则确定其他内容页为最新章节的内容页，且该目录页具备持续贡献新章节的能力。这里，最新章节是指章节式文本最新发表的章节，例如一本连载小说最新发表的章节。小说用户通常会追书，即最新章节一经作者发表，用户就想立刻看到，最新章节发布越快的小说站越容易受用户喜爱。以上述例子为例，目录页A对应的内容页A1、A2、A3(其文本特征向量分别为a、b、c)，目录页B对应的内容页B1、B2(其文本特征向量分别为a、b’)，目录页C对应的内容页C1、C2、C3、C4(其文本特征向量分别为a、b、c、d)。可见，目录页A对应的内容页包含有目录页A和目录页B对应的内容页，且目录页C还存在其他内容页(即内容页C4)，则确定内容页C4为最新章节的内容页，且目录页C具备持续贡献新章节的能力。Second, if the content page corresponding to a catalog page contains content pages corresponding to multiple other catalog pages in the catalog page set, and there are other content pages in the content page corresponding to the catalog page, then determine the other content pages as The content page of the latest chapter, and the table of contents has the ability to continuously contribute new chapters. Here, the latest chapter refers to the latest published chapter of the chapter-type text, for example, the latest published chapter of a serial novel. Fiction users usually follow books, that is, once the latest chapter is published by the author, users want to see it immediately. The faster the latest chapter is released, the more popular the novel site is. Taking the above example as an example, the content pages A1, A2, and A3 corresponding to catalog page A (their text feature vectors are respectively a, b, and c), and the content pages B1 and B2 corresponding to catalog page B (their text feature vectors are respectively a , b'), content pages C1, C2, C3, and C4 corresponding to catalog page C (their text feature vectors are respectively a, b, c, and d). It can be seen that the content page corresponding to catalog page A includes the content pages corresponding to catalog page A and catalog page B, and there are other content pages (ie content page C4) in catalog page C, then determine that content page C4 is the content page of the latest chapter , and the table of contents page C has the ability to continuously contribute new chapters.

第三种，若某一目录页对应的某个内容页未存在于目录页集合中其他目录页对应的内容页中，且该内容页长度不属于该目录页对应的内容页的平均长度对应的区间范围，则确定该内容页为虚假的内容页。以上述例子为例，目录页A对应的内容页A1、A2、A3(其文本特征向量分别为a、b、c)，目录页B对应的内容页B1、B2(其文本特征向量分别为a、b’)，目录页C对应的内容页C1、C2、C3、C4(其文本特征向量分别为a、b、c、d)。可见，目录页B对应的内容页B2未存在于目录页A、C对应的内容页中，若内容页B2长度不属于目录页B对应的内容页的平均长度对应的区间范围，则确定内容页B2为虚假的内容页。Third, if a content page corresponding to a catalog page does not exist in the content pages corresponding to other catalog pages in the catalog page set, and the length of the content page does not belong to the average length of the content pages corresponding to the catalog page interval range, it is determined that the content page is a false content page. Taking the above example as an example, the content pages A1, A2, and A3 corresponding to catalog page A (their text feature vectors are respectively a, b, and c), and the content pages B1 and B2 corresponding to catalog page B (their text feature vectors are respectively a , b'), content pages C1, C2, C3, and C4 corresponding to catalog page C (their text feature vectors are respectively a, b, c, and d). It can be seen that the content page B2 corresponding to the catalog page B does not exist in the content pages corresponding to the catalog pages A and C, if the length of the content page B2 does not belong to the range corresponding to the average length of the content pages corresponding to the catalog page B, then determine the content page B2 is a false content page.

第四种，若某一目录页对应的某个内容页未存在于目录页集合中其他目录页对应的内容页中，该内容页长度属于该目录页对应的内容页的平均长度对应的区间范围，且该目录页不具备持续贡献新章节的能力，则确定该内容页为虚假的内容页。以上述例子为例，目录页A对应的内容页A1、A2、A3(其文本特征向量分别为a、b、c)，目录页B对应的内容页B1、B2(其文本特征向量分别为a、b’)，目录页C对应的内容页C1、C2、C3、C4(其文本特征向量分别为a、b、c、d)。可见，目录页B对应的内容页B2未存在于目录页A、C对应的内容页中，内容页B2长度属于目录页B对应的内容页的平均长度对应的区间范围，若目录页B不具备持续贡献新章节的能力，则确定内容页B2为虚假的内容页。Fourth, if a content page corresponding to a catalog page does not exist in the content pages corresponding to other catalog pages in the catalog page set, the length of the content page belongs to the range corresponding to the average length of the content pages corresponding to the catalog page , and the content page does not have the ability to continuously contribute new chapters, then it is determined that the content page is a false content page. Taking the above example as an example, the content pages A1, A2, and A3 corresponding to catalog page A (their text feature vectors are respectively a, b, and c), and the content pages B1 and B2 corresponding to catalog page B (their text feature vectors are respectively a , b'), content pages C1, C2, C3, and C4 corresponding to catalog page C (their text feature vectors are respectively a, b, c, and d). It can be seen that the content page B2 corresponding to the catalog page B does not exist in the content pages corresponding to the catalog pages A and C, and the length of the content page B2 belongs to the range corresponding to the average length of the content pages corresponding to the catalog page B. If the catalog page B does not have If there is no ability to continuously contribute new chapters, the content page B2 is determined to be a false content page.

需要说明的是，以上四种分析方法可以单独用于章节完整性的分析，还可以对这四种分析方法的任意一种或多种进行结合用于章节完整性的分析。例如采用第一种方法确定目录页B对应的章节不完整，之后采用第三种或第四种方法进一步分析，确定了目录页B对应的内容页B2为虚假的内容页，使得识别结果更加准确、客观。此外，上述举例(即多个站点为站点A、B和C，分别对应的章节式文本的目录页为目录页A、B和C。目录页A对应的多个内容页为内容页A1、A2、A3，目录页B对应的多个内容页为内容页B1、B2，目录页C对应的多个内容页为内容页C1、C2、C3、C4)仅仅是示意性的，并不限制本发明。It should be noted that the above four analysis methods can be used alone for the analysis of chapter integrity, and any one or more of these four analysis methods can be combined for the analysis of chapter integrity. For example, the first method is used to determine that the chapter corresponding to catalog page B is incomplete, and then the third or fourth method is used for further analysis, and it is determined that the content page B2 corresponding to catalog page B is a false content page, making the recognition result more accurate ,objective. In addition, the above example (that is, the multiple sites are sites A, B and C, the catalog pages of the corresponding chapter texts are catalog pages A, B and C. The multiple content pages corresponding to catalog page A are content pages A1 and A2 , A3, the multiple content pages corresponding to catalog page B are content pages B1 and B2, and the multiple content pages corresponding to catalog page C are content pages C1, C2, C3, C4) are only illustrative and do not limit the present invention .

基于同一发明构思，本发明实施例还提供了一种章节式文本的章节完整性的识别装置，以实现上述章节式文本的章节完整性的识别方法。Based on the same inventive concept, an embodiment of the present invention also provides a device for identifying chapter integrity of a chapter-style text, so as to realize the above-mentioned method for identifying chapter integrity of a chapter-style text.

图2示出了根据本发明一个实施例的章节式文本的章节完整性的识别装置的结构示意图。参见图2，该装置至少包括：获取模块210、确定模块220以及识别模块230。Fig. 2 shows a schematic structural diagram of an apparatus for identifying chapter integrity of a chapter text according to an embodiment of the present invention. Referring to FIG. 2 , the device at least includes: an acquisition module 210 , a determination module 220 and an identification module 230 .

现介绍本发明实施例的章节式文本的章节完整性的识别装置的各组成或器件的功能以及各部分间的连接关系：The functions of each component or device and the connection relationship between the various parts of the identification device of the chapter integrity of the chapter text of the embodiment of the present invention are now introduced:

获取模块210，适于从多个站点分别识别出章节式文本的目录页以及多个内容页，其中，每个站点对应一个目录页，每个目录页对应多个内容页；The acquiring module 210 is adapted to separately identify a chapter-style text catalog page and multiple content pages from multiple sites, wherein each site corresponds to a catalog page, and each catalog page corresponds to multiple content pages;

确定模块220，与获取模块210相耦合，适于根据每个目录页对应的多个内容页，确定所述章节式文本在不同站点上的目录页集合；The determination module 220, coupled with the acquisition module 210, is adapted to determine the set of catalog pages of the chapter text on different sites according to the multiple content pages corresponding to each catalog page;

识别模块230，与确定模块220相耦合，适于分析所述目录页集合中各目录页和/或各目录页对应的内容页，根据分析得到的结果识别出所述目录页集合中各目录页的章节完整性。The identifying module 230, coupled with the determining module 220, is adapted to analyze each catalog page in the catalog page set and/or the content page corresponding to each catalog page, and identify each catalog page in the catalog page set according to the analysis result chapter completeness.

在本发明的一个实施例中，上述确定模块220还适于：计算每两个目录页对应的内容页之间的交集，并作为每两个目录页的交集；根据每两个目录页的交集，确定所述章节式文本在不同站点上的目录页集合。In one embodiment of the present invention, the determination module 220 is further adapted to: calculate the intersection between the content pages corresponding to every two catalog pages, and use it as the intersection of every two catalog pages; , to determine the set of catalog pages on different sites for the chapter-style text.

在本发明的一个实施例中，上述确定模块220还适于：提取多个目录页对应的内容页中每个内容页的文本特征向量；将具备相同文本特征向量的内容页进行聚类，生成多个内容页分组；根据所述多个内容页分组、以及每个目录页与其对应的内容页的映射关系，计算每两个目录页对应的内容页之间的交集。In an embodiment of the present invention, the determination module 220 is further adapted to: extract the text feature vector of each content page in the content pages corresponding to multiple catalog pages; cluster the content pages with the same text feature vector to generate multiple content page groups; according to the multiple content page groups and the mapping relationship between each catalog page and its corresponding content page, calculate the intersection between the content pages corresponding to every two catalog pages.

在本发明的一个实施例中，上述确定模块220还适于：将交集的元素个数大于或等于预设阈值的每两个目录页进行合并，得到合并结果；将所述合并结果作为所述章节式文本在不同站点上的目录页集合。In one embodiment of the present invention, the determination module 220 is further adapted to: merge every two catalog pages whose number of elements in the intersection is greater than or equal to a preset threshold to obtain a merged result; use the merged result as the A collection of catalog pages on different sites for chapter-style text.

在本发明的一个实施例中，上述识别模块230还适于：计算所述目录页集合中每两个目录页的交集的元素的个数的平均值；若某一目录页与所述目录页集合中多个其他目录页的交集的元素的个数均小于所述平均值，则确定该目录页对应的章节不完整。In one embodiment of the present invention, the identification module 230 is further adapted to: calculate the average value of the number of elements in the intersection of every two catalog pages in the catalog page set; if a certain catalog page and the catalog page If the number of elements in the intersection of multiple other catalog pages in the collection is less than the average value, it is determined that the chapter corresponding to the catalog page is incomplete.

在本发明的一个实施例中，上述识别模块230还适于：若某一目录页对应的内容页包含有所述目录页集合中多个其他目录页对应的内容页，且该目录页对应的内容页中还存在其他内容页，则确定所述其他内容页为最新章节的内容页，且该目录页具备持续贡献新章节的能力。In an embodiment of the present invention, the identification module 230 is further adapted to: if the content page corresponding to a certain catalog page contains content pages corresponding to multiple other catalog pages in the catalog page set, and the corresponding content page of the catalog page If there are other content pages in the content page, it is determined that the other content pages are the content pages of the latest chapter, and the content page has the ability to continuously contribute new chapters.

在本发明的一个实施例中，上述识别模块230还适于：若某一目录页对应的某个内容页未存在于所述目录页集合中其他目录页对应的内容页中，且该内容页长度不属于该目录页对应的内容页的平均长度对应的区间范围，则确定该内容页为虚假的内容页。In an embodiment of the present invention, the identification module 230 is further adapted to: if a certain content page corresponding to a certain catalog page does not exist in the content pages corresponding to other catalog pages in the catalog page set, and the content page If the length does not belong to the range corresponding to the average length of the content pages corresponding to the catalog page, then it is determined that the content page is a false content page.

在本发明的一个实施例中，上述识别模块230还适于：若某一目录页对应的某个内容页未存在于所述目录页集合中其他目录页对应的内容页中，该内容页长度属于该目录页对应的内容页的平均长度对应的区间范围，且该目录页不具备持续贡献新章节的能力，则确定该内容页为虚假的内容页。In an embodiment of the present invention, the identification module 230 is further adapted to: if a certain content page corresponding to a certain catalog page does not exist in the content pages corresponding to other catalog pages in the catalog page set, the length of the content page If it belongs to the range corresponding to the average length of the content page corresponding to the content page, and the content page does not have the ability to continuously contribute new chapters, then the content page is determined to be a false content page.

在本发明的一个实施例中，上述获取模块210还适于：从多个站点搜索到章节式文本相关的网页；从搜索到的网页中识别出所述章节式文本的目录页以及多个内容页。In one embodiment of the present invention, the acquisition module 210 is further adapted to: search for webpages related to the chapter text from multiple sites; identify the catalog page and multiple contents of the chapter text from the searched web pages Page.

在本发明的一个实施例中，上述获取模块210还适于：将搜索到的网页解析成文本对象模型树结构；对所述文本对象模型树结构中的各结点进行分类，以确定所述网页的结构分块；根据所述结构分块抽取所述章节式文本的目录页以及多个内容页。In an embodiment of the present invention, the acquisition module 210 is further adapted to: parse the searched webpage into a text object model tree structure; classify each node in the text object model tree structure to determine the Structural blocks of the webpage; extracting the catalog page and multiple content pages of the chapter text according to the structural blocks.

根据上述任意一个优选实施例或多个优选实施例的组合，本发明实施例能够达到如下有益效果：According to any one of the above preferred embodiments or a combination of multiple preferred embodiments, the embodiments of the present invention can achieve the following beneficial effects:

本发明还公开了：The invention also discloses:

A1、一种章节式文本的章节完整性的识别方法，包括：A1. A method for identifying chapter completeness of a chapter text, comprising:

从多个站点分别识别出章节式文本的目录页以及多个内容页，其中，每个站点对应一个目录页，每个目录页对应多个内容页；Identify chapter-style text catalog pages and multiple content pages from multiple sites, where each site corresponds to a catalog page, and each catalog page corresponds to multiple content pages;

根据每个目录页对应的多个内容页，确定所述章节式文本在不同站点上的目录页集合；According to multiple content pages corresponding to each content page, determine the set of content pages of the chapter text on different sites;

分析所述目录页集合中各目录页和/或各目录页对应的内容页，根据分析得到的结果识别出所述目录页集合中各目录页的章节完整性。Analyzing each catalog page in the catalog page set and/or the content page corresponding to each catalog page, and identifying the chapter integrity of each catalog page in the catalog page set according to the analysis result.

A2、根据A1所述的方法，其中，根据每个目录页对应的多个内容页，确定所述章节式文本在不同站点上的目录页集合，包括：A2, according to the method described in A1, wherein, according to a plurality of content pages corresponding to each catalog page, determine the catalog page set of the chapter text on different sites, including:

计算每两个目录页对应的内容页之间的交集，并作为每两个目录页的交集；Calculate the intersection between the content pages corresponding to every two catalog pages, and use it as the intersection of every two catalog pages;

根据每两个目录页的交集，确定所述章节式文本在不同站点上的目录页集合。According to the intersection of every two catalog pages, a set of catalog pages of the chapter text on different sites is determined.

A3、根据A1或A2所述的方法，其中，所述计算每两个目录页对应的内容页之间的交集，包括：A3. The method according to A1 or A2, wherein the calculation of the intersection between the content pages corresponding to every two catalog pages includes:

提取多个目录页对应的内容页中每个内容页的文本特征向量；Extracting the text feature vector of each content page in the content pages corresponding to the plurality of catalog pages;

将具备相同文本特征向量的内容页进行聚类，生成多个内容页分组；Cluster content pages with the same text feature vector to generate multiple content page groups;

根据所述多个内容页分组、以及每个目录页与其对应的内容页的映射关系，计算每两个目录页对应的内容页之间的交集。According to the plurality of content page groups and the mapping relationship between each catalog page and its corresponding content page, the intersection between the content pages corresponding to every two catalog pages is calculated.

A4、根据A1至A3任一项所述的方法，其中，根据每两个目录页的交集，确定所述章节式文本在不同站点上的目录页集合，包括：A4. The method according to any one of A1 to A3, wherein, according to the intersection of every two catalog pages, determining the set of catalog pages of the chapter text on different sites includes:

将交集的元素个数大于或等于预设阈值的每两个目录页进行合并，得到合并结果；Merge every two catalog pages whose number of elements in the intersection is greater than or equal to the preset threshold, and obtain the merged result;

将所述合并结果作为所述章节式文本在不同站点上的目录页集合。The merged result is used as a set of catalog pages of the chapter text on different sites.

A5、根据A1至A4任一项所述的方法，其中，分析所述目录页集合中各目录页和/或各目录页对应的内容页，根据分析得到的结果识别出所述目录页集合中各目录页的章节完整性，包括：A5. The method according to any one of A1 to A4, wherein each catalog page in the catalog page set and/or the content page corresponding to each catalog page is analyzed, and the content page in the catalog page set is identified according to the analysis results. Section completeness of each table of contents page, including:

计算所述目录页集合中每两个目录页的交集的元素个数的平均值；calculating the average value of the number of elements in the intersection of every two catalog pages in the catalog page set;

若某一目录页与所述目录页集合中多个其他目录页的交集的元素的个数均小于所述平均值，则确定该目录页对应的章节不完整。If the number of elements in the intersection of a certain catalog page and multiple other catalog pages in the catalog page set is less than the average value, it is determined that the chapter corresponding to the catalog page is incomplete.

A6、根据A1至A5任一项所述的方法，其中，分析所述目录页集合中各目录页和/或各目录页对应的内容页，根据分析得到的结果识别出所述目录页集合中各目录页的章节完整性，包括：A6. The method according to any one of A1 to A5, wherein each catalog page in the catalog page set and/or the content page corresponding to each catalog page is analyzed, and the content page in the catalog page set is identified according to the analysis result. Section completeness of each table of contents page, including:

若某一目录页对应的内容页包含有所述目录页集合中多个其他目录页对应的内容页，且该目录页对应的内容页中还存在其他内容页，则确定所述其他内容页为最新章节的内容页，且该目录页具备持续贡献新章节的能力。If the content page corresponding to a certain catalog page includes content pages corresponding to multiple other catalog pages in the catalog page set, and there are other content pages in the content page corresponding to the catalog page, then it is determined that the other content pages are The content page of the latest chapter, and the table of contents has the ability to continuously contribute new chapters.

A7、根据A1至A6任一项所述的方法，其中，分析所述目录页集合中各目录页和/或各目录页对应的内容页，根据分析得到的结果识别出所述目录页集合中各目录页的章节完整性，包括：A7. The method according to any one of A1 to A6, wherein each catalog page in the catalog page set and/or the content page corresponding to each catalog page is analyzed, and the content page in the catalog page set is identified according to the analysis results. Section completeness of each table of contents page, including:

若某一目录页对应的某个内容页未存在于所述目录页集合中其他目录页对应的内容页中，且该内容页长度不属于该目录页对应的内容页的平均长度对应的区间范围，则确定该内容页为虚假的内容页。If a content page corresponding to a certain catalog page does not exist in the content pages corresponding to other catalog pages in the catalog page set, and the length of the content page does not belong to the range corresponding to the average length of the content pages corresponding to the catalog page , it is determined that the content page is a fake content page.

A8、根据A1至A7任一项所述的方法，其中，分析所述目录页集合中各目录页和/或各目录页对应的内容页，根据分析得到的结果识别出所述目录页集合中各目录页的章节完整性，包括：A8. The method according to any one of A1 to A7, wherein each catalog page in the catalog page set and/or the content page corresponding to each catalog page is analyzed, and the content page in the catalog page set is identified according to the analysis results. Section completeness of each table of contents page, including:

若某一目录页对应的某个内容页未存在于所述目录页集合中其他目录页对应的内容页中，该内容页长度属于该目录页对应的内容页的平均长度对应的区间范围，且该目录页不具备持续贡献新章节的能力，则确定该内容页为虚假的内容页。If a certain content page corresponding to a certain catalog page does not exist in the content pages corresponding to other catalog pages in the catalog page set, the length of the content page belongs to the range corresponding to the average length of the content pages corresponding to the catalog page, and If the content page does not have the ability to continuously contribute new chapters, then it is determined that the content page is a false content page.

A9、根据A1至A8任一项所述的方法，其中，所述从多个站点分别识别出章节式文本的目录页以及多个内容页，包括：A9. The method according to any one of A1 to A8, wherein the catalog page and multiple content pages of the chapter text are identified from multiple sites, including:

从多个站点搜索到章节式文本相关的网页；Web pages related to chapter text are searched from multiple sites;

从搜索到的网页中识别出所述章节式文本的目录页以及多个内容页。A content page and a plurality of content pages of the chapter text are identified from the searched web pages.

A10、根据A1至A9任一项所述的方法，其中，从搜索到的网页中识别出所述章节式文本的目录页以及多个内容页，包括：A10. The method according to any one of A1 to A9, wherein the catalog page and multiple content pages of the chapter text are identified from the searched web pages, including:

将搜索到的网页解析成文本对象模型树结构；Parse the searched webpage into a text object model tree structure;

对所述文本对象模型树结构中的各结点进行分类，以确定所述网页的结构分块；Classify each node in the tree structure of the text object model to determine the structural blocks of the webpage;

根据所述结构分块抽取所述章节式文本的目录页以及多个内容页。The content page and multiple content pages of the chapter text are extracted in blocks according to the structure.

B11、一种章节式文本的章节完整性的识别装置，包括：B11. A device for identifying chapter integrity of a chapter text, comprising:

B12、根据B11所述的装置，其中，所述确定模块还适于：B12. The device according to B11, wherein the determining module is further adapted to:

B13、根据B11或B12所述的装置，其中，所述确定模块还适于：B13. The device according to B11 or B12, wherein the determination module is further adapted to:

B14、根据B11至B13任一项所述的装置，其中，所述确定模块还适于：B14. The device according to any one of B11 to B13, wherein the determination module is further adapted to:

B15、根据B11至B14任一项所述的装置，其中，所述识别模块还适于：B15. The device according to any one of B11 to B14, wherein the identification module is further adapted to:

计算所述目录页集合中每两个目录页的交集的元素的个数的平均值；calculating the average value of the number of elements in the intersection of every two catalog pages in the set of catalog pages;

B16、根据B11至B15任一项所述的装置，其中，所述识别模块还适于：B16. The device according to any one of B11 to B15, wherein the identification module is further adapted to:

B17、根据B11至B16任一项所述的装置，其中，所述识别模块还适于：B17. The device according to any one of B11 to B16, wherein the identification module is further adapted to:

B18、根据B11至B17任一项所述的装置，其中，所述识别模块还适于：B18. The device according to any one of B11 to B17, wherein the identification module is further adapted to:

B19、根据B11至B18任一项所述的装置，其中，所述获取模块还适于：B19. The device according to any one of B11 to B18, wherein the acquisition module is further adapted to:

B20、根据B11至B19任一项所述的装置，其中，所述获取模块还适于：B20. The device according to any one of B11 to B19, wherein the acquisition module is further adapted to:

在此处所提供的说明书中，说明了大量具体细节。然而，能够理解，本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中，并未详细示出公知的方法、结构和技术，以便不模糊对本说明书的理解。In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure the understanding of this description.

类似地，应当理解，为了精简本公开并帮助理解各个发明方面中的一个或多个，在上面对本发明的示例性实施例的描述中，本发明的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而，并不应将该公开的方法解释成反映如下意图：即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多的特征。更确切地说，如下面的权利要求书所反映的那样，发明方面在于少于前面公开的单个实施例的所有特征。因此，遵循具体实施方式的权利要求书由此明确地并入该具体实施方式，其中每个权利要求本身都作为本发明的单独实施例。Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, in order to streamline this disclosure and to facilitate an understanding of one or more of the various inventive aspects, various features of the invention are sometimes grouped together in a single embodiment, figure, or its description. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the Detailed Description are hereby expressly incorporated into this Detailed Description, with each claim standing on its own as a separate embodiment of this invention.

本领域那些技术人员可以理解，可以对实施例中的设备中的模块进行自适应性地改变并且把它们设置在与该实施例不同的一个或多个设备中。可以把实施例中的模块或单元或组件组合成一个模块或单元或组件，以及此外可以把它们分成多个子模块或子单元或子组件。除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外，可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述，本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。Those skilled in the art can understand that the modules in the device in the embodiment can be adaptively changed and arranged in one or more devices different from the embodiment. Modules or units or components in the embodiments may be combined into one module or unit or component, and furthermore may be divided into a plurality of sub-modules or sub-units or sub-assemblies. All features disclosed in this specification (including accompanying claims, abstract and drawings) and any method or method so disclosed may be used in any combination, except that at least some of such features and/or processes or units are mutually exclusive. All processes or units of equipment are combined. Each feature disclosed in this specification (including accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

此外，本领域的技术人员能够理解，尽管在此所述的一些实施例包括其它实施例中所包括的某些特征而不是其它特征，但是不同实施例的特征的组合意味着处于本发明的范围之内并且形成不同的实施例。例如，在权利要求书中，所要求保护的实施例的任意之一都可以以任意的组合方式来使用。Furthermore, those skilled in the art will understand that although some embodiments described herein include some features included in other embodiments but not others, combinations of features from different embodiments are meant to be within the scope of the invention. and form different embodiments. For example, in the claims, any one of the claimed embodiments can be used in any combination.

本发明的各个部件实施例可以以硬件实现，或者以在一个或者多个处理器上运行的软件模块实现，或者以它们的组合实现。本领域的技术人员应当理解，可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的章节式文本的章节完整性的识别装置中的一些或者全部部件的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如，计算机程序和计算机程序产品)。这样的实现本发明的程序可以存储在计算机可读介质上，或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到，或者在载体信号上提供，或者以任何其他形式提供。The various component embodiments of the present invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. It should be understood by those skilled in the art that a microprocessor or a digital signal processor (DSP) can be used in practice to realize some or some of all components in the device for identifying chapter integrity of a chapter text according to an embodiment of the present invention Or full functionality. The present invention can also be implemented as an apparatus or an apparatus program (for example, a computer program and a computer program product) for performing a part or all of the methods described herein. Such a program for realizing the present invention may be stored on a computer-readable medium, or may be in the form of one or more signals. Such a signal may be downloaded from an Internet site, or provided on a carrier signal, or provided in any other form.

应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制，并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中，不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中，这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention can be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In a unit claim enumerating several means, several of these means can be embodied by one and the same item of hardware. The use of the words first, second, and third, etc. does not indicate any order. These words can be interpreted as names.

至此，本领域技术人员应认识到，虽然本文已详尽示出和描述了本发明的多个示例性实施例，但是，在不脱离本发明精神和范围的情况下，仍可根据本发明公开的内容直接确定或推导出符合本发明原理的许多其他变型或修改。因此，本发明的范围应被理解和认定为覆盖了所有这些其他变型或修改。So far, those skilled in the art should appreciate that, although a number of exemplary embodiments of the present invention have been shown and described in detail herein, without departing from the spirit and scope of the present invention, the disclosed embodiments of the present invention can still be used. Many other variations or modifications consistent with the principles of the invention are directly identified or derived from the content. Accordingly, the scope of the present invention should be understood and deemed to cover all such other variations or modifications.

Claims

1. a recognition methods for the chapters and sections integrality of chapters and sections formula text, comprising:

The catalogue page of chapters and sections formula text and multiple content pages is identified respectively from multiple website, wherein, the corresponding catalogue page of each website, the corresponding multiple content pages of each catalogue page;

The multiple content pages corresponding according to each catalogue page, determine the catalogue page set of described chapters and sections formula text on different website;

Analyze each catalogue page in described catalogue page set and/or content pages corresponding to each catalogue page, according to the chapters and sections integrality analyzing the result obtained and identify each catalogue page in described catalogue page set.

2. method according to claim 1, wherein, the multiple content pages corresponding according to each catalogue page, determine the catalogue page set of described chapters and sections formula text on different website, comprising:

Calculate the common factor between content pages corresponding to every two catalogue pages, and as the common factor of every two catalogue pages;

According to the common factor of every two catalogue pages, determine the catalogue page set of described chapters and sections formula text on different website.

3. method according to claim 1 and 2, wherein, the common factor between the content pages that every two catalogue pages of described calculating are corresponding, comprising:

Extract the Text eigenvector of each content pages in content pages corresponding to multiple catalogue page;

The content pages possessing same text proper vector is carried out cluster, generates the grouping of multiple content pages;

According to the mapping relations of the content pages that described multiple content pages is divided into groups and each catalogue page is corresponding with it, calculate the common factor between content pages corresponding to every two catalogue pages.

4. the method according to any one of claims 1 to 3, wherein, according to the common factor of every two catalogue pages, determine the catalogue page set of described chapters and sections formula text on different website, comprising:

Every two catalogue pages element number of common factor being more than or equal to predetermined threshold value merge, and obtain amalgamation result;

Using described amalgamation result as the catalogue page set of described chapters and sections formula text on different website.

5. the method according to any one of Claims 1-4, wherein, analyze each catalogue page in described catalogue page set and/or content pages corresponding to each catalogue page, according to the chapters and sections integrality analyzing the result obtained and identify each catalogue page in described catalogue page set, comprising:

Calculate the mean value of the element number of the common factor of every two catalogue pages in described catalogue page set;

If the number of the element of the common factor of other catalogue pages multiple is all less than described mean value in a certain catalogue page and described catalogue page set, then determine that the chapters and sections that this catalogue page is corresponding are imperfect.

6. the method according to any one of claim 1 to 5, wherein, analyze each catalogue page in described catalogue page set and/or content pages corresponding to each catalogue page, according to the chapters and sections integrality analyzing the result obtained and identify each catalogue page in described catalogue page set, comprising:

If the content pages that a certain catalogue page is corresponding includes the content pages that in described catalogue page set, other catalogue pages multiple are corresponding, and also there is other guide page in content pages corresponding to this catalogue page, then determine that described other guide page is the content pages of up-to-date chapters and sections, and this catalogue page possesses the ability continuing the new chapters and sections of contribution.

7. the method according to any one of claim 1 to 6, wherein, analyze each catalogue page in described catalogue page set and/or content pages corresponding to each catalogue page, according to the chapters and sections integrality analyzing the result obtained and identify each catalogue page in described catalogue page set, comprising:

If certain content pages that a certain catalogue page is corresponding is not present in the content pages that in described catalogue page set, other catalogue pages are corresponding, and this content pages length does not belong to interval range corresponding to the average length of content pages corresponding to this catalogue page, then determine that this content pages is false content pages.

8. the method according to any one of claim 1 to 7, wherein, analyze each catalogue page in described catalogue page set and/or content pages corresponding to each catalogue page, according to the chapters and sections integrality analyzing the result obtained and identify each catalogue page in described catalogue page set, comprising:

If certain content pages that a certain catalogue page is corresponding is not present in the content pages that in described catalogue page set, other catalogue pages are corresponding, this content pages length belongs to interval range corresponding to the average length of content pages corresponding to this catalogue page, and this catalogue page does not possess the ability continuing the new chapters and sections of contribution, then determine that this content pages is false content pages.

9. the method according to any one of claim 1 to 8, wherein, describedly identifies the catalogue page of chapters and sections formula text and multiple content pages respectively from multiple website, comprising:

The webpage relevant from multiple site search to chapters and sections formula text;

The catalogue page of described chapters and sections formula text and multiple content pages is identified from the webpage searched.

10. a recognition device for the chapters and sections integrality of chapters and sections formula text, comprising:

Acquisition module, is suitable for identifying the catalogue page of chapters and sections formula text and multiple content pages respectively from multiple website, wherein, and the corresponding catalogue page of each website, the corresponding multiple content pages of each catalogue page;

Determination module, is suitable for the multiple content pages corresponding according to each catalogue page, determines the catalogue page set of described chapters and sections formula text on different website;

Identification module, is suitable for analyzing each catalogue page in described catalogue page set and/or content pages corresponding to each catalogue page, according to the chapters and sections integrality analyzing the result obtained and identify each catalogue page in described catalogue page set.